Conference PaperPDF Available

Human-Like Playtesting with Deep Learning


Abstract and Figures

We present an approach to learn and deploy human-like playtesting in computer games based on deep learning from player data. We are able to learn and predict the most “human” action in a given position through supervised learning on a convolutional neural network. Furthermore, we show how we can use the learned network to predict key metrics of new content — most notably the difficulty of levels. Our player data and empirical data come from Candy Crush Saga (CCS) and Candy Crush Soda Saga (CCSS). However, the method is general and well suited for many games, in particular where content creation is sequential. CCS and CCSS are non-deterministic match-3 puzzle games with multiple game modes spread over a few thousand levels, providing a diverse testbed for this technique. Compared to Monte Carlo Tree Search (MCTS) we show that this approach increases correlation with average level difficulty, giving more accurate predictions as well as requiring only a fraction of the computation time.
Content may be subject to copyright.
Human-Like Playtesting with Deep Learning
Stefan Freyr Gudmundsson, Philipp Eisen, Erik Poromaa, Alex Nodet, Sami Purmonen,
Bartlomiej Kozakowski, Richard Meurling, Lele Cao
AI R&D, King Digital Entertainment, Activision Blizzard Group, Stockholm, Sweden
Abstract—We present an approach to learn and deploy human-
like playtesting in computer games based on deep learning from
player data. We are able to learn and predict the most “human”
action in a given position through supervised learning on a
convolutional neural network. Furthermore, we show how we can
use the learned network to predict key metrics of new content
— most notably the difficulty of levels. Our player data and
empirical data come from Candy Crush Saga (CCS) and Candy
Crush Soda Saga (CCSS). However, the method is general and
well suited for many games, in particular where content creation
is sequential. CCS and CCSS are non-deterministic match-3
puzzle games with multiple game modes spread over a few
thousand levels, providing a diverse testbed for this technique.
Compared to Monte Carlo Tree Search (MCTS) we show that this
approach increases correlation with average level difficulty, giving
more accurate predictions as well as requiring only a fraction of
the computation time.
Index Terms—deep learning, convolutional neural network,
agent simulation, playtesting, Monte-Carlo tree search
Within the recent years, game developers have increasingly
adopted a free-to-play business model for their games. This
is especially true for mobile games (see e.g. [1], [2]). In the
free-to-play business model, the core game is available free of
charge and revenue is created through the sales of additional
products and services such as additional content or in-game
items. Therefore, game producers tend to continuously add
content to the game to keep their users engaged and to be
able to continuously monetize on a game title. For this to
work out, it is important that the new content lives up to the
quality expectations of the players.
The difficulty of a game has a considerable impact on
a user’s perceived quality. Denisova et al. [3] argue that
challenge is the most important player experience. In trying
to create the desired experience with regards to the difficulty,
game designers estimate the players’ skill and set game
parameters accordingly. Mobile game companies usually have
sophisticated tracking techniques in place to monitor how
users interact with their games. This way, measures that reflect
the difficulty of the game can be monitored once content has
been released to players.
However, if new content would be released directly to
players of the game, those would potentially be exposed to
content that does not live up to their quality expectations and
might abandon the game as a consequence. Therefore, game
designers usually let new content be playtested and tune the
parameters in an iterative manner based on data obtained from
those tests before releasing the new content to players [4], [5].
Playtesting can be carried out by human test players that are
given access to the new content before it is released. However,
human playtesting comes at the disadvantages of introducing
latency and costs in the development process. Game designers
need to wait for the results from the test players before they
can continue with the next iteration of their development
process. Additionally, results from test players might not lead
to appropriate conclusions about the general player population
as the populations’ skill levels can differ.
In an attempt to tackle these disadvantages several ap-
proaches for automatic playtesting have been proposed [6]–
[9]. Isaksen et al. simulate playing levels using a simple
heuristic and then analyze the level design using survival
analysis. Zook et al. use Active Learning to automatically
tune game parameters to achieve a target value in human
player performance. Poromaa, similarly to Isaksen, proposes
an approach, where the playtest is carried out using a Monte-
Carlo Tree Search (MCTS) algorithm to simulate game play.
Silva et al. evaluate a competitive board game by letting
general (MCTS and A*) and custom AI agents play against
each other.
The methods above, however, do not consider data that can
be gathered from content that has been released earlier, when
simulating game play. We hypothesize, that taking this data
into account could lead to a game play simulation closer to
human play, and therefore to better estimates of the difficulty
of new content. More specifically, we built a prediction model
that predicts moves from a given game state. This model
is trained on moves that were executed by players on the
previously released content. Once trained, the model acts as
a policy, suggesting which move to execute given a game
state, for an agent simulating game play. The state-of-the-art
methods for predicting a move from a given state are based on
Convolutional Neural Networks (CNN) [10]–[12]. CNN is a
specific type of Neural Networks (NN) that is very well suited
for data that comes in a grid-like structure [13]. Since the data
of the problem at hand has a grid-like structure and is similar
to data used in state-of-the-art research, CNN appear to be the
most promising approach for the task at hand. Therefore, we
rely on this approach for this research.
In our investigation, the player data is the essential part.
Hence, we have to limit our research and empirical results to
the games from where we can gather the required data, Candy
Crush Saga (CCS) and Candy Crush Soda Saga (CCSS). It
remains to be tested on other types of games. However, the
approach only requires a state representation which can be
processed by a CNN, a discrete actions space and a substantial
amount of play data. The same method was used in AlphaGo
[11] and the success of agents based on reinforcement learning
in Atari games [14] with a similar state-action setup suggests
that we can do the same for many other types of games.
Interestingly, in the Atari games, the main input is the pixels
of the screen so the grid-like structure can even apply to
pixels. In several games, the action space can be very large,
although discrete. Preprocessing of the action space might be
needed without necessarily reducing the quality of the agent.
For example, in bubble shooter games, e.g. Bubble Witch 3
Saga, Panda Pop, one might shrink the action space to 180
1angles or define actions from possible destinations on the
board. In linker games where the order of the links does not
necessarily matter, e.g. Blossom Blast Saga, the combination
of linked squares can grow exponentially with the length of
the link where most combinations result in the same effect on
the game and could, therefore, be defined as the same action.
Bubble shooters and linker games are two types of games
where we think our approach could do very well as well as
clicker games, e.g. Toy Blast, Toon Blast, to mention three
very popular casual game types.
In our case, the difficulty of levels is the key metric. In
practice, a human-like agent can give us many more metrics
from the gameplay, e.g. score distribution and a distribution
of the number of moves needed to succeed. Moreover, it can
become a vital part of the Quality Assurance (QA) workflow,
being able to explore the relevant game space to a much larger
extent than humans or random agents.
Currently, we train the agent on all the gameplay we gather.
Consequently, the agent learns by averaging over all the
players’ policies. The policy of different players can be quite
different and the result of an average policy does not have to
represent the average result of different policies. The results
suggest that there is, nevertheless, significant knowledge to
be gained from the average policy. With player modeling
or "personas" [15]–[17] we could learn policies for different
clusters of players and build agents for each cluster that better
predict the different policies.
A. Casual Games Genre and Match-3 Games
Casual games are a big part of the gaming industry and
the genre has been growing very fast with gaming moving
increasingly to mobile devices. For casual games on mobile
devices it is common that the content generation is sequential,
i.e. new content/levels are added to the game as the players
progress. The frequency of new content can vary from every
few months up to once a week. One of the biggest game types
in the casual game genre is match-3 puzzle games with a few
hundred million monthly active users [18].
CCS and CCSS are two versions of a match-3 game.
They have a 2D board of tiles which may be left empty or
filled with different items (regular and special candies) and
blockers (e.g. chocolate and locked candy). A legal action is
a vertical/horizontal swap of two adjacent game items that
results in a vertical/horizontal match of at least 3 items of
the same type or that are special candies. When included in
a match, special candies remove more items from the board
than just the candies that are part of the match. The empty
tiles are then filled by items dropping down from above. If
there are no items above an empty tile it is filled with a new
random item. The diversity of this game is further enriched
by different game modes, e.g. score level and timed level and
behaviours of special items.
B. Contributions and Paper Organization
This paper presents an approach to estimate level difficulty
in games by simulating a gameplay policy CNN learned from
human gameplay. Our main contributions are:
a deep CNN architecture for training agents that can play
the games at hand like human players;
a generic framework for estimating the level difficulty of
games using agent simulations and binomial regression;
extensive experimental evaluations that validate the effec-
tiveness of our framework on match-3 games and imply
practical suggestions for implementation.
In the upcoming sections, we start with related work and
continue to present our proposed approach followed by thor-
ough experimental evaluations and finally, conclusions are
drawn in section VI after a short discussion about future work.
Playtesting in games is used to understand the player expe-
rience and can have different perspectives, difficulty balancing
and crash testing being two common examples. Player experi-
ence can be measured with various metrics [3], [19], [20]. In
our context, the main focus is on playtesting as balancing the
difficulty of content. To automate agent playtesting, diversified
heuristic-based approaches were adopted to construct game-
play agents (e.g. [7], [21], [22]). Agents based on Monte-Carlo
Tree Search, as have been proposed in [8], [23], [24], are
generic and need little game-specific knowledge. Silva et al.
[9] argued that game-specific agents usually outperform both
standard MCTS and A* agents. Complying with that belief,
some attempts in customizing agent heuristics began to emerge
and the representative literature of that category include [25],
[26], to name a few. Despite the success of hand-crafted agents
on one specific game, its performance is non-transitive to other
games [27] — different agents perform best in different games,
which imposes difficulties when seeking to create agents
effortlessly for multiple games. One of the straightforward
(but inefficient) solutions is the ensemble method, so authors
of [28] investigated the relative performance of 7 algorithms
to formulate their approach of general game evaluation and
[29] show that there is no "one-fits-all" AI-algorithm available
yet in General Video Game Playing. Taken further broadly,
this problem calls for a more generic form of an intelligent
agent that is capable of learning the salient features embodied
in different games by analyzing human-play patterns and/or
directly interacting with game engines.
Although, training agents from move patterns (e.g. [30])
has been seen for over a decade, the recent advances of deep
learning techniques have moved the methodologies of this
kind beyond manual feature engineering, towards a setting of
end-to-end supervised learning from raw game-play data. For
instance, Runarsson [31] directly approximated a policy for
Othello game using binary classification. The works of [10]
and [12] reported their CNN-based approaches achieving a
prediction accuracy of 44.4% and 42% respectively on a Go
dataset; Silver et al. [11] (AlphaGo) managed to improve the
prediction accuracy to 55.7%.
In a more recent paper Silver at al. [32] managed to create
an agent that could outperform any previously best artificial
and human Go player in the game of Go. They proposed
a novel method of Reinforcement Learning (RL) coined Al-
phaGo Zero, that was using progressive self-play without the
aid of any human knowledge. In the field of multi-agent
collaboration, Peng et al. [33] introduced a bidirectionally
coordinated network with a vectorized extension of actor-
critic formulation, which managed to learn several effective
coordination strategies in StarCraft1. The goal of the last two
approaches, however, differs from our goal in that they try
to outperform, not simulate, human players. This can lead to
move patterns that are different from even those of the best
human players.
Fig. 1. An example game board of CCS encoded as 102-channel 2D input.
Our approach suggests using an agent to simulate human
gameplay, creating a metric of interest. Then we relate the
values of that metric observed during the simulation with the
values of the same metric observed by actual, human players.
As mentioned above, the metric of interest in this paper is the
difficulty measured as the average success rate.
Intuitively, the more similar the strategy of the agent is to
that of human players, the more should values observed during
the simulation relate to the values observed from real human
players. We, therefore, suggest training a CNN on human
player data from previous levels to act as policy for an agent
to play new, previously unseen levels.
We benchmark this approach against an approach using
MCTS [34]. MCTS agents are well suited where the game
environment is diverse and difficult to predict. For example,
they are the state-of-the-art in General Game Playing (GGP)
[35] and were a key component for the improvement of Go
programs (e.g. [11], [32], [34], [36], [37]). They are search
based as the agent simulates possible future states with self-
play, building an asynchronous game tree in memory in the
1A game published by Blizzard™:
process, until it reaches the end of the search time and chooses
an action to perform [30]. Our previous non-human playtesting
was done with MCTS agents [8].
A. CNN agent
A CNN-based agent sends the state to a policy network
which gives back a probability vector over possible actions. It
can be used to play greedily in each position picking the action
with the highest probability in each state. Thus, playing much
faster than the MCTS agent. The training of the network is
done with supervised learning from player data from previous
levels. Therefore, the policy network learns the most common
action taken by the players in similar states.
CNNs are well suited for capturing structural correlations
from data in grid-like topology [13], [38] which is often the
structure of a game board, especially in casual games and
match-3 games. Hence, we use a customized CNN (Fig. 2a)
as our agent, which predicts the next move greedily using the
current game state (i.e. board layout) as input. In this section,
CCS is used as an exemplary match-3 game facilitating the
explanation of the CNN agent.
B. Representation of Input and Output
The game board state, as the input of CNN, is represented
as a 9×9grid with 102 binary feature planes as demonstrated
in Fig. 1. When 0-padded and stacked together, those feature
planes form a 102-channel 2D input to the network. There are
4 types of input channels:
1) 80 item channels — “1” for existence of the correspond-
ing item, “0” otherwise.
2) 20 objective channel — all “1”s when the corresponding
objective (e.g. creating nstriped candies) is still unful-
filled, or all “0”s.
3) 1 legal-move channel — “1” for tile that is part of a
legal move, “0” otherwise.
4) 1 bias channel — a plane with “1” for every tile to allow
learning a spatial bias of game board. Can be thought
of as a heat map of moves.
Since the moves (output of the network) are horizontal/vertical
swaps of 2 items, they are encoded as a scalar by enumerating
the inner edges of the game grid (Fig. 2b), resulting in 144
possible moves.
C. Network Architecture
The architecture is selected as a result of the prestudy using
both the play data from MCTS-agents and human-play data
[39]. The architecture we chose consists of 11 convolutional
layers. We found that a 3×3kernel operating in stride 1
performs well for all convolutional layers. As less complex
models are favored during deployment due to faster inference
and training time, we empirically discovered that using only
35 filters for the first 11 convolutional layers is sufficient to
maintain a relatively high accuracy. To obtain move predictions
from previous convolutional layers, we followed [41]: the
last convolutional layer uses exactly 144 filters (to match
the numbers of possible moves) that are fed into a Global
3x3 conv
35 filters
3x3 conv
144 filters
avg pool softmax output:
144 144 144 144
Fig. 2. Illustration of (a) the network architecture with the specification of each layer, and (b) the encoding of moves. Figures are adapted from [39], [40].
Average Pooling (GAP) layer (generating 144 scalars) right
before the softmax function. It is also worth mentioning that
adding the classic Fully-Connected (FC) layers with dropout
regularization [38], [42] performed inferior to GAP. Despite
the common application of Rectified Linear Unit (ReLU)
activation function in many prominent CNN architectures (e.g.
[10]–[12], [38], [42]), we opted to use the Exponential Linear
Unit (ELU) function [43] instead, because it improved the
validation accuracy by about 2.5%. In addition, we also ex-
perimented with batch normalization [44] and residual network
[45] but neither managed to provide better generalization
D. Prediction Models
The strength of an agent lies in its prediction ability, i.e.
how accurately it can predict the difficulty of new content
— a new level. To compare our agents, we must, therefore,
build prediction models based on historical data and compare
the prediction performance on new content. We measure the
difficulty as the success rate, calculated as the ratio between
the total number of successes and the total number of attempts.
For CCS and CCSS we use data from 800 levels to build the
prediction model. Then we predict the difficulty of succeeding
200 levels that have not been revealed to the training of the
CNN policy network. Gathering the data for the MCTS at-
tempts was the limiting factor where time allowed for running
on 1,000 levels. We try to build the best possible prediction
model for each agent. The results are therefore inevitably
subject to human choices in the prediction modeling. However,
we do not think this biases the comparison. The prediction
models are based on binomial regression using level type
features and the agents success rates as inputs. The model
type and input features should not favor the CNN agents over
the MCTS agent.
We use three different measures to compare the predictive
power: 1) mean absolute error (MAE) between the estimated
success rate and the actual success rate, 2) the percentage of
test points lying outside the 95% prediction bands, and 3)
standard deviation (STDDEV) of random effects. Measuring
from the perspective of generalization capability, we are in
favour of predictions for new levels with as little error or bias
as possible. The anticipated prediction error can be expressed
through the STDDEV of random effects and MAE and the
bias is indicated by the percentage of points outside of the
95% prediction bands.
E. Binomial Regression
Prior to building a statistical model that expresses the
players’ success rate ρplayer using agent success rate ρagent, we
observe that they do not need to linearly map to each other.
For the following reasons:
The agent and players performance depends in a different
way on the game mode and features present on the board
Players show higher success rate in the presence of game
features requiring deeper strategic thinking
The agent is much less random than players. It is because
(a) the agent is a single player while human-players
belong to a large group of millions of individuals playing
with different skills and strategies; (b) agents follow their
own policy to the point and that leads to highly correlated
The average success rate observed for players is limited
in its value. The same does not hold for a single player
or a single agent. The observed relationship between the
agent and players cannot hold for levels where the agent
needs much more attempts to succeed than the average
observed for the population
We have observed that the agents and players exhibit
different sensitivity to increased difficulty. The difference
does not need to be linear.
In addition to limitations explained above, the model chosen
needs to support rate values — ranging from 0 (when the
agent fails on all attempts) to 1 (when the agent succeeds
on all attempts) and the prediction, including the prediction
uncertainty, needs to stay within this range of values. For that
reason, we model the relationship ρplayer ρagent with binomial
To account for the difference in difficulty in the presence
of different board elements we add features available on the
board as covariates so that the model becomes:
logit(ρplayer,i) = β0+β1·f(ρagent,i ) + X
where fdenotes a transformation of ρagent,i making it linear in
logit scale; iis the index of observations; x<i>
jdenotes the j-
th feature for the i-th observation; and is the error term.
The data we model is aggregated per level, i.e. each data
point is represented by the average success rate for players
and the average success rate obtained for the agents. The
problem with this approach is that binomial regression imposes
a certain limit on uncertainty. The expected variance of the
observed success rate is formulated as Var(ρ) = ρ(1 ρ)/n,
where nrepresents the number of attempts. For both ends of
the success rate range (i.e. ρ= 0 or ρ= 1), the expected
variance is 0 and for the middle point (i.e. ρ= 0.5) it takes
its maximum value and becomes 0.25/n. The dispersion of
data collected exceeds the limits imposed by the binomial
model. This phenomenon is known as overdispersion [46] and
if not taken into account it causes problems with inference.
In particular it leads to underestimation of the uncertainty
around the estimated parameters and as a consequence to
biased predictions.
The common strategies to account for overdispersion in-
clude adding new features and transforming the existing fea-
tures, neither of which solves our problem since we know that
the overdispersion is caused by the agent behavior which tends
to be self-correlated. In statistical literature, such a situation
is described as clustered measurements [46]. In our case, a
cluster is a game level for which we observe an average
success rate. Such data is assumed to be affected by two
random processes: (a) within-cluster randomness (uncertainty
of the measurement of ρfor a single game level) that is already
captured by the term in equation (1); and (b) between-cluster
randomness (uncertainty resulting from data being correlated),
which we model by introducing term κto (1) and hence obtain
the improved model:
logit(ρplayer,i) = β0+β1·f(ρagent,i )+X
where both random terms are expected to be normally dis-
tributed around zero. This kind of model belongs to a family
of generalized linear mixed models. Here, κiis a random
effect and all other features are fixed effects. The model
estimates all parameters (i.e. β0,β1and βj) together with
the variance of κ. The prediction of our model has two parts:
the deterministic part expressed by the logit model and the
random part expressed by κ. Since there is no closed-form
solution for obtaining prediction bands resulting from (2), we
obtain them instead by simulation and bootstrapping. The final
result of the modeling is shown in Fig. 4 and Table II.
Binomial regression with random effects describes the data
well but it is also possible to model the same data — log-
transformed, with linear regression. The drawback with linear
regression is that it lacks interpretation when the prediction
goes over 1 or below 0. Also, because the relationship cannot
be assumed to be strictly linear, as explained at the beginning
of this point, the variance of the model is higher than the
variance estimated with binomial regression.
Tracked data
levels ≤ 2150
(a) Train a CNN agent
800 levels
≤ 2150
(b) Fit a binomial regression model
200 levels
> 2150 Binomial
(c) Predict human success rate for new levels
Fig. 3. Three flow charts describing the overview of our proposed approach.
In this section, we evaluate the proposed approach on the
games at hand. We compare MCTS results to the CNN results
for CCS and additionally show prediction accuracy for CCSS.
We did not build an MCTS agent for CCSS as this is a non-
trivial and time-consuming task in a live game on a game
engine not optimized for simulating agents. Thus, such a
comparison to MCTS for CCSS is not possible. Therefore, we
describe the details of the setup for the CNN agent in CCS and
show results for CCSS where they apply. The overall approach
for both games is identical.
A. Briefs of Human-Player Datasets
The data required to train the human-like agents is gathered
via tracking the state-action pairs (samples) from approxi-
mately 1% of players, selected at random, during about 2
weeks. By the time when we collected the data, there were
about 2,400 levels released in CCS. For training the CNN
agents, we use 5,500 state-action pairs per level for the level
range of 1 to 2,150, obtaining a dataset with nearly 1.2×107
samples. The data set is split into 3 subsets: training set (4,500
samples per level), validation set (500 samples per level), and
test set (500 samples per level).
B. Evaluation Procedures and Settings
The experiment design is illustrated in Fig. 3. Prior to the
recent few applications of deep learning in developing intelli-
gent game agents (e.g. [11], [12], [32]), MCTS variants (e.g.
[8], [25]–[27]) served as one of the mainstream approaches for
simulating gameplay (as discussed in section II) and MCTS
Accuracy (%)
12Conv+ELU+GAP aα,BS 32.35
12Conv+ReLU+FC+Dropout bα,BS,p25.72
12Conv+ReLU+GAP cα,BS 28.59
20ResBlocks+ELU+GAP dα,BS 30.01
Random Policy *N/A 16.67
aThe selected network architecture (Fig. 2a) that achieved the best validation accuracy
(indicated in bold).
bA network consisting of 12 convolutional layers with ReLU activation functions,
followed by 2 dropout regularized FC layers.
cSame as aexcept that ReLU is used in all convolutional layers.
dThe convolutional layers used in awere replaced by 20 residual blocks of two
convolutional operations and a shortcut connection around these two.
*Baseline: an entirely random policy for choosing game moves.
still plays a key role in many of the recent deep learning
applications. Therefore we use Poromaa’s implementation of
MCTS [8] as a benchmark agent in our experiments. The
implementation considers partial objective fulfillment when a
roll-out does not lead to a win, instead of just binary values
(win or loss). The MCTS agent is much more time consuming
than the CNN agent. We want to understand how well the CNN
agent does compared to the MCTS agent. The MCTS agent
runs with 100 attempts on each level to get an estimate of
ρMCTS,i but at the same time it runs 200 self-play simulations
before taking a decision about which move to make for each
position. In our comparison we run the CNN agent with 100
attempts and also with 1,000 attempts as a proxy for what
could be more than 100,000 attempts if we want to compare
the total number of simulations2.
Training one version of our CNN model takes about 24
hours on a single machine with 6 CPUs and one Nvidia
Tesla K80 GPU. Game-play simulations are executed using
32 CPUs in parallel. All computational resources are allocated
on demand from a cloud service provider. The selected net-
work architecture (Fig. 2a: 12Conv+ELU+GAP) is a result
of the pre-study on data (state-action pairs) generated from
game-play by MCTS agents. We experimentally evaluated
4 different network architectures, each of which requires a
hyper-parameter search of learning rate (α), batch size (BS),
and dropout keep probability (p). The values used when
conducting a hyper-parameter grid search are respectively
α∈ {5×105,1×104,5×104},BS ∈ {27,28,29}, and
p∈ {0.4,0.5,0.6}. We report the best validation accuracy
achieved by different network architectures in Table I. We
found that a learning rate of 5×104and a batch size of
29lead to the best results.
C. The Training Performance of CNN-based Agents
From this section, we will base our analysis on the best per-
forming architecture (i.e. 12Conv+ELU+GAP) in the previous
section. Agents using that network architecture are trained with
the data obtained by tracking human-players. The validation
2MCTS: 100 attempts, 30 moves per attempt, 200 simulations per move
and test accuracy reached around 47% for CCS and 48% for
CCSS. Comparing the validation accuracies we notice that
CNN agents trained on real human-player data performed
almost 50% better than the ones trained on data produced by
MCTS agents. We tried to improve the accuracy by adding
complexity to the model in form of more filters per layer
as well as adding a layer of linear combination after GAP;
but none of those added components made any significant
improvement to the network’s performance.
D. The Performance of Simulating Game Play
We can now use the trained CNN agent as a policy
evaluating all actions aAgiven a state s, where Ais
the set of actions. The action with the highest probability
maxaAP(a|s)is then executed by the agent. The MCTS
agents use 200 simulations to make one move in one state.
This number proved in [8] to produce good results using a
tolerable amount of time. Selecting and executing an action
leads to a new state s0with a new set of possible actions
A0. The available actions are then again evaluated by the
respective agents. This loop of executing an action given a
state is continued until a terminal state is reached (either
fulfillment of the objective or out of moves).
E. Comparing Predictions
The predicted values for the 200 test levels and the associ-
ated prediction bands are shown in Fig. 4. The graphs compare
prediction accuracy for the CNN agents and the MCTS agent.
The CNN agents played both CCS and CCSS while the MCTS
agent played only CCS. Additionally, it also shows the impact
of the number of attempts on the prediction. For the CNN
agents the prediction is based on 100 or 1,000 attempts and
for the MCTS agent prediction is based on 100 attempts.
Table II summarizes the models. We see that MAE is lower
for CCS than CCSS and that both CNN with 100 attempts and
1000 attempts has a lower MAE than MCTS. The prediction
band is also much wider for MCTS indicating that the CNN
agent is a stronger predictor. It is interesting to see that for the
MCTS agent the ratio of prediction outside the 95% prediction
band is close to the expected 5% and the out-of-band ratio
is much higher for CNN. This is partly due to the wider
prediction band for MCTS but it also suggests that the MCTS
is quite robust to the evolution of the game. Note that the game
is evolving with every new level, sometimes introducing new
elements which the CNN has not trained on. Therefore, the
CNN must be retrained when new elements are introduced to
the game for optimal prediction performance but that was not
done here. MCTS and any model predicting player difficulty
measures would need to retrain their prediction model for new
game elements but additionally, the CNN agents need new
tracked data to update the policy.
The policy that the CNN agent is learning is the aver-
age policy of all the players. It would improve difficulty
predictions if we could learn different policies representing
(a) CCS: MCTS 100 Att/Lvl
agent success rate (transformed)
player success rate (transformed)
(b) CCS: CNN 1000 Att/Lvl
(c) CCSS: CNN 1000 Att/Lvl
Fig. 4. The success rate obtained by different agents plotted against the success rate of human-players. Note, the values have been transformed. The scale is
the same in all plots. The actual success rates for players are considered as sensitive information and therefore removed from the plot. The grey shaded areas
indicates the 95% prediction band. The graphs provide visual overview of the model prediction performances. The uncertainty band is narrower for the CNN
agents and thus the prediction for players’ prediction performance is captured better.
Agent Att/Lvl Game MAE out-band ratio STDDEV
MCTS 100 CCS 5.4% 4% 53%
CNN 1,000 CCS 4.0% 11% 35%
CNN 100 CCS 4.9% 24% 33%
CNN 1,000 CCSS 5.7% 17% 38%
CNN 100 CCSS 6.6% 23% 35%
different kind of players. Creating player "personas" based
on different policies to represent clusters of similar players
has the potential to greatly increase the understanding of
levels. With different "personas" we could measure how often
different policies agree and how certain the move predictions
are. Possibly indicating the different experiences players have,
e.g. if a level needs a high level of strategy or not. It could also
improve the prediction to play non-deterministically with the
CNN policy, with probabilities given by the CNN prediction
output or -greedy. The architecture and hyper-parameters of
the CNN can likely be improved which would be interesting
to investigate further, especially with more data. Practically, it
is important to measure other key metrics which can be done
in a very similar way as the difficulty. We have already done
this for score distribution and move distribution in CCS, CCSS
and other games. For Procedural Content Generation (PCG)
[47], the proposed agent can be a critical ingredient in the
generation loop. For example, providing a fitness function for
an evolutionary algorithm in a search-based approach [48].
Finally, we have indirectly been using the CNN agent for
Quality Assurance. Playing with an agent which visits tens of
thousands of the most relevant states in a level’s state space has
proven valuable and could prove to be an interesting research
on its own.
Inspired by the recent advancement of deep learning tech-
niques, mostly in the domain of computer vision, we proposed
a framework for estimating level difficulty of match-3 games,
the core of which is essentially a CNN-based agent trained on
human-player data. However, the method is general and well
suited for many games, in particular where content creation is
sequential. The predictive power of our approach outperformed
the state-of-the-art MCTS-based agents by a large margin on
prediction accuracy and execution efficiency.
In CCS we can now estimate the difficulty of a new level
in less than a minute and can easily scale the solution at
a low cost. This compares to the previous 7 days needed
with human playtesting on each new episode of 15 levels.
This completely changes the level design process where level
designers have now more freedom to iterate on the design and
focus more on innovation and creativity than before. Internally,
we have also tried this approach on a game in development
using rather limited playtest data. Nevertheless, we were able
to train a decent agent, albeit much noisier than in CCS and
CCSS, which has helped a lot with the iterative process of
game development. Since we ran the experiments presented
in this paper we have used the CNN agent for more than a
year, for more than 1,000 new levels in CCS. The prediction
accuracy has been stable and when new game features have
been presented it has been easy to retrain the agent to learn
the new feature and continue predicting the difficulty.
[1] J. Hamari, N. Hanner, and J. Koivisto, “Service quality explains why
people use freemium services but not if they go premium: An empirical
study in free-to-play games,” International Journal of Information
Management, vol. 37, no. 1, pp. 1449–1459, 2017.
[2] K. Alha, E. Koskinen, J. Paavilainen, and J. Hamari, “Free-to-Play
Games: Professionals’ Perspectives,” in Proceedings of Nordic Digra.
Gotland, Sweden, 2014, pp. 1–14.
[3] A. Denisova, C. Guckelsberger, and D. Zendle, “Challenge in
digital games: Towards developing a measurement tool,” in
Proceedings of the 2017 CHI Conference Extended Abstracts on
Human Factors in Computing Systems, ser. CHI EA ’17. New
York, NY, USA: ACM, 2017, pp. 2511–2519. [Online]. Available:
[4] M. Seif El-Nasr, A. Drachen, and A. Canossa, Game Analytics. London:
Springer, 2013.
[5] A. Drachen and A. Canossa, “Towards gameplay analysis via gameplay
metrics,” in Proceedings of the 13th International MindTrek Conference:
Everyday Life in the Ubiquitous Era on - MindTrek ’09. New York,
USA: ACM Press, 2009, p. 202.
[6] A. Zook, E. Fruchter, and M. O. Riedl, “Automatic Playtesting for Game
Parameter Tuning via Active Learning,” Foundations of Digital Games,
[7] A. Isaksen, D. Gopstein, and A. Nealen, “Exploring game space using
survival analysis,” in Proceedings of the 10th International Conference
on the Foundations of Digital Games. Pacific Grove, CA, 2015.
[8] E. R. Poromaa, “Crushing Candy Crush: Predicting Human Success Rate
in a Mobile Game using Monte-Carlo Tree Search,” Master’s thesis,
KTH Royal Institute of Technology, 2017.
[9] F. Silva, S. Lee, and N. Ng, “AI as Evaluator: Search Driven Playtesting
in Game Design,” in Proceedings of AAAI. Phoenix City, USA, 2016.
[10] C. Clark and A. Storkey, “Training deep convolutional neural networks
to play go,” in International Conference on Machine Learning. Lille,
France, 2015, pp. 1766–1774.
[11] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den
Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc-
tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever,
T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,
“Mastering the game of Go with deep neural networks and tree search,”
Nature, vol. 529, no. 7587, pp. 484–489, 1 2016.
[12] K. Shao, D. Zhao, Z. Tang, and Y. Zhu, “Move prediction in Gomoku
using deep learning,” in Proceedings of IEEE Youth Academic Annual
Conference of Chinese Association of Automation (YAC). Hefei, China,
2017, pp. 292–297.
[13] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,Nature, vol. 521,
no. 7553, pp. 436–444, 2015.
[14] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade
learning environment: An evaluation platform for general agents.J.
Artif. Intell. Res.(JAIR), vol. 47, pp. 253–279, 2013.
[15] C. Holmgård, A. Liapis, J. Togelius, and G. N. Yannakakis, “Evolving
personas for player decision modeling,” in 2014 IEEE Conference on
Computational Intelligence and Games, pp. 1–8.
[16] ——, “Personas versus clones for player decision modeling,” in En-
tertainment Computing – ICEC 2014, Y. Pisan, N. M. Sgouros, and
T. Marsh, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2014,
pp. 159–166.
[17] C. Holmgård, M. C. Green, A. Liapis, and J. Togelius, “Automated
playtesting with procedural personas through MCTS with evolved
heuristics,” CoRR, vol. abs/1802.06881, 2018. [Online]. Available:
[18] M. T. Omori and A. S. Felinto, “Analysis of motivational elements of
social games: a puzzle match 3-games study case,” International Journal
of Computer Games Technology, vol. 2012, p. 9, 2012.
[19] C. Guckelsberger, C. Salge, J. Gow, and P. Cairns, “Predicting player
experience without the player.: An exploratory study,” in Proceedings
of the Annual Symposium on Computer-Human Interaction in Play,
ser. CHI PLAY ’17. New York, NY, USA: ACM, 2017, pp. 305–315.
[Online]. Available:
[20] N. Shaker, S. Asteriadis, G. N. Yannakakis, and K. Karpouzis, “Fusing
visual and behavioral cues for modeling user experience in games,
IEEE Trans. Cybernetics, vol. 43, no. 6, pp. 1519–1531, 2013. [Online].
[21] D. Churchill, A. Saffidine, and M. Buro, “Fast heuristic search for rts
game combat scenarios.” in Proceedings of the 8th AIIDE. Palo Alto,
California, 2012, pp. 112–117.
[22] D. Perez, S. Samothrakis, and S. Lucas, “Knowledge-based fast evo-
lutionary MCTS for general video game playing,” in Proceedings of
IEEE Conference on Computational Intelligence and Games (CIG).
Dortmund, Germany, 2014, pp. 1–8.
[23] A. Zook, B. Harrison, and M. O. Riedl, “Monte-carlo tree search
for simulation-based strategy analysis,” in Proceedings of the 10th
Conference on the Foundations of Digital Games, 2015.
[24] S. Devlin, A. Anspoka, N. Sephton, P. Cowling, and
J. Rollason, “Combining gameplay data with monte carlo tree
search to emulate human play,” 2016. [Online]. Available:
[25] T. Imagawa and T. Kaneko, “Enhancements in monte carlo tree search
algorithms for biased game trees,” in Proceedings of IEEE Conference
on Computational Intelligence and Games (CIG). Tainan, Taiwan,
2015, pp. 43–50.
[26] A. Khalifa, A. Isaksen, J. Togelius, and A. Nealen, “Modifying MCTS
for Human-like General Video Game Playing,” in Proceedings of IJCAI.
New York, USA, 2016, pp. 2514–2520.
[27] A. Mendes, J. Togelius, and A. Nealen, “Hyper-heuristic general video
game playing,” in Proceedings of IEEE Conference on Computational
Intelligence and Games (CIG). Santorini, Greece, 2016, pp. 1–8.
[28] T. S. Nielsen, G. A. Barros, J. Togelius, and M. J. Nelson, “General
video game evaluation using relative algorithm performance profiles,
in Proceedings of European Conference on the Applications of Evolu-
tionary Computation. Copenhagen, Denmark, 2015, pp. 369–380.
[29] P. Bontrager, A. Khalifa, A. Mendes, and J. Togelius, “Matching
games and algorithms for general video game playing,” in Proceedings
of the Twelfth AAAI Conference on Artificial Intelligence and
Interactive Digital Entertainment, AIIDE 2016, October 8-12, 2016,
Burlingame, California, USA., 2016, pp. 122–128. [Online]. Available:
[30] R. Coulom, “Efficient selectivity and backup operators in Monte-Carlo
tree search,” Computers and games, vol. 4630, pp. 72–83, 2007.
[31] T. P. Runarsson and S. M. Lucas, “Preference learning for move
prediction and evaluation function approximation in Othello,IEEE
Transactions on Computational Intelligence and AI in Games, vol. 6,
no. 3, pp. 300–313, 2014.
[32] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering
the game of go without human knowledge,Nature, vol. 550, no. 7676,
pp. 354–359, 2017.
[33] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang,
“Multiagent bidirectionally-coordinated nets for learning to play starcraft
combat games,” arXiv preprint arXiv:1703.10069, 2017.
[34] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling,
P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton,
“A survey of monte carlo tree search methods,” IEEE Transactions on
Computational Intelligence and AI in games, vol. 4, no. 1, pp. 1–43,
[35] M. ´
Swiechowski, H. Park, J. Ma´
ndziuk, and K.-J. Kim, “Recent ad-
vances in general game playing,The Scientific World Journal, vol. 2015,
pp. 1–22, 2015.
[36] S. Gelly and D. Silver, “Monte-carlo tree search and rapid action value
estimation in computer go,” Artificial Intelligence, vol. 175, no. 11, pp.
1856–1875, 2011.
[37] ——, “Combining online and offline knowledge in uct,” in Proceedings
of the 24th International Conference on Machine learning. Oregon,
USA, 2007, pp. 273–280.
[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification
with Deep Convolutional Neural Networks, 2012.
[39] P. Eisen, “Simulating human game play for level difficulty estimation
with convolutional neural networks,” Master’s thesis, KTH Royal Insti-
tute of Technology, 2017.
[40] S. Purmonen, “Predicting game level difficulty using deep neural net-
works,” Master’s thesis, KTH Royal Institute of Technology, 2017.
[41] M. Lin, Q. Chen, and S. Yan, “Network in network,” in Proceedings of
ICLR. Banff, Canada, 2014.
[42] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for
Large-Scale Image Recognition,” Information and Software Technology,
vol. 51, no. 4, pp. 769–784, 9 2014.
[43] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and Accurate
Deep Network Learning by Exponential Linear Units (ELUs),” in
Proceedings of ICLR. Vancouver, Cadana, 2016, pp. 1–13.
[44] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Proceedings of
International Conference on Machine Learning. Lille, France, 2015,
pp. 448–456.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition. Las Vegas, USA, 2016, pp. 770–778.
[46] J. Hinde and C. G. Demétrio, “Overdispersion: models and estimation,”
Computational Statistics & Data Analysis, vol. 27, no. 2, pp. 151–170,
[47] N. Shaker, J. Togelius, and M. J. Nelson, Procedural Content Generation
in Games, 1st ed. Springer Publishing Company, Incorporated, 2016.
[48] V. Volz, J. Schrum, J. Liu, S. M. Lucas, A. Smith, and S. Risi, “Evolving
Mario Levels in the Latent Space of a Deep Convolutional Generative
Adversarial Network,ArXiv e-prints, May 2018.
... Within the game research community, ACB is regarded as a promising solution to address this challenge by leveraging AI players and generators (i.e., balancers) agents as game testers and designers, respectively. In ACB, two repetitive phases are involved in automating game design tasks and quality assurance: (1) generating new game content with machine learning (PCGML) [5] methods and (2) evaluating the content through play-testing with an AI player. These automated sequences allow generator models to be trained through extensive trial-and-error iterations, surpassing the limited trials feasible with human testers. ...
... The robustness is measured with the performance ratio that inc-/decreasing in an unseen environment. We adopt the win rate metric to evaluate an agent's performance in an environment, and the win rate is a widely used metric for agent-based play-testing [2], [47]; the win rate is calculated by dividing the number of wins by total number of episodes. ...
Full-text available
The balance of game content significantly impacts the gaming experience. Unbalanced game content diminishes engagement or increases frustration because of repetitive failure. Although game designers intend to adjust the difficulty of game content, this is a repetitive, labor-intensive, and challenging process, especially for commercial-level games with extensive content. To address this issue, the game research community has explored automated game balancing using artificial intelligence (AI) techniques. However, previous studies have focused on limited game content and did not consider the importance of the generalization ability of play-testing agents when encountering content changes. In this study, we propose RaidEnv, a new game simulator that includes diverse and customizable content for the boss raid scenario in the MMORPG games. Additionally, we design two benchmarks for the boss raid scenario that can aid in the practical application of game AI. These benchmarks address two open problems in automatic content balancing, and we introduce two evaluation metrics to provide guidance for AI in automatic content balancing. This novel game research platform expands the frontiers of automatic game balancing problems and offers a framework within a realistic game production pipeline. The open-source environment is available at a GitHub repository.
... Within the game research community, ACB is regarded as a promising solution to address this challenge by leveraging AI player and generator (i.e., balancer) agents as game testers and designers, respectively. The ACB process involves two repetitive phases to automating game design tasks and quality assurance: (1) generating new game content using machine learning methods known as procedural content generation with machine learning (PCGML) [5] In ACB, there are two repetitive phases to automating game design tasks and quality assurance: (1) generating new game content with machine learning (PCGML) [5] methods and (2) evaluating the content through playtesting with an AI player. These automated sequences allow generator models to be trained through extensive trial-and-error iterations, surpassing the limited trials feasible with human testers. ...
... To evaluate the change of difficulty fairly, To evaluate the change of difficulty fairly, we made the population by extracting representative scores for a given level. The difficulty of level can be determined using various game logs such as churn rate [44], [45], success rate [2], [46]. Herein, we evaluate the level of difficulty based on the win rate of games. ...
Full-text available
The balance of game content significantly impacts the gaming experience. Unbalanced game content diminishes engagement or increases frustration because of repetitive failure. Although game designers intend to adjust the difficulty of game content, this is a repetitive, labor-intensive, and challenging process, especially for commercial-level games with extensive content. To address this issue, the game research community has explored automated game balancing using artificial intelligence (AI) techniques. However, previous studies have focused on limited game content and did not consider the importance of the generalization ability of playtesting agents when encountering content changes. In this study, we propose RaidEnv, a new game simulator that includes diverse and customizable content for the boss raid scenario in MMORPG games. Additionally, we design two benchmarks for the boss raid scenario that can aid in the practical application of game AI. These benchmarks address two open problems in automatic content balancing, and we introduce two evaluation metrics to provide guidance for AI in automatic content balancing. This novel game research platform expands the frontiers of automatic game balancing problems and offers a framework within a realistic game production pipeline.
... Lastly, actual player play-traces can also be used to learn how actual players play, which was demonstrated by Gudmundsson et al. [2], where a convolutional neural network action selection policy is learned from the play-traces. However, one issue with using playtraces is that these data are not always available, for example for a newly released game with little or no player data, or technical limitations such as cost of storage or tracking issues. ...
... The results so far suggest that there is a correlation between the agent behaviour with player completion rate. Unlike other works that try to model and predict the precise player metric (c.f., [2]) using the rank correlation can instead be used to give an estimate on how a level is compared to other levels. For example, it may show that a certain level is one of the top 10% most difficult levels. ...
In this work we investigate whether it is plausible to use the performance of a reinforcement learning (RL) agent to estimate the difficulty measured as the player completion rate of different levels in the mobile puzzle game Lily's Garden.For this purpose we train an RL agent and measure the number of moves required to complete a level. This is then compared to the level completion rate of a large sample of real players.We find that the strongest predictor of player completion rate for a level is the number of moves taken to complete a level of the ~5% best runs of the agent on a given level. A very interesting observation is that, while in absolute terms, the agent is unable to reach human-level performance across all levels, the differences in terms of behaviour between levels are highly correlated to the differences in human behaviour. Thus, despite performing sub-par, it is still possible to use the performance of the agent to estimate, and perhaps further model, player metrics.
... In addition to MCTS, methods based on Deep Learning are receiving attention. Gudmundsson et al. (Gudmundsson et al. 2018) trained a convoluted neural network that takes player data from Candy Crush Saga (CCS) and Candy Crush Soda Saga (CCSS) as input to predict human actions in a given position. Their experiments show that this algorithm works well in CCS and CCSS, and it is also suitable for many games, especially where content creation is sequential. ...
We introduce a new reward function direction for intrinsically motivated reinforcement learning to mimic human behavior in the context of computer games. Similar to previous research, we focus on so-called ``curiosity agents'', which are agents whose intrinsic reward is based on the concept of curiosity. We designed our novel intrinsic reward, which we call ``Cautious Curiosity'' (CC) based on (1) a theory that proposes curiosity as a psychological definition called information gap, and (2) a recent study showing that the relationship between curiosity and information gap is an inverted U-curve. In this work, we compared our agent using the classic game Super Mario Bros. with (1) a random agent, (2) an agent based on the Asynchronous Advantage Actor Critic algorithm (A3C), (3) an agent based on the Intrinsic Curiosity Module (ICM), and (4) an average human player. We also asked participants (n = 100) to watch videos of these agents and rate how human-like they are. The main contribution of this work is that we present a reward function that, as perceived by humans, induces an agent to play a computer game similarly to a human, while maintaining its competitiveness and being more believable compared to other agents.
... The recent success of deep neural networks and Reinforcement Learning (RL) has prompted extensive studies in their potential for automated playtesting. Supervised deep learning has been used to simulate humanlike playtesting to predict level difficulty [14]; however, the method is limited by the need for player data at a scale not plausible for new games in production. At the same time RL has been proposed to augment the game testing framework, introducing self-learning to testing to find game exploits and bugs [15]. ...
Reinforcement learning has been widely successful in producing agents capable of playing games at a human level. However, this requires complex reward engineering, and the agent's resulting policy is often unpredictable. Going beyond reinforcement learning is necessary to model a wide range of human playstyles, which can be difficult to represent with a reward function. This paper presents a novel imitation learning approach to generate multiple persona policies for playtesting. Multimodal Generative Adversarial Imitation Learning (MultiGAIL) uses an auxiliary input parameter to learn distinct personas using a single-agent model. MultiGAIL is based on generative adversarial imitation learning and uses multiple discriminators as reward models, inferring the environment reward by comparing the agent and distinct expert policies. The reward from each discriminator is weighted according to the auxiliary input. Our experimental analysis demonstrates the effectiveness of our technique in two environments with continuous and discrete action spaces.
... In this combination, the authors prove that for certain game genres, it is possible to obtain both state and code coverage detection. Using the same basic method, other works exploit the idea of training agents by using reward functions to punish their actions when they deviate too much from human behaviors, as shown in (Gudmundsson et al., 2018), (Tastan and Sukthankar, 2011), and (Glavin and Madden, 2015). A novelty in the same area, presented in (Bergdahl et al., 2020), , , extends the previous ideas by attempting to create a heatmap of the situations analyzed at each point in the testing process. ...
Conference Paper
Full-text available
The increasingly popular no- or low-code paradigm is based on functional blocks connected on a graphical interface that is accessible to many stakeholders in an application. Areas such as machine learning, DevOps, digital twins, simulations, and video games use this technique to facilitate communication between stakeholders regarding the business logic. However, the testing methods for such interfaces that connect blocks of code through visual programming are not well studied. In this paper, we address this research gap by taking an example from a niche domain that nevertheless allows for full generalization to other types of applications. Our open-source tool and proposed methods are reusing existing software testing techniques, mainly those based on fuzzing methods, and show how they can be applied to test applications defined as visual interaction blocks. Specifically for simulation applications, but not limited to them, the automated fuzz testing processes can serve two main purposes: (a) automatically generate tests triggered by new stakeholder changes and (b) support tuning of different parameters with shorter processing times. We present a comprehensive motivation plan and high-level methods that could help industry reduce the cost of testing, designing, and tuning parameters, as well as a preliminary evaluation.
... Montezuma's Revenge is used to evaluate the exploratory capacities of RL agents [7]. CNNs are used to build human-like AI agents for automated game testing [27]. Human player datasets have been used to train CNN agents in match-3 gaming. ...
Full-text available
Professional StarCraft game players are likely to focus on the management of the most important group of units (called the main force) during gameplay. Although macro-level skills have been observed in human game replays, there has been little study of the high-level knowledge used for tactical decision-making, nor exploitation thereof to create AI modules. In this paper, we propose a novel tactical decision-making model that makes decisions to control the main force. We categorized the future movement direction of the main force into six classes (e.g., toward the enemy’s main base). The model learned to predict the next destination of the main force based on the large amount of experience represented in replays of human games. To obtain training data, we extracted information from 12,057 replay files produced by human players and obtained the position and movement direction of the main forces through a novel detection algorithm. We applied convolutional neural networks and a Vision Transformer to deal with the high-dimensional state representation and large state spaces. Furthermore, we analyzed human tactics relating to the main force. Model learning success rates of 88.5%, 76.8%, and 56.9% were achieved for the top-3, -2, and -1 accuracies, respectively. The results show that our method is capable of learning human macro-level intentions in real-time strategy games.
... The authors use metrics to predict the score. Gudmundsson et al. [6] tested the difficulty of a game level using autonomous agents to simulate gameplay and the "success rate" against human players. They wanted to predict the difficulty of a new game level automatically. ...
Full-text available
As the complexity and scope of games increase, game testing, also called playtesting, becomes an essential activity to ensure the quality of video games. Yet, the manual, ad-hoc nature of game testing leaves space for automation. In this paper, we research, design, and implement an approach to supplement game testing to balance video games with autonomous agents. We evaluate our approach with two platform games. We bring a systematic way to assess if a game is balanced by (1) comparing the difficulty levels between game versions and issues with the game design, and (2) the game demands for skill or luck.
Conference Paper
Full-text available
Generative Adversarial Networks (GANs) are a machine learning approach capable of generating novel example outputs across a space of provided training examples. Procedural Content Generation (PCG) of levels for video games could benefit from such models, especially for games where there is a pre-existing corpus of levels to emulate. This paper trains a GAN to generate levels for Super Mario Bros using a level from the Video Game Level Corpus. The approach successfully generates a variety of levels similar to one in the original corpus, but is further improved by application of the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Specifically, various fitness functions are used to discover levels within the latent space of the GAN that maximize desired properties. Simple static properties are optimized, such as a given distribution of tile types. Additionally, the champion A* agent from the 2009 Mario AI competition is used to assess whether a level is playable, and how many jumping actions are required to beat it. These fitness functions allow for the discovery of levels that exist within the space of examples designed by experts, and also guide the search towards levels that fulfill one or more specified objectives.
Full-text available
This paper describes a method for generative player modeling and its application to the automatic testing of game content using archetypal player models called procedural personas. Theoretically grounded in psychological decision theory, procedural personas are implemented using a variation of Monte Carlo Tree Search (MCTS) where the node selection criteria are developed using evolutionary computation, replacing the standard UCB1 criterion of MCTS. Using these personas we demonstrate how generative player models can be applied to a varied corpus of game levels and demonstrate how different play styles can be enacted in each level. In short, we use artificially intelligent personas to construct synthetic playtesters. The proposed approach could be used as a tool for automatic play testing when human feedback is not readily available or when quick visualization of potential interactions is necessary. Possible applications include interactive tools during game development or procedural content generation systems where many evaluations must be conducted within a short time span.
Full-text available
A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
Conference Paper
Full-text available
A key challenge of procedural content generation (PCG) is to evoke a certain player experience (PX), when we have no direct control over the content which gives rise to that experience. We argue that neither the rigorous methods to assess PX in HCI, nor specialised methods in PCG are sufficient, because they rely on a human in the loop. We propose to address this shortcoming by means of computational models of intrinsic motivation and AI game-playing agents. We hypothesise that our approach could be used to automatically predict PX across games and content types without relying on a human player or designer. We conduct an exploratory study in level generation based on empowerment, a specific model of intrinsic motivation. Based on a thematic analysis, we find that empowerment can be used to create levels with qualitatively different PX. We relate the identified experiences to established theories of PX in HCI and game design, and discuss next steps.
Conference Paper
Full-text available
We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However, ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore, ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.
Conference Paper
Full-text available
Challenge is arguably the most important experience that players seek in digital games. However, without a measure of how challenged players feel during the act of play, it is hard to design games that are neither too easy nor too hard and, therefore, truly enjoyable. Especially in industry, challenge is dominantly assessed by means of manual play testing in ad-hoc trials. The aim of this research is to create a more systematic, complete, and reliable instrument to evaluate the level of players' experienced challenge in games in the form of a questionnaire. This paper presents the key results from an extensive literature survey which will inform further development. We survey definitions of challenge, challenge types, and their relation to player experience based on the observations of game designers. We furthermore draw from empirical findings in a diverse range of fields such as game studies, human computer interaction (HCI) and artificial intelligence (AI).
Monte Carlo Tree Search (MCTS) has become a popular solution for controlling non-player characters. Its use has repeatedly been shown to be capable of creating strong game playing opponents. However, the emergent playstyle of agents using MCTS is not necessarily human-like, believable or enjoyable. AI Factory Spades, currently the top rated Spades game in the Google Play store, uses a variant of MCTS to control non-player characters. In collaboration with the developers, we collected gameplay data from 27,592 games and showed in a previous study that the playstyle of human players significantly differed from that of the non-player characters. This paper presents a method of biasing MCTS using human gameplay data to create Spades playing agents that emulate human play whilst maintaining a strong, competitive performance. The methods of player modelling and biasing MCTS presented in this study are generally applicable to digital games with discrete actions.
This paper examines the performance of a number of AI agents on the games included in the General Video Game Playing Competition. Through analyzing these results, the paper seeks to provide insight into the strengths and weaknesses of the current generation of video game playing algorithms. The paper also provides an analysis of the given games in terms of inherent features which define the different games. Finally, the game features are matched with AI agents, based on performance, in order to demonstrate a plausible case for algorithm portfolios as a general video game playing technique.
Real-world artificial intelligence (AI) applications often require multiple agents to work in a collaborative effort. Efficient learning for intra-agent communication and coordination is an indispensable step towards general AI. In this paper, we take StarCraft combat game as the test scenario, where the task is to coordinate multiple agents as a team to defeat their enemies. To maintain a scalable yet effective communication protocol, we introduce a multiagent bidirectionally-coordinated network (BiCNet ['bIknet]) with a vectorised extension of actor-critic formulation. We show that BiCNet can handle different types of combats under diverse terrains with arbitrary numbers of AI agents for both sides. Our analysis demonstrates that without any supervisions such as human demonstrations or labelled data, BiCNet could learn various types of coordination strategies that is similar to these of experienced game players. Moreover, BiCNet is easily adaptable to the tasks with heterogeneous agents. In our experiments, we evaluate our approach against multiple baselines under different scenarios; it shows state-of-the-art performance, and possesses potential values for large-scale real-world applications.