Interface Design Optimization as a Multi-Armed Bandit
Derek Lomas, Jodi Forlizzi, Nikhil Poonwala, Nirmal Patel, Sharan Shodhan,
Kishan Patel, Ken Koedinger, Emma Brunskill
Carnegie Mellon University
School of Computer Science
5000 Forbes Ave
“Multi-armed bandits” offer a new paradigm for designing
user interfaces in collaboration with AI and user data. To
help designers understand the potential, we present the
results of two experimental comparisons between bandit
algorithms and random assignment. Our studies are
intended to show designers how bandits algorithms are able
to rapidly explore an experimental design space and then
automatically select the optimal design configuration.
Our experimental results show that bandits can make data-
driven design more efficient and accessible to interface
designers, but that human participation is essential to ensure
that AI systems optimize for the right metric. Based on our
results, we introduce several design lessons that help keep
human design judgment “in the loop”. As bandits expose
players to significantly fewer “low-performing” conditions,
we discuss ethical implications for large-scale online
experimentation in education. Finally, we consider the
future for “Human-Technology Teamwork” in AI-assisted
design and scientific inquiry.
Educational games; optimization; data-driven design; multi-
armed bandits; continuous improvement;
ACM Classification Keywords
H.5.m. Information interfaces and presentation (e.g., HCI):
Miscellaneous. See: http://www.acm.org/about/class/1998/
for more information and the full list of ACM classifiers
The purpose of this paper is to provide empirical evidence
that can help designers understand the promise and perils of
using multi-armed bandits for interface optimization.
Unlike the days of boxed software, online software has a
continuous stream of users that makes it possible to support
regular updates at a rapid pace. As part of this shift,
companies now run thousands of online controlled
experiments (i.e., A/B tests) to evaluate the efficacy of
different design decisions . How is this changing
Today’s designers must integrate their skills within larger
organizations that can measure the impact of their designs
with great precision. Who can argue with clear, empirical
data showing that a particular design is objectively better in
the eyes of users? Yet, designers must be prepared to
engage in an ongoing dialogue about what “good design”
really is. While one design may perform better in an A/B
test, it is not always the best design for the larger objectives
of the organization. That is, when design becomes driven
by metrics, there must be ongoing debate about whether the
metrics are actually measuring the ultimate goals. This
questioning of performance metrics becomes increasingly
important in an age of automated design optimization.
This paper addresses some of these concerns by outlining a
set of experiments with different design optimization
algorithms. We present empirical data showing how bandit
algorithms can increase the efficiency of experiments,
lower the costs of data-driven continuous improvement and
improve overall utility for subjects participating in online
experiments. However, our data also shows some of the
dangers of relying on artificial intelligence that operates
without human input. Together, our evidence can help
designers understand why design optimization can be seen
as a “multi-armed bandit” problem and how bandit
algorithms can be used to optimize user experience.
The optimization of interfaces based on individual user
characteristics is a significant sub-field within HCI. One
challenge in the field has been the availability of these
evaluation functions at scale. For instance, SUPPLE , a
system for algorithmically rendering interfaces, initially
required cost/utility weights that were generated manually
for each interface factor. Later, explicit user preferences
were used to generate these cost/utility weights. Other
algorithmic interface optimization approaches have used
cost functions based on estimated cognitive difficulty, user
Paste the appropriate copyright/license statement here. ACM now
supports three different publication options:
ACM copyright: ACM holds the copyright on the work. This is the
License: The author(s) retain copyright, but ACM receives an
exclusive publication license.
Open Access: The author(s) wish to pay for the work to be open
access. The additional fee must be paid to ACM.
This text field is large enough to hold the appropriate release statement
assuming it is single-spaced in TimesNewRoman 8 point font. Please do
not change or modify the size of this text box.
Every& submissi on& will& be& assigned& their& own& unique& DOI& string& to& be&
reported undesirability, user satisfaction ratings and even
aesthetic principles [8, 35].
Online user data and A/B testing provide an alternative
source for UX evaluation. Hacker  contrasts UX
optimization for individual users (student adaptive) with the
optimization of interfaces for all users (system adaptive). In
Duolingo (a language learning app used by millions),
student adaptive UX is supported by knowledge models,
whereas Duolingo’s system adaptivity is a longer term
process that unfolds in response to large-scale A/B tests.
A/B testing is also widely used by Google, Amazon,
Yahoo, Facebook, Microsoft and Zynga to make data-
driven UX decisions on topics including surface aesthetics,
game balancing, search algorithms and display advertising
. With that framing, A/B testing is indeed a widespread
approach for the optimization of interface design.
Design Space Optimization
A design space is the combination of potential design
factors . Design spaces are used to describe the
parametric multiplication of design factors in many areas,
including microprocessor design and wireless signaling.
Herb Simon  described how it was possible to separate
the generation design alternatives (a design space) from the
search within the space for an optimal or satisfactory
design, based upon some requirements. To identify these
optimal designs, Glaser  describes an optimization
process in terms of finding the set of values that maximizes
the desired outcome, based on a fitness function that
describes the design parameters, constraints and goals.
In this paper, we focus on the optimization of a game
design space, or the combination of all game design factors
within a particular game . The total design space of a
game (which represents the total possibilities) can be
distinguished from the far smaller instantiated design
space, which consists of the variables that are actually
instantiated in the game at a given time. Games are a rich
area for design space exploration as they often have many
small variations to support diverse game level design.
During game balancing, designers make many small
modifications to different design parameters or design
factors. Thus, the instantiated design space of a game will
often include continuous variables like reward or
punishment variables (e.g., the amount of points per
successful action or the effect of an enemy’s weapon on a
player’s health) and categorical game design variables (e.g.,
different visual backgrounds, enemies or maps).
Increasingly, online games use data to drive the balancing
or optimization of these variables .
Even simple games have a massive design space: every
design factor produces exponential growth in the total
number of variations (“factorial explosion”). For instance,
previous experiments with Battleship Numberline , the
game presented in this paper, varied 14 different design
factors for a total instantiated design space of 1.2 billion
variations. Clearly, a game’s design space can quickly
surpass the number of players who might test it, eliminating
the possibility of fully testing the design space. However,
the objective of design is not to completely test a design
space but rather to find a satisfactory optimum within it.
Often, this is done through the subjective judgment of a
designer, but human judgment is often wrong . Given
the opportunity of a large online population, multi-armed
bandit algorithms can be an efficient mechanism for
exploring and optimizing large design spaces.
Multi-Armed Bandits and Design
The Multi-Armed Bandit problem  refers to the
theoretical challenge of maximizing winnings if presented
with a row of slot machine arms, where each arm has a
different and unknown rate of payout. The solution to a
multi-armed bandit problem is a strategy for selecting the
optimal arm to maximize payoff, specifically, a policy for
which arm to select, given the prior history of pulls and
payoffs. The objective is normally to minimize regret, often
defined as the difference in reward from the arm with the
best payout – which, of course, is unknown a priori.
This paper frames UX optimization design decision-making
as a multi-armed bandit problem : Each variation of a
UX design can be viewed as a slot machine “arm” where
the payoff (e.g., engagement) is unknown. In data-driven
design and continuous improvement, a designer’s goal will
be to balance the exploration of new designs (that may or
may not be effective) with the exploitation of designs that
are known to be effective. This framing allows us to draw
upon the extensive machine learning research on algorithms
for optimizing the selection of arms in multi-armed bandits
Solving Multi-Armed Bandit Problems
If one doesn’t know which slot machine has the highest
payout rate, what can be done? For instance, one could
adopt a policy of “explore first then exploit”. In this case,
one could test each machine n number of times until the
average payout of a particular arm is significantly higher
than the rest of the arms; then this arm could be pulled
indefinitely. This “explore first then exploit” approach is
similar to traditional A/B testing, where subjects are
randomly assigned to different conditions for a while and
then one design configuration is chosen as the optimal.
However, this approach is often “grossly inefficient” (p.
646, ): for instance, imagine the extreme example
where one arm produces a reward with every pull while all
other arms don’t pay out at all. In a typical experiment, the
worst arm will be pulled an equal amount of time as the
most optimal arm. Furthermore, the stopping conditions for
A/B tests are quite unclear, as few online experimenters
properly determine power in advance . As a result, this
runs the risk of choosing the (wrong) condition too quickly.
Background on Multi-Armed Bandits
In the past few years, multi-armed bandits have become
widely used in industry for optimization, particularly in
advertising and interface optimization . Google, for
instance, uses Thompson Sampling in their online analytics
platform . Yahoo  uses online “contextual bandits”
to make personalized news article recommendations. Work
has also been done to apply bandits to basic research, as
well as applied optimization. Liu et al. , for instance,
explored the offline use of bandits in the context of online
educational games, introducing the UCB-EXPLORE
algorithm to balance scientific knowledge discovery with
There are many multi-armed bandits algorithms that
sequentially use the previous arm pulls and observations to
guide future selection in a way that balances exploration
and exploitation. Here we present a non-comprehensive
review of approaches to the bandit problem in the context
of online interface optimization. Longer reviews
Gittins Index: This is the theoretically optimal strategy for
solving a multi-armed bandit problem. However, it is
computationally complex and hard to implement in
Epsilon-First: This is the classic A/B test. Users are
randomly assigned to design conditions for a set time
period, until a certain number of users are assigned, or until
the confidence interval between conditions reaches a value
acceptable to the experimenters (e.g., test until p<0.05).
After this time, a winner is picked and then used forever.
Epsilon-Greedy: Here, the policy is to randomly assign x%
of incoming users, with the remainder assigned to the
highest performing design. A variation known as Epsilon-
Decreasing gradually reduces x over time; in this way,
exploration is emphasized at first, then exploitation .
These algorithms are simple but generally perform worse
than other bandit techniques.
Probability Matching: Players are assigned to particular
conditions with a probability proportionate to the success
rate of the condition . This includes Thompson
Sampling (which predates randomized experiments) and
other Bayesian sampling procedures; these are often used in
adaptive clinical trials in medicine .
UCB1: After testing each arm once, users are assigned to
the condition with the highest upper confidence bound.
Upper confidence bounds are conceptually similar to upper
confidence limits, but they do not statistically assume that
the rewards have a normal distribution . UCB was
chosen for the present study due to its strong theoretical
guarantees, simplicity and computationally efficiency.
UCL: In this paper, we introduce an illustrative approach to
bandit problems using upper confidence limits for
assignment. This algorithm is presented to help those
familiar with basic statistics to understand the logic
underlying the UCB algorithm. This algorithm operates by
calculating the upper confidence limit of a design condition
after first randomly assigning a condition for a period of
time (e.g., 25 game sessions); every new player thereafter is
assigned to the condition with the highest upper confidence
A Real-World Illustration
Consider a hypothetical game designer who wants to
maximize player engagement by sending new players to the
best design variation: A, B or C (Figure 1). How should she
allocate the next 1000 players to the game? She could pick
the single condition that has the highest average
engagement at that time. Alternatively, she could have a
policy of sending players to the design condition with the
highest mean (“C”). However, this “greedy” approach
doesn’t take into account that A’s mean might be higher
than C, as there isn’t enough information yet.
Figure 1: Example data comparing three different design
conditions. While “C” has the highest mean, “A” has the
highest upper confidence limit. The policy of picking the
condition with the highest upper confidence limit can
maximize returns by balancing exploration with exploitation.
So instead, our designer could have a simple policy of just
choosing the design that has the highest upper confidence
limit. This policy would result in testing the engagement of
A with additional players. Eventually, the additional
information gathered from A would shrink the confidence
interval to the point where the policy would begin to pick C
Practically, this policy of picking the higher upper
confidence limit has the effect of choosing designs that
have either a higher mean (exploitation) or insufficient
information (exploration). This policy was instantiated as
the UCL-bandit in order to help communicate the multi-
armed bandit problem to a general audience. It is not
expected to be a practical algorithm for solving real-world
System for Bandit-Based Optimization
We present a system for integrating bandit-based
optimization into a software design process with the
following goals: 1) Support the use of data in an ongoing
design process by communicating useful information to
game designers; 2) Automatically optimize the design space
identified by designers in order to reduce the cost of
experimentation and analysis. 3.) Reduce the cost of
experimentation to the player community by minimizing
exposure to low-value game design configurations.
Our data-driven UX optimization method extends previous
optimization research involving user preferences ,
aesthetic principles , and embedded assessments .
Moreover, we extend previous applications of bandits for
offline educational optimization  by demonstrating the
value of bandits as a tool for online optimization.
Our first “meta-experiment” compares three different
algorithms for conducting online experimentation (we use
the term “meta-experiment” because it is experimenting
with different approaches to experimentation).
The players of an educational game were first randomly
assigned to one of these experiments. Then, according to
their assignment method, the players were assigned to a
game design within the experimental design space of
We hypothesize that, in comparison to random assignment,
one of the multi-armed bandit assignment approaches will
result in greater overall student engagement during the
course of the study. Specifically, we hypothesize that multi-
armed bandits can automatically optimize a UX design
space more efficiently (i.e., with less regret) than random
assignment and also produce benefits for subjects.
H1: Multi-Armed Bandits will automatically search
through a design space to find the best designs
H2: Automatic optimization algorithms will reduce the cost
of experimentation to players
To test our hypothesis, we simultaneously deployed three
experiments involving 1) random assignment 2) UCB1
bandit assignment or 3) UCL-95% bandit assignment. The
UCL-95% bandit had a randomization period of 25,
meaning that all design variations needed to have 25 data
points before it started assignment on the basis of upper
Calculating UCB1: In this paper, we calculate the upper
confidence bound of a design “D” as the adjusted mean of
D (the value must be between 0-1, so for us, we divided
each mean by the maximum engagement allowed, 100) +
square root of (2 x log(n) / n_D), where n is the total
number of games played and n_D is the total number of
games played in design condition D.
Calculating UCL: The upper confidence limit is calculated
as the mean + standard error x 1.96 (for 95% confidence)
and mean + standard error x 3.3 (for 99.9% confidence).
The Design Space of Battleship Numberline
Battleship Numberline is a simple online Flash game that
involves estimating fractions, decimals and whole numbers
on a number line. Players attempt to blow up enemy ships
or submarines (the xVessel design factor) by successfully
estimating the location of different numbers (xItem_Sets)
on a number line. The “ship” design variant requires users
to type a number to estimate the location of a visible ship
on a line; “sub” requires users to click on the number line to
estimate the location of a hidden submarine target. Ship and
sub targets can be large (and easy to hit) or small (and
challenging); their size (xPerfectHitPercent) represents the
level of accuracy required for a successful hit. For example,
when 95% accuracy is required, the target is 10% of the
length of the number line—when 90% accuracy is required,
the target is 20% of the line.
Figure 2: Samples showing variations of the xVessel and the
xPerfectHitPercent design factor. xVessel varies between
submarines (hidden until a player clicks on the line to estimate
a number) and ships (players type in a number to estimate
where it is on the line). The xPerfectHitPercent design factor
varies the size of the target, in term of the accuracy required
to hit it. Above, we show 70%, 95%, 80% and 97% accuracy
required for a hit in Battleship Numberline
The combination of these, and other, design factors
constitute the design space of our Battleship Numberline.
The purpose of the meta-experiment is to evaluate different
methods for exploring and then selecting the optimal design
condition within this design space. In this paper, we only
explore the xVessel and xPerfectHitPercent design factors.
Over the course of 10 days, 10,832 players of Battleship
Numberline on Brainpop.com were randomly assigned to
three different experimentation algorithms: random, upper
confidence bound (UCB), and upper confidence limit
(UCL). The UCL-95% bandit had a randomization period
of 25 trials before assigning by UCL. Within each
experimentation condition, players were assigned to 1 of 6
different conditions, a 2x3 factorial involving the xVessel
(clicking vs. typing) and the size of targets
(xErrorTolerance, where larger is easier to hit).
After players made a choice to play in the domain of whole
numbers, fractions or decimals, they were given a four-item
game-based pretest . They were then randomly assigned
to an experimental method, which assigned them to one of
12 experimental conditions. If players clicked on the menu
and played again, they were reassigned to a new
When players enter the game, the server receives a request.
It then allocates conditions to these incoming requests
based on a round-robin method of assignment, that starts at
the experimental level and then at the condition level. For
instance, the player would be assigned to random
assignment and then assigned to one of the random
assignment conditions; the next player in the queue would
be assigned to UCB. The UCB algorithm would then
determine which game condition they would receive.
Our experimental system was built on Heroku and
MongoDb. It was conceived with the intention of providing
data analytics dashboard for designers to monitor the
experiment. The dashboard shows running totals of the
number of times each arm is pulled (i.e., the number of
times a design condition was served to a player), the mean
number of trials played in each condition (our measure of
engagement), the upper confidence limit and the upper
confidence bound of each condition. Our dashboard also
provided a mechanism for designers to click on a URL to
experience for themselves any particular design (i.e., any
arms in the experiment). These links made it easy to sample
the design space and gain human insight into the data. We
also provided a checkbox so that a particular arm could be
disabled, if necessary. Finally, our system made it easy to
set the randomization period and confidence limit for the
Optimization relies on an outcome evaluation metric 
that drives the decisions made between different design
configurations. In this study, our key outcome metric of
engagement is the number of estimates made, on average,
within each design variation. Choosing an appropriate
outcome variable is essential, because this metric will be
used to determine which conditions are promoted.
In Figure 3 and Table 1, we confirm H2 by showing that
players in the bandit conditions were more engaged; thus,
bandits reduced the cost of experimentation to subjects by
deploying fewer low-engagement conditions.
Figure 3: Shows how the bandit-based experiments garnered
more student engagement, as measured by the total amount of
The UCL and the UCB bandit experiment produced 52%
and 24% greater engagement than the experiment involving
random assignment. Our measure of regret between
experiments is shown in Table 1 as the percent difference in
our outcome metric between the different experiments and
the optimal policy (e.g., Sub90, which had the highest
average engagement). This shows that the UCL-25
experiment (one of the bandit algorithms) achieved the
lowest regret of all experiments.
Table 1: Comparison of Meta-Experiment. * 22.7 average
trials for Sub90. Assuming same number of games logged.
8,645 total logged out of 10,832 total served.
H1, the hypothesis that bandits can automatically optimize
a UX design space, was confirmed with evidence presented
in Figure 4. These data show that random assignment
equally allocated all 6 conditions whereas both the UCB
bandit and the UCL bandit allocated subjects to conditions
preferentially. The reason for this unequal allocation is the
difference in the efficacy of the conditions for producing
player engagement, as seen in Figure 5.
Figure 4: Random assignment experimentation equally
allocates subjects whereas both bandit-based experiments
allocate subjects based on the measured efficacy of the
conditions. Total Games Played are the number of players
assigned to each design condition.
Figure 5 shows the means and 95% confidence intervals of
each arm in the three experiments. The Y-axis is our
measure of engagement: the average number of trials
played in the game within that condition. All experiments
identify Sub90 as the most engaging condition. In this
variant of the game, players needed to click to estimate a
number on the number line and their estimates needed to be
90% accurate to hit the submarine. All bandit-based
experiments deployed this condition far more than other
conditions (as seen in Figure 4).
Total Games Played
Figure 5: The means and 95% confidence intervals of the
experimental conditions. Note that the Upper Confidence
Limit experiment focused data collection on the condition with
the highest upper confidence limit. Y-Axis represents
Engagement, as the total number of trials played.
Note the long confidence intervals on the right side of
Figure 5: these are a result of insufficient data. As can be
seen in Figure 4, these conditions each had less than 30 data
points. However, if any of the confidence intervals were to
exceed the height of Sub90’s condition in the UCL
experiment, then UCL would deploy those again.
Figure 6: This graph shows the allocation of game sessions to
the different target sizes and the variation in the average
number of trials played over time.
After running the first meta-experiment, our results clearly
supported the value of bandits. However, UCL never tested
additional design variations after “deciding” during the
randomization period that Sub90 was the most engaging
condition (Figure 6). While the benefits of the Sub90
outcome was confirmed by the random experiment, it does
not illustrate the explore-exploit dynamic of a bandit
algorithm. Therefore, to introduce greater variation and
bolster our discussion, we ran a second meta-experiment.
In this meta-experiment, we compared random assignment
with two variations on the Upper Confidence Limit bandit.
We retained our use of the greedy 95% confidence limit
bandit but also added a more conservative 99.9%
confidence limit bandit. The difference between these two
is the parameter that is multiplied by the standard error and
added to the mean: for 95% we multiply 1.96 times the
standard error while for 99.9% we multiply 3.3. We expect
this more conservative and less greedy version of UCL to
be more effective because it is less likely to get stuck in a
H3: UCL-99.9% will tend to deploy a more optimal design
condition than UCL-95% as a result of greater exploration.
In the second experiment, we focused on submarines, which
we found to be more engaging than ships (or, in any case,
resulted in more trials played). We eliminated the smallest
size (which was the worst performing) and added a broader
sample: 95%, 90%, 80%, 70%, 60% accuracy required for a
hit. The largest of these, requiring only 60% accuracy for a
hit, was actually 80% of the length of the number line.
Although we felt the targets were grotesquely large, they
actually represented a good scenario for using online
bandits for scientific research. We’d like to collect online
data to understand the optimal size, but we’d want to
minimize how many students were exposed to the
suboptimal conditions. We were sure that this target size
was too large, but could the bandits detect this?
H4: Bandits will deploy “bad” conditions less often than
Figure 7: Four of the five variations in target size in
experiment 2. The largest submarines (at top, 60 and 70)
appeared to be far too big and too easy to hit. However, they
were generally more engaging than the smaller sized targets.
In short, we found very mixed evidence for both H3 and
H4. The more conservative bandit did not appear to explore
longer (Figure 10) nor did it deploy a more optimal design
(Figure 9), although it did achieve greater engagement than
UCL-95 (Table 2). Additionally, while the bandits did
achieve less regret than the random assignment condition
(Table 2), the conditions deployed were so bad that we got
phone calls from Brainpop inquiring whether there was a
bug in our game.
In Figure 9, we show that the least chosen arm of UC-95%
was the most chosen arm of UCL-99.9% -- and vice versa!
Moreover, the second most-picked design of both bandits
was the very largest target, which seemed to us to be far too
large. Finally, the optimal arm, according to random
Conﬁdence Intervals of Experimental
expID 2 / xVessel / perfectHitPercent
assignment, was hardly in the running inside the bandit
experiments. What accounts for this finding?
Table 2: Compares each experiments to the optimal policy:
Sub70 according to the random assignment experiment.
* 29.04 average trials for sub70 in the random condition.
Figure 8: The means and confidence intervals of each
condition within each experiment. The random assignment
experiment reveals an inverted U-shaped response, with the
largest performing less effectively than the second largest.
Note the major discrepancies in the order of the items between
the different experiments.
Figure 9: The “bad design” (Sub60) did surprisingly well in
the bandits, where it was assigned second most often – and far
less than the optimal design, according to random assignment.
First, all of the design variations in this experiment
performed reasonably well (Figure 8) on our engagement
metric, even the “bad” conditions. Without major
differences between the conditions, all experimental
methods performed relatively well. But, digging in, there
were other problems that have to do with the dangers of
Figure 10 shows the allocation of designs over time. The X-
axis represents game sessions over time (i.e.,“arm pulls”, as
each game session requests a design from the server). The
smoother line shows the mean number of trials played over
time (i.e., engagement over time). Note that the randomly
assigned mean varies significantly over time, by nearly
50%! This reflects the “seasonality” of the data; for
instance, the dip in engagement between 2000 to 3000
represents data collected after school ended, into the
evening. And the rise around 3000 represents a shift to the
morning school day.
Figure 10: For meta-experiment 2, this graph shows the
allocation of game sessions to the different target sizes and the
variation in the average number of trials played over time.
Therefore, the average player engagement in the bandits
can be expected to be affected by the same seasonal factors
as those affecting random assignment, yet also vary for
other reasons. In particular, the average engagement of
these bandits will be affected by the different conditions
they are deploying. Possible evidence of this phenomenon
can be noted in the concurrent dip in engagement around
4200 – a dip that is not present in random assignment. This
dip occurs during the morning hours, so one plausible
explanation is that the morning population is not as engaged
by the designs deployed by the bandits as much as other
populations. It could be, for example, be that students in
the morning are more engaged by learning-related
challenges or even that the lack of challenge presented by
the large ships are unappealing. This might be because
students in the morning have greater ability than those who
play in the evening.
In both meta-experiments, all bandit algorithms tended to
test different design conditions in bursts. Note for instance
that the sub 70% in size was explored briefly by UCL-95%
around 2500-3000 – yet this period was the time when all
conditions were fairing the worst. So, this condition had the
bad luck of being sampled at a time when any condition
would have a low mean. This shows the limitations of
sequential experimentation in the context of time-based
We cannot confirm our hypothesis that the UCL-99.9%
bandit actually explored more than UCL-95%. Visually, it
appears that UCL-99.9% was busy testing more conditions
Engagement (Total # Trials Played)
more often than UCL-95%. However, both bandits fell into
a local optimum at roughly the same time, based on when
each began exclusively deploying one condition.
Our studies explore the optimization of the design space of
an educational game using three different multi-armed
bandit algorithms. Supporting the potential advantage of
using bandit algorithms over standard random experimental
design, bandits achieved higher student engagement during
the course of the experiment.
However, we were surprised to find that Sub60 condition
(the condition that was absurdly large Figure 7) was one of
the most engaging conditions. This condition was included
for the purpose of testing the ability of the bandits to
identify it as a poor condition. However, Sub60 turned out
to be one of the top performing conditions! This
discrepancy indicates that our metric for engagement
(number of trials played) may be the wrong metric – or that
we, as designers, are wrong about what the users want.
Our results also point to practical issues that must be
understood and resolved prior to adopting bandit-based
optimization over random assignment. While we expect that
over the long term, bandits will fluidly adjust to periodic
variations, the temporal non-stationarity of the data is likely
to reduce their efficacy.
The bandits collected more data about leading design
conditions than random assignment, resulting in tighter
confidence intervals around the means. Indeed, the
tightening of confidence intervals (or bounds) is the
primary mechanism for balancing exploration-exploitation
in this paper. However, these tightened confidence intervals
do not indicate that the bandit estimates are more accurate,
at least in the short term. The tightening of the confidence
intervals is deceptive, as the data are not collected
simultaneously – a violation of assumptions in the
underlying statistics. The next section illustrates why the
non-simultaneous collection of data impacts the ability of
bandits to make valid inferences from its exploration.
Dangers of sequential experimentation in the real world
Time-based variations appear to significantly affect bandit
exploration validity. During the randomization period
(roughly, from 500-1500 in Figure 10), the UCL-95%
bandit gave the 60% submarine the highest mean and the
90% submarine one of the lowest. After the randomization
period, the bandit continued to allocate most users to the
60% submarine. We noted that average engagement was
less during the night than during the day, however, players
tended to play less during the night than during the day. As
a result, this pushed down the estimates for the 60%
submarine. Then, because few players had played the 90%
submarine, the bandit started allocating players during the
day, when the average mean had increased. This then
resulted in the mean for the 90% submarine to get pushed
up higher than any other arm.
These fluctuations due to contextual time-of-day factors
have a much bigger effect on the bandits than the random
assignment. So, even though much more data was collected
about particular arms, it is not necessarily more trustworthy
than the data collected by random assignment. While it is
likely that conservative UCB-style bandits would
eventually identify the highest performing arm if they were
run forever, these time-based variation effects can
significantly reduce their performance.
Constructing bandits that operate using confidence intervals
is conceptually similar to running experiments and
constantly testing for significance and then running the
condition that is, at the time, significant. However,
significance tests assume that sample size was estimated in
advance. While our bandits were “riding the wave of
significance” they were susceptible to the tendency to be
over confident in the present significance of a particular
condition. This is a major problem in contemporary A/B
testing, as well .
Our goal was to run an empirical study to illustrate how
bandits work, not to advance algorithms themselves. Thus,
we could have used more powerful algorithms. For
instance, our bandit algorithms did not allow for the sharing
of information between arms or make use of previously
collected data, which would be especially useful for
continuous game design factors (such as the size of our
targets) or larger design spaces. To this end, we might have
considered approaches making use of fractional factorial
designs, linear regressions or Scott’s suggestion of a “multi-
armed mafia” . Recent work combining Thompson
Sampling and Gaussian Processes is promising .
Our bandit experiments ran for just a few days in total. All
A/B tests will be susceptible to “seasonality” effects, not
just bandits. Therefore it is always preferable to run a
smaller proportion of concurrent users for a longer period
of time than a large proportion of users for a shorter time.
Kolhavi recommends running A/B tests for at least two
weeks to account for seasonality . We did not model the
benefits of these algorithms over time. Seasonality will
differentially affect different arms in the bandit condition,
but not in the random condition. If the bandit is too greedy
(like the 95 confidence interval bandit in the first
experiment), then this could have serious effects on the
optimality of the arms chosen.
Unlike UCB1, both UCL bandits had a tendency to fall into
a local maximum for a long period of time, without
exploration. This is likely to be a property of UCL rather
than UCB simply being more conservative in its approach.
As N (total number of arms) increases, the confidence
bounds will decrease, whereas the confidence intervals have
no natural tendency to decrease.
The data in our sample are not normal; they follow a
distribution that may be negative binomial or beta. While
the UCB1 algorithm does not rely on an underlying
assumption of normality, both UCL algorithms do.
However, by making stronger parametric assumption about
the payoff distributions, we achieved good online results
without requiring additional online tuning. However, if not
for our desire to provide clear results for publication, we
likely would have increased the confidence limit parameter
to make the bandits even less conservative.
In support of the broader research community, we intend to
make the data from these bandit experiments available
online at pslcdatashop.org . Given that seasonality
affected the performance of the bandit algorithms, having
real-world data may be useful for others who seek to
contribute more reliable bandit algorithms for UI
We present this work to help designers understand some of
the dangers of automated experimentation. Our results show
how easy it is to optimize for the wrong metric. Indeed, it is
not that maximizing engagement is wrong, per se; however,
when taken to the extreme, it produces designs that are
unquestionably bad. (How do we know that the large
designs are objectively bad, and not just a matter of taste?
Well, we got a call from Brainpop.com, telling us that they
were getting emails from teachers asking about a broken
Bandits are very capable of optimizing for a metric – but if
this is not the best measure of optimal design, the bandit
can easily optimize for the wrong outcome. For example, it
is much easier to measure student participation than it is to
measure learning outcomes, but conditions that are most
highly engaging are often not the best for learning (e.g.,
students perform better during massed practice, but learn
more from distributed practice ). In our study, the super
large ship was engaging, but was unlikely to create the best
learning environment .
With an increase in data-driven design, it is important that
designers develop a critical and dialectical relationship to
optimization metrics. We recommend making it as easy as
possible for designers to personally experience design
variations as they are optimized so that metrics alone are
not the sole source of judgment. While this may interfere
with the desire for automated design, it is nevertheless
critical to ensure that the automation is designing for the
In the original design of Battleship Numberline, the
assumed optimal design of the target size was 95%
accuracy – as in, this was delivered as a final design.
However, this level of difficulty turns out to be significantly
less engaging than other levels. This suggests that during a
design process, designers can also provide parameters for
design space exploration—in addition to providing a single,
assumed optimal design. These “Fuzzy Designs” can then
be explored and optimized using online data.
Finally, we recommend using bandits that use greater use of
randomization. Even epsilon-greedy bandits, while
inefficient, are likely to be far more robust in practice than
bandits that do not always randomly sample to some extent.
Randomized Probability Matching and Thompson
Sampling are also likely to be less susceptible to seasonality
effects. In part, this is because they will tend to continue to
test all arms, but test the lower performing arms with a
lower probability. As a result, during “seasonal” variation,
more conditions will get a chance to shine or dim within
The purpose of this paper is to provide empirical evidence
that can help designers understand the promise and perils of
using “multi-armed bandits” for interface optimization. To
be clear, this paper does not intend to introduce a newer,
better, faster bandit algorithm!
In summary, we show how user interface design can be
framed as a multi-armed bandit problem. In two large-scale
online meta-experiments, we illustrate new methods for
optimizing game UI using crowd-sourced data collected
from players. We then empirically evaluated the online
application of multi-armed bandit optimization methods in
a real-world context involving the maximization of student
engagement in an educational game.
We demonstrate that bandit-based experimental design
systems can automatically optimize a design space based
upon an established outcome metric. In the process, we
show that bandits can maximize the utility of
experimentation to students by minimizing exposure to
low-value game design configurations. By balancing
exploration and exploitation, bandits help maximize the
benefits to both designers and users.
However, our data illustrate several major challenges for
data-driven designers. First of all, we show that
optimization methods that do not involve random
assignment have the potential to cause issues with the
validity of data. Secondly, we show that automatic
optimization is susceptible to optimizing the design for the
We provide mixed evidence about the ability of bandits to
reduce the cost of experimentation in a data-driven design
process. Part of our goal was to eliminate the need to
constantly analyze data and make judgments about
statistical significance. While bandits address this need
nicely, we discovered that monitoring and human judgment
maybe necessary to ensure that bandits are actually
optimizing for the right metric. While this is important in
A/B testing as well, bandits create a stronger potential for
an experiment to automatically optimize for the wrong
Ethical considerations of online research
There are significant ethical issues that accompany large-
scale online experiments. Recent online experiments (viz.
the infamous “Facebook Mood Experiment” ) caused
global distress, in large part due to a perceived conflict
between the pursuit of basic scientific knowledge and the
best interests of unknowing users.
To this end, this paper addresses one specific ethical issue
that is present with online and offline educational research:
the notion of “fairness” in experimentation. Experimental
education research necessarily requires that some subjects
be given access to certain educational resources that are not
available to other subjects. Even cross-over experimental
designs (where all subjects get access to the same resources
but in a different order) can be problematic for students
feeling “left out”. While experiments are fundamentally
required to determine which resources cause what
outcomes, it is not always necessary to continue
experiments to the end.
In a typical experiment involving random assignment, each
condition receives the same number of subjects, as the
“best” condition is not known a priori. In contrast, as it
becomes likely that some conditions are better than others,
Bandit algorithms could selectively allocate subjects to
these higher performing conditions. As a result, these
algorithms may deliver greater benefits to participants in
large online studies.
Whereas random assignment optimizes for the precision of
measurement, bandit algorithms optimize for outcomes.
Bandit algorithms, therefore, have the potential to improve
upon random assignment within applied research settings
where the goal is to find "the best design", rather than
measuring precisely how bad the alternatives are. However,
Bandits have also been shown to have application for
scientific research .
For the future oriented, it may be helpful to view bandits as
simple machine learning algorithms that represent the early
stages of complex Artificially Intelligent systems. With this
view, one may assume that automated explorations of
design spaces will achieve greater and greater capabilities
with time. For instance, our bandit algorithms treated
continuous variables as discrete variables; if we sought to
test a slightly different target size, the algorithm would need
to start collecting data from scratch. Similarly, our
algorithms were unable to discover advantage of daily
population changes in engagement, where players appeared
more engaged with difficult games earlier in the morning.
Eventually, optimization algorithms may be able to identify
While computer scientists will surely develop more
advanced algorithms, in other areas of AI system design, it
is often useful to think about designing the larger human-
computer system. Don Norman has described this as the
“human-technology teamwork framework.” According to
this framework, designers should aim to integrate human
and computer capabilities by first identifying the unique
capabilities of each. Therefore, future work should consider
specific ways in which user data and experimentation
systems can best leverage human contextual understanding
as well as AI-based search. Such human-AI systems are
predicted to be more capable of generating effective
interface optimizations and innovations than human or AI
While this paper focused on the use of online
experimentation to drive interface optimization (applied
research), there are also unique opportunities for scientific
discovery. In the learning sciences, for instance, with
millions of learners engaged in online digital learning
environments, scientists can now conduct learning science
experiments on a massive scale. Indeed, pure optimization
experiments may well lead to new generalizable insights by
accident . This may lead to a deeper integration of
basic science (i.e., improving theory) and applied research
(i.e., improving user outcomes).
However, it is unclear how the scale of online research will
be adopted by the scientific community. While recruiting
100 students to participate in an educational experiment can
take weeks, online research can provide thousands of
subjects in a single day [25,33]. This massive scale poses
new challenges to scientists that wish to efficiently benefit
from this vast increase in experimental data. For instance,
how would a typical psychology research group change
their approach if they had a thousand subjects show up to
their laboratory every day, for a year?
Future researchers who build on existing work  with
bandits in scientific discovery may also find it useful to
consider how the emerging framework of “Human-
Technology Teamwork” can guide the integration of
artificial intelligence with scientific inquiry. Eventually,
such work may even contribute solutions to the problem of
writing good scientific papers. After all, even if a scientific
team could design and run a meaningful online experiment
every single day (perhaps with the aid of more complex
multi-armed bandits), how could they ever hire enough
graduate students to write the papers? Alas.
But, putting aside the issues of scientific contribution, it is
worth considering what it would mean to have AI-human
systems conducting thousands of experiments to determine
what interface designs are most engaging and attractive to
users. Wouldn’t this lead to computer interactions that are
so addictive and compelling that they consume most of the
average person’s waking hours? In some ways, of course,
Facebook, Google, and other major corporations have
already become these AI-human systems. So, if their
interfaces are compelling now, just imagine what is to
1. Agarwal, D., Chen, B.C., and Elango, P. (2009)
Explore/Exploit Schemes for Web Content
Optimization. Ninth IEEE International Conference on
Data Mining, 1–10.
2. Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002)
Finite-time Analysis of the Multiarmed Bandit Problem.
Machine Learning, 235–256.
3. Berry, D. (2011) Adaptive Clinical Trials: The Promise
and the Caution. Journal of clinical oncology!: official
journal of the American Society of Clinical Oncology
29, 6. 603–6.
4. Brezzi, M. and Lai, T.L. (2002) Optimal learning and
experimentation in bandit problems. Journal of
Economic Dynamics and Control 27, 1. 87–108.
5. Card, S., Mackinlay, J., & Robertson, G. (1990). The
design space of input devices. ACM CHI
6. Chapelle, O., & Li, L. (2011). An empirical evaluation
of thompson sampling. InAdvances in neural
information processing systems (pp. 2249-2257).
7. Drachen, A. and Canossa, A. (2009) Towards Gameplay
Analysis via Gameplay Metrics. ACM MindTrek, 202–
8. Fogarty, J., Forlizzi, J., and Hudson, S.E. (2001)
Aesthetic Information Collages: Generating Decorative
Displays that Contain Information. ACM CHI
9. Gajos, K., & Weld, D. S. (2005). Preference elicitation
for interface optimization. ACM UIST (pp. 173-182).
10. Gajos, K., Weld, D., and Wobbrock, J. Decision-
Theoretic User Interface Generation. AAAI, (2008),
11. Gittins, J. (1979) Bandit Processes and Dynamic
Allocation Indicies. Journal of the Royal Statistical
Society. Series B., 148–177.
12. Glaser, R. (1976). Components of a psychology of
instruction: Toward a science of design. Review of
Educational Research, 46(1), 1–24.
13. Hacker, S. (2014) Duolingo: Learning a Language
While Translating the Web. PhD Thesis, Carnegie
Mellon University School of Computer Science. May
14. Hauser, J.R., Urban, G.L., Liberali, G., and Braun, M.
(2009) Website Morphing. Marketing Science. 28, 2,
15. Khajah, M., Roads, B. D., Lindsey, R. V, Liu, Y., &
Mozer, M. C. (2014). Designing Cognitive-Training
Games to Maximize Engagement, 1–9.
16. Koedinger, K. R., Booth, J. L., Klahr, D. (2013)
Instructional Complexity and the Science to Constrain It
Science. 22 November 2013: Vol. 342 no. 6161 pp. 935-
17. Koedinger, K. R., Baker, R. S., Cunningham, K.,
Skogsholm, A., Leber, B., & Stamper, J. (2010). A data
repository for the EDM community: The PSLC
DataShop. Handbook of educational data mining, 43.
18. Kohavi, R., Longbotham, R., Sommerfield, D., and
Henne, R.M. (2008) Controlled experiments on the web:
survey and practical guide. Data Mining and Knowledge
Discovery 18, 1 140–181.
19. Kohavi, R., Deng, A., Frasca, B., Longbotham, R.,
Walker, T., and Xu, Y. (2012) Trustworthy Online
Controlled Experiments: Five Puzzling Outcomes
20. Kramer, Adam DI, Jamie E. Guillory, and Jeffrey T.
Hancock. (2014) Experimental evidence of massive-
scale emotional contagion through social networks.
21. Lai, T. (1987) Adaptive treatment allocation and the
multi-armed bandit problem. The Annals of Statistics;
22. Lai, T., & Robbins, H. (1985). Asymptotically efficient
adaptive allocation rules. Advances in Applied
Mathematics, 6, 4–22.
23. Li, L., Chu, W., Langford, J., & Schapire, R.E. (2010) A
Contextual-Bandit Approach to Personalized News
Article Recommendation. WWW
24. Liu, Y., Mandel, T., Brunskill, E., & Popovic, Z. (2014)
Trading Off Scientific Knowledge and User Learning
with Multi-Armed Bandits. Educational Data Mining
25. Liu, Y., Mandel, T., Brunskill, E., & Popovi, Z. (2014)
Towards Automatic Experimentation of Educational
Knowledge. ACM CHI
26. Lomas, D., Patel, K., Forlizzi, J. L., & Koedinger, K. R.
(2013) Optimizing challenge in an educational game
using large-scale design experiments. ACM CHI
27. Lomas, D. Harpstead, E., (2012) Design Space
Sampling for the Optimization of Educational Games.
Game User Experience Workshop, ACM CHI
28. Lomas, J. D. (2014). Optimizing motivation and
learning with large-scale game design experiments
(Unpublished Doctoral Dissertation). HCI Institute,
Carnegie Mellon University.
29. Norman, D. (in preparation) Technology or People:
Putting People Back in Charge. Jnd.org
30. Scott, S. (2010) A modern Bayesian look at the multi-
armed bandit. Applied Stochastic Models in Business
and Industry, 639–658.
31. Scott, S. (2014) Google Content Experiments
32. Simon, H. (1969). The Sciences of the Artificial.
33. J. C. Stamper, D. Lomas, D. Ching, S. Ritter, K. R.
Koedinger, and J. Steinhart. (2012) The rise of the super
experiment. EDM p. 196–200
34. Vermorel, J. and Mohri, M. (2005) Multi-armed bandit
algorithms and empirical evaluation. Machine Learning:
ECML 2005, 437–448.
35. Yannakakis, G. N., & Hallam, J. (2007). Towards
optimizing entertainment in computer games. Applied
Artificial Intelligence, 21(10), 933-971.