Content uploaded by Derek Lomas
Author content
All content in this area was uploaded by Derek Lomas on Oct 16, 2017
Content may be subject to copyright.
Interface Design Optimization as a Multi-Armed Bandit
Problem
J. Derek Lomas1, Jodi Forlizzi2, Nikhil Poonwala2, Nirmal Patel2, Sharan Shodhan2,
Kishan Patel2, Ken Koedinger2, Emma Brunskill2
The Design Lab1
UC San Diego
9500 Gilman Drive
dereklomas@gmail.com
Carnegie Mellon University2
HCI Institute,
5000 Forbes Ave
{forlizzi,krk,ebrun}@cs.cmu.edu
ABSTRACT
“Multi-armed bandits” offer a new paradigm for the AI-
assisted design of user interfaces. To help designers
understand the potential, we present the results of two
experimental comparisons between bandit algorithms and
random assignment. Our studies are intended to show
designers how bandits algorithms are able to rapidly
explore an experimental design space and automatically
select the optimal design configuration. Our present focus is
on the optimization of a game design space.
The results of our experiments show that bandits can make
data-driven design more efficient and accessible to interface
designers, but that human participation is essential to ensure
that AI systems optimize for the right metric. Based on our
results, we introduce several design lessons that help keep
human design judgment in the loop. We also consider the
future of human-technology teamwork in AI-assisted design
and scientific inquiry. Finally, as bandits deploy fewer low-
performing conditions than typical experiments, we discuss
ethical implications for bandits in large-scale experiments
in education.
Author Keywords
Educational games; optimization; data-driven design; multi-
armed bandits; continuous improvement;
ACM Classification Keywords
H.5.m. Information interfaces and presentation (e.g., HCI):
Miscellaneous.
INTRODUCTION
The purpose of this paper is to provide empirical evidence
that can help designers understand the promise and perils of
using multi-armed bandits for interface optimization.
Unlike the days of boxed software, online software makes it
easy to update designs and measure user behavior. These
properties have resulted in the widespread use of online
design experiments to optimize UX design. On any given
day, companies run thousands of these online controlled
experiments to evaluate the efficacy of different designs
[21]. These experiments are changing the nature of design
by introducing quantitative evidence that can serve as a
“ground truth” for design decisions. This evidence,
however, often conflicts with designer expectations. For
instance, at Netflix and Google [22,33] only about 10% of
tested design improvements (which were, presumably,
introduced as superior designs) actually lead to better
outcomes.
What is “good design” is necessarily relative. However, a
“better” design is often defined as a design with better
outcome metrics. Yet, the relationship between metrics and
“good design” is not always clear. Design experiments
seemingly show that particular designs are objectively
better. However, these metrics are not always conclusive
measures of value, though they can appear to be. Even
when a particular design gets better results in A/B test, it is
not always the best design for the larger objectives of the
organization [9,22].
This paper addresses these issues through a set of
experiments involving different design optimization
algorithms. We present empirical data showing how bandit
algorithms can increase the efficiency of experiments,
lower the costs of data-driven continuous improvement and
improve overall utility for subjects participating in online
experiments. However, our data also show some of the
limitations of relying on AI without human oversight.
Together, our evidence can help designers understand why
design optimization can be seen as a “multi-armed bandit”
problem and how bandit algorithms can be used to optimize
user experiences.
RELATED WORK
The optimization of interfaces based on individual user
characteristics is a significant sub-field within HCI. One
challenge in the field has been the availability of these
evaluation functions at scale. For instance, SUPPLE [12], a
system for algorithmically rendering interfaces, initially
required cost/utility weights that were generated manually
for each interface factor. Later, explicit user preferences
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions
from
Permissions@acm.org.
CHI'16, May 7-12, 2016, San Jose, CA, USA
Copyright is held by the owner/authors. Publication rights licensed
to ACM. ACM 978-1-4503-3362-7/16/05
…
$15.00
DOI:
http://dx.doi.org/10.1145/2858036.2858425
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4142
were used to generate these cost/utility weights. Other
algorithmic interface optimization approaches have used
cost functions based on estimated cognitive difficulty, user
reported undesirability, user satisfaction ratings and even
aesthetic principles [11, 42].
Online user data and A/B testing provide an alternative
source for UX evaluation. Hacker [16] contrasts UX
optimization for individual users (student adaptive) with the
optimization of interfaces for all users (system adaptive). In
Duolingo (a language learning app used by millions),
student adaptive UX is supported by knowledge models,
whereas Duolingo’s system adaptivity is a longer term
process that unfolds in response to large-scale A/B tests.
A/B testing is also widely used by Google, Amazon,
Yahoo, Facebook, Microsoft and Zynga to make data-
driven UX decisions on topics including surface aesthetics,
game balancing, search algorithms and display advertising
[22]. With that framing, A/B testing is indeed a widespread
approach for the optimization of interface design.
Design Space Optimization
Design space optimization involves finding the set of
design factor parameter values that will maximize a desired
outcome. This follows Herb Simon’s [38] notion that
designing can be understood in terms of generating design
alternatives and searching among those alternatives for
designs that best satisfy intended outcomes. The concept of
“design spaces” has been used to formally represent design
alternatives as a multiplication of independent design
factors [7]; for instance, the color, size and font of text are
independent design factors. While the notion of design
space exploration as a mechanism for optimization is
commonly used in areas like microprocessor design, there
is also a rich history of the concept in HCI, where design
spaces have been used to express designs as a possibility
space, rather than as a single final product [33].
In this paper, we focus on the optimization of a game
design space, or the multiplication of all game design
factors within a particular game [30]. The complete design
space of a game (all possible variations) can be
distinguished from the far smaller instantiated design
space, which consists of the game variables that are actually
instantiated in the game software at a given time. Games
are a rich area for design space exploration as they often
have many small variations to support diverse game level
design. During game balancing, designers make many small
modifications to different design parameters or design
factors. Thus, the instantiated design space of a game will
often include continuous variables like reward or
punishment variables (e.g., the amount of points per
successful action or the effect of an enemy’s weapon on a
player’s health) and categorical game design variables (e.g.,
different visual backgrounds, enemies or maps).
Increasingly, online games use data to drive the balancing
or optimization of these variables [21].
Even simple games have a massive design space: every
design factor produces exponential growth in the total
number of variations (“factorial explosion”). For instance,
previous experiments with the game presented in this paper
varied 14 different design factors and dozens of factor
levels [29]; in total, the instantiated design space included
1.2 billion variations! Thankfully, design isn’t about
completely testing a design space to discover the optimum,
but rather to find the most satisfactory design alternative in
the available time. Designers often use their imaginative
judgment to evaluate design alternatives, but this is error
prone [22]. Thus, this paper explores multi-armed bandit
algorithms as a mechanism for exploring and optimizing
large design spaces.
Multi-Armed Bandits and Design
The Multi-Armed Bandit problem [25] refers to the
theoretical problem of how to maximize one’s winnings on
a row of slot machines, when each machine has a different
and unknown rate of payout. The solution to a multi-armed
bandit problem is a selection strategy; specifically, a policy
for which machine’s arm to pull to maximize winnings,
given the prior history of pulls and payoffs. The objective is
normally to minimize regret, which is often defined as the
difference between the payoff of one’s policy and the
payoff of selecting only the best machine – which, of
course, is unknown a priori.
This paper frames UX optimization design decision-making
as a multi-armed bandit problem [25]: Each variation of a
UX design can be viewed as a slot machine “arm” where
the payoff (in our case, user engagement) is unknown. We
view data-driven designers as needing to balance exploring
designs that are potentially effective with the exploitation of
designs that are known to be effective. This framing allows
us to draw upon the extensive machine learning research on
algorithms for solving multi-armed bandits problems.
Solving Multi-Armed Bandit Problems
If one doesn’t know which slot machine has the highest
payout rate, what can be done? For instance, one could
adopt a policy of “explore first then exploit”. In this case,
one could test each machine n number of times until the
average payout of a particular arm is significantly higher
than the rest of the arms; then this arm could be pulled
indefinitely. This “explore first then exploit” approach is
similar to traditional A/B testing, where subjects are
randomly assigned to different conditions for a while and
then one design configuration is chosen as the optimal.
However, this approach is often “grossly inefficient” (p.
646, [36]): for instance, imagine the extreme example
where one arm produces a reward with every pull while all
other arms don’t pay out at all. In a typical experiment, the
worst arm will be pulled an equal amount of time as the
most optimal arm. Furthermore, the stopping conditions for
A/B tests are quite unclear, as few online experimenters
properly determine power in advance [21]. As a result, this
runs the risk of choosing the (wrong) condition too quickly.
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4143
A Real-World Illustration
Consider a hypothetical game designer who wants to
maximize player engagement by sending new players to the
best design variation: A, B or C (Figure 1). How should she
allocate the next 1000 players to the game? She could pick
the single condition that has the highest average
engagement at that time. Alternatively, she could have a
policy of sending players to the design condition with the
highest mean (“C”). However, this “greedy” approach
doesn’t take into account that A’s mean might be higher
than C, as there isn’t enough information yet.
Figure 1: Example data comparing three different design
conditions. While “C” has the highest mean, “A” has the
highest upper confidence limit. The policy of picking the
condition with the highest upper confidence limit can improve
returns by balancing exploration with exploitation.
So instead, our designer could have a simple policy of just
choosing the design that has the highest upper confidence
limit. This policy would result in testing the engagement of
A with additional players. The additional information
gathered from A would either show that condition A
actually does have a higher mean than condition C, or it
would shrink the confidence interval to the point where the
policy would begin to pick C again.
Practically, this policy of picking the higher upper
confidence limit has the effect of choosing designs that
have either a higher mean (exploitation) or insufficient
information (exploration). This policy was instantiated as
the UCL-bandit in order to help communicate the multi-
armed bandit problem to a general audience. It is not
expected to be a practical algorithm for solving real-world
bandit problems.
Background on Multi-Armed Bandits
In the past few years, multi-armed bandits have become
widely used in industry for optimization, particularly in
advertising and interface optimization [22]. Google, for
instance, uses Thompson Sampling in their online analytics
platform [37]. Yahoo [26] uses online “contextual bandits”
to make personalized news article recommendations. Work
has also been done to apply bandits to basic research, as
well as applied optimization. Liu et al. [27], for instance,
explored the offline use of bandits in the context of online
educational games, introducing the UCB-EXPLORE
algorithm to balance scientific knowledge discovery with
user benefits.
There are many multi-armed bandits algorithms that
sequentially use the previous arm pulls and observations to
guide future selection in a way that balances exploration
and exploitation. Here we present a non-comprehensive
review of approaches to the bandit problem in the context
of online interface optimization.
Gittins Index: This is the theoretically optimal strategy for
solving a multi-armed bandit problem. However, it is
computationally complex and hard to implement in
practice. [14,6]
Epsilon-First: This is the classic A/B test. Users are
randomly assigned to design conditions for a set time
period, until a certain number of users are assigned, or until
the confidence interval between conditions reaches a value
acceptable to the experimenters (e.g., test until p<0.05).
After this time, a winner is picked and then used forever.
Epsilon-Greedy: Here, the policy is to randomly assign x%
of incoming users, with the remainder assigned to the
highest performing design. A variation known as Epsilon-
Decreasing gradually reduces x over time; in this way,
exploration is emphasized at first, then exploitation [2].
These algorithms are simple but generally perform worse
than other bandit techniques.
Probability Matching: Players are assigned to a particular
condition with a probability proportionate to the probability
that the condition is optimal [36]. Thus, some degree of
random assignment is still involved. Probability matching
techniques include Thompson Sampling [36,37,38] and
other Bayesian sampling procedures that are used in
adaptive clinical trials [3].
UCB1: After testing each arm once, users are assigned to
the condition with the highest upper confidence bound.
With slight abuse of notation, in this paper we will use the
word “limit” to imply that the data is assumed to be
generated from a normally distributed variable, and
“bound” to make no assumption on the underlying data
distribution process (except that there exists bounds on the
possible data values) UCB was chosen for the present study
due to its strong theoretical guarantees, simplicity and
computationally efficiency.
UCL: In this paper, we introduce an illustrative approach to
bandit problems using upper confidence limits for
assignment, which is intended to help those familiar with
basic statistics to understand the logic underlying the UCB
algorithm. Again, here we slightly abuse the term “limit” to
compute confidence intervals that assume the data is
normally distributed. The benefit of doing this is that the
confidence intervals are typically much tighter than if one
makes no such assumption (as in UCB). This algorithm
operates by calculating the upper confidence limit of a
design condition after first randomly assigning a condition
for a period of time (e.g., 25 game sessions); every new
player thereafter is assigned to the condition with the
highest upper confidence limit.
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4144
Recall than an X% confidence interval [a,b] around a
parameter of interest is computed from data such that if the
experiment was repeated many times, at least X% of the
time the true parameter value would lie in the computed
interval for that experiment. For example, this parameter
could be the mean outcome for a design condition. The
confidence limits are precisely the two edges of the interval,
a and b. Confidence intervals can be computed under
different assumptions of how the data is generated given the
parameter of interest (e.g. sampled from a normal
distribution with a particular mean, etc.).
System for Bandit-Based Optimization
We present a system for integrating bandit-based
optimization into a software design process with the
following goals: 1) Support the use of data in an ongoing
design process by communicating useful information to
game designers; 2) Automatically optimize the design space
identified by designers in order to reduce the cost of
experimentation and analysis. 3.) Reduce the cost of
experimentation to the player community by minimizing
exposure to low-value game design configurations. Our
data-driven UX optimization method extends previous
optimization research involving user preferences [12],
aesthetic principles [11], and embedded assessments [13].
Moreover, we extend previous applications of bandits for
offline educational optimization [27] by demonstrating the
value of bandits as a tool for online optimization.
Measures
Optimization relies on an outcome evaluation metric [12]
that drives the decisions made between different design
configurations. Choosing an appropriate outcome variable
is essential, because this metric determines which
conditions are promoted. In this paper, our key outcome
metric for engagement is the number of trials (estimates)
that are made, on average, within each design variation.
EXPERIMENT 1
Our first “meta-experiment” compares three different
algorithms for conducting online experimentation (we use
the term “meta-experiment” because it is experimenting
with different approaches to experimentation). The players
of an educational game were first randomly assigned to one
of these experiments. Then, according to their assignment
method, the players were assigned to a game design within
the experimental design space of Battleship Numberline.
We hypothesize that, in comparison to random assignment,
one of the multi-armed bandit assignment approaches will
result in greater overall student engagement during the
course of the study. Specifically, we hypothesize that multi-
armed bandits can automatically optimize a UX design
space more efficiently than random assignment and also
produce greater benefits and lower costs for subjects. Costs
occur when participants receive sub-optimal designs.
H1: Multi-Armed Bandits will automatically search
through a design space to find the best designs
H2: Automatic optimization algorithms will reduce the cost
of experimentation to players
To test our hypothesis, we simultaneously deployed three
experiments involving 1) random assignment 2) UCB1
bandit assignment or 3) UCL-95% bandit assignment. The
UCL-95% bandit had a randomization period of 25,
meaning that all design variations needed to have 25 data
points before it started assignment on the basis of upper
confidence interval.
Calculating UCB1: In this paper, we calculate the upper
confidence bound of a design “D” as the adjusted mean of
D (the value must be between 0-1, so for us, we divided
each mean by the maximum engagement allowed, 100) +
square root of (2 x log(n) / n_D), where n is the total
number of games played and n_D is the total number of
games played in design condition D.
Calculating UCL: The upper confidence limit is calculated
as the mean + standard error x 1.96 (for 95% confidence)
and mean + standard error x 3.3 (for 99.9% confidence).
The Design Space of Battleship Numberline
Battleship Numberline is an online game about estimating
fractions, decimals and whole numbers on a number line.
Players attempt to blow up different targets by estimating
the location of different numbers on a number line. The
“ship” variant (in the xVessel factor) requires users to type
a number to estimate the location of a visible ship on a line.
The “sub” variant requires users to click on the number line
to estimate the location of a target number indicating the
location of a hidden submarine (the sub is only shown after
the player estimates, as a form of grounded feedback [40]).
Figure 2: Variations of the xVessel and the xPerfectHitPercent
design factors. xVessel variants are submarines (hidden until a
player clicks on the line to estimate a number) and ships
(players type in a number to estimate where it is on the line).
xPerfectHitPercent varies the size of the target, in term of the
accuracy required to hit it. The above targets require
accuracies of 70% (largest), 95%, 80% and 97% (smallest) .
The size of the targets (xPerfectHitPercent) represents the
level of accuracy required for a successful hit. For example,
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4145
when 95% accuracy is required, the target is 10% of the
length of the number line—when 90% accuracy is required,
the target is 20% of the line. Thus, big targets are easier to
hit and small targets are more difficult.
The combination of these, and other, design factors
constitute the design space of Battleship Numberline. The
present study evaluates different methods for exploring and
then selecting optimal design condition within this design
space. In this paper, we only explore the xVessel and
xPerfectHitPercent design factors.
Procedure
Over the course of 10 days, 10,832 players of Battleship
Numberline on Brainpop.com were randomly assigned to
three different experimentation algorithms: random, upper
confidence bound (UCB), and upper confidence limit
(UCL). The UCL-95% bandit had a randomization period
of 25 trials before assigning by UCL. Within each
experimentation condition, players were assigned to 1 of 6
different conditions, a 2x3 factorial involving the xVessel
(clicking vs. typing) and the size of targets
(xErrorTolerance, where larger is easier to hit).
After players made a choice to play in the domain of whole
numbers, fractions or decimals, they were given a four-item
game-based pretest [29]. They were then randomly assigned
to an experimental method, which assigned them to one of
12 experimental conditions. If players clicked on the menu
and played again, they were reassigned to a new
experiment.
When players enter the game, the server receives a request.
It then allocates conditions to these incoming requests
based on a round-robin method of assignment, that starts at
the experimental level and then at the condition level. For
instance, the player would be assigned to random
assignment and then assigned to one of the random
assignment conditions; the next player in the queue would
be assigned to UCB. The UCB algorithm would then
determine which game condition they would receive.
Experimental System
Our experimental system was built on Heroku and
MongoDb. It was conceived with the intention of providing
data analytics dashboard for designers to monitor the
experiment. The dashboard shows running totals of the
number of times each arm is pulled (i.e., the number of
times a design condition was served to a player), the mean
number of trials played in each condition (our measure of
engagement), the upper confidence limit and the upper
confidence bound of each condition. Our dashboard also
provided a mechanism for designers to click on a URL to
experience for themselves any particular design (i.e., any
arms in the experiment). These links made it easy to sample
the design space and gain human insight into the data. We
also provided a checkbox so that a particular arm could be
disabled, if necessary. Finally, our system made it easy to
set the randomization period and confidence limit for the
UCL algorithm.
RESULTS
In Figure 3 and Table 1, we confirm H2 by showing that
players in the bandit conditions were more engaged; thus,
bandits reduced the cost of experimentation to subjects by
deploying fewer low-engagement conditions.
The UCL and the UCB bandit experiment produced 52%
and 24% greater engagement than the experiment involving
random assignment. Our measure of regret between
experiments is shown in Table 1 as the percent difference in
our outcome metric between the different experiments and
the optimal policy (e.g., Sub90, which had the highest
average engagement). This shows that the UCL-25
experiment (one of the bandit algorithms) achieved the
lowest regret of all experiments.
Figure 3: Shows how the bandit-based experiments garnered
more student engagement (total number of trials played).
Meta
Experimental
Conditions
Total
Games
Logged
Total
Trials
Played
% Difference
from Optimal
Random
Assignment
2818
42,835
36%
UCB Assignment
2896
53,274
20%
UCL-25
Assignment
2931
65,206
2%
Optimal Policy
(Sub90)
2931*
66,534*
0%
Table 1: Comparison of Meta-Experiment. * 22.7 average
trials for Sub90. Assuming same number of games logged.
8,645 total logged out of 10,832 total served.
H1, the hypothesis that bandits can automatically optimize
a UX design space, was confirmed with evidence presented
in Figure 4. These data show that random assignment
equally allocated all 6 conditions whereas both the UCB
bandit and the UCL bandit allocated subjects to conditions
preferentially. The reason for this unequal allocation is the
difference in the efficacy of the conditions for producing
player engagement, as seen in Figure 5.
Figure 5 shows the means and 95% confidence intervals of
each arm in the three experiments. The Y-axis is our
measure of engagement: the average number of trials
played in the game within that condition. All experiments
identify Sub90 as the most engaging condition. In this
variant of the game, players needed to click to estimate a
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4146
number on the number line and their estimates needed to be
90% accurate to hit the submarine. All bandit-based
experiments deployed this condition far more than other
conditions (as seen in Figure 4).
Figure 4: Random assignment experimentation equally
allocates subjects whereas both bandit-based experiments
allocate subjects based on the measured efficacy of the
conditions. Total Games Played is the number of players
assigned to each design condition.
Figure 5: The means and 95% confidence intervals of the
experimental conditions. Note that the Upper Confidence
Limit experiment focused data collection on the condition with
the highest upper confidence limit. Y-Axis represents
Engagement, as the total number of trials played.
Figure 6: This graph shows the allocation of game sessions to
the different target sizes and the variation in the average
number of trials played over time.
Note the long confidence intervals on the right side of
Figure 5: these are a result of insufficient data. As can be
seen in Figure 4, these conditions each had less than 30 data
points. However, if any of the confidence intervals were to
exceed the height of Sub90’s condition in the UCL
experiment, then UCL would deploy those again.
EXPERIMENT TWO
After running the first meta-experiment, our results clearly
supported the value of bandits. However, UCL never tested
additional design variations after “deciding” during the
randomization period that Sub90 was the most engaging
condition (Figure 6). While the benefits of the Sub90
outcome was confirmed by the random experiment, it does
not illustrate the explore-exploit dynamic of a bandit
algorithm. Therefore, to introduce greater variation and
bolster our discussion, we ran a second meta-experiment.
In this meta-experiment, we compared random assignment
with two variations on the Upper Confidence Limit bandit.
We retained our use of the greedy 95% confidence limit
bandit but also added a more conservative 99.9%
confidence limit bandit. The difference between these two
is the parameter that is multiplied by the standard error and
added to the mean: for 95% we multiply 1.96 times the
standard error while for 99.9% we multiply 3.3. We expect
this more conservative and less greedy version of UCL to
be more effective because it is less likely to get stuck in a
local optimum.
H3: UCL-99.9% will tend to deploy a more optimal design
condition than UCL-95% as a result of greater exploration.
Figure 7: Four of the five variations in target size in
experiment 2. The largest submarines (at top, 60 and 70)
appeared to be far too big and too easy to hit. However, they
were generally more engaging than the smaller sized targets.
In the second experiment, we focused on submarines, which
we found to be more engaging than ships (or, in any case,
resulted in more trials played). We eliminated the smallest
size (which was the worst performing) and added a broader
sample: 95%, 90%, 80%, 70%, 60% accuracy required for a
hit. The largest of these, requiring only 60% accuracy for a
hit, was actually 80% of the length of the number line.
Although we felt the targets were grotesquely large, they
actually represented a good scenario for using online
bandits for scientific research. We’d like to collect online
data to understand the optimal size, but we’d want to
Random
UCB
UCL
Total Games Played
0
500
1000
1500
2000
2500
3000
90
95
97
90
95
97
ship
sub
90
95
97
90
95
97
ship
sub
90
95
97
90
95
97
ship
sub
Confidence Intervals of Experimental
Conditions
total played
0
2
4
6
8
10
12
14
16
18
20
22
24
90
95
97
90
95
97
90
95
97
90
95
97
90
95
97
90
95
97
ship
sub
ship
sub
ship
sub
metaRandom
metaUCB
metaUCL
expID 2 / xVessel / perfectHitPercent
Target
Size
#
Trials
Target
Size
#
Trials
Target
Size
#
Trials
50
80
95
12
18
24
50
80
95
12
18
24
50
80
95
12
18
24
FullRandom
FullUCB
fullUCL
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4147
minimize how many students were exposed to the
suboptimal conditions. We were sure that this target size
was too large, but could the bandits detect this?
H4: Bandits will deploy “bad” conditions less often than
random assignment.
RESULTS
We found mixed evidence for both H3 and H4. The more
conservative bandit did not appear to explore longer (Figure
10) nor did it deploy better designs (Figure 8), although it
did achieve greater engagement than UCL-95 (Table 2).
Additionally, while the bandits did achieve less regret than
the random assignment condition (Table 2), the conditions
deployed were so bad that we got phone calls from
Brainpop inquiring whether there was a bug in our game!
Figure 8: The “bad design” (Sub60) did surprisingly well in
the bandits, where it was assigned second most often – and far
less than the optimal design, according to random assignment.
Meta Experimental
Conditions
Total
Games
Logged
Total
Trials
Played
% Diff.
from
Optimal
Random Assignment
1,950
46,796
21%
UCL-95%
1,961
47,312
20%
UCL-99.9%
1,938
49,836
14%
Optimal Policy (Sub70)
1,950*
56,628*
0%
Table 2: Compares each experiments to the optimal policy:
Sub70 according to the random assignment experiment.
* 29.04 average trials for sub70 in the random condition.
Figure 9: The means and confidence intervals of each
condition within each experiment. The random assignment
experiment reveals an inverted U-shape, where the largest
target is less engaging than the second largest. The rank order
of condition engagement varies over the experiments.
In Figure 8, we show that the least chosen arm of UC-95%
was the most chosen arm of UCL-99.9% -- and vice versa!
Moreover, the second most-picked design of both bandits
was the very largest target, which seemed to us to be far too
large. Finally, the optimal arm, according to random
assignment, was hardly in the running inside the bandit
experiments. What accounts for this finding?
First, all of the design variations in this experiment
performed reasonably well (Figure 9) on our engagement
metric, even the “bad” conditions. Without major
differences between the conditions, all experimental
methods performed relatively well. But, digging in, there
were other problems that have to do with the dangers of
sequential experimentation.
Figure 10: For meta-experiment 2, this graph shows the
allocation of game sessions to the different target sizes and the
variation in the average number of trials played over time.
Figure 10 shows the allocation of designs over time. The X-
axis represents game sessions over time (i.e.,“arm pulls”, as
each game session requests a design from the server). The
smoother line shows the mean number of trials played over
time (i.e., engagement over time). Note that the randomly
assigned mean varies significantly over time, by nearly
50%! This reflects the “seasonality” of the data; for
instance, the dip in engagement between 2000 to 3000
represents data collected after school ended, into the
evening. And the rise around 3000 represents a shift to the
morning school day.
Therefore, the average player engagement in the bandits
can be expected to be affected by the same seasonal factors
as those affecting random assignment, yet also vary for
other reasons. In particular, the average engagement of
these bandits will be affected by the different conditions
they are deploying. Possible evidence of this phenomenon
can be noted in the concurrent dip in engagement around
4200 – a dip that is not present in random assignment. This
dip occurs during the morning hours, so one plausible
expID
Random
UCL95
UCL99
Engagement (Total # Trials Played)
0
4
8
12
16
20
24
28
32
60
70
80
90
95
60
70
80
90
95
60
70
80
90
95
count
Target
Size
#
Trials
Target
Size
#
Trials
Target
Size
#
Trials
60
70
80
90
95
20
26
32
60
70
80
90
95
20
26
32
60
70
80
90
95
20
26
32
Random
UCL95
UCL99
500
1500
2500
3500
4500
5500
Game Sessions
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4148
explanation is that the morning population is not as engaged
by the designs deployed by the bandits as much as other
populations. It appears that students in the morning are
more engaged by challenge—possibly because students in
the morning have higher ability, on average.
In both meta-experiments, all bandit algorithms tended to
test different design conditions in bursts. Note for instance
that the Sub70, the optimal condition, was explored briefly
by UCL-95% around 2500-3000 – yet this period was the
time when all conditions were fairing the worst. So, this
condition had the bad luck of being sampled at a time when
any condition would have a low mean. This shows the
limitations of sequential experimentation in the context of
time-based variations.
We cannot confirm our hypothesis that the UCL-99.9%
bandit actually explored more than UCL-95%. Visually, it
appears that UCL-99.9% was busy testing more conditions
more often than UCL-95%. However, both bandits fell into
a local optimum at roughly the same time, based on when
each began exclusively deploying one condition.
GENERAL DISCUSSION
Our studies explore the optimization of the design space of
an educational game using three different multi-armed
bandit algorithms. In support of using bandit algorithms
over standard random experimental design, we consistently
found that bandits achieved higher student engagement
during the course of the experiment. While we focused on
game design factors for optimization, this work is relevant
to any UI optimization that seeks to manipulate independent
variables to maximize dependent variables.
Our work is notable, in part, for the problems it uncovered
with automated data-driven optimization. For instance, we
were surprised to find that one of the most engaging
conditions was Sub60 (the absurdly large condition in
Figure 7), despite the fact that it was included for the
purpose of testing the ability of the bandits to identify it as a
poor condition. This discrepancy indicates that our metric
(number of trials played) may be the wrong metric to
optimize. Alternatively, the metric might be appropriate,
but we (and Brainpop) might be wrong about what is best
for users. Our work illustrates how automated systems have
the potential to optimize for the wrong metric. The risks of
AI optimizing arbitrary goals has also been raised by AI
theorists [5]; one thought experiment describes the dangers
of an AI seeking to maximize paperclip production.
Dangers of sequential experimentation in the real world
Our results also point to practical issues that must be
understood and resolved for bandit algorithms to transition
from computer science journals to the real world. For
instance, we found that time-based variations (e.g., average
player engagement was less during the night than during the
day) significantly affected the validity of our sequentially
experimenting bandits. These fluctuations due to contextual
time-of-day factors have a much bigger effect on
sequentially experimenting bandits than random
assignment. So, even though much more data was collected
about particular arms, it was not necessarily more
trustworthy than the data collected by random assignment.
While it is likely that conservative UCB-style bandits
would eventually identify the highest performing arm if
they were run forever, these time-based variation effects
can significantly reduce their performance. In contrast,
these effects may help explain the remarkable success of
simple bandit algorithms like Epsilon-Greedy, which
randomize a portion of the traffic and direct another portion
to the condition with the highest mean. Thompson
Sampling also randomly assigns players to all conditions,
but with lower probabilities for lower performing
conditions. While known factors (like time-of-day) can be
factored into bandit algorithms [26], any bandit that
involves randomization (like Thompson Sampling) is likely
to be more trustworthy in messy real-world scenarios.
Limitations
Our goal was to run an empirical study to illustrate how
bandits work, not to introduce better algorithms. Our work
might be viewed as specific to games; however, we view it
in the context of any situation where experimentation with
design factors (independent variables) might optimize
outcome measures (dependent variables).
Our algorithms, however, were simple. For instance, they
couldn’t do things like take into account factors like time of
day (like contextual bandits [26]). Additionally, our bandit
algorithms did not allow for the sharing of information
between arms or make use of previously collected data,
which would be especially useful for continuous game
design factors (such as the size of our targets) or larger
design spaces. To this end, we might have considered
approaches making use of fractional factorial designs, linear
regressions or a “multi-armed mafia” [36]. Recent work
combining Thompson Sampling and Gaussian Processes is
also promising [18].
Our UCL bandit was designed to help explain how bandits
work to a general audience, specifically, by illustrating the
conceptual relationship between the mechanism of the UCB
algorithm and the policy of “always choosing the design
with the highest error bar (upper confidence limit).” In our
experience, general audiences quickly grasp this idea as an
approach for balancing exploration and exploitation. In
contrast, far less familiarity with Chernoff-Hoeffding
bounds (the basis for the UCB bandit). This illustrative
value of the UCL algorithm is important because our goal is
to contribute an understanding of UI design optimization as
a multi-armed bandit problem to designers, not contribute a
new bandit algorithm to the machine learning community.
Indeed, there are fundamental problems that should be
expected from the UCL bandit. Constructing bandits that
operate using confidence intervals is conceptually similar to
running experiments and constantly testing for significance
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4149
and then running the condition that is, at the time,
significant. However, significance tests assume that sample
size was estimated in advance. While our bandits were
“riding the wave of significance” they were susceptible to
the tendency to be over confident in the present significance
of a particular condition. This is a major problem in
contemporary A/B testing, as well [11].
There are other fundamental issues with UCL. For instance,
unlike UCB1, both UCL bandits had a tendency to fall into
a local maximum for a long period of time, without
exploration. This is likely to be a property of UCL rather
than UCB simply being more conservative in its approach.
As N (total number of arms) increases, the confidence
bounds will decrease, whereas the confidence intervals have
no natural tendency to decrease. Finally, the data in our
sample are not normal; they follow a distribution that may
be negative binomial or beta. While the UCB1 algorithm
does not rely on an underlying assumption of normality,
both UCL algorithms do.
In support of the broader research community, we intend to
make the data from these bandit experiments available
online at pslcdatashop.org [20]. Given that seasonality
affected the performance of both bandit algorithms, having
real-world data may be useful for others who seek to
contribute more reliable bandit algorithms for UI
optimization.
Implications
Implications for Practical Implementation: For those
considering the use of bandits in the real world, we highly
recommend using approaches that involve some degree of
randomization (such as epsilon-greedy or Thompson
Sampling). Without any randomization for comparison,
there is no “ground truth” that would allow one to discover
issues with seasonality, etc. As of this writing, there are
now a variety of online resources and companies that can
guide the implementation of bandits [4,37].
Implications for Data-Driven Designers: This work is
intended to help designers understand the dangers of
automated experimentation, in particular, how easy it is to
optimize for the wrong metric. Indeed, it is not that
maximizing engagement (our metric) is wrong, per se;
however, when maximized to the extreme, it produces
designs that appear quite bad.
Bandits are very capable of optimizing for a metric – but if
this is not the best measure of optimal design, the bandit
can easily optimize for the wrong outcome. For example, it
is much easier to measure student participation than it is to
measure learning outcomes, but conditions that are most
highly engaging are often not the best for learning (e.g.,
students perform better during massed practice, but learn
more from distributed practice [19]). In our study, the
extremely large ship was the most engaging, but was
unlikely to create the best learning environment [29].
Further work will continue to be necessary to refine our
outcome criterion.
With the general increase in data-driven design, we think it
is important for designers develop a critical and dialectical
relationship to optimization metrics. To support the role of
human judgment in an AI-Human design process, we
recommend making it as easy as possible for designers to
personally experience design variations as they are
optimized. Metrics alone should not be the sole source of
judgment; human experience will remain critical. Human
designers should be trained to raise alternative hypotheses
to understand why designs might optimize metrics but be
otherwise objectionable [22].
When design becomes driven by metrics, designers must be
able to participate in value-based discussions about the
relationships between quantitative metrics and ultimate
organization value [9]. Designers must be prepared to
engage in an ongoing dialogue about what “good design”
truly means, within an organization’s value system. Design
education should support training to help students engage
purposefully with the meaning behind quantitative metrics.
Implications for AI-Assisted Design: In general, we wonder:
how might people and AI work together in a design
process? According to Norman’s “human-technology
teamwork framework” [35], human and computer
capabilities should be integrated on the basis of their unique
capabilities. For instance, humans can excel at creating
novel design alternatives and evaluating whether the
optimization is aligned with human values. AI can excel at
exploring very large design spaces and mining data for
patterns. Integrated Human-AI “design optimization teams”
are likely to be more effective at optimizing designs than
human or AI systems alone.
Importantly, both human judgment and AI-driven
optimization can be wrong—so, ideally systems should be
designed to support effective human-technology teamwork
[35]. For instance, the original design of Battleship
Numberline had a target size of 95% accuracy; while the
designer felt this was best, this level of difficulty turns out
to be significantly less engaging than other levels. At the
same time, when the automated system was permitted to
test a very broad range of options, it ended up generating
designs that deeply violated our notion of good.
Designers can support optimization systems by producing
“fuzzy designs” instead of final designs, where a fuzzy
design is the range of acceptable variations within each
design factor. More than a range, however, we recommend
that designers deliver an “assumed optimal” setting for each
design parameter along with a range of acceptable
alternatives. AI can learn which alternative produces the
best outcomes, but at the same time, designers can learn by
reflecting on discrepancies between assumed optimal
designs and actually optimal designs. This reflection has the
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4150
potential to support designer learning and create new
insights for design and theory.
Implications for AI-Assisted Science: We note that previous
work [31] showed that the effect of “surface-level” design
factors (e.g., time limits, tickmarks, etc) may be mediated
by “theory-level” design factors (e.g., “difficulty” or
“novelty”). Thus, generalizable theoretical models might be
uncovered algorithmically, or, more likely, through AI-
human teams. Nevertheless, we anticipate significant
opportunities for AI to support the discovery of
generalizable theoretical models that can support both
product optimization and scientific inquiry.
To be clear, the explosion of experiments with consumer
software is driven by optimization needs, not science. That
is, the purpose is to improve outcomes, not to uncover
generalizable scientific theory. Yet, large optimization
experiments have the potential to lead to generalizable
insight (as with [29]). If the number of software
experiments continues to increase (particularly with bandit-
like automation), it would be wise to understand
opportunities for how these experiments can also inform
basic science. In areas like education, where millions of
students are now engaged in digital learning, there may be
many mutual benefits from a deeper integration of basic
science (i.e., improving theory) with applied research (i.e.,
improving outcomes).
Online studies can easily involve thousands of subjects in a
single day [28,38]. This is like having thousands of subjects
show up to a psychology lab every day. Clearly, scientists
don’t have enough graduate students to analyze the results
of dozens of experiments run every day of the year. This
suggests that, while there is significant “Big Science”
potential in conducting thousands of online experiments,
deeper AI-human collaboration may be required to realize
this potential. As others have suggested, bandits may help
support this scientific exploration [27].
Yet, AI-Assisted experimentation may present some degree
of existential risk. We have already discussed how runaway
AI might optimize the “wrong thing.” Keeping a human in
the loop can mitigate this risk. However, there is another
long-term risk: if AI-human systems are able to conduct
thousands of psychological experiments with the intent of
optimizing human engagement, might this eventually lead
to computer interactions that are so addictive that they
consume most people’s waking hours? (Oops, too late!)
Still, if online interfaces are highly engaging now, we can
only imagine what will come with AI-assisted design.
Ethical considerations of online research
There are significant ethical issues that accompany large-
scale online experiments involving humans. For instance,
the infamous “Facebook Mood Experiment” [23] prompted
a global uproar due to a perceived conflict between the
pursuit of basic scientific knowledge and the best interests
of unknowing users. Many online commenters bristled at
the idea that they were “lab rats” in a large experiment.
Online scientific research in education, although offering
enormous potential social value (e.g., advancing learning
science), faces the potential risk of crippling public
backlash. We suggest that multi-armed bandits in online
research could actually help assuage public fears.
First, bandit algorithms might help address the issue of
fairness in experimental assignment. A common concern
around education experiments is that they are unfair to the
half of students who receive the lower-performing
educational resource. Bandits offer an alternative where
each participant is most likely to be assigned to a resource
that brings better or equal outcomes. Indeed, Bandit-based
experiments like ours are designed to optimize the value
delivered to users, unlike traditional experimentation. Thus,
we suggest that bandits may have a moral advantage over
random assignment if they can adequately support scientific
inference while also maximizing user value. Future work,
of course, should explore this further.
CONCLUSION
The purpose of this paper is to illustrate how user interface
design can be framed as a multi-armed bandit problem. As
such, we provide experimental evidence to illustrate the
promise and perils of automated design optimization. Our
two large-scale online meta-experiments showed how
multi-armed bandits can automatically test variations of an
educational game to maximize an outcome metric. In the
process, we showed that bandits maximize user value by
minimizing their exposure to low-value game design
configurations. By automatically balancing exploration and
exploitation, bandits can make design optimization easier
for designers and ensure that experimental subjects get
fewer low-performing experimental conditions. While the
future is promising, we illustrate several major challenges.
First, optimization methods lacking randomization have
serious potential to produce invalid results. Second,
automatic optimization is susceptible to optimizing for the
“wrong” thing. Human participation remained critical for
ensuring that bandits were optimizing the “right” metric.
Overall, bandits appear well positioned to improve upon
random assignment when the goal is to find "the best
design", rather than measuring exactly how bad the
alternatives are.
ACKNOWLEDGMENTS
Many thanks to Allisyn Levy & the Brainpop.com crew!
For intellectual input, thank you to Burr Settles, Mike
Mozer, John Stamper, Vincent Aleven, Erik Harpstead,
Alex Zook and Don Norman. The research reported here
was supported by the Institute of Education Sciences, U.S.
Department of Education, through Grant R305C100024 to
WestEd, and by Carnegie Mellon University’s Program in
Interdisciplinary Education Research (PIER) funded by
grant number R305B090023 from the US Department of
Education. Additional support came from DARPA contract
ONR N00014-12-C-0284.
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4151
REFERENCES
1. Agarwal, D., Chen, B.C., and Elango, P. (2009)
Explore/Exploit Schemes for Web Content
Optimization. Ninth IEEE International Conference on
Data Mining, 1–10.
2. Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002)
Finite-time Analysis of the Multiarmed Bandit Problem.
Machine Learning, 235–256.
3. Berry, D. (2011) Adaptive Clinical Trials: The Promise
and the Caution. Journal of clinical oncology!: official
journal of the American Society of Clinical Oncology
29, 6. 603–6.
4. Birkett, A. (2015) When to Run Bandit Tests Instead of
A/B/n Tests. http://conversionxl.com/bandit-tests/
5. Bostrom, N. (2003). Ethical issues in advanced artificial
intelligence. Science Fiction and Philosophy: From
Time Travel to Superintelligence, 277-284.
6. Brezzi, M. and Lai, T.L. (2002) Optimal learning and
experimentation in bandit problems. Journal of
Economic Dynamics and Control 27, 1. 87–108.
7. Card, S., Mackinlay, J., & Robertson, G. (1990). The
design space of input devices. ACM CHI
8. Chapelle, O., & Li, L. (2011). An empirical evaluation
of thompson sampling. InAdvances in neural
information processing systems (pp. 2249-2257).
9. Crook, T., Frasca, B., Kohavi, R., & Longbotham, R.
(2009, June). Seven pitfalls to avoid when running
controlled experiments on the web. In Proceedings of
the 15th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 1105-1114).
ACM.
10. Drachen, A. and Canossa, A. (2009) Towards Gameplay
Analysis via Gameplay Metrics. ACM MindTrek, 202–
209.
11. Fogarty, J., Forlizzi, J., and Hudson, S.E. (2001)
Aesthetic Information Collages: Generating Decorative
Displays that Contain Information. ACM CHI
12. Gajos, K., & Weld, D. S. (2005). Preference elicitation
for interface optimization. ACM UIST (pp. 173-182).
13. Gajos, K., Weld, D., and Wobbrock, J. Decision-
Theoretic User Interface Generation. AAAI, (2008),
1532–1536.
14. Gittins, J. (1979) Bandit Processes and Dynamic
Allocation Indicies. Journal of the Royal Statistical
Society. Series B., 148–177.
15. Glaser, R. (1976). Components of a psychology of
instruction: Toward a science of design. Review of
Educational Research, 46(1), 1–24.
16. Hacker, S. (2014) Duolingo: Learning a Language
While Translating the Web. PhD Thesis, Carnegie
Mellon University School of Computer Science. May
2014
17. Hauser, J.R., Urban, G.L., Liberali, G., and Braun, M.
(2009) Website Morphing. Marketing Science. 28, 2,
202–223.
18. Khajah, M., Roads, B. D., Lindsey, R. V, Liu, Y., &
Mozer, M. C. (2016). Designing Engaging Games Using
Bayesian Optimization, ACM CHI.
19. Koedinger, K. R., Booth, J. L., Klahr, D. (2013)
Instructional Complexity and the Science to Constrain It
Science. 22 November 2013: Vol. 342 no. 6161 pp. 935-
937
20. Koedinger, K. R., Baker, R. S., Cunningham, K.,
Skogsholm, A., Leber, B., & Stamper, J. (2010). A data
repository for the EDM community: The PSLC
DataShop. Handbook of educational data mining, 43.
21. Kohavi, R., Longbotham, R., Sommerfield, D., and
Henne, R.M. (2008) Controlled experiments on the web:
survey and practical guide. Data Mining and Knowledge
Discovery 18, 1 140–181.
22. Kohavi, R., Deng, A., Frasca, B., Longbotham, R.,
Walker, T., and Xu, Y. (2012) Trustworthy Online
Controlled Experiments: Five Puzzling Outcomes
Explained. KDD
23. Kramer, Adam DI, Jamie E. Guillory, and Jeffrey T.
Hancock. (2014) Experimental evidence of massive-
scale emotional contagion through social networks.
PNAS
24. Lai, T. (1987) Adaptive treatment allocation and the
multi-armed bandit problem. The Annals of Statistics;
15(3):1091–1114.
25. Lai, T., & Robbins, H. (1985). Asymptotically efficient
adaptive allocation rules. Advances in Applied
Mathematics, 6, 4–22.
26. Li, L., Chu, W., Langford, J., & Schapire, R.E. (2010) A
Contextual-Bandit Approach to Personalized News
Article Recommendation. WWW
27. Liu, Y., Mandel, T., Brunskill, E., & Popovic, Z. (2014)
Trading Off Scientific Knowledge and User Learning
with Multi-Armed Bandits. Educational Data Mining
28. Liu, Y., Mandel, T., Brunskill, E., & Popovi, Z. (2014)
Towards Automatic Experimentation of Educational
Knowledge. ACM CHI
29. Lomas, D., Patel, K., Forlizzi, J. L., & Koedinger, K. R.
(2013) Optimizing challenge in an educational game
using large-scale design experiments. ACM CHI
30. Lomas, D. Harpstead, E., (2012) Design Space
Sampling for the Optimization of Educational Games.
Game User Experience Workshop, ACM CHI
31. Lomas, D. (2014). Optimizing motivation and learning
with large-scale game design experiments (Unpublished
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4152
Doctoral Dissertation). HCI Institute, Carnegie Mellon
University. DOI: 10.13140/RG.2.1.5090.8645
32. Lomas, D., (2013). Digital Games for Improving
Number Sense Retrieved from https://pslcdatashop.
web.cmu.edu/Files?datasetId=445
33. Maclean, A., Young, R. M., Victoria, M. E., & Moran,
T. P. (1991). Questions, Options, and Criteria: Elements
of Design Space Analysis. Human Computer
Interaction, 6, 201–250.
34. Manzi, J. (2012). Uncontrolled: The surprising payoff of
trial-and-error for business, politics, and society. Basic
Books.
35. Norman, D. (in preparation) Technology or People:
Putting People Back in Charge. Jnd.org
36. Scott, S. (2010) A modern Bayesian look at the multi-
armed bandit. Applied Stochastic Models in Business
and Industry, 639–658.
37. Scott, S. (2014) Google Content Experiments
https://support.google.com/analytics/answer/2844870?hl
=en&ref_topic=2844866
38. Simon, H. (1969). The Sciences of the Artificial. CScott,
S. L. (2015). Multi‐armed bandit experiments in the
online service economy. Applied Stochastic Models in
Business and Industry, 31(1), 37-45.
39. Stamper, J., Lomas, D., Ching, D., Ritter, S., Koedinger,
K., & Steinhart, J. (2012) The rise of the super
experiment. EDM p. 196–200
40. Stampfer, E., Long, Y., Aleven, V., & Koedinger, K. R.
(2011, January). Eliciting intelligent novice behaviors
with grounded feedback in a fraction addition tutor.
In Artificial Intelligence in Education (pp. 560-562).
Springer Berlin Heidelberg.
41. Vermorel, J. & Mohri, M. (2005) Multi-armed bandit
algorithms and empirical evaluation. Machine Learning:
ECML 2005, 437–448.
42. Yannakakis, G. N., & Hallam, J. (2007). Towards
optimizing entertainment in computer games. Applied
Artificial Intelligence, 21(10), 933-971.
Making Interfaces Work for Each Individual
#chi4good, CHI 2016, San Jose, CA, USA
4153