Content uploaded by Derek Lomas

Author content

All content in this area was uploaded by Derek Lomas on Sep 28, 2015

Content may be subject to copyright.

Interface Design Optimization as a Multi-Armed Bandit

Problem

Derek Lomas, Jodi Forlizzi, Nikhil Poonwala, Nirmal Patel, Sharan Shodhan,

Kishan Patel, Ken Koedinger, Emma Brunskill

Carnegie Mellon University

School of Computer Science

5000 Forbes Ave

dereklomas@gmail.com

ABSTRACT

“Multi-armed bandits” offer a new paradigm for designing

user interfaces in collaboration with AI and user data. To

help designers understand the potential, we present the

results of two experimental comparisons between bandit

algorithms and random assignment. Our studies are

intended to show designers how bandits algorithms are able

to rapidly explore an experimental design space and then

automatically select the optimal design configuration.

Our experimental results show that bandits can make data-

driven design more efficient and accessible to interface

designers, but that human participation is essential to ensure

that AI systems optimize for the right metric. Based on our

results, we introduce several design lessons that help keep

human design judgment “in the loop”. As bandits expose

players to significantly fewer “low-performing” conditions,

we discuss ethical implications for large-scale online

experimentation in education. Finally, we consider the

future for “Human-Technology Teamwork” in AI-assisted

design and scientific inquiry.

Author Keywords

Educational games; optimization; data-driven design; multi-

armed bandits; continuous improvement;

ACM Classification Keywords

H.5.m. Information interfaces and presentation (e.g., HCI):

Miscellaneous. See: http://www.acm.org/about/class/1998/

for more information and the full list of ACM classifiers

and descriptors.

INTRODUCTION

The purpose of this paper is to provide empirical evidence

that can help designers understand the promise and perils of

using multi-armed bandits for interface optimization.

Unlike the days of boxed software, online software has a

continuous stream of users that makes it possible to support

regular updates at a rapid pace. As part of this shift,

companies now run thousands of online controlled

experiments (i.e., A/B tests) to evaluate the efficacy of

different design decisions [18]. How is this changing

design?

Today’s designers must integrate their skills within larger

organizations that can measure the impact of their designs

with great precision. Who can argue with clear, empirical

data showing that a particular design is objectively better in

the eyes of users? Yet, designers must be prepared to

engage in an ongoing dialogue about what “good design”

really is. While one design may perform better in an A/B

test, it is not always the best design for the larger objectives

of the organization. That is, when design becomes driven

by metrics, there must be ongoing debate about whether the

metrics are actually measuring the ultimate goals. This

questioning of performance metrics becomes increasingly

important in an age of automated design optimization.

This paper addresses some of these concerns by outlining a

set of experiments with different design optimization

algorithms. We present empirical data showing how bandit

algorithms can increase the efficiency of experiments,

lower the costs of data-driven continuous improvement and

improve overall utility for subjects participating in online

experiments. However, our data also shows some of the

dangers of relying on artificial intelligence that operates

without human input. Together, our evidence can help

designers understand why design optimization can be seen

as a “multi-armed bandit” problem and how bandit

algorithms can be used to optimize user experience.

RELATED WORK

The optimization of interfaces based on individual user

characteristics is a significant sub-field within HCI. One

challenge in the field has been the availability of these

evaluation functions at scale. For instance, SUPPLE [9], a

system for algorithmically rendering interfaces, initially

required cost/utility weights that were generated manually

for each interface factor. Later, explicit user preferences

were used to generate these cost/utility weights. Other

algorithmic interface optimization approaches have used

cost functions based on estimated cognitive difficulty, user

Paste the appropriate copyright/license statement here. ACM now

supports three different publication options:

•

ACM copyright: ACM holds the copyright on the work. This is the

historical approach.

•

License: The author(s) retain copyright, but ACM receives an

exclusive publication license.

•

Open Access: The author(s) wish to pay for the work to be open

access. The additional fee must be paid to ACM.

This text field is large enough to hold the appropriate release statement

assuming it is single-spaced in TimesNewRoman 8 point font. Please do

not change or modify the size of this text box.

Every& submissi on& will& be& assigned& their& own& unique& DOI& string& to& be&

included&here.

reported undesirability, user satisfaction ratings and even

aesthetic principles [8, 35].

Online user data and A/B testing provide an alternative

source for UX evaluation. Hacker [13] contrasts UX

optimization for individual users (student adaptive) with the

optimization of interfaces for all users (system adaptive). In

Duolingo (a language learning app used by millions),

student adaptive UX is supported by knowledge models,

whereas Duolingo’s system adaptivity is a longer term

process that unfolds in response to large-scale A/B tests.

A/B testing is also widely used by Google, Amazon,

Yahoo, Facebook, Microsoft and Zynga to make data-

driven UX decisions on topics including surface aesthetics,

game balancing, search algorithms and display advertising

[19]. With that framing, A/B testing is indeed a widespread

approach for the optimization of interface design.

Design Space Optimization

A design space is the combination of potential design

factors [5]. Design spaces are used to describe the

parametric multiplication of design factors in many areas,

including microprocessor design and wireless signaling.

Herb Simon [32] described how it was possible to separate

the generation design alternatives (a design space) from the

search within the space for an optimal or satisfactory

design, based upon some requirements. To identify these

optimal designs, Glaser [12] describes an optimization

process in terms of finding the set of values that maximizes

the desired outcome, based on a fitness function that

describes the design parameters, constraints and goals.

In this paper, we focus on the optimization of a game

design space, or the combination of all game design factors

within a particular game [27]. The total design space of a

game (which represents the total possibilities) can be

distinguished from the far smaller instantiated design

space, which consists of the variables that are actually

instantiated in the game at a given time. Games are a rich

area for design space exploration as they often have many

small variations to support diverse game level design.

During game balancing, designers make many small

modifications to different design parameters or design

factors. Thus, the instantiated design space of a game will

often include continuous variables like reward or

punishment variables (e.g., the amount of points per

successful action or the effect of an enemy’s weapon on a

player’s health) and categorical game design variables (e.g.,

different visual backgrounds, enemies or maps).

Increasingly, online games use data to drive the balancing

or optimization of these variables [18].

Even simple games have a massive design space: every

design factor produces exponential growth in the total

number of variations (“factorial explosion”). For instance,

previous experiments with Battleship Numberline [28], the

game presented in this paper, varied 14 different design

factors for a total instantiated design space of 1.2 billion

variations. Clearly, a game’s design space can quickly

surpass the number of players who might test it, eliminating

the possibility of fully testing the design space. However,

the objective of design is not to completely test a design

space but rather to find a satisfactory optimum within it.

Often, this is done through the subjective judgment of a

designer, but human judgment is often wrong [19]. Given

the opportunity of a large online population, multi-armed

bandit algorithms can be an efficient mechanism for

exploring and optimizing large design spaces.

Multi-Armed Bandits and Design

The Multi-Armed Bandit problem [22] refers to the

theoretical challenge of maximizing winnings if presented

with a row of slot machine arms, where each arm has a

different and unknown rate of payout. The solution to a

multi-armed bandit problem is a strategy for selecting the

optimal arm to maximize payoff, specifically, a policy for

which arm to select, given the prior history of pulls and

payoffs. The objective is normally to minimize regret, often

defined as the difference in reward from the arm with the

best payout – which, of course, is unknown a priori.

This paper frames UX optimization design decision-making

as a multi-armed bandit problem [22]: Each variation of a

UX design can be viewed as a slot machine “arm” where

the payoff (e.g., engagement) is unknown. In data-driven

design and continuous improvement, a designer’s goal will

be to balance the exploration of new designs (that may or

may not be effective) with the exploitation of designs that

are known to be effective. This framing allows us to draw

upon the extensive machine learning research on algorithms

for optimizing the selection of arms in multi-armed bandits

problems.

Solving Multi-Armed Bandit Problems

If one doesn’t know which slot machine has the highest

payout rate, what can be done? For instance, one could

adopt a policy of “explore first then exploit”. In this case,

one could test each machine n number of times until the

average payout of a particular arm is significantly higher

than the rest of the arms; then this arm could be pulled

indefinitely. This “explore first then exploit” approach is

similar to traditional A/B testing, where subjects are

randomly assigned to different conditions for a while and

then one design configuration is chosen as the optimal.

However, this approach is often “grossly inefficient” (p.

646, [30]): for instance, imagine the extreme example

where one arm produces a reward with every pull while all

other arms don’t pay out at all. In a typical experiment, the

worst arm will be pulled an equal amount of time as the

most optimal arm. Furthermore, the stopping conditions for

A/B tests are quite unclear, as few online experimenters

properly determine power in advance [18]. As a result, this

runs the risk of choosing the (wrong) condition too quickly.

Background on Multi-Armed Bandits

In the past few years, multi-armed bandits have become

widely used in industry for optimization, particularly in

advertising and interface optimization [19]. Google, for

instance, uses Thompson Sampling in their online analytics

platform [31]. Yahoo [23] uses online “contextual bandits”

to make personalized news article recommendations. Work

has also been done to apply bandits to basic research, as

well as applied optimization. Liu et al. [24], for instance,

explored the offline use of bandits in the context of online

educational games, introducing the UCB-EXPLORE

algorithm to balance scientific knowledge discovery with

user benefits.

There are many multi-armed bandits algorithms that

sequentially use the previous arm pulls and observations to

guide future selection in a way that balances exploration

and exploitation. Here we present a non-comprehensive

review of approaches to the bandit problem in the context

of online interface optimization. Longer reviews

Gittins Index: This is the theoretically optimal strategy for

solving a multi-armed bandit problem. However, it is

computationally complex and hard to implement in

practice. [11,4]

Epsilon-First: This is the classic A/B test. Users are

randomly assigned to design conditions for a set time

period, until a certain number of users are assigned, or until

the confidence interval between conditions reaches a value

acceptable to the experimenters (e.g., test until p<0.05).

After this time, a winner is picked and then used forever.

Epsilon-Greedy: Here, the policy is to randomly assign x%

of incoming users, with the remainder assigned to the

highest performing design. A variation known as Epsilon-

Decreasing gradually reduces x over time; in this way,

exploration is emphasized at first, then exploitation [2].

These algorithms are simple but generally perform worse

than other bandit techniques.

Probability Matching: Players are assigned to particular

conditions with a probability proportionate to the success

rate of the condition [30]. This includes Thompson

Sampling (which predates randomized experiments) and

other Bayesian sampling procedures; these are often used in

adaptive clinical trials in medicine [3].

UCB1: After testing each arm once, users are assigned to

the condition with the highest upper confidence bound.

Upper confidence bounds are conceptually similar to upper

confidence limits, but they do not statistically assume that

the rewards have a normal distribution [22]. UCB was

chosen for the present study due to its strong theoretical

guarantees, simplicity and computationally efficiency.

UCL: In this paper, we introduce an illustrative approach to

bandit problems using upper confidence limits for

assignment. This algorithm is presented to help those

familiar with basic statistics to understand the logic

underlying the UCB algorithm. This algorithm operates by

calculating the upper confidence limit of a design condition

after first randomly assigning a condition for a period of

time (e.g., 25 game sessions); every new player thereafter is

assigned to the condition with the highest upper confidence

limit.

A Real-World Illustration

Consider a hypothetical game designer who wants to

maximize player engagement by sending new players to the

best design variation: A, B or C (Figure 1). How should she

allocate the next 1000 players to the game? She could pick

the single condition that has the highest average

engagement at that time. Alternatively, she could have a

policy of sending players to the design condition with the

highest mean (“C”). However, this “greedy” approach

doesn’t take into account that A’s mean might be higher

than C, as there isn’t enough information yet.

Figure 1: Example data comparing three different design

conditions. While “C” has the highest mean, “A” has the

highest upper confidence limit. The policy of picking the

condition with the highest upper confidence limit can

maximize returns by balancing exploration with exploitation.

So instead, our designer could have a simple policy of just

choosing the design that has the highest upper confidence

limit. This policy would result in testing the engagement of

A with additional players. Eventually, the additional

information gathered from A would shrink the confidence

interval to the point where the policy would begin to pick C

again.

Practically, this policy of picking the higher upper

confidence limit has the effect of choosing designs that

have either a higher mean (exploitation) or insufficient

information (exploration). This policy was instantiated as

the UCL-bandit in order to help communicate the multi-

armed bandit problem to a general audience. It is not

expected to be a practical algorithm for solving real-world

bandit problems.

System for Bandit-Based Optimization

We present a system for integrating bandit-based

optimization into a software design process with the

following goals: 1) Support the use of data in an ongoing

design process by communicating useful information to

game designers; 2) Automatically optimize the design space

identified by designers in order to reduce the cost of

experimentation and analysis. 3.) Reduce the cost of

experimentation to the player community by minimizing

exposure to low-value game design configurations.

Our data-driven UX optimization method extends previous

optimization research involving user preferences [9],

aesthetic principles [8], and embedded assessments [10].

Moreover, we extend previous applications of bandits for

offline educational optimization [24] by demonstrating the

value of bandits as a tool for online optimization.

EXPERIMENT 1

Our first “meta-experiment” compares three different

algorithms for conducting online experimentation (we use

the term “meta-experiment” because it is experimenting

with different approaches to experimentation).

The players of an educational game were first randomly

assigned to one of these experiments. Then, according to

their assignment method, the players were assigned to a

game design within the experimental design space of

Battleship Numberline.

We hypothesize that, in comparison to random assignment,

one of the multi-armed bandit assignment approaches will

result in greater overall student engagement during the

course of the study. Specifically, we hypothesize that multi-

armed bandits can automatically optimize a UX design

space more efficiently (i.e., with less regret) than random

assignment and also produce benefits for subjects.

H1: Multi-Armed Bandits will automatically search

through a design space to find the best designs

H2: Automatic optimization algorithms will reduce the cost

of experimentation to players

To test our hypothesis, we simultaneously deployed three

experiments involving 1) random assignment 2) UCB1

bandit assignment or 3) UCL-95% bandit assignment. The

UCL-95% bandit had a randomization period of 25,

meaning that all design variations needed to have 25 data

points before it started assignment on the basis of upper

confidence interval.

Calculating UCB1: In this paper, we calculate the upper

confidence bound of a design “D” as the adjusted mean of

D (the value must be between 0-1, so for us, we divided

each mean by the maximum engagement allowed, 100) +

square root of (2 x log(n) / n_D), where n is the total

number of games played and n_D is the total number of

games played in design condition D.

Calculating UCL: The upper confidence limit is calculated

as the mean + standard error x 1.96 (for 95% confidence)

and mean + standard error x 3.3 (for 99.9% confidence).

The Design Space of Battleship Numberline

Battleship Numberline is a simple online Flash game that

involves estimating fractions, decimals and whole numbers

on a number line. Players attempt to blow up enemy ships

or submarines (the xVessel design factor) by successfully

estimating the location of different numbers (xItem_Sets)

on a number line. The “ship” design variant requires users

to type a number to estimate the location of a visible ship

on a line; “sub” requires users to click on the number line to

estimate the location of a hidden submarine target. Ship and

sub targets can be large (and easy to hit) or small (and

challenging); their size (xPerfectHitPercent) represents the

level of accuracy required for a successful hit. For example,

when 95% accuracy is required, the target is 10% of the

length of the number line—when 90% accuracy is required,

the target is 20% of the line.

Figure 2: Samples showing variations of the xVessel and the

xPerfectHitPercent design factor. xVessel varies between

submarines (hidden until a player clicks on the line to estimate

a number) and ships (players type in a number to estimate

where it is on the line). The xPerfectHitPercent design factor

varies the size of the target, in term of the accuracy required

to hit it. Above, we show 70%, 95%, 80% and 97% accuracy

required for a hit in Battleship Numberline

The combination of these, and other, design factors

constitute the design space of our Battleship Numberline.

The purpose of the meta-experiment is to evaluate different

methods for exploring and then selecting the optimal design

condition within this design space. In this paper, we only

explore the xVessel and xPerfectHitPercent design factors.

Procedure

Over the course of 10 days, 10,832 players of Battleship

Numberline on Brainpop.com were randomly assigned to

three different experimentation algorithms: random, upper

confidence bound (UCB), and upper confidence limit

(UCL). The UCL-95% bandit had a randomization period

of 25 trials before assigning by UCL. Within each

experimentation condition, players were assigned to 1 of 6

different conditions, a 2x3 factorial involving the xVessel

(clicking vs. typing) and the size of targets

(xErrorTolerance, where larger is easier to hit).

After players made a choice to play in the domain of whole

numbers, fractions or decimals, they were given a four-item

game-based pretest [26]. They were then randomly assigned

to an experimental method, which assigned them to one of

12 experimental conditions. If players clicked on the menu

and played again, they were reassigned to a new

experiment.

When players enter the game, the server receives a request.

It then allocates conditions to these incoming requests

based on a round-robin method of assignment, that starts at

the experimental level and then at the condition level. For

instance, the player would be assigned to random

assignment and then assigned to one of the random

assignment conditions; the next player in the queue would

be assigned to UCB. The UCB algorithm would then

determine which game condition they would receive.

Experimental System

Our experimental system was built on Heroku and

MongoDb. It was conceived with the intention of providing

data analytics dashboard for designers to monitor the

experiment. The dashboard shows running totals of the

number of times each arm is pulled (i.e., the number of

times a design condition was served to a player), the mean

number of trials played in each condition (our measure of

engagement), the upper confidence limit and the upper

confidence bound of each condition. Our dashboard also

provided a mechanism for designers to click on a URL to

experience for themselves any particular design (i.e., any

arms in the experiment). These links made it easy to sample

the design space and gain human insight into the data. We

also provided a checkbox so that a particular arm could be

disabled, if necessary. Finally, our system made it easy to

set the randomization period and confidence limit for the

UCL algorithm.

Measures

Optimization relies on an outcome evaluation metric [9]

that drives the decisions made between different design

configurations. In this study, our key outcome metric of

engagement is the number of estimates made, on average,

within each design variation. Choosing an appropriate

outcome variable is essential, because this metric will be

used to determine which conditions are promoted.

RESULTS

In Figure 3 and Table 1, we confirm H2 by showing that

players in the bandit conditions were more engaged; thus,

bandits reduced the cost of experimentation to subjects by

deploying fewer low-engagement conditions.

Figure 3: Shows how the bandit-based experiments garnered

more student engagement, as measured by the total amount of

trials played.

The UCL and the UCB bandit experiment produced 52%

and 24% greater engagement than the experiment involving

random assignment. Our measure of regret between

experiments is shown in Table 1 as the percent difference in

our outcome metric between the different experiments and

the optimal policy (e.g., Sub90, which had the highest

average engagement). This shows that the UCL-25

experiment (one of the bandit algorithms) achieved the

lowest regret of all experiments.

Meta

Experimental

Conditions

Total

Games

Logged

Total

Trials

Played

% Difference

from Optimal

Random

Assignment

2818

42,835

36%

UCB Assignment

2896

53,274

20%

UCL-25

Assignment

2931

65,206

2%

Optimal Policy

(Sub90)

2931*

66,534*

0%

Table 1: Comparison of Meta-Experiment. * 22.7 average

trials for Sub90. Assuming same number of games logged.

8,645 total logged out of 10,832 total served.

H1, the hypothesis that bandits can automatically optimize

a UX design space, was confirmed with evidence presented

in Figure 4. These data show that random assignment

equally allocated all 6 conditions whereas both the UCB

bandit and the UCL bandit allocated subjects to conditions

preferentially. The reason for this unequal allocation is the

difference in the efficacy of the conditions for producing

player engagement, as seen in Figure 5.

Figure 4: Random assignment experimentation equally

allocates subjects whereas both bandit-based experiments

allocate subjects based on the measured efficacy of the

conditions. Total Games Played are the number of players

assigned to each design condition.

Figure 5 shows the means and 95% confidence intervals of

each arm in the three experiments. The Y-axis is our

measure of engagement: the average number of trials

played in the game within that condition. All experiments

identify Sub90 as the most engaging condition. In this

variant of the game, players needed to click to estimate a

number on the number line and their estimates needed to be

90% accurate to hit the submarine. All bandit-based

experiments deployed this condition far more than other

conditions (as seen in Figure 4).

Random

UCB

UCL

Total Games Played

0

500

1000

1500

2000

2500

3000

90

95

97

90

95

97

ship

sub

90

95

97

90

95

97

ship

sub

90

95

97

90

95

97

ship

sub

Figure 5: The means and 95% confidence intervals of the

experimental conditions. Note that the Upper Confidence

Limit experiment focused data collection on the condition with

the highest upper confidence limit. Y-Axis represents

Engagement, as the total number of trials played.

Note the long confidence intervals on the right side of

Figure 5: these are a result of insufficient data. As can be

seen in Figure 4, these conditions each had less than 30 data

points. However, if any of the confidence intervals were to

exceed the height of Sub90’s condition in the UCL

experiment, then UCL would deploy those again.

Figure 6: This graph shows the allocation of game sessions to

the different target sizes and the variation in the average

number of trials played over time.

EXPERIMENT TWO

After running the first meta-experiment, our results clearly

supported the value of bandits. However, UCL never tested

additional design variations after “deciding” during the

randomization period that Sub90 was the most engaging

condition (Figure 6). While the benefits of the Sub90

outcome was confirmed by the random experiment, it does

not illustrate the explore-exploit dynamic of a bandit

algorithm. Therefore, to introduce greater variation and

bolster our discussion, we ran a second meta-experiment.

In this meta-experiment, we compared random assignment

with two variations on the Upper Confidence Limit bandit.

We retained our use of the greedy 95% confidence limit

bandit but also added a more conservative 99.9%

confidence limit bandit. The difference between these two

is the parameter that is multiplied by the standard error and

added to the mean: for 95% we multiply 1.96 times the

standard error while for 99.9% we multiply 3.3. We expect

this more conservative and less greedy version of UCL to

be more effective because it is less likely to get stuck in a

local optimum.

H3: UCL-99.9% will tend to deploy a more optimal design

condition than UCL-95% as a result of greater exploration.

In the second experiment, we focused on submarines, which

we found to be more engaging than ships (or, in any case,

resulted in more trials played). We eliminated the smallest

size (which was the worst performing) and added a broader

sample: 95%, 90%, 80%, 70%, 60% accuracy required for a

hit. The largest of these, requiring only 60% accuracy for a

hit, was actually 80% of the length of the number line.

Although we felt the targets were grotesquely large, they

actually represented a good scenario for using online

bandits for scientific research. We’d like to collect online

data to understand the optimal size, but we’d want to

minimize how many students were exposed to the

suboptimal conditions. We were sure that this target size

was too large, but could the bandits detect this?

H4: Bandits will deploy “bad” conditions less often than

random assignment.

Figure 7: Four of the five variations in target size in

experiment 2. The largest submarines (at top, 60 and 70)

appeared to be far too big and too easy to hit. However, they

were generally more engaging than the smaller sized targets.

RESULTS

In short, we found very mixed evidence for both H3 and

H4. The more conservative bandit did not appear to explore

longer (Figure 10) nor did it deploy a more optimal design

(Figure 9), although it did achieve greater engagement than

UCL-95 (Table 2). Additionally, while the bandits did

achieve less regret than the random assignment condition

(Table 2), the conditions deployed were so bad that we got

phone calls from Brainpop inquiring whether there was a

bug in our game.

In Figure 9, we show that the least chosen arm of UC-95%

was the most chosen arm of UCL-99.9% -- and vice versa!

Moreover, the second most-picked design of both bandits

was the very largest target, which seemed to us to be far too

large. Finally, the optimal arm, according to random

Conﬁdence Intervals of Experimental

Conditions

total played

0

2

4

6

8

10

12

14

16

18

20

22

24

90

95

97

90

95

97

90

95

97

90

95

97

90

95

97

90

95

97

ship

sub

ship

sub

ship

sub

metaRandom

metaUCB

metaUCL

expID 2 / xVessel / perfectHitPercent

Target

Size

#

Trials

Target

Size

#

Trials

Target

Size

#

Trials

50

80

95

12

18

24

50

80

95

12

18

24

50

80

95

12

18

24

FullRandom

FullUCB

fullUCL

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

13000

assignment, was hardly in the running inside the bandit

experiments. What accounts for this finding?

Meta

Experimental

Conditions

Total

Games

Logged

Total

Trials

Played

% Difference

from Optimal

Random

Assignment

1,950

46,796

21%

UCL-95%

1,961

47,312

20%

UCL-99.9%

1,938

49,836

14%

Optimal Policy

(Sub70)

1,950*

56,628*

0%

Table 2: Compares each experiments to the optimal policy:

Sub70 according to the random assignment experiment.

* 29.04 average trials for sub70 in the random condition.

Figure 8: The means and confidence intervals of each

condition within each experiment. The random assignment

experiment reveals an inverted U-shaped response, with the

largest performing less effectively than the second largest.

Note the major discrepancies in the order of the items between

the different experiments.

Figure 9: The “bad design” (Sub60) did surprisingly well in

the bandits, where it was assigned second most often – and far

less than the optimal design, according to random assignment.

First, all of the design variations in this experiment

performed reasonably well (Figure 8) on our engagement

metric, even the “bad” conditions. Without major

differences between the conditions, all experimental

methods performed relatively well. But, digging in, there

were other problems that have to do with the dangers of

sequential experimentation.

Figure 10 shows the allocation of designs over time. The X-

axis represents game sessions over time (i.e.,“arm pulls”, as

each game session requests a design from the server). The

smoother line shows the mean number of trials played over

time (i.e., engagement over time). Note that the randomly

assigned mean varies significantly over time, by nearly

50%! This reflects the “seasonality” of the data; for

instance, the dip in engagement between 2000 to 3000

represents data collected after school ended, into the

evening. And the rise around 3000 represents a shift to the

morning school day.

Figure 10: For meta-experiment 2, this graph shows the

allocation of game sessions to the different target sizes and the

variation in the average number of trials played over time.

Therefore, the average player engagement in the bandits

can be expected to be affected by the same seasonal factors

as those affecting random assignment, yet also vary for

other reasons. In particular, the average engagement of

these bandits will be affected by the different conditions

they are deploying. Possible evidence of this phenomenon

can be noted in the concurrent dip in engagement around

4200 – a dip that is not present in random assignment. This

dip occurs during the morning hours, so one plausible

explanation is that the morning population is not as engaged

by the designs deployed by the bandits as much as other

populations. It could be, for example, be that students in

the morning are more engaged by learning-related

challenges or even that the lack of challenge presented by

the large ships are unappealing. This might be because

students in the morning have greater ability than those who

play in the evening.

In both meta-experiments, all bandit algorithms tended to

test different design conditions in bursts. Note for instance

that the sub 70% in size was explored briefly by UCL-95%

around 2500-3000 – yet this period was the time when all

conditions were fairing the worst. So, this condition had the

bad luck of being sampled at a time when any condition

would have a low mean. This shows the limitations of

sequential experimentation in the context of time-based

variations.

We cannot confirm our hypothesis that the UCL-99.9%

bandit actually explored more than UCL-95%. Visually, it

appears that UCL-99.9% was busy testing more conditions

Random

UCL95

UCL99

Engagement (Total # Trials Played)

0

4

8

12

16

20

24

28

32

60

70

80

90

95

60

70

80

90

95

60

70

80

90

95

count

Target

Size

#

Trials

Target

Size

#

Trials

Target

Size

#

Trials

60

70

80

90

95

20

26

32

60

70

80

90

95

20

26

32

60

70

80

90

95

20

26

32

Random

UCL95

UCL99

500

1500

2500

3500

4500

5500

Game Sessions

more often than UCL-95%. However, both bandits fell into

a local optimum at roughly the same time, based on when

each began exclusively deploying one condition.

GENERAL DISCUSSION

Our studies explore the optimization of the design space of

an educational game using three different multi-armed

bandit algorithms. Supporting the potential advantage of

using bandit algorithms over standard random experimental

design, bandits achieved higher student engagement during

the course of the experiment.

However, we were surprised to find that Sub60 condition

(the condition that was absurdly large Figure 7) was one of

the most engaging conditions. This condition was included

for the purpose of testing the ability of the bandits to

identify it as a poor condition. However, Sub60 turned out

to be one of the top performing conditions! This

discrepancy indicates that our metric for engagement

(number of trials played) may be the wrong metric – or that

we, as designers, are wrong about what the users want.

Our results also point to practical issues that must be

understood and resolved prior to adopting bandit-based

optimization over random assignment. While we expect that

over the long term, bandits will fluidly adjust to periodic

variations, the temporal non-stationarity of the data is likely

to reduce their efficacy.

The bandits collected more data about leading design

conditions than random assignment, resulting in tighter

confidence intervals around the means. Indeed, the

tightening of confidence intervals (or bounds) is the

primary mechanism for balancing exploration-exploitation

in this paper. However, these tightened confidence intervals

do not indicate that the bandit estimates are more accurate,

at least in the short term. The tightening of the confidence

intervals is deceptive, as the data are not collected

simultaneously – a violation of assumptions in the

underlying statistics. The next section illustrates why the

non-simultaneous collection of data impacts the ability of

bandits to make valid inferences from its exploration.

Dangers of sequential experimentation in the real world

Time-based variations appear to significantly affect bandit

exploration validity. During the randomization period

(roughly, from 500-1500 in Figure 10), the UCL-95%

bandit gave the 60% submarine the highest mean and the

90% submarine one of the lowest. After the randomization

period, the bandit continued to allocate most users to the

60% submarine. We noted that average engagement was

less during the night than during the day, however, players

tended to play less during the night than during the day. As

a result, this pushed down the estimates for the 60%

submarine. Then, because few players had played the 90%

submarine, the bandit started allocating players during the

day, when the average mean had increased. This then

resulted in the mean for the 90% submarine to get pushed

up higher than any other arm.

These fluctuations due to contextual time-of-day factors

have a much bigger effect on the bandits than the random

assignment. So, even though much more data was collected

about particular arms, it is not necessarily more trustworthy

than the data collected by random assignment. While it is

likely that conservative UCB-style bandits would

eventually identify the highest performing arm if they were

run forever, these time-based variation effects can

significantly reduce their performance.

Constructing bandits that operate using confidence intervals

is conceptually similar to running experiments and

constantly testing for significance and then running the

condition that is, at the time, significant. However,

significance tests assume that sample size was estimated in

advance. While our bandits were “riding the wave of

significance” they were susceptible to the tendency to be

over confident in the present significance of a particular

condition. This is a major problem in contemporary A/B

testing, as well [11].

Limitations

Our goal was to run an empirical study to illustrate how

bandits work, not to advance algorithms themselves. Thus,

we could have used more powerful algorithms. For

instance, our bandit algorithms did not allow for the sharing

of information between arms or make use of previously

collected data, which would be especially useful for

continuous game design factors (such as the size of our

targets) or larger design spaces. To this end, we might have

considered approaches making use of fractional factorial

designs, linear regressions or Scott’s suggestion of a “multi-

armed mafia” [30]. Recent work combining Thompson

Sampling and Gaussian Processes is promising [15].

Our bandit experiments ran for just a few days in total. All

A/B tests will be susceptible to “seasonality” effects, not

just bandits. Therefore it is always preferable to run a

smaller proportion of concurrent users for a longer period

of time than a large proportion of users for a shorter time.

Kolhavi recommends running A/B tests for at least two

weeks to account for seasonality [18]. We did not model the

benefits of these algorithms over time. Seasonality will

differentially affect different arms in the bandit condition,

but not in the random condition. If the bandit is too greedy

(like the 95 confidence interval bandit in the first

experiment), then this could have serious effects on the

optimality of the arms chosen.

Unlike UCB1, both UCL bandits had a tendency to fall into

a local maximum for a long period of time, without

exploration. This is likely to be a property of UCL rather

than UCB simply being more conservative in its approach.

As N (total number of arms) increases, the confidence

bounds will decrease, whereas the confidence intervals have

no natural tendency to decrease.

The data in our sample are not normal; they follow a

distribution that may be negative binomial or beta. While

the UCB1 algorithm does not rely on an underlying

assumption of normality, both UCL algorithms do.

However, by making stronger parametric assumption about

the payoff distributions, we achieved good online results

without requiring additional online tuning. However, if not

for our desire to provide clear results for publication, we

likely would have increased the confidence limit parameter

to make the bandits even less conservative.

In support of the broader research community, we intend to

make the data from these bandit experiments available

online at pslcdatashop.org []. Given that seasonality

affected the performance of the bandit algorithms, having

real-world data may be useful for others who seek to

contribute more reliable bandit algorithms for UI

optimization.

Design Lessons

We present this work to help designers understand some of

the dangers of automated experimentation. Our results show

how easy it is to optimize for the wrong metric. Indeed, it is

not that maximizing engagement is wrong, per se; however,

when taken to the extreme, it produces designs that are

unquestionably bad. (How do we know that the large

designs are objectively bad, and not just a matter of taste?

Well, we got a call from Brainpop.com, telling us that they

were getting emails from teachers asking about a broken

game).

Bandits are very capable of optimizing for a metric – but if

this is not the best measure of optimal design, the bandit

can easily optimize for the wrong outcome. For example, it

is much easier to measure student participation than it is to

measure learning outcomes, but conditions that are most

highly engaging are often not the best for learning (e.g.,

students perform better during massed practice, but learn

more from distributed practice [16]). In our study, the super

large ship was engaging, but was unlikely to create the best

learning environment [26].

With an increase in data-driven design, it is important that

designers develop a critical and dialectical relationship to

optimization metrics. We recommend making it as easy as

possible for designers to personally experience design

variations as they are optimized so that metrics alone are

not the sole source of judgment. While this may interfere

with the desire for automated design, it is nevertheless

critical to ensure that the automation is designing for the

right outcome.

In the original design of Battleship Numberline, the

assumed optimal design of the target size was 95%

accuracy – as in, this was delivered as a final design.

However, this level of difficulty turns out to be significantly

less engaging than other levels. This suggests that during a

design process, designers can also provide parameters for

design space exploration—in addition to providing a single,

assumed optimal design. These “Fuzzy Designs” can then

be explored and optimized using online data.

Finally, we recommend using bandits that use greater use of

randomization. Even epsilon-greedy bandits, while

inefficient, are likely to be far more robust in practice than

bandits that do not always randomly sample to some extent.

Randomized Probability Matching and Thompson

Sampling are also likely to be less susceptible to seasonality

effects. In part, this is because they will tend to continue to

test all arms, but test the lower performing arms with a

lower probability. As a result, during “seasonal” variation,

more conditions will get a chance to shine or dim within

that variation.

CONCLUSION

The purpose of this paper is to provide empirical evidence

that can help designers understand the promise and perils of

using “multi-armed bandits” for interface optimization. To

be clear, this paper does not intend to introduce a newer,

better, faster bandit algorithm!

In summary, we show how user interface design can be

framed as a multi-armed bandit problem. In two large-scale

online meta-experiments, we illustrate new methods for

optimizing game UI using crowd-sourced data collected

from players. We then empirically evaluated the online

application of multi-armed bandit optimization methods in

a real-world context involving the maximization of student

engagement in an educational game.

We demonstrate that bandit-based experimental design

systems can automatically optimize a design space based

upon an established outcome metric. In the process, we

show that bandits can maximize the utility of

experimentation to students by minimizing exposure to

low-value game design configurations. By balancing

exploration and exploitation, bandits help maximize the

benefits to both designers and users.

However, our data illustrate several major challenges for

data-driven designers. First of all, we show that

optimization methods that do not involve random

assignment have the potential to cause issues with the

validity of data. Secondly, we show that automatic

optimization is susceptible to optimizing the design for the

wrong thing.

We provide mixed evidence about the ability of bandits to

reduce the cost of experimentation in a data-driven design

process. Part of our goal was to eliminate the need to

constantly analyze data and make judgments about

statistical significance. While bandits address this need

nicely, we discovered that monitoring and human judgment

maybe necessary to ensure that bandits are actually

optimizing for the right metric. While this is important in

A/B testing as well, bandits create a stronger potential for

an experiment to automatically optimize for the wrong

thing.

Ethical considerations of online research

There are significant ethical issues that accompany large-

scale online experiments. Recent online experiments (viz.

the infamous “Facebook Mood Experiment” [20]) caused

global distress, in large part due to a perceived conflict

between the pursuit of basic scientific knowledge and the

best interests of unknowing users.

To this end, this paper addresses one specific ethical issue

that is present with online and offline educational research:

the notion of “fairness” in experimentation. Experimental

education research necessarily requires that some subjects

be given access to certain educational resources that are not

available to other subjects. Even cross-over experimental

designs (where all subjects get access to the same resources

but in a different order) can be problematic for students

feeling “left out”. While experiments are fundamentally

required to determine which resources cause what

outcomes, it is not always necessary to continue

experiments to the end.

In a typical experiment involving random assignment, each

condition receives the same number of subjects, as the

“best” condition is not known a priori. In contrast, as it

becomes likely that some conditions are better than others,

Bandit algorithms could selectively allocate subjects to

these higher performing conditions. As a result, these

algorithms may deliver greater benefits to participants in

large online studies.

Future Opportunities

Whereas random assignment optimizes for the precision of

measurement, bandit algorithms optimize for outcomes.

Bandit algorithms, therefore, have the potential to improve

upon random assignment within applied research settings

where the goal is to find "the best design", rather than

measuring precisely how bad the alternatives are. However,

Bandits have also been shown to have application for

scientific research [24].

For the future oriented, it may be helpful to view bandits as

simple machine learning algorithms that represent the early

stages of complex Artificially Intelligent systems. With this

view, one may assume that automated explorations of

design spaces will achieve greater and greater capabilities

with time. For instance, our bandit algorithms treated

continuous variables as discrete variables; if we sought to

test a slightly different target size, the algorithm would need

to start collecting data from scratch. Similarly, our

algorithms were unable to discover advantage of daily

population changes in engagement, where players appeared

more engaged with difficult games earlier in the morning.

Eventually, optimization algorithms may be able to identify

such patterns.

While computer scientists will surely develop more

advanced algorithms, in other areas of AI system design, it

is often useful to think about designing the larger human-

computer system. Don Norman has described this as the

“human-technology teamwork framework.” According to

this framework, designers should aim to integrate human

and computer capabilities by first identifying the unique

capabilities of each. Therefore, future work should consider

specific ways in which user data and experimentation

systems can best leverage human contextual understanding

as well as AI-based search. Such human-AI systems are

predicted to be more capable of generating effective

interface optimizations and innovations than human or AI

systems alone.

While this paper focused on the use of online

experimentation to drive interface optimization (applied

research), there are also unique opportunities for scientific

discovery. In the learning sciences, for instance, with

millions of learners engaged in online digital learning

environments, scientists can now conduct learning science

experiments on a massive scale. Indeed, pure optimization

experiments may well lead to new generalizable insights by

accident [26]. This may lead to a deeper integration of

basic science (i.e., improving theory) and applied research

(i.e., improving user outcomes).

However, it is unclear how the scale of online research will

be adopted by the scientific community. While recruiting

100 students to participate in an educational experiment can

take weeks, online research can provide thousands of

subjects in a single day [25,33]. This massive scale poses

new challenges to scientists that wish to efficiently benefit

from this vast increase in experimental data. For instance,

how would a typical psychology research group change

their approach if they had a thousand subjects show up to

their laboratory every day, for a year?

Future researchers who build on existing work [24] with

bandits in scientific discovery may also find it useful to

consider how the emerging framework of “Human-

Technology Teamwork” can guide the integration of

artificial intelligence with scientific inquiry. Eventually,

such work may even contribute solutions to the problem of

writing good scientific papers. After all, even if a scientific

team could design and run a meaningful online experiment

every single day (perhaps with the aid of more complex

multi-armed bandits), how could they ever hire enough

graduate students to write the papers? Alas.

But, putting aside the issues of scientific contribution, it is

worth considering what it would mean to have AI-human

systems conducting thousands of experiments to determine

what interface designs are most engaging and attractive to

users. Wouldn’t this lead to computer interactions that are

so addictive and compelling that they consume most of the

average person’s waking hours? In some ways, of course,

Facebook, Google, and other major corporations have

already become these AI-human systems. So, if their

interfaces are compelling now, just imagine what is to

come.

ACKNOWLEDGMENTS

REFERENCES

1. Agarwal, D., Chen, B.C., and Elango, P. (2009)

Explore/Exploit Schemes for Web Content

Optimization. Ninth IEEE International Conference on

Data Mining, 1–10.

2. Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002)

Finite-time Analysis of the Multiarmed Bandit Problem.

Machine Learning, 235–256.

3. Berry, D. (2011) Adaptive Clinical Trials: The Promise

and the Caution. Journal of clinical oncology!: official

journal of the American Society of Clinical Oncology

29, 6. 603–6.

4. Brezzi, M. and Lai, T.L. (2002) Optimal learning and

experimentation in bandit problems. Journal of

Economic Dynamics and Control 27, 1. 87–108.

5. Card, S., Mackinlay, J., & Robertson, G. (1990). The

design space of input devices. ACM CHI

6. Chapelle, O., & Li, L. (2011). An empirical evaluation

of thompson sampling. InAdvances in neural

information processing systems (pp. 2249-2257).

7. Drachen, A. and Canossa, A. (2009) Towards Gameplay

Analysis via Gameplay Metrics. ACM MindTrek, 202–

209.

8. Fogarty, J., Forlizzi, J., and Hudson, S.E. (2001)

Aesthetic Information Collages: Generating Decorative

Displays that Contain Information. ACM CHI

9. Gajos, K., & Weld, D. S. (2005). Preference elicitation

for interface optimization. ACM UIST (pp. 173-182).

10. Gajos, K., Weld, D., and Wobbrock, J. Decision-

Theoretic User Interface Generation. AAAI, (2008),

1532–1536.

11. Gittins, J. (1979) Bandit Processes and Dynamic

Allocation Indicies. Journal of the Royal Statistical

Society. Series B., 148–177.

12. Glaser, R. (1976). Components of a psychology of

instruction: Toward a science of design. Review of

Educational Research, 46(1), 1–24.

13. Hacker, S. (2014) Duolingo: Learning a Language

While Translating the Web. PhD Thesis, Carnegie

Mellon University School of Computer Science. May

2014

14. Hauser, J.R., Urban, G.L., Liberali, G., and Braun, M.

(2009) Website Morphing. Marketing Science. 28, 2,

202–223.

15. Khajah, M., Roads, B. D., Lindsey, R. V, Liu, Y., &

Mozer, M. C. (2014). Designing Cognitive-Training

Games to Maximize Engagement, 1–9.

16. Koedinger, K. R., Booth, J. L., Klahr, D. (2013)

Instructional Complexity and the Science to Constrain It

Science. 22 November 2013: Vol. 342 no. 6161 pp. 935-

937

17. Koedinger, K. R., Baker, R. S., Cunningham, K.,

Skogsholm, A., Leber, B., & Stamper, J. (2010). A data

repository for the EDM community: The PSLC

DataShop. Handbook of educational data mining, 43.

18. Kohavi, R., Longbotham, R., Sommerfield, D., and

Henne, R.M. (2008) Controlled experiments on the web:

survey and practical guide. Data Mining and Knowledge

Discovery 18, 1 140–181.

19. Kohavi, R., Deng, A., Frasca, B., Longbotham, R.,

Walker, T., and Xu, Y. (2012) Trustworthy Online

Controlled Experiments: Five Puzzling Outcomes

Explained. KDD

20. Kramer, Adam DI, Jamie E. Guillory, and Jeffrey T.

Hancock. (2014) Experimental evidence of massive-

scale emotional contagion through social networks.

PNAS

21. Lai, T. (1987) Adaptive treatment allocation and the

multi-armed bandit problem. The Annals of Statistics;

15(3):1091–1114.

22. Lai, T., & Robbins, H. (1985). Asymptotically efficient

adaptive allocation rules. Advances in Applied

Mathematics, 6, 4–22.

23. Li, L., Chu, W., Langford, J., & Schapire, R.E. (2010) A

Contextual-Bandit Approach to Personalized News

Article Recommendation. WWW

24. Liu, Y., Mandel, T., Brunskill, E., & Popovic, Z. (2014)

Trading Off Scientific Knowledge and User Learning

with Multi-Armed Bandits. Educational Data Mining

25. Liu, Y., Mandel, T., Brunskill, E., & Popovi, Z. (2014)

Towards Automatic Experimentation of Educational

Knowledge. ACM CHI

26. Lomas, D., Patel, K., Forlizzi, J. L., & Koedinger, K. R.

(2013) Optimizing challenge in an educational game

using large-scale design experiments. ACM CHI

27. Lomas, D. Harpstead, E., (2012) Design Space

Sampling for the Optimization of Educational Games.

Game User Experience Workshop, ACM CHI

28. Lomas, J. D. (2014). Optimizing motivation and

learning with large-scale game design experiments

(Unpublished Doctoral Dissertation). HCI Institute,

Carnegie Mellon University.

29. Norman, D. (in preparation) Technology or People:

Putting People Back in Charge. Jnd.org

30. Scott, S. (2010) A modern Bayesian look at the multi-

armed bandit. Applied Stochastic Models in Business

and Industry, 639–658.

31. Scott, S. (2014) Google Content Experiments

https://support.google.com/analytics/answer/2844870?hl

=en&ref_topic=2844866

32. Simon, H. (1969). The Sciences of the Artificial.

Cambridge, MA.

33. J. C. Stamper, D. Lomas, D. Ching, S. Ritter, K. R.

Koedinger, and J. Steinhart. (2012) The rise of the super

experiment. EDM p. 196–200

34. Vermorel, J. and Mohri, M. (2005) Multi-armed bandit

algorithms and empirical evaluation. Machine Learning:

ECML 2005, 437–448.

35. Yannakakis, G. N., & Hallam, J. (2007). Towards

optimizing entertainment in computer games. Applied

Artificial Intelligence, 21(10), 933-971.