Conference PaperPDF Available

Towards Automated A/B Testing

Authors:

Abstract and Figures

User-intensive software, such as Web and mobile applica-tions, heavily depends on the interactions with large and unknown pop-ulations of users. Knowing the preferences and behaviors of these popu-lations is crucial for the success of this class of systems. A/B testing is an increasingly popular technique that supports the iterative development of user-intensive software based on controlled experiments performed on live users. However, as currently performed, A/B testing is a time con-suming, error prone and costly manual activity. In this paper, we in-vestigate a novel approach to automate A/B testing. More specifically, we rephrase A/B testing as a search-based software engineering problem and we propose an initial approach that supports automated A/B testing through aspect-oriented programming and genetic algorithms.
Content may be subject to copyright.
Towards Automated A/B Testing
Giordano Tamburrelli and Alessandro Margara
Faculty of Informatics. University of Lugano, Switzerland.
{giordano.tamburrelli |alessandro.margara}@usi.ch
Abstract. User-intensive software, such as Web and mobile applica-
tions, heavily depends on the interactions with large and unknown pop-
ulations of users. Knowing the preferences and behaviors of these popu-
lations is crucial for the success of this class of systems. A/B testing is an
increasingly popular technique that supports the iterative development
of user-intensive software based on controlled experiments performed on
live users. However, as currently performed, A/B testing is a time con-
suming, error prone and costly manual activity. In this paper, we in-
vestigate a novel approach to automate A/B testing. More specifically,
we rephrase A/B testing as a search-based software engineering problem
and we propose an initial approach that supports automated A/B testing
through aspect-oriented programming and genetic algorithms.
1 Introduction
Modern software systems increasingly deal with large and evolving populations
of users that may issue up to millions of requests per day. These systems are
commonly referred to as user-intensive software systems (e.g., Web and mobile
applications). A key distinguishing feature of these systems is the heavy depen-
dence on the interactions with many users, who approach the applications with
different needs, attitudes, navigation profiles, and preferences1.
Designing applications that meet user preferences is a crucial factor that
may directly affect the success of user-intensive systems. Underestimating its
importance can lead to substantial economic losses. For example, an inadequate
or distorted knowledge of user preferences in a Web application can lead to an
unsatisfactory user experience with consequent loss of customers and revenues.
Domain experts typically provide valuable insights concerning user prefer-
ences that engineers can exploit to obtain an effective design of user-intensive
applications. Unfortunately, this information could be inaccurate, generic, and
obsolete. In practice, it is almost impossible to design applications that accu-
rately capture all possible and meaningful user preferences upfront.
As a consequence, engineers typically design a user-intensive application re-
lying on the initial available knowledge while, at run-time, they continuously
monitor, refine, and improve the system to meet newly discovered user prefer-
ences. In this context, engineers increasingly rely on A/B testing2[11] to evaluate
1We collectively identify these factors under the term user preferences.
2A/B testing is also known as randomized experiments, split tests, or control/treat-
ment. In this paper we always use the term A/B testing.
and improve their applications. In A/B testing, two distinct variants (i.e., vari-
ant A and B) of the same application are compared using live experiments. Live
users are randomly assigned to one of the two variants and some metrics of inter-
est (e.g., the likelihood for a user to buy in an e-commerce Web application) are
collected. The two variants are compared based on these metrics, and the best
one is selected, while the other is discarded. The iterative development of vari-
ants and their comparative evaluation through live experiments allow designers
to gradually evolve their applications maximizing a given metric of interest. For
example, an e-commerce application may be refactored adopting variants that
maximize sales, while a mobile application may be refactored adopting variants
that maximize the advertisements’ views.
A/B testing is being increasingly adopted by the industry and proved to be
effective [3]. Still, it suffers from several limitations. Indeed, conceiving, running,
and summarizing the results of A/B tests is a difficult, tedious, error prone, and
costly manual activity [5]. This paper tackles this issue laying the foundations of
an automated A/B testing framework in which the generation of application vari-
ants, their run-time evaluation, and the continuous evolution of the system are
automatically obtained by casting the process of A/B testing to a Search-Based
Software Engineering (SBSE) [7] problem. This novel viewpoint on A/B testing
brings to the table several research challenges defined and discussed in the paper.
The contribution of this paper is twofold. First, it lays the foundations and
explores the potential of automated A/B testing as an optimization problem.
Specifically, it proposes an initial approach based on aspect-oriented program-
ming [8] and genetic algorithms [13], which can be considered as a primer to
demonstrate the feasibility of the concepts introduced in the paper and a first
concrete step towards their practical application. Second, it provides the SBSE
community with a novel and crucial domain where its expertise can be applied.
The remainder of the paper is organized as follows. Section 2 provides a more
detailed introduction to A/B testing and discusses some open issues. Section 3
rephrases the process of A/B testing as an optimization problem. Next, Section 4
reifies the illustrated concepts in the context of user-intensive Web applications
with a solution based on aspect-oriented programming and genetic algorithms.
Section 5 presents some preliminary results. Finally, Section 6 surveys related
work and Section 7 draws some conclusions and discusses future work.
2 Background and Problem Statement
This section introduces A/B testing, partially recalling the definition reported
in [11]. Next, it points out some of the existing limitations of A/B testing and
discusses the need for automating it.
The diffusion and standardization of Web technologies and the increasing
importance of user-intensive software represent a perfect playground to evaluate
competing alternatives, ideas, and innovations by means of controlled experi-
ments, commonly referred to as A/B tests in this context. The overall process
of A/B testing is exemplified in Fig. 1. Live users are randomly assigned to one
Programmer
Variant A
Variant B
Experiment
Variant A
Variant C
Experiment Variant C
Variant D
Experiment
selects
selects
....
develops
develops
develops
Fig. 1. A/B testing iterative process.
of two variants of the system under analysis: variant A (i.e., the control vari-
ant), which is commonly the current version, and variant B (i.e., the treatment
variant), which is usually a newer version of the system being evaluated. The
two variants are compared on the basis of some metrics of interest related to the
user preferences. The variant that shows a statistically significant improvement
is retained, while the other is discarded. As previously mentioned, the iterative
development of variants and their comparative evaluation through live controlled
experiments allow designers to gradually evolve their applications maximizing
the metrics of interest.
Even if widely and successfully adopted in industry [3,10], A/B testing is
still considered by the majority of developers as a complex and crafted activity
rather than a well-established software engineering practice. Indeed, conceiving,
running, and summarizing the results of A/B tests is a difficult, tedious, and
costly manual activity. More precisely, an accurate and consistent A/B testing
demands for several complex engineering decisions and tasks. The most relevant
ones are illustrated hereafter.
1. Development and deployment of multiple variants. A/B testing requires a
continuous modification and deployment of the application codebase to imple-
ment and evaluate variants. These variants are deployed and monitored concur-
rently serving at the same time a certain percentage of users.
2. What is a variant. Programs may be customized along different lines. Because
of this, a critical choice for developers is the selection of how many and which
aspects of the program to change when generating a new variant.
3. How many variants. We defined A/B testing as the process of concurrently
deploy and evaluate two variants of the system. In the general case, developers
may concurrently deploy more than two variants. However, they do not typically
have evidences to select this number effectively and to tune it over time.
4. How to select variants. As previously mentioned, A/B testing works itera-
tively. At the beginning of each iteration, developers have to decide which vari-
ants to deploy and test. Prioritizing certain variants is critical for quickly finding
better program configurations. However, selecting the most promising variants
is also difficult, especially for large and complex programs.
5. How to evaluate variants. A sound and accurate comparison of variants in
live experiments with users requires mathematical skills (e.g., statistics) that
developers do not necessarily have. For example, sizing the duration of tests and
the number of users involved is a crucial factor that affects the quality of results.
6. When to stop. Usually, A/B testing enacts a continuous adaptation process
that focuses on certain specific aspects of a program. Understanding when a
nearly optimal solution has been reached for those aspects is critical to avoid
investing time and effort on changes that provide only a minimal impact on the
quality of the program.
Not only the factors mentioned above represent concrete obstacles for develop-
ers, but they also characterize A/B testing as an error-prone process that may
yield to unexpected, counter-intuitive, and unsatisfactory results (see for exam-
ple [9,12]). To facilitate the adoption of A/B testing and to avoid potential errors,
we believe it is a crucial goal of the software engineering research to provide the
developers with a set of conceptual foundations and tools aimed at increasing
the degree of automation in A/B testing. So far, to the best of our knowledge,
this research direction has received little attention.
3 A/B Testing as an Optimization Problem
In this section we take a different and novel perspective on A/B testing, rephras-
ing it as an optimization problem. The conceptual foundations and the notations
introduced hereafter are used in the remainder of this paper.
1. Features. From an abstract viewpoint a program pcan be viewed as a finite
set of features:Fp={f1. . . fn}. Each feature fihas an associated domain Di
that specifies which values are valid/allowed for fi. The concept of feature is
very broad and may include entities of different nature and at different level
of abstraction. For example, a feature could be a primitive integer value that
specifies how many results are displayed per page in an e-commerce Web appli-
cation. Similarly, a feature could be a string that specifies the text applied to
a certain button in a mobile application. However, a feature can also represent
more abstract software entities such as a component in charge of sorting some
items displayed to the user. The features above are associated to the domains of
integers, strings, and sorting algorithms, respectively.
2. Instantiation. An instantiation is a function Ip,fi:fiDithat associates
a feature fiin Fpwith a specific value from its domain Di. Two key concepts
follow: (i) to obtain a concrete implementation for a program pit is necessary
to specify the instantiations for all the features in p; (ii) the specification of
different instantiations yields to different concrete implementations of the same
abstract program p.
3. Variants. We call a concrete implementation of a program pavariant of p.
As a practical example, recalling the features exemplified above, three possible
instantiations may assign: (i) 10 to the feature that specifies the number of items
displayed per page, (ii) the label “Buy Now” to the button, (iii ) an algorithm
that sorts items by their name to the sorting component. These instantiations
define a possible variant of the system.
4. Constraints. A constraint is a function Ci,j :Di→ P (Dj) that, given a value
diDifor a feature fi, returns a subset of values in Djthat are not allowed
for the feature fj. Intuitively, constraints can be used to inhibit combinations
of features that are not valid in the application domains. For example, consider
a Web application including two features: font color and background color. A
developer can use constraints to express undesired combination of colors. We say
that a variant of a program psatisfies a constraint Ci,j if Ip,fj6∈ Ci,j (Ip,fi). A
variant is valid for pif it satisfies all the constraints defined for p.
5. Assessment Function. An assessment function is a function defined as o(v) :
VpR, where Vp={v1, . . . , vm}is the set of all possible variants for program
p. This function associates to each and every variant of a program a numeric
value, which indicates the goodness of the variant with respect to the goal of the
program. The assessment function depends on the preferences of users and can
only be evaluated by monitoring variants at run-time. As previously mentioned,
the likelihood for a user to buy is a valid assessment function for an e-commerce
Web application. Indeed, this metric is evaluated at run-time for a specific variant
of the application and concretely measures its goodness with respect to the
ultimate goal of the program (i.e., selling goods): higher values indicate better
variants.
Given these premises, we can rephrase A/B testing as a search problem as fol-
lows. Given a program pcharacterised by a set of features Fp, a set of constraints
Cp, and an assessment function o(v), find the variant ˆvVpsuch that ˆvis valid
and maximizes o(v):
ˆv= arg max
v
o(v)
4 Towards Automated A/B Testing
Section 2 identified some difficulties and open issues in the usage of A/B testing,
mainly deriving from a limited degree of automation. We claim that, by for-
mulating A/B testing as an optimization problem as shown in Section 3, we
can effectively exploit automated search algorithms to investigate and enable
automated A/B testing. This section reifies this idea.
In our vision, automated A/B testing can be achieved by combining together
two ingredients: (i) an appropriate design-time declarative facility to specify
program features, and (ii) a run-time framework in charge of automatically
and iteratively exploring the solution space of possible concrete programs by:
generating, executing, and evaluating variants. We captured these ideas in a
.
Run-time
Framework
(aop+genetic)
Program
uses
Programmer
specifies
= feature
generates selects
Experiment 1
Variant 1 Users
Experiment n
Variant n Users
Population of Variants
serves & evaluates
.
.
serves & evaluates
Fig. 2. A reference architecture for automated A/B testing.
reference architecture (see Fig. 2) and we implemented it in a prototype tool.
The architecture consists of two main steps, summarized below and detailed in
the following paragraphs.
Specifying Features. As stated in Section 2, A/B testing requires a significant
effort in developing and deploying the conceived variants. To alleviate the devel-
oper from this burden, our architecture provides ad-hoc annotations to specify
the set of relevant features for a program and their associated domain. In other
words, this allows the developer to write a parametric program only once, that
represents all possible variants that will be automatically instantiated later on
at run-time.
Selecting and Evaluating Variants. As stated in Section 2, developers need
to take several critical decisions to guide and control the iterative search for
better solutions through A/B testing. To overcome these difficulties, our archi-
tecture provides a run-time framework that automates the search process by
exploiting genetic algorithms. In particular, it creates and iteratively evolves
a population of variants by selecting, mutating, and evaluating its individuals.
At execution time, the framework instantiates concrete variants from the pa-
rameterized program by means of a dependency injection mechanism based on
aspect-oriented programming [8]. The run-time framework assesses the good-
ness of instantiated variants by means of an appropriate application-specific
assessment function (see Section 2) that measures the preferences of users. Such
assessesment function is adopted as fitness function for the genetic algorithm
and drives the generation and evolution of new variants at each iteration.
4.1 Specifying Features
In our approach, developers can specify features by means of annotated variable
declarations3. Variables declared as features cannot be initialized or modified
3Our prototype is designed for the Java programming language. However, the illus-
trated concepts and techniques apply seamlessly to other languages and technologies.
1@ S t r i n g F e a t u r e (n ame=” ch e c kO u t Bu t t on T e xt ” ,
2v a l u e s ={”C he ck O ut ” , ”B uy ” , ”Bu y Now! })
3S t r i n g b u t to n T ex t ; / / P r i mi t i v e f e a tu r e s p e c i f i c a t i o n
4
5@ I n t e g e r F e a t u r e (n ame=” f o n t S i z e ” , r a n g e=” 1 2 : 1 : 1 8 ” )
6int textSize ; / / P r im i t i v e f e a t u r e s p e c i f i c a t i o n
7
8Button checkOutButton = new Button () ;
9c he c k Ou t B ut t o n . se t T e x t ( bu t t o n Te x t ) ; // P r i m i t i v e f e a t u r e u se
10 c he c kO u tB u tt o n . s e t F o n t S i z e ( t e x t S i z e ) ; / / P r i m i t i v e f e a t u r e u se
Listing 1.1. Primitive Type Feature Example.
since their value is automatically assigned at execution time by the run-time
framework. As mentioned in Section 3, the concept of feature is very broad.
In particular we distinguish two categories of features: (i)Primitive Type and
(ii)Abstract Data Type features. The former refers to program features that
can be modelled with primitive type (integers, double, boolean, and strings);
the latter refers to features implemented through ad-hoc abstract data types.
Hereafter we discuss in details both options.
Primitive Type Features. Let us consider the example in Section 3 of a fea-
ture of string type that specifies the label of a button in a mobile application.
To represent this feature and its domain, developers can declare a string vari-
able annotated with the @StringFeature annotation (see Listing 1.1). In this
case, the features can assume values in the entire domain of strings. Develop-
ers can restrict such domain by specifying a set of valid values in the values
parameter. Analogous annotations are provided for other primitive types (e.g.,
@IntegerFeature,@BooleanFeature, etc.). Differently from string features,
the domain of numeric features can be specified as a range of valid values. For
example, reasonable values for an integer feature that represents a font size may
be in the range (12:18) with a step of 1.
The run-time framework instantiates and serves to users concrete variants of
the program that are automatically generated by injecting values to the features
declared by the developers. Thus, considering the code in Listing 1.1, a user
accessing the system may experience a variant of the program in which the
button is labeled with the text “Buy Now!” with font size equal to 12, while
a different user, at the same time, may experience a different variant of the
program with text “Check-out” and font size equal to 16.
Generic Data Type Feature. Real-world programs may be characterized by
complex features that require ad-hoc abstract data types. We support them by
relying on two ingredients: (i) an interface that specifies the abstract behavior
of the feature and (ii) several implementations of the interface that contain the
concrete realizations of this feature.
Let us recall the example mentioned in Section 3 of a component in charge of
sorting some items displayed to the users. Possible realizations of this feature may
sort items by name, price, or rating. Thus, the generic aspect of sorting items can
be defined as a feature as shown in Listing 1.2. To do so, developers declare an
1@GenericFeatureInterface(name=”sort” ,
2v a l u e s ={”com . e xa m p l e . S o r t B y P r i c e , ” com . e x am p l e . S or tB yNa me ” })
3p u b l i c i n t e r f a c e AbstractSortingInterface{
4public L i s t <It e m>s o r t ( ) ;
5}
6
7p u b l i c c l a s s SortByPrice i m p l e m e n t s AbstractSortingInterface{
8public L i s t <It e m>s o r t ( ) {
9// S o r t b y p r i c e i m p l e m e n t a t i o n
10 }
11 }
12
13 p u b l i c c l a s s So rt ByN am e i m p l e m e nts AbstractSortingInterface{
14 public L i s t <It e m>s o r t ( ) {
15 // S o r t b y name i m p l e m e n t a t i o n
16 }
17 }
18
19 // . . .
20
21 @GenericFeature
22 A b s t r a c t S o r t i n g I n t e r f a c e s o r t i n g F e a t ur e ; / / ADT f e a t ur e s p e c i f i c a t i o n
23
24 s o r t i n g F e a t u r e . s o r t ( . . ) ; / / ADT f e a t u r e u se
Listing 1.2. Generic Data Type Feature Example.
interface that includes all the required methods (i.e., sort(...) in this example)
and implement as many concrete realizations of this interface as needed. The
interface must be annotated with the @GenericFeatureInterface annotation
including the full class name of all its implementations.
Developers can declare variables of the type specified by the interface and
annotated with @GenericFeature. Analogously to primitive type features, the
run-time framework serves to users concrete variants of the program that are
automatically generated by injecting an appropriate reference to one of the in-
terface implementations. For example, the invocation to the sort method in the
last row of Listing 1.2 may be dispatched to an instance that implements the
SortByName algorithm or to an instance that implements the SortByPrice al-
gorithm, depending on the type of the object injected at run-time.
4.2 Selecting and Evaluating Variants
So far, we explained how developers can declare features in a program and we
delegated to the run-time framework the task of generating, executing, and evalu-
ating variants. Now, we explore how the run-time framework actually implements
these aspects.
In our prototype, the run-time framework relies on a genetic algorithm that
runs online while the system is operating. A genetic algorithm encodes every
possible solution of an optimization problem as a chromosome, composed of
several genes. It selects and iteratively evolves a population of chromosomes by
applying three main steps (discussed below): (i) selection, (ii) crossover, and
(iii) mutation. Each chromosome is evaluated according to a fitness function.
The algorithm terminates after a fixed number of iterations or when subsequent
iterations do not generates new chromosome with significantly improved values
of fitness. Solving the problem of searching for variants via genetic algorithms
requires to specify an encoding and a concrete strategy for each of the three
main steps mentioned above. Next, we provide these ingredients for the specific
case of A/B testing.
Encoding. Each feature declared by developers directly maps into a gene, while
each variant maps into a chromosome. Analogously, the assessment function,
which evaluates variants on live users, corresponds directly to the fitness func-
tion, which evaluates chromosomes. Two additional aspects are required to fully
specify a valid encoding: the number of chromosomes used in each iteration and
the termination condition.
Our framework enables the developers to specify application specific fitness
functions. In addition, it accepts a preferred population size, but adaptively
changes it at run-time based on the measured fitness values. Furthermore, the
framework is responsible for terminating the experiment when the newly gener-
ated variants do not provide improvements over a certain threshold.
Selection. Selection is the process of identifying, at each iteration, a finite num-
ber of chromosomes in the population that survive, i.e., that are considered in
the next iteration. Several possible strategies have been discussed in the liter-
ature. For example, the tournament strategy divides chromosomes into groups.
Only the best one in each group (i.e., the one with the highest fitness value)
wins the tournament and survives to the next iteration. Conversely, the thresh-
old selection strategy selects all and only the chromosomes whose fitness value is
higher than a given threshold. Traditional A/B testing, as described in Section 2,
corresponds to the tournament strategy when the population size is limited to
two chromosomes. Our framework supports several different strategies. This en-
ables for more complex decision schemes than in traditional A/B testing (e.g., by
comparing several variants concurrently). At the same time, selection strategies
relieve the developer from manually selecting variants during A/B testing.
Crossover & Mutation. Crossover and mutation contribute to the generation
of new variants. Crossover randomly selects two chromosomes from the popu-
lation and generates new ones by combining them. Mutation produces instead
a new chromosome (i.e., a new variant) starting from an existing one by ran-
domly mutating some of its genes. One of the key roles of mutation is to widen
the scope of exploration, thus trying to avoid converging to local minima. In
traditional A/B testing, the process of generating new variants of a program is
performed manually by the developers. Thanks to the crossover and mutation
steps of genetic algorithms, also this critical activity is completely automated.
The architecture described so far represents a first concrete implementation of
an automated solution to the A/B testing problem. This contributes to overcome
some of the burdens of manual A/B testing discussed in Section 2. However, it
is worth mentioning that some issues still remain open as exemplified hereafter.
First, the behaviour of the architecture still needs to be configured by speci-
fying several parameters (e.g., the population size, selection strategy, and the
termination condition). Section 5 discusses some relevant scenarios that exem-
plify their role. Experimental campaigns on real users will be required to further
study and tune them appropriately. Second, further investigations are required
on the mapping between the automated A/B testing problem and our proposed
architecture. As an example let us consider mutation. Its capability are highly de-
pendant from the encoding of features. Currently, we support mutations among
different values of primitive types and among different implementations of an
interface. In the future we envision more complex forms of mutation that tries
to automatically modify the source code (e.g., by swapping the order of some
statements). Finally, even if we did not discuss the role of constraints (see Sec-
tion 3) in our architecture, they can be easily modelled and integrated in genetic
algorithms as demonstrated in literature (e.g., [6]).
5 Preliminary Validation
In this section, we provide an initial empirical investigation of the feasibility of
automated A/B testing. To do so, we generate a sample program, we simulate
different user preferences, and we study to which extent an automated optimiza-
tion algorithm converges towards a “good” solution, i.e., a solution that maxi-
mizes the quality of the program, as measured through its assessment function.
In our evaluation, we consider different parameters to model the preferences of
users, the complexity of the program, and to configure the computational steps
performed by the genetic algorithm.
Experiment setup. For our experiments, we adopt the implementation de-
scribed in Section 4. Our prototype is entirely written in Java and relies on
JBoss AOP [2] to detect and instantiate features in programs and on the JGAP
library [14] to implement the genetic algorithm that selects, evolves, and vali-
dates variants of the program at run-time. In our experiments, we consider a
program with nfeatures. We assume that each feature has a finite, countable
domain (e.g., integer numbers, concrete implementations of a function). Further-
more, for ease of modeling, we assume that each domain Dis a metric space,
i.e., that we can compute a distance d1,2for each and every couple of elements
e1, e2D. To simulate the user preferences (and compute the value of the as-
sessment function for a variant of the program), we perform the following steps.
1. We split the users into ggroups. Each user uselects a “favourite” value bestf
for each feature fin the program; users within the same group share the same
favourite values for all the features.
2. We assume that the assessment function of a program ranges from 0 (worst
case) to 1000 (best case). When a user uinteracts with a variant vof a program,
it evaluates vas follows. It provides a score for each feature fin v. The score
of fis maximum (1000) if the value of fin vis bestf(the favourite value for
u) and decreases linearly with the distance from bestf. The value of a variant is
the average score of all its features.
Number of features in the program 10
Number of values per feature 100
Number of variants evaluated concurrently 100
Number of user groups 4
Distance threshold 80% of maximum distance
Number of evaluations for each variant 1000
Stopping condition 10 repetitions with improvement <0.1%
Selection strategy Natural selection (90%)
Crossover rate 35%
Mutation rate 0.08%
Table 1. Parameters used in the default scenario.
Measure Average 95% Confidence Interval
Value of the assessment function (Best) 810.9 11.7
Value of the assessment function (Average) 779.2 10.7
Number of Iterations 34.2 5.9
Table 2. Results for the default scenario.
3. We set a distance threshold t. If the distance between the value of a feature
and the user’s favourite value is higher than t, then the value of the variant is 0.
Intuitively, the presence of multiple groups models the differences in the users’
profile (e.g., differences in age, location, culture, etc.). The more the value of
features differ from a user’s favourite ones, the worse she will evaluate it. The
threshold mimics the user tolerance: after a maximum distance, the program does
not have any value for her. While this is clearly a simplified and abstract model,
it is suitable to highlight some key aspects that contribute to the complexity of
A/B testing: understanding how to select the variants of a program and how to
iteratively modify them to satisfy a heterogeneous population of users may be
extremely difficult. In addition to the issues listed in Section 2, a manual process
is time consuming and risks to fail to converge towards a good solution, or it
may require a higher number of iterations.
Default Scenario. To perform the experiments discussed in the remainder of
this section, we defined a default scenario with the parameters listed in Table 1.
Next, we investigate the impact of each parameter in the measured results. For
space reasons, we report here only the most significant findings.
Our default scenario considers a program with 10 different features, each one
selecting values from a finite domain of 100 elements. At each iteration, we con-
currently evaluate 100 program variants, submitting each variant to 1000 random
users. Users are clustered into 4 different groups. They do not tolerate features
whose distance from their favourite value is higher than 80% of the maximum
distance. At each iteration, the genetic algorithm keeps the number of chromo-
somes (i.e., program variants) fixed. We adopt a natural selection strategy that
selects the best 90% chromosomes to survive for the next generation. Selected
chromosomes are combined with a crossover rate of 35% and modified using a
0
200
400
600
800
1000
10 100 1000
Value of Assessment Function
Number of Features
Average
Final
(a) Value of Assessment Function
0
10
20
30
40
50
60
70
80
90
100
10 100 1000
Number of Iterations
Number of Features
(b) Number of Iterations
Fig. 3. Impact of the complexity of the program
mutation rate of 0.08%. The process stops when the improvement in fitness value
is lower than 0.1% for 10 subsequent iterations.
In each experiment, we measure the number of iterations performed by the
genetic algorithm, the value of the assessment function for the selected variant,
and the average value of the assessment function for all the variants used in
the iterative step. The first value tells us how long the algorithm needs to run
before providing a result. The second value represents the quality of the final
solution. Finally, the third value represents the quality of the solutions proposed
to the users during the iterative process. In A/B testing, this value is extremely
important: consider for example an e-commerce Web application, in which the
developer wants to maximize the number of purchases performed by users. Not
only the final version of the application is important, but also all the intermediate
versions generated during the optimization process: indeed, they may be running
for long time to collect feedback from the users and may impact on the revenue of
the application. We repeated each experiment 10 times, with different random
seeds to generate the features of the program, the user preferences, and the
selection, crossover, and mutation steps. In the graph below, we show the average
value of each measure and the 95% confidence interval.
Table 2 shows the results we measured in our default scenario. Despite the
presence of 4 user groups with different requirements, the algorithm converges
towards a solution that provides a good average quality. We manually checked
the solutions proposed by the algorithm and verified that they always converged
to near-optimal values for the given user preferences. The average value of the
assessment function during the optimization process is particularly interesting:
it is very close to the final solution, meaning that the algorithm converges fast
towards good values. Although the average number of iterations is 34.2, the last
iterations only provide incremental advantages. As discussed above, this is a key
aspect for A/B testing, since developers want to maximize the revenue of an
application during the optimization process.
Complexity of the program. In this section, we analyze how the results
change with the complexity of the program, i.e., with the number of specified
features. Fig. 3 shows the results we obtained. By looking at the value of the
assessment function (Fig. 3(a)), we notice that both the final and the average
0
200
400
600
800
1000
1 2 3 4 5 6 7 8 9 10
Value of Assessment Function
Number of User Groups
Average
Final
(a) Value of Assessment Function
0
200
400
600
800
1000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Value of Assessment Function
Distance Threshold
Average
Final
(b) Number of Iterations
Fig. 4. Impact of the profile of users
quality decrease with the number of features. This is expected, since a higher
number of features increases the probability that at least one is outside the
maximum distance from the user’s preferred value, thus producing a value of
0. Nevertheless, even in a complex program with 1000 features, both values
remain above 700. Moreover, the average value remains very close to the final
one. Fig. 3(b) shows that the number of iterations required to converge increases
with the number of features in the program. Indeed, a higher number of features
increases the size of the search space. Nevertheless, an exponential growth in
the number of features only produces a sub-linear increase in the number of
iterations. Event with 1000 features less than 100 iterations are enough for the
genetic algorithm to converge. These preliminary results are encouraging and
suggest that automated A/B testing could be adopted with complex program
with several hundreds of features.
Profiles of Users. In this section, we analyze how the profile of users impacts
the performance of the optimization algorithm. Fig. 4 shows the results we mea-
sured by observing two main parameters: the number of user groups and the
maximum distance tolerated by users. For space reason, we show only the value
of the assessment function: the number of iterations did not change significantly
during these experiments. Fig. 4(a) shows that the value of the assessment func-
tion decreases with the number of user groups. Indeed, a higher number of groups
introduces heterogeneous preferences and constraints. Finding a suitable variant
that maximizes the user satisfaction becomes challenging. With one group, the
solution proposed by the genetic algorithm is optimal, i.e., it selects all the pre-
ferred features of the users in the group. This is not possible in presence of more
than one group, due to differences in requirements. Nevertheless, the quality of
the solution remains almost stable when considering from 2 to 10 user groups.
Finally, also in this case, the average value of the assessment function remains
very close to the final one. Fig. 4(b) shows how the selectivity of users influences
the results. When reducing the maximum tolerated difference, it becomes more
and more difficult to find a solution that satisfies a high number of users. Because
of this, when considering a threshold of only 10% of the maximum distance, the
final solution can satisfy only a fraction of the users. Thus, the quality of the
solution drops below 600.
Discussion. Although based on a synthetic and simple model of the users’ pref-
erences, the analysis above highlights some important aspects of A/B testing.
First, our experiments confirmed and emphasized some key problems in per-
forming manual A/B testing: in presence of heterogeneous user groups, with
different preferences and constraints, devising a good strategy for evolving and
improving the program is extremely challenging. Most importantly for the goal
of this paper, our analysis suggests that an automated solution is indeed pos-
sible and worth investigating. Indeed, in all the experiments we performed, the
genetic algorithm was capable to converge towards a good solution. Moreover,
it always converged within a small number of steps (less than 100, even in the
most challenging scenarios we tested). Furthermore, intermediate variants of the
programs adopted during the optimization process were capable of providing
good values for the assessment function. This is relevant when considering live
experiments, in which intermediate programs are shown to the users: providing
good satisfaction of users even in this intermediate phase may be crucial to avoid
loss of customs and revenues.
6 Related Work
The SBSE [7] community focused its research efforts on several relevant soft-
ware engineering challenges covering all the various steps in the software life-
cycle ranging from requirements engineering [17], design [16], testing [4], and
even maintenance [15]. However, despite all valuable these research efforts, the
problem of evolving and refining systems after their deployment – in particular
in the domain of user-intensive systems – received very little attention so far
and, at the best of our knowledge, this is the very first attempt to introduce this
problem in the SBSE community.
Concerning instead the research on A/B testing we can mention many in-
teresting related efforts. For example, Crook et al. [5] discuss seven pitfalls to
avoid in A/B testing on the Web. Analogously, Kohavi et al. [9,12] discuss the
complexity of conducting sound and effective A/B testing campaigns. To support
developers in this complex activity, Kohavi et al. [11] also provided a practical tu-
torial. Worth mentioning is also [10], which discusses online experiments on large
scale scenarios. Finally, worth mentioning is the Javascript project Genetify [1].
The project represents a preliminary effort to introduce genetic algorithms in
A/B testing and demonstrates how practitioners actually demand for methods
and tools for automated A/B testing as claimed in the motivation of this paper.
However, this project is quite immature and does not exploit all the potentials
of genetic algorithms as we propose in this paper: it only supports the evolution
of HTML pages in Web applications.
7 Conclusions
In this paper we tackled the problem of automating A/B testing. We formalized
A/B testing as a SBSE problem and we proposed an initial prototype that relies
on aspect-oriented programming and genetic algorithms. We provided two im-
portant contributions. On the one hand, we used our prototype to demonstrate
the practical feasibility of automated A/B testing through a set of synthetic ex-
periments. On the other hand, we provided the SBSE community with a novel
domain where its expertise can be applied. As future work, we plan to test our
approach on real users and to refine the proposed approach with customized
mutation operators (e.g., changes to the source code) and full support for con-
straints.
References
1. Genetify. https://github.com/gregdingle/genetify/wiki. [Accessed 25-02-
2014].
2. JBoss AOP. http://www.jboss.org/jbossaop. [Accessed 25-02-2014].
3. The A/B Test: Inside the Technology Thats Changing the Rules of Business. http:
//www.wired.com/business/2012/04/ff_abtesting. [Accessed 25-02-2014].
4. Wasif Afzal, Richard Torkar, and Robert Feldt. A systematic review of search-
based testing for non-functional system properties. Inf. Softw. Technol., 2009.
5. Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. Seven pitfalls
to avoid when running controlled experiments on the web. In ACM SIGKDD,
KDD ’09, pages 1105–1114, New York, NY, USA, 2009. ACM.
6. Kalyanmoy Deb. An efficient constraint handling method for genetic algorithms.
Computer Methods in Applied Mechanics and Engineering, 186(24):311 – 338, 2000.
7. Mark Harman and Bryan F Jones. Search-based software engineering. Information
and Software Technology, 43(14):833–839, 2001.
8. Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes,
Jean-Marc Loingtier, and John Irwin. Aspect-oriented programming. Springer,
1997.
9. Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and
Ya Xu. Trustworthy online controlled experiments: Five puzzling outcomes ex-
plained. In ACM SIGKDD, pages 786–794. ACM, 2012.
10. Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann.
Online controlled experiments at large scale. In ACM SIGKDD, KDD ’13, pages
1168–1176, New York, NY, USA, 2013. ACM.
11. Ron Kohavi, Randal M Henne, and Dan Sommerfield. Practical guide to controlled
experiments on the web: listen to your customers not to the hippo. In ACM
SIGKDD, pages 959–967. ACM, 2007.
12. Ron Kohavi and Roger Longbotham. Unexpected results in online controlled ex-
periments. ACM SIGKDD, 12(2):31–35, 2011.
13. John R Koza. Genetic programming: on the programming of computers by means
of natural selection, volume 1. MIT press, 1992.
14. K Meffert, N Rotstan, C Knowles, and U Sangiorgi. JGAP - Java Genetic Algo-
rithms and Genetic Programming Package, 2014.
15. M. O’Keeffe and M.O. Cinneide. Search-based software maintenance. In Soft-
ware Maintenance and Reengineering, 2006. CSMR 2006. Proceedings of the 10th
European Conference on, pages 10 pp.–260, March 2006.
16. Outi R¨aih¨a. A survey on search-based software design. Computer Science Review,
4(4):203–249, 2010.
17. Yuanyuan Zhang, Anthony Finkelstein, and Mark Harman. Search based require-
ments optimisation: Existing work and challenges. In REFSQ. Springer, 2008.
... The architecture framework proposed in Chapter 5 provides a general architecture framework for automated experimentation, that is not restricted only to sequential optimization. However, its instantiation as well as most of the work on automated experimentation focus on sequential optimization [29,30,[49][50][51]. ...
... We group these approaches as a single practice because of their common experiment type. However, these techniques are based on very different premises and theories, ranging from Bayesian analysis [53], response surface methodology [16], Taguchi optimal designs [2,145,235] to search-based heuristics [51,147]. These techniques are commonly used in network optimization and can be performed by both the mobile operators as well as Ericsson [30]. ...
... One area of active development in online optimization is the development of algorithms for searching the parameter space. When formulating the online optimization problem, as a search-based black-box problem, [29,51,53], practitioners are presented with a several algorithm families without any systematic procedure to guide their choice. This situation is complicated by the lack of statistical comparison between families of algorithms and by the low quality of the statistical analysis. ...
Thesis
Full-text available
Context: Delivering software that has value to customers is a primary concern of every software company. Prevalent in web-facing companies, controlled experiments are used to validate and deliver value in incremental deployments. At the same that web-facing companies are aiming to automate and reduce the cost of each experiment iteration, embedded systems companies are starting to adopt experimentation practices and leverage their activities on the automation developments made in the online domain. Objective: This thesis has two main objectives. The first objective is to analyze how software companies can run and optimize their systems through automated experiments. This objective is investigated from the perspectives of the software architecture, the algorithms for the experiment execution and the experimentation process. The second objective is to analyze how non web-facing companies can adopt experimentation as part of their development process to validate and deliver value to their customers continuously. This objective is investigated from the perspectives of the software development process and focuses on the experimentation aspects that are distinct from web-facing companies. Method: To achieve these objectives, we conducted research in close collab�oration with industry and used a combination of different empirical research methods: case studies, literature reviews, simulations, and empirical evalua�tions. Results: This thesis provides six main results. First, it proposes an architecture framework for automated experimentation that can be used with different types of experimental designs in both embedded systems and web-facing systems. Second, it proposes a new experimentation process to capture the details of a trustworthy experimentation process that can be used as the basis for an automated experimentation process. Third, it identifies the restrictions and pitfalls of different multi-armed bandit algorithms for automating experiments in industry. This thesis also proposes a set of guidelines to help practitioners select a technique that minimizes the occurrence of these pitfalls. Fourth, it proposes statistical models to analyze optimization algorithms that can be used in automated experimentation. Fifth, it identifies the key challenges faced by embedded systems companies when adopting controlled experimentation, and we propose a set of strategies to address these challenges. Sixth, it identifies experimentation techniques and proposes a new continuous experimentation model for mission-critical and business-to-business. Conclusion: The results presented in this thesis indicate that the trustwor�thiness in the experimentation process and the selection of algorithms still need to be addressed before automated experimentation can be used at scale in industry. The embedded systems industry faces challenges in adopting experimentation as part of its development process. In part, this is due to the low number of users and devices that can be used in experiments and the diversity of the required experimental designs for each new situation. This limitation increases both the complexity of the experimentation process and the number of techniques used to address this constraint.
... After a trial usage, the tester collects necessary data about execution time, events, user's behaviour and other data from the experiment. In the end, the user passes a feedback form to describe his opinion on satisfaction with the given version of the product [ [30]]. [30], shows, A/B testing is an iterative methodology. ...
... In the end, the user passes a feedback form to describe his opinion on satisfaction with the given version of the product [ [30]]. [30], shows, A/B testing is an iterative methodology. Each round, there is a winning prototype, and it can be compared with the all-new prototype in the next iteration. ...
Article
Full-text available
The most common reason for software product failure is misunderstanding user needs. Analysing and validating user needs before developing a product can allow to prevent such failures. This paper investigates several data-driven techniques for user research and product design through prototyping, customer validation, and usability testing. The authors implemented a case study software product using the proposed techniques, and analyses how the application of UX/UI research techniques affected the development process. The case study results indicate that preliminary UX/UI research before the development reduces the cost of product changes. Moreover, the paper proposes a set of metrics for testing the effectiveness of UX/UI design.
... To battle the issue, previous work has attempted to apply Artificial Intelligence algorithms to enable efficient search of a large design space [5,13,16,19,22]. Particularly, Salem [19] combined crowdsourcing and genetic programming [12] for the design of landing pages. ...
... To address the issue, Salem [19] combined crowdsourcing and genetic programming [12] for the design of landing pages. Tamburrelli and Margara [22] explored approaches for optimizing software designs specified in Java through GA, basing their fitness function on the distance from users' interaction position. Despite the adoption of the crowd, these interactive GA solutions rely on implicit information, such as click location that is difficult to generalize to other design tasks. ...
Preprint
User interface design is a complex task that involves designers examining a wide range of options. We present Spacewalker, a tool that allows designers to rapidly search a large design space for an optimal web UI with integrated support. Designers first annotate each attribute they want to explore in a typical HTML page, using a simple markup extension we designed. Spacewalker then parses the annotated HTML specification, and intelligently generates and distributes various configurations of the web UI to crowd workers for evaluation. We enhanced a genetic algorithm to accommodate crowd worker responses from pairwise comparison of UI designs, which is crucial for obtaining reliable feedback. Based on our experiments, Spacewalker allows designers to effectively search a large design space of a UI, using the language they are familiar with, and improve their design rapidly at a minimal cost.
... MVTs are cautioned against [86] because of their added complexity. In contrast, other researchers take an optimization approach using lots (see 14 J o u r n a l P r e -p r o o f Journal Pre-proof Section 4.2.5) of variables with multi-armed bandits [87,88,89,46] or searchbased methods [90,91,92]. Also mixed methods research is used to combine quantitative and qualitative data. ...
... Tamburelli and Margara [92] proposed search-based methods (i.e. genetic algorithms) for optimization of software, and Iitsuka and Matsuo [97] demonstrated a local search method with a proof of concept on web sites. ...
Article
Full-text available
Context Continuous experimentation and A/B testing is an established industry practice that has been researched for more than 10 years. Our aim is to synthesize the conducted research. Objective We wanted to find the core constituents of a framework for continuous experimentation and the solutions that are applied within the field. Finally, we were interested in the challenges and benefits reported of continuous experimentation. Methods We applied forward snowballing on a known set of papers and identified a total of 128 relevant papers. Based on this set of papers we performed two qualitative narrative syntheses and a thematic synthesis to answer the research questions. Results The framework constituents for continuous experimentation include experimentation processes as well as supportive technical and organizational infrastructure. The solutions found in the literature were synthesized to nine themes, e.g. experiment design, automated experiments, or metric specification. Concerning the challenges of continuous experimentation, the analysis identified cultural, organizational, business, technical, statistical, ethical, and domain-specific challenges. Further, the study concludes that the benefits of experimentation are mostly implicit in the studies. Conclusion The research on continuous experimentation has yielded a large body of knowledge on experimentation. The synthesis of published research presented within include recommended infrastructure and experimentation process models, guidelines to mitigate the identified challenges, and what problems the various published solutions solve.
... MVTs are cautioned against [86] because of their added complexity. In contrast, other researchers take an optimization approach using lots (see Section 4.2.5) of variables with multi-armed bandits [87,88,89,46] or searchbased methods [90,91,92]. Also mixed methods research is used to combine quantitative and qualitative data. ...
... Tamburelli and Margara [92] proposed search-based methods (i.e. genetic algorithms) for optimization of software, and Iitsuka and Matsuo [97] demonstrated a local search method with a proof of concept on web sites. ...
Preprint
Full-text available
Context: Continuous experimentation and A/B testing is an established industry practice that has been researched for more than 10 years. Our aim is to synthesize the conducted research. Objective: We wanted to find the core constituents of a framework for continuous experimentation and the solutions that are applied within the field. Finally, we were interested in the challenges and benefits reported of continuous experimentation. Method: We applied forward snowballing on a known set of papers and identified a total of 128 relevant papers. Based on this set of papers we performed two qualitative narrative syntheses and a thematic synthesis to answer the research questions. Results: The framework constituents for continuous experimentation include experimentation processes as well as supportive technical and organizational infrastructure. The solutions found in the literature were synthesized to nine themes, e.g. experiment design, automated experiments, or metric specification. Concerning the challenges of continuous experimentation, the analysis identified cultural, organizational, business, technical, statistical, ethical, and domain-specific challenges. Further, the study concludes that the benefits of experimentation are mostly implicit in the studies. Conclusions: The research on continuous experimentation has yielded a large body of knowledge on experimentation. The synthesis of published research presented within include recommended infrastructure and experimentation process models, guidelines to mitigate the identified challenges, and what problems the various published solutions solve.
... However, these techniques are based on very different premises and theories, ranging from Bayesian analysis, 45 response surface methodology, 41 Taguchi optimal designs 1,46,47 to search-based heuristics. 48,49 These techniques are commonly used in network optimization and can be performed by both the mobile operators as well as Ericsson. 40 ...
Article
Full-text available
Continuous experimentation (CE) refers to a set of practices used by software companies to rapidly assess the usage, value, and performance of deployed software using data collected from customers and systems in the field using an experimental methodology. However, despite its increasing popularity in developing web‐facing applications, CE has not been studied in the development process of business‐to‐business (B2B) mission‐critical systems. By observing the CE practices of different teams, with a case study methodology inside Ericsson, we were able to identify the different practices and techniques used in B2B mission‐critical systems and a description and classification of the four possible types of experiments. We present and analyze each of the four types of experiments with examples in the context of the mission‐critical long‐term evolution (4G) product. These examples show the general experimentation process followed by the teams and the use of the different CE practices and techniques. Based on these examples and the empirical data, we derived the HURRIER process to deliver high‐quality solutions that the customers value. Finally, we discuss the challenges, opportunities, and lessons learned from applying CE and the HURRIER process in B2B mission‐critical systems. The HURRIER process combines existing validation techniques together with experimentation practices to deliver high‐quality software that customers value.
... In many real-world machine learning tasks, the evaluation metric one seeks to optimize is not explicitly available in closed-form. This is true for metrics that are evaluated through live experiments or by querying human users (Tamburrelli and Margara, 2014;Hiranandani et al., 2019aHiranandani et al., , 2020a, or require access to private or legally-protected data (Awasthi et al., 2021), and hence cannot be written as an explicit training objective. This is also the case when the learner only has access to data with skewed training distribution or labels with heteroscedastic noise (Huang et al., 2019;Jiang et al., 2020), and hence cannot directly optimize the metric on the training set despite knowing its mathematical form. ...
Preprint
We consider learning to optimize a classification metric defined by a black-box function of the confusion matrix. Such black-box learning settings are ubiquitous, for example, when the learner only has query access to the metric of interest, or in noisy-label and domain adaptation applications where the learner must evaluate the metric via performance evaluation using a small validation sample. Our approach is to adaptively learn example weights on the training dataset such that the resulting weighted objective best approximates the metric on the validation sample. We show how to model and estimate the example weights and use them to iteratively post-shift a pre-trained class probability estimator to construct a classifier. We also analyze the resulting procedure's statistical properties. Experiments on various label noise, domain shift, and fair classification setups confirm that our proposal is better than the individual state-of-the-art baselines for each application.
... Traceability. Experiment iterations allow to gradually improve the system under experimentation by iterative adjustments of the parameters in order to maximize a metric of interest [22]. Hence, experiments are commonly part of a series of iterative evolving experiments. ...
Chapter
Full-text available
Online controlled experimentation is an established technique to assess ideas for software features. Current approaches to conduct experimentation are based on experimentation platforms. However, each experimentation platform has its own explicit properties and implicit assumptions about an experiment. As a result, experiments are incomplete, difficult to repeat, and not comparable across experimentation platforms or platform versions. Our approach separates the experiment definition from the experimentation platform. This makes the experimentation infrastructure-less dependent on the experimentation platform. Requirements on the independent experiment definition are researched and an architecture to implement the approach is proposed. A proof-of-concept demonstrates the feasibility and achieved level of independence from the platform.
Chapter
A significant amount of research effort is put into studying machine learning (ML) and deep learning (DL) technologies. Real-world ML applications help companies to improve products and automate tasks such as classification, image recognition and automation. However, a traditional “fixed” approach where the system is frozen before deployment leads to a sub-optimal system performance. Systems autonomously experimenting with and improving their own behavior and performance could improve business outcomes but we need to know how this could actually work in practice. While there is some research on autonomously improving systems, the focus on the concepts and theoretical algorithms. However, less research is focused on empirical industry validation of the proposed theory. Empirical validations are usually done through simulations or by using synthetic or manually alteration of datasets. The contribution of this paper is twofold. First, we conduct a systematic literature review in which we focus on papers describing industrial deployments of autonomously improving systems and their real-world applications. Secondly, we identify open research questions and derive a model that classifies the level of autonomy based on our findings in the literature review.
Conference Paper
Full-text available
Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher's experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale—thousands of experiments now—has taught us many lessons. These exemplify the proverb that the difference between theory and practice is greater in practice than in theory. We present our learnings as they happened: puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain. Each of these took multiple-person weeks to months to properly analyze and get to the often surprising root cause. The root causes behind these puzzling results are not isolated incidents; these issues generalized to multiple experiments. The heightened awareness should help readers increase the trustworthiness of the results coming out of controlled experiments. At Microsoft's Bing, it is not uncommon to see experiments that impact annual revenue by millions of dollars, thus getting trustworthy results is critical and investing in understanding anomalies has tremendous payoff: reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts. The topics we cover include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects.
Conference Paper
Full-text available
Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft’s Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day. Running experiments at large scale requires addressing multiple challenges in three areas: cultural/organizational, engineering, and trustworthiness. On the cultural and organizational front, the larger organization needs to learn the reasons for running controlled experiments and the tradeoffs between controlled experiments and other methods of evaluating ideas. We discuss why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits. On the engineering side, we architected a highly scalable system, able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users. Classical testing and debugging techniques no longer apply when there are millions of live variants of the site, so alerts are used to identify issues rather than relying on heavy up-front testing. On the trustworthiness front, we have a high occurrence of false positives that we address, and we alert experimenters to statistical interactions between experiments. The Bing Experimentation System is credited with having accelerated innovation and increased annual revenues by hundreds of millions of dollars, by allowing us to find and focus on key ideas evaluated through thousands of controlled experiments. A 1% improvement to revenue equals $10M annually in the US, yet many ideas impact key metrics by 1% and are not well estimated a-priori. The system has also identified many negative features that we avoided deploying, despite key stakeholders’ early excitement, saving us similar large amounts.
Conference Paper
Full-text available
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that significant learning and return-on- investment (ROI) are seen when development teams listen to their customers, not to the Highest Paid Person's Opinion (HiPPO). We provide several examples of controlled experiments with surprising results. We review the important ingredients of running controlled experiments, and discuss their limitations (both technical and organizational). We focus on several areas that are critical to experimentation, including statistical power, sample size, and techniques for variance reduction. We describe common architectures for experimentation systems and analyze their advantages and disadvantages. We evaluate randomization and hashing techniques, which we show are not as simple in practice as is often assumed. Controlled experiments typically generate large amounts of data, which can be analyzed using data mining techniques to gain deeper understanding of the factors influencing the outcome of interest, leading to new hypotheses and creating a virtuous cycle of improvements. Organizations that embrace controlled experiments with clear evaluation criteria can evolve their systems with automated optimizations and real-time analyses. Based on our extensive practical experience with multiple systems and organizations, we share key lessons that will help practitioners in running trustworthy controlled experiments.
Article
This paper claims that a new field of software engineering research and practice is emerging: search-based software engineering. The paper argues that software engineering is ideal for the application of metaheuristic search techniques, such as genetic algorithms, simulated annealing and tabu search. Such search-based techniques could provide solutions to the difficult problems of balancing competing (and some times inconsistent) constraints and may suggest ways of finding acceptable solutions in situations where perfect solutions are either theoretically impossible or practically infeasible. In order to develop the field of search-based software engineering, a reformulation of classic software engineering problems as search problems is required, The paper briefly sets out key ingredients for successful reformulation and evaluation criteria for search-based software engineering.
Conference Paper
We have found many programming problems for which neither procedural nor object-oriented programming techniques are sufficient to clearly capture some of the important design decisions the program must implement. This forces the implementation of those design decisions to be scattered throughout the code, resulting in “tangled” code that is excessively difficult to develop and maintain. We present an analysis of why certain design decisions have been so difficult to clearly capture in actual code. We call the properties these decisions address aspects, and show that the reason they have been hard to capture is that they cross-cut the system's basic functionality. We present the basis for a new programming technique, called aspect-oriented programming, that makes it possible to clearly express programs involving such aspects, including appropriate isolation, composition and reuse of the aspect code. The discussion is rooted in systems we have built using aspect-oriented programming.
Article
Many real-world search and optimization problems involve inequality and/or equality constraints and are thus posed as constrained optimization problems. In trying to solve constrained optimization problems using genetic algorithms (GAs) or classical optimization methods, penalty function methods have been the most popular approach, because of their simplicity and ease of implementation. However, since the penalty function approach is generic and applicable to any type of constraint (linear or nonlinear), their performance is not always satisfactory. Thus, researchers have developed sophisticated penalty functions specific to the problem at hand and the search algorithm used for optimization. However, the most difficult aspect of the penalty function approach is to find appropriate penalty parameters needed to guide the search towards the constrained optimum. In this paper, GA's population-based approach and ability to make pair-wise comparison in tournament selection operator are exploited to devise a penalty function approach that does not require any penalty parameter. Careful comparisons among feasible and infeasible solutions are made so as to provide a search direction towards the feasible region. Once sufficient feasible solutions are found, a niching method (along with a controlled mutation operator) is used to maintain diversity among feasible solutions. This allows a real-parameter GA's crossover operator to continuously find better feasible solutions, gradually leading the search near the true optimum solution. GAs with this constraint handling approach have been tested on nine problems commonly used in the literature, including an engineering design problem. In all cases, the proposed approach has been able to repeatedly find solutions closer to the true optimum solution than that reported earlier.
Article
Search-based software testing is the application of metaheuristic search techniques to generate software tests. The test adequacy criterion is transformed into a fitness function and a set of solutions in the search space are evaluated with respect to the fitness function using a metaheuristic search technique. The application of metaheuristic search techniques for testing is promising due to the fact that exhaustive testing is infeasible considering the size and complexity of software under test. Search-based software testing has been applied across the spectrum of test case design methods; this includes white-box (structural), black-box (functional) and grey-box (combination of structural and functional) testing. In addition, metaheuristic search techniques have also been applied to test non-functional properties. The overall objective of undertaking this systematic review is to examine existing work into non-functional search-based software testing (NFSBST). We are interested in types of non-functional testing targeted using metaheuristic search techniques, different fitness functions used in different types of search-based non-functional testing and challenges in the application of these techniques. The systematic review is based on a comprehensive set of 35 articles obtained after a multi-stage selection process and have been published in the time span 1996–2007. The results of the review show that metaheuristic search techniques have been applied for non-functional testing of execution time, quality of service, security, usability and safety. A variety of metaheuristic search techniques are found to be applicable for non-functional testing including simulated annealing, tabu search, genetic algorithms, ant colony methods, grammatical evolution, genetic programming (and its variants including linear genetic programming) and swarm intelligence methods. The review reports on different fitness functions used to guide the search for each of the categories of execution time, safety, usability, quality of service and security; along with a discussion of possible challenges in the application of metaheuristic search techniques.
Article
This survey investigates search-based approaches to software design. The basics of the most popular meta-heuristic algorithms are presented as background to the search-based viewpoint. Software design is considered from a wide viewpoint, including topics that can also be categorized as software maintenance or re-engineering. Search-based approaches have been used in research from the high architecture design level to software clustering and finally software refactoring. Enhancing and predicting software quality with search-based methods is also taken into account as a part of the design process. The background for the underlying software engineering problems is discussed, after which search-based approaches are presented. Summarizing remarks and tables collecting the fundamental issues of approaches for each type of problem are given. The choices regarding critical decisions, such as representation and fitness function, when used in meta-heuristic search algorithms, are emphasized and discussed in detail. Ideas for future research directions are also given.