The Anatomy of a Large-Scale Online Experimentation Platform

Conference Paper (PDF Available) · May 2018with 2,536 Reads
DOI: 10.1109/ICSA.2018.00009
Conference: IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ARCHITECTURE (ICSA 2018), At Seattle, USA
Cite this publication
Abstract
Online controlled experiments (e.g., A/B tests) are an integral part of successful data-driven companies. At Microsoft, supporting experimentation poses a unique challenge due to the wide variety of products being developed, along with the fact that experimentation capabilities had to be added to existing, mature products with codebases that go back decades. This paper describes the Microsoft ExP Platform (ExP for short) which enables trustworthy A/B experimentation at scale for products across Microsoft, from web properties (such as bing.com) to mobile apps to device drivers within the Windows operating system. The two core tenets of the platform are trustworthiness (an experiment is meaningful only if its results can be trusted) and scalability (we aspire to expose every single change in any product through an A/B experiment). Currently, over ten thousand experiments are run annually. In this paper, we describe the four core components of an A/B experimentation system: experimentation portal, experiment execution service, log processing service and analysis service, and explain the reasoning behind the design choices made. These four components work together to provide a system where ideas can turn into experiments within minutes and those experiments can provide initial trustworthy results within hours.
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
The Anatomy of a Large-Scale Online
Experimentation Platform
Somit Gupta, Lucy Ulanova, Sumit Bhardwaj, Pavel
Dmitriev, Paul Raff
Analysis and Experimentation Team, Microsoft
Bellevue, WA, USA
sogupta,liulanov,subhardw,padmitri,paraff@microsoft.com
Aleksander Fabijan
Department of Computer Science and Media Technology
Malmö University
Malmö, Sweden
aleksander.fabijan@mau.se
Abstract— Online controlled experiments (e.g., A/B tests) are
an integral part of successful data-driven companies. At Microsoft,
supporting experimentation poses a unique challenge due to the
wide variety of products being developed, along with the fact that
experimentation capabilities had to be added to existing, mature
products with codebases that go back decades. This paper
describes the Microsoft ExP Platform (ExP for short) which
enables trustworthy A/B experimentation at scale for products
across Microsoft, from web properties (such as bing.com) to
mobile apps to device drivers within the Windows operating
system. The two core tenets of the platform are trustworthiness (an
experiment is meaningful only if its results can be trusted) and
scalability (we aspire to expose every single change in any product
through an A/B experiment). Currently, over ten thousand
experiments are run annually. In this paper, we describe the four
core components of an A/B experimentation system:
experimentation portal, experiment execution service, log
processing service and analysis service, and explain the reasoning
behind the design choices made. These four components work
together to provide a system where ideas can turn into experiments
within minutes and those experiments can provide initial
trustworthy results within hours.
Keywords— A/B testing; experimentation; metrics; hypothesis
testing; experimentation platform; scalability.
I. INTRODUCTION
Online controlled experiments (e.g., A/B tests) are
becoming the gold standard for evaluating improvements in
software systems [1]. From front-end user-interface changes to
backend algorithms, from search engines (e.g., Google, Bing,
Yahoo!) to retailers (e.g., Amazon, eBay, Etsy) to social
networking services (e.g., Facebook, LinkedIn, Twitter) to
travel services (e.g., Expedia, Airbnb, Booking.com) to many
startups, online controlled experiments are now utilized to make
data-driven decisions at a wide range of companies [2]. While
the theory of a controlled experiment is simple, and dates to Sir
Ronald A. Fisher’ [3], the deployment of online controlled
experiments at scale (for example, 100’s of concurrently
running experiments) across variety of web sites, mobile apps,
and desktop applications is a formidable challenge.
Consider the following experiment that ran in Skype mobile
and desktop apps. When the user attempts a call but the caller
does not answer, a dialog is shown prompting the user to leave
a video message.
Many steps need to happen to execute this experiment. First,
the code that shows the message needs to be deployed to the
clients in such a way that it can be turned on and off via the
experimentation system. Then, the audience, the experiment
design, the experiment steps, and the size and duration for each
step need to be determined. During the experiment execution,
correct configurations need to be delivered to the users' mobile
and desktop apps, ideally in a gradual manner, while verifying
that the experiment is not causing unintended harm (e.g. app
crashes). Such monitoring should continue throughout the
experiment, checking for variety of issues, including
interactions with other concurrently running experiments.
During the experiment, various actions could be suggested to
the experiment owner (the person who executes the
experiment), such as stopping the experiment if harm is
detected, looking at surprising metric movements, or examining
a specific user segment that behaves differently from others.
After the experiment, results need to be analyzed to make a
ship/no-ship recommendation and learn about the impact of the
feature on users and business, to inform future experiments.
The role of the experimentation platform in the process
described above is critical. The platform needs to support the
steps above in a way that allows not only a data scientist, but
any engineer or product manager to successfully experiment.
The stakes are high. Recommending an incorrect experiment
size may result in not being able to achieve statistical
significance, therefore wasting release cycle time and
resources. Inability to detect harm and alert the experiment
owner may result in bad user experience, leading to lost revenue
and user abandonment. Inability to detect interactions with
other experiments may lead to wrong conclusions.
At Microsoft, the experimentation system (ExP) [4]
supports trustworthy experimentation in various products,
including Bing, MSN, Cortana, Skype, Office, xBox, Edge,
Visual Studio, Windows OS. Thousands use ExP every month
to execute and analyze experiments, totaling over ten thousand
experiments a year.
In this paper, we describe the architecture of ExP that
enables such a large volume of diverse experiments, explain the
reasoning behind the design choices made, and discuss the
alternatives where possible. While the statistical foundations of
online experimentation and their advancement is an active
research area [5], [6], and some specific aspects of experiment
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
execution such as enabling overlapping experiments [7] were
discussed elsewhere, to our knowledge this is the first work
focusing on the end-to-end architecture of a software system
supporting diverse experimentation scenarios at scale.
The two core tenets of ExP are trustworthiness and
scalability. “Getting numbers is easy; getting numbers you can
trust is hard” [8]. Producing trustworthy results requires solid
statistical foundation as well as seamless integration,
monitoring, and variety of quality checks of different
components of the system. Scalability refers to the ability of a
new product or team to easily onboard and start running
trustworthy experiments on ExP at low cost, and then enable
more and more ExP features as the experimentation volume
grows, until every single product feature or bug fix, as long as
it is ethical and technically feasible, is evaluated via an
experiment [9].
The four core components of ExP are experimentation
portal, experiment execution service, log processing service,
and analysis service. Experimentation portal is an interface
between the experiment owner and the experimentation system,
enabling the owner to configure, start, monitor and control the
experiment throughout its lifecycle (see Figure 1). Experiment
execution service is concerned with how different experiment
types are executed. Log processing service is responsible for
collecting and cooking the log data and executing various
analysis jobs. Analysis service is used throughout the whole
experiment lifecycle, informing experiment design,
determining parameters of experiment execution, and helping
experiment owners interpret the results.
The contribution of our paper is threefold:
1. We describe the architecture of ExP, Microsoft’s large
scale online experimentation platform, supporting experiments
across web sites, mobile and desktop apps, gaming consoles,
and operating systems.
2. We discuss choices and alternatives, which, in other
scenarios and for other companies, may potentially be better
options when building their own experimentation platform.
3. We illustrate the discussion with examples of real
experiments that were ran at Microsoft.
With increasing use of experimentation in software
development, more and more companies are striving to expand
their experimentation capabilities. We hope that this paper can
serve as a guide, helping them build trustworthy and scalable
experimentation solutions, ultimately leading to more data
driven decisions benefitting the products and their users.
The rest of the paper is organized as follows. Section II
provides a brief overview of online controlled experiments,
related work, and our research approach. Section III contains
our main contributions, discussing the four core components of
ExP in sections III.A-III.D. Section IV concludes the paper.
II. BACKGROUND & RESEARCH METHOD
A. Online Controlled Experiments
In the simplest controlled experiment or A/B test users are
randomly assigned to one of the two variants: control (typically
denoted as A) or treatment (typically denoted as B). Elsewhere
in this paper, we will use V1, V2, V3… to represent variants,
and an experiment is defined as a set of variants, e.g. E = {V1,
V2}. Usually control is the existing system, and treatment is the
existing system with a change X (for example, a new feature).
While the experiment is running, user interactions with the
system are recorded, and metrics are computed. If the
experiment were designed and executed correctly, the only
thing consistently different between the two variants is the
change X. External factors are distributed evenly between
control and treatment and therefore do not impact the results of
the experiment. Hence any difference in metrics between the
two groups must be due to the change X (or a random chance,
that we rule out using statistical testing). This establishes a
causal relationship between the change made to the product and
changes in user behavior, which is the key reason for
widespread use of controlled experiments for evaluating new
features in software. For the discussion that follows, an
important aspect is where an experiment is executed. If the
experiment is executed by the user’s mobile/desktop app,
gaming console, OS, etc. we label it a client-side experiment. If
the experiment is executed by the web site, call routing
infrastructure, or other server component, we label it a server-
side experiment.
B. Related Work
Controlled experiments are an active research area, fueled
by the growing importance of online experimentation in the
software industry. Research has been focused on topics such as
new statistical methods to improve metric sensitivity [10],
metric design and interpretation [11], [12], projections of
results from a short-term experiment to the long term [13], [14],
the benefits of experimentation at scale [2], [15], experiments
in social networks [16], as well as examples, pitfalls, rules of
thumb and lessons learned from running controlled experiments
in practical settings [17], [18], [19]. These works provide good
context for our paper and highlight the importance of having a
solid engineering platform on top of which the new research
ideas, such as the ones referenced, can be implemented and
evaluated. Our work describes how to build such a platform.
High-level architecture of an experimentation system was
discussed in section 4.1 of [20] and partially in [9], [21]. The
discussion, however, is very brief, and, most importantly, only
concerns experimentation on web properties. Similarly this is
the case in [7]. Supporting client experimentation, such as
desktop or mobile apps, requires a different architecture.
Moreover, since [20] and [7] were published, new experiment
design and execution approaches were developed, e.g. [10],
which also require a different architecture to support them.
Therefore, to introduce a detailed architectural design of an
experimentation platform that supports the aforementioned
experimentation scenarios, we conducted an in-depth case
study at Microsoft, which we briefly describe next.
C. Research Method
The result of this paper builds on case study research [22]
conducted at Microsoft Corporation.
Case Company. Microsoft is a multinational organization
and employs over 120 000 people. Most of the authors of this
paper have been working at the case company for several years
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
with many product teams (e.g. Bing, Skype, Office, XBOX etc.)
specifically focusing on enabling and advancing their
experimentation capabilities.
Data collection. The authors of this paper have good access
to the meeting notes, design documents, notebooks and the
experimentation platform itself. We collected and placed all
existing documents that describe the ExP in a shared folder, and
complemented them with a joint notebook with personal notes
and insights about ExP that were previously not documented.
Data Analysis. We aggregated the collected data describing
the ExP platform, and jointly triangulated the historical data
describing the reasoning behind the various architectural
decisions of the ExP platform with our recent and extensive
experience. In principal, we applied the thematic coding
approach where we grouped similar concepts under categories,
and aggregated these further until we agreed on a level of
abstraction. In this way, for example, we emerged with
“Experiment Execution Service” label and other components
that describe our platform. During the analysis process, several
people working at the case company reviewed our work and
provided feedback on it.
Threats to Validity. Construct validity: Our case company
has a long tradition of experimentation research and applying it
in practice. The authors of this paper and others that contributed
to it are very familiar with the architecture of the ExP platform,
and the needs and challenges that it has to address. External
Validity: The discussion in this paper and its main contribution
is based on a system implemented at a single company. This is
a limitation that we acknowledge. Despite this concern,
however, we believe that the diverse range of products that our
platform supports make our contributions generalizable to
many other scenarios across the software industry, including to
companies developing web sites, mobile and desktop apps,
gaming consoles, services, and operating systems.
III. EXPERIMENTATION PLATFORM
ARCHITECTURE
The ExP platform aims to facilitate the experimentation
lifecycle (see Figure 1), which is the basis of data-driven
product development at Microsoft.
A. Experimentation Lifecycle
All product changes (e.g. new features) should start with a
hypothesis, which helps explain the reasoning for the change.
Good hypotheses originate from data analysis and should be
quantifiable. In the Skype experiment example introduced in
Section 1, the hypothesis could be that at least 1 out 10 of users
who had an unanswered call would leave a video message,
leading to an increase in the number of calls (due to more
returned calls) and an increase in the overall use of Skype.
The hypothesis then goes through the steps of
implementation and evaluation via an experiment. At the end of
the experiment, the decision is made to ship the change or abort.
Regardless of the outcome, the learnings from the experiment
become the basis of a new hypothesis, either an incremental
iteration on the current feature, or something radically different.
In the remainder of this section, we discuss the end-to-end
architecture of the experimentation system, which we also
depict on Figure 3 on the next page.
B. Experimentation Portal
Experimentation portal serves as an interface between the
experiment owner and the experimentation system. The goal of
the experimentation portal is to make it easy for the experiment
owner to create, configure, run, and analyze experiments. For
experimentation to scale, we need to keep making these tasks
easier.
The three components of the experimentation portal are
experiment management, metric definition and validation, and
result visualization and exploration.
1) EXPERIMENT MANAGEMENT: Consider the
following example of an experiment ran in Bing, aimed to
understand the impact of changing the Bing’s logo, as shown in
Figure 2. We will use this example to illustrate the key choices
the experiment owner has to make when creating and managing
an experiment.
a) Audience often depends on the design of the feature.
For the example in Figure 2, since this is an English language
logo, only users in the United States-English market were
exposed to the experiment. It is a common practice to hone your
feature and learnings in one audience, before proceeding to
expand to other audiences. The experimentation portal needs to
provide flexible audience selection mechanisms, also known as
targeting or traffic filters. Examples of simple traffic filters are
market, browser, operating system, version of the
mobile/desktop app. More complex targeting criteria may
involve prior usage information, e.g. “new users” who started
using the app within the last month.
Figure 1. The experimentation lifecycle.
CONTROL (current logo)
TREATMENT (new logo)
Figure 2. An iteration of the Bing logo experiment.
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
b) Overall Evaluation Criteria (OEC) is, essentially,
the formalization of the success of experiment hypothesis in
terms of the expected impact on key product metrics. It is
important that the OEC is determined before the experiment
starts, to avoid the temptation of tweaking the ship criteria to
match the results, and because it serves as input for determining
the experiment size and duration. Ideally, most experiments
should optimize for a single product-wide OEC rather than
individual experiment-specific OECs [9]. For the example in
Figure 2, the OEC was the standard Bing OEC discussed in
[23]. The experimentation portal should require the experiment
owner to formally document the OEC when creating an
experiment.
c) Size and duration. The experiment owner needs to
specify what fraction of the targeted population will receive the
treatment and the control, and how long the experiment should
run. The key tradeoff here is that the experiment owner would
be risk averse and would want to exposure as few users as
possible to the new feature but still be able to detect sizable
changes in key metrics. Power analysis [24] helps in making
this decision. Do note that running an experiment longer is not
always better in terms of detecting small changes in metrics
[23]. One key input in power analysis is the coverage of the new
feature: the proportion of users in treatment who will
experience the new feature. In cases of small coverage triggered
analysis can help improve the power of metrics [18]. Other
factors like apriori expectaction of novelty or primacy effects
in features, or warm up period needed for machine learning
models may also be a factor for determining the duration of an
experiment. It is important to stick to the duration of the
experiment. Stopping the experiment early or increasing the
duration after looking at the results can lead to wrong
conclusions [19].
d) Experiment template. While the basic experiment
setup described in Section II.A can be used, it may suffer from
random imbalance. When the users are randomized into
variants, just by chance there may be an imbalance resulting in
a statistically significant difference in a metric of interest
between the two populations. Disentangling the impact of such
random imbalance from the impact of the treatment may be
hard. Following the idea of [25], ExP supports several types of
experiments that use re-randomization to alleviate this
problem. The naïve way to detect random imbalance is to start
an experiment as an A/A (treatment and control are the same in
this stage), run it for several days, and if no imbalance is
detected, apply the treatment effect to one of the A’s. If random
imbalance is present, however, the experiment needs to be
restarted with a re-randomization, resulting in a delay. When
this experiment type was used in Bing about 1 out of 4
experiments had to be restarted. Another option is to perform a
retrospective A/A analysis on historical data, the Pre-
Experiment Analysis step in Figure 3. For example, we can take
the historical data for the last one week and simulate the
randomization on that data using the hash seed selected for the
A/B experiment about to start, as if a real A/A experiment ran
during that time, and check for imbalance. Expanding on the
above idea, we can perform many (e.g. 1000) retrospective A/A
steps simultaneously and then select the most balanced
Figure 3. Architecture of the experimentation platform.
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
randomization hash seed to use for the A/B step. We call this
experiment type SeedFinder, referring to its ability to find a
good hash seed for randomization. To implement SeedFinder,
the experimentation portal needs to interact with the analysis
service even before the experiment has started, and the analysis
service needs to be able to precisely simulate offline the
targeting and randomization as if it happened in the online
system. Another aspect of experiment configuration is ramp-
up: the sequence of steps the experiment will take to reach its
final size. Ramp-up is particularly important for risky
experiments, where the experiment owner may want to start
them at a very small size and gradually move to the final desired
size. ExP platform encapsulates different options for
experiment types and ramp-up scheduled into experiment
templates. Each product has its own set of templates that may
include, in addition to the aspects described above, choices for
what randomization unit to use in the experiment and
alternative experiment designs such as interleaving [26]. Each
product has a default recommended template.
e) Experiment Interacitons. When many experiments
configured as discussed earlier in this section are running
concurrently, every user has a chance to be in any subset of
these experiments at the same time. While often it does not
create any issues, it is problematic in cases where interactions
occur between variants of different experiments. Consider a
hypothetical example where one experiment changes the font
color on a web page to blue, while another changes the
background color to blue. A mechanism to prevent the user
from being in both of these experiments at the same time is
needed. To address the need to execute multiple potentially
interacting experiments concurrently, we use the notion of
isolation group. Each variant
!
is tagged with one or more
isolation groups, denoted by
"#(!)
. Isolation groups must
satisfy the Isolation Group Principle: if
!1
and
!2
are two
variants and
"#
(
!1
)
"#
(
!2
)
≠ ∅
, then a user can never be
concurrently assigned to both
!1
and
!2
. To avoid assignment
mismatches between the variants, we require all variants in the
experiment to share the same set of isolation groups. The
problem of taking a global set of variants – each with its own
set of isolation groups – and providing a set of assignments for
each user will be discussed in Section III.C.
f) Variant behavior. Experiment owner needs to
configure the behavior for each variant. For server-side
experiments, where the product code can be updated at any
time, configuration management system such as the one
described in [27] can be used to manage large amount of
configurations. For client-side experiments, where the software
is only updated periodically, an online configuration system
needs to be used to turn the features of the product on and off.
Discussion of the design of configuration systems is beyond the
scope of this paper.
2) METRIC DEFINITION AND VALIDATION: To
obtain accurate and comprehensive understanding of
experiment results, a rich set of metrics is needed [19].
Therefore, to scale experimentation it is important to provide an
easy way to define and validate new metrics, that anyone in the
company is able to use. The process of adding new metrics
needs to be lightweight, so that it does not introduce delays into
the experiment life cycle. It is also important to ensure that
metric definitions are consistent across different experiments
and business reports.
To satisfy the above requirements, we developed a Metric
Definition Language (MDL) that provides a formal
programming language-independent way to define all common
types of metrics. The user-provided metric definition gets
translated into an internal representation, which can then be
compiled into a number of target languages, such as SQL.
Having a language-independent representation is particularly
useful, because different Microsoft products use different
backend data processing mechanisms, and sometimes the same
product may use several mechanisms. MDL ensures that the
definition of a metric is consistent across all such pipelines. See
Figure 4 for an example of MDL.
Experimentation portal provides a user interface enabling
everyone to easily create and edit MDL metric definitions, test
them on real data, and deploy to the Metric Repository. This
repository is then used by all experiment analysis and business
reporting jobs. On average, over 100 metrics are created or
edited every day using the metric definition service.
3) RESULT VISUALIZATION/EXPLORATION:
While the experiment is running, and after its completion,
various analysis results are generated: quick near-real time
MDL Definition: AVG(SUM<User>(ClickCount))
PageviewLevel =
SELECT VariantId, UserId, SessionId, ClickCount
FROM Data;
SessionLevel =
SELECT VariantId, UserId, SessionId,
SUM(ClickCount) AS SessionClickCount
FROM PageviewLevel
GROUP BY VariantId, UserId, SessionId;
UserLevel =
SELECT VariantId, UserId,
SUM(SessionsClickCount) AS UserClickCount
FROM SessionLevel
GROUP BY VariantId, UserId;
Variant =
SELECT VariantId,
AVG(UserClickCount) AS AvgClicsPerUser
FROM UserLevel
GROUP BY VariantId;
Figure 4. MDL representation of the AvgClicksPerUser
metric: definition (top), internal representation (middle)
and auto-generated SQL (bottom).
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
checks are performed, large experiment scorecards with
thousands of metrics are computed, alerts are fired, deep-dive
analyses are automatically kicked off, and manually configured
custom analyses are created. The goal of the Results
Visualization and Exploration component is to serve as a single
place where experiment analysis information is collected,
providing visualizations and tools to help the experiment owner
explore the results learning as much as possible, and make the
correct ship/no-ship decision. The portal also allows
configuring and submitting additional analysis requests.
The simplest form of experiment results is an experiment
scorecard – a table consisting of a set of metrics and their
movements in an experiment. In a typical scorecard comparing
two variants, control and treatment, the following minimal
information should be displayed for every metric:
The observed metric values for control and treatment.
The delta between these observed values, both in absolute
and relative terms.
The results of a statistical test to convey significance of the
delta, typically via a
+
-value from a t-test.
At Microsoft, it is common for experiment scorecards to
compute thousands of metrics over dozens of population
segments, making manual scorecard analysis tedious and error
prone. Important insights can be missed, especially if they are
not expected by the experiment owner or heterogeneity is
present in the treatment effect [28], and a large number of
metrics gives rise to the multiple comparisons problem [29].
ExP provides a number of visualizations and analysis tools to
automatically identify and highlight interesting insights for the
experiment owner, and help avoid pitfalls such as the ones
mentioned above. Some of these solutions are discussed in [19].
In addition to helping configure, run, and analyze
experiments, Experimentation Portal also acts as a central
repository of all previously run experiments, collecting
metadata about each experiment and its results, and providing
anyone an opportunity to learn about the past successes and
failures, helping to disseminate learnings from experiments
across the company.
C. Experiment Execution Service
The goal of the experiment execution service is generating
variant assignments and delivering them to the product.
1) GENERATING VARIANT ASSIGNMENTS: At any
given point in time, the system has its global experimentation
state, which consists of the set of active variants, along with the
following information about each variant:
The set of isolation groups.
The traffic filters.
The percentage of total traffic configured.
An assignment request comes with a certain randomization
unit (e.g. user id) and as well as other properties that allow
resolving the traffic filters. Given this information, the goal of
the experiment assignment service is to compute the set of
variants for this request.
Let us first consider the scenario of one isolation group. The
isolation group has an associated numberline, which can be
represented as a sequence of numbers (called buckets), with
each bucket representing an equal portion of the population, and
the whole representing 100% of the population. For any
numberline, each user is consistently hashed into one specific
bucket based on their user id, using a hash function such as [30].
For example, Figure 5 shows a numberline with ten buckets,
where a user has been assigned to bucket 7.
An experiment that utilizes this isolation group will be
allotted a set of buckets, as determined by availability and
granularity of the numberline. If an experiment were to utilize
this isolation group and consisted of a 40% treatment and a 40%
control, then buckets 0-3 could be allotted to treatment and 4-7
to control. In this case, the user who hashes into bucket 7 would
be assigned to control for this experiment. This is demonstrated
in Figure 6.
Now that assignment for an individual isolation group has
been established, we can provide assignment globally by
instituting a priority order on isolation groups. Each experiment
will utilize the numberline associated with the highest-priority
isolation group involved as long as all other isolation groups for
that experiment are still available. Then we can iterate through
the isolation groups in their priority order, assign on each
isolation group, and remove from consideration lower-priority
isolation groups that were part of the assigned variant.
2) DELIVERING VARIANT ASSIGNMENT: There are
three ways in which a client can obtains a treatment assignment,
depicted and compared in Figure 7.
a) Via a service call. In this simple architecture, the
client/server is responsible for explicitly making a service call
to the experiment assignment service to obtain treatment
assignments. The advantage of this architecture is that it applies
equally to sever-side and client-side experiments. The
disadvantage is that the service call introduces a delay. Because
of this care should be taken to select the appropriate time to
make the service call. The typical pattern is to make the call as
soon as the application starts (in order to allow experimenting
on first run experience), and then periodically, avoiding doing
it during important user interactions (e.g. in the middle of a
Skype call). Often the application may choose to delay applying
the assignment until it restarts, to avoid changing the user
experience in the middle of the session.
0
1
2
3
4
5
6
7
8
9
Treatment
Control
Figure 6. Diagram of a numberline where buckets 0-3 have been
allocated to treatment and 4-7 to control.
0
1
3
4
5
6
7
8
9
Figure 5. Diagram of a numberline for a single isolation group. A
user who hashes into bucket 7 will always hash into bucket 7.
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
b) Via a request annotation. In this architecture, the
service is situated behind an Edge network (the network of
servers responsible primarily for load balancing) and treatment
assignment is obtained as an enrichment when the user’s
request passes through the Edge network. The service then
obtains the treatment assignment from that enrichement,
typically via added HTTP headers. This is a common
architecture for web sites because it does not require making an
explicit service call. This architecture, however, is not
applicable to client experimentation scenarios, because user
interaction with clients are often local not requiring remote
requests (e.g. editing a document on a PC.)
c) Via a local library. In this architecture, the client
utilizes solely local resources for treatment assignment. The
local library relies on configuration that is periodically
refreshed by making a call to the central variant assignment
service. This architecture is the best for client experimentation
scenarios, and applicable to server experimentation scenarios as
well. An application does not need to deal with making service
calls explicitly, and also does not need to pass the treatment
assignments around the stack as every component in the stack
can call the library as often as needed without incurring the
delay penalty.
3) CONFIGURATION: In addition to obtaining its
variant assignment, the client also needs to know how to change
its behavior based on these assignments. For server-side
experiments, where the code can be updated any time, these
variant configurations can be shipped as part of the code. For
client-side experiments, updating the code may be a lengthy
process, and a typical approach is to ship features dark (turned-
off), and then utilize a configuration system to turn them on and
off in response to different variant assignments. While variant
assignment and configuration services have different goals, it is
convenient to collocate them so that configurations are
delivered to clients along with variant assignments.
D. Log Processing Service
The job of the log processing service is to collect the
different types of log data (e.g. unique identifiers, timestaps,
performance signals, context flags, etc.) and then merge, enrich,
and join it to make it easier to consume by analysis jobs, the
processed we call cooking. The data then becomes available for
consumption by different data processing systems. Near real
time processing system typically operates on small amounts of
recent raw or lightly cooked data and is used to perform time-
sensitive computations such as real-time alerts. Batch
processing system operates on large volume of data and is used
to perform expensive operations, such as computing a scorecard
with thousands of metrics over the whole experiment period.
1) DATA COLLECTION: From experiment analysis
point of view, the goal of data collection is to be able to
accurately reconstruct user behavior. During experiment
analysis, we will both measure different user behavior patterns,
and interpret them as successes and failures, to determine
whether the change improves or degrades user experience. In
data driven products that make extensive use of
experimentation, logging the right data is an integral part of
development process, and the code that does it adheres to the
same standards (design, unit testing, etc.) as the code that
implements core product functionality.
An important type of telemetry data is counterfactual logs.
Having such signal allows for triggered analysis, greatly
improving the sensitivity of analysis [18].
Telemetry data is usually collected from both the client and
the server. Whenever there is a choice of collecting the same
type of data from the client or from the server, it is preferable
to collect server-side data. Server-side telemetry is easier to
update, and the data is usually more complete and has less
delay.
Delivery
Type
Architecture
Pros
Cons
Via a
service
call
Simple and applicable
to both server and
client experimentation
scenarios.
Service call incurs a
delay
Via a
request
annotation
No need to make a
service request and
incur delay penalty.
User request needs to
pass through the Edge
network, limiting
applicability to mostly
web sites.
Via a
local
library
No need to make a
service request.
Applicable to both
client and server
scenarios.
More complex
implementation
compared to the other
approaches.
Figure 7. Comparison of different variant assignment delivery mechanisms.
End User
Client/Server App
request Variant
Assignment
Service
periodic
requests
variant
assignments
End User
Edge Node with
collocated Variant
Assignment Service
request
Request
enriched
with var
assignment Server App
(e.g. web site)
End User Client/Server App
Experiment
Assignment
Library
request Variant
Assignment
Service
periodic
requests
variant
assignments
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
Client logs can be lossy – for instance if a web page takes
too long to load then the user may close the browser or navigate
away before the page loads and client cannot send a data point
for it. Clients can also lose data if they are offline for a long
time and their cache fills up. It is common for client data to
arrive late (some mobile apps only send telemetry when the user
is on Wi-Fi). It is important to measure the rate of client data
loss in each variant, as it is possible for the treatment variant to
affect the data loss rate, potentially leading to incorrect metric
calculations [23].
2) DATA COOKING: The goal of data cooking is to
merge and transform the raw logs from all the different sources
into the form that is easy and efficient to consume by analysis
jobs. For example, when a web page is shown to the user there’s
a record about it in the server logs. When the user clicks on a
link on the page, there’s a record for it in the client logs. During
cooking these two records will be merged making it much
easier to compute metrics such as click-through rate.
Since telemetry keeps evolving over time, the cooking
pipeline needs to be designed in a way that easily
accommodates new telemetry events and changes to the
existing events. Ideally, for most common telemetry changes,
no manual changes to the pipeline should be needed.
An important property of the cooking pipeline is the “No
data left behind” principle. It means that any event or event
property that existed in the raw logs should, in one form or
another, be present in the cooked log. This property is essential
both for ensuring the accuracy of analysis results, and for deep
dive investigations of the reasons for metric movements.
As discussed in section 4.3.1, data delay and loss are
common in client experimentation, and it is the job of the
cooking pipeline to handle correctly delayed and incomplete
data, update the cooked data as delayed events come in, and
enrich the cooked data with extra signals for measuring the rate
of data loss.
3) DATA QUALITY MONITORING: A critical piece of
log processing service is data quality monitoring. Our
experience is that intentional and unintentional changes to
telemetry and cooking process happen all the time, impacting
the correctness of the experimentation metrics. If such changes
go unnoticed by experiment owners, metrics may be computed
incorrectly, potentially resulting in wrong decisions on
experiments.
Typical properties of the data to monitor are volume,
fractions of outlier, empty, and null values, and various
invariants – the statements about the data that we expect to
always hold. An example of an invariant is “every Skype call
with the status Connected=true should have a non-zero
duration”, and conversely “every Skype call with the status
Connected=false should have zero duration”. Once the
cooked data stream is produced, a separate job needs to run to
compute various properties of the stream, monitoring
conditions should be evaluated, and alerts triggered if any
issues are detected.
E. Experiment Analysis
1) ANALYSIS LIFECYCLE: Experiment analysis is
conducted during the entire lifecycle of an experiment –
hypothesis generation, experiment design, experiment
execution, and post experiment during the decision-making
process.
a) Hypothesis Generation. Results of historical
experiments inform new hypotheses, help estimate how likely
the new hypothesis is to impact the OEC, and help prioritize
existing ideas. During this stage, the experiment owner
examines other historical experiments similar to the one he/she
is planning to run, as well as historical experiments that
improved the metrics he/she is hoping to improve.
b) Experiment Design. During experiment design,
analysis is performed to answer the following key questions:
What kind of randomization scheme to use?
How long should the experiment run?
What percentage of traffic should be allotted?
What randomization seed to use to minimize the imbalance?
c) Experiment Execution. While the experiment is
running, the analysis must answer two key questions:
Is the experiment causing an unacceptable harm to users?
Are there any data quality issues yielding untrustworthy
experiment results?
d) Post-Experiment. At the end of the experiment, to
decide on the next steps, analysis should answer the following
questions:
Is the data from the experiment trustworthy?
Did the treatment perform better than control?
Did the treatment cause unacceptable harm to other key
product metrics?
Why did the treatment do better/worse than control?
Are there any heterogeneous effects, with treatment
impacting different populations differently?
To answer these and other analysis questions analysis service
runs a number of automated and manual analysis jobs,
generates notifications and alerts for the experiment owners,
and provides tools for deep dive analysis and debugging. We
discuss several key analysis topics in more detail next.
2) METRIC COMPUTATION: Care needs to be taken to
compute the variance correctly during the metric computation.
While the variance of the count and sum metrics computed at
the randomization unit level is straightforward, percentile
metrics and metrics computed at other levels where units are
not independent (e.g. average Page Load Time computed as
sum of page load times over the number of pages loaded across
all users) require the use of delta method and bootstrap [5].
Wrong variance calculation may lead to incorrect p-values and
potentially incorrect conclusions about the experiment
outcome. In [10] a technique was introduced that uses pre-
experiment data to reduce the variance in experimentation
metrics. We found that this technique is very useful in practice
and is well worth the extra computation time it requires,
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
because in large products very small changes in the metric can
have millions of dollars impact on annual revenue.
Another challenge with metric computation is computing
large number of metrics efficiently at scale. With thousands of
scorecards computed every day, each often requiring to process
terabytes of data and perform sophisticated computations such
as the one mentioned above, this is a non-trivial challenge. We
have found that caching computations that are common across
multiple experiments is helpful for reducing data size and
performance.
3) SCORECARD ORGANIZATION: While the naïve
view of the scorecard as an arbitrary organized set of metrics
may work for products that are just starting to run experiments,
the number of metrics grows quickly, with most products using
hundreds to thousands of metrics. This makes examining the
metrics table manually very time consuming. Several scorecard
organization principles and analysis components can help make
it easier for users to understand the results and highlight
important information they should not miss.
Another important aspect of scorecard organization is
providing breakdowns of key metrics. For example, Figure 8
shows that a success metric Overall Page Clicks Rate has
moved negatively in the treatment group. Such movement is
much easier to interpret using the breakdown shown below,
which shows that the negative movement happened in “Web
Results”, and that its other components were in fact positive.
4) VERIFYING DATA QUALITY: Data quality checks
are the first step in analyzing experiment results [31]. One of
the most effective data quality checks for experiment analysis
is the Sample Ratio Mismatch (SRM) test, which utilizes the
Chi-Squared Test to compare the ratio of the observed user
counts in the variants against the configured ratio. When an
SRM is detected, the results are deemed invalid and the
experimenter is blocked from viewing the results to prevent
incorrect conclusions.
Another useful data quality verification mechanism is
running A/A experiments. In an A/A experiment there is no
treatment effect, so if all the computations are correct, the p-
values are expected to be distributed uniformly. If p-values are
not uniform, it indicates an issue, for example an incorrect
variance computation.
Given that data quality issues put in question the validity of
OEC and other key experiment metrics, the platform needs to
guarantee that these metrics are easily discoverable on the
scorecard, and experiment owners are alerted of any
degradations.
5) ALERTING: In the early stages of growing
experimentation in a product, when only a few experiments are
run, alerting is an optional feature of the platform as every
experiment is carefully designed and monitored by the
experiment owner. As experimentation scales, however, there
is less scrutiny about every proposed experiment, and less
monitoring. Experimentation platform needs to become more
intelligent, automatically analyzing experiment results and
alerting experiment owners, providing an important layer of
live-site security.
Experimentation portal needs to provide functionality to
configure alerts for every product running experiments.
Typically alerts would be configured for data quality metrics,
OEC metrics, performance metrics such as page load time,
errors and crashes, and a few other important product
characteristics.
To make alerts actionable, statistical analysis needs to be
done to determine when to trigger an alert. Note that the goal of
analysis here is different from that of understanding experiment
results – alerts should only notify experiment owners of serious
issues that may require experiment shutdown. It is important
that alerts take into account not only the level of statistical
significance, but also the effect size. For example, an
experimentation platform should not alert on a 1-millisecond
degradation to page load time metric, even if the confidence that
the degradation is a real effect is high (e.g., the p-value is less
than 1e-10.)
Based on the severity of the impact, the system can simply
notify the experiment owner, schedule experiment termination
and let experiment owner cancel it, or in most severe cases it
may terminate the experiment. These configurations need to
balance live-site security with allowing experiment owners to
be the final authority on shutting down an experiment.
6) DEEP DIVE ANALYSIS: While the overall scorecard
provides the average treatment effect on a set of metrics,
experiment owners should also have an ability to better
understand metric movements. Experiment scorecard should
provide an option to investigate the data by population segment
(e.g. browser type) or time-period (e.g. daily/weekly) and
evaluate the impact on each segment. Using this breakdown,
experiment owners might discover heterogeneous effects in
various segments (e.g. an old browser version might be causing
a high number of errors) and detect novelty effects (e.g. the
treatment group that received a new feature engages with it in
the first day and stops using it after that). This type of analysis,
however, needs to be done with care, as it is subject to the
multiple comparisons issue. Automatically and reliably
identifying such heterogeneous movements is an active
Metric
T
C
Delta (%)
,-value
Overall Click Rate
0.9206
0.9219
-0.14%
8e-11
Web Results
0.5743
0.5800
-0.98%
0
Answers
0.1913
0.1901
+0.63%
5e-24
Image
0.0262
0.0261
+0.38%
0.1112
Video
0.0280
0.0278
+0.72%
0.0004
News
0.0190
0.0190
+0.10%
0.8244
Related Search
0.0211
0.0207
+1.93%
7e-26
Pagination
0.0226
0.0227
-0.44%
0.0114
Other
0.0518
0.0515
+0.58%
0.0048
Figure 8. Example set of metrics between a treatment (T) and
control (C). A full breakdown is shown, indicating and
explaining the overall movement.
This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA18, April 30 May 4, 2018, Seattle, USA.
research area. ExP platform employs one such method [28],
notifying experiment owners if interesting or unusual
movements are detected in any segment.
To deeper understand the metric change in a specific
segment, the system should provide tools to easily obtain
examples of user sessions, containing the events that are used
to compute the metric of interest. For example, in an increase
in JavaScript errors is observed on a Bing search results page,
it may be useful for experimenter to see examples of the error
types and queries they triggered on, and conduct simple
analysis to see if a specific error type, or a specific query, is
causing the increase. While such analyses are conceptually
simple, implementing tools and automation to support them is
important. Doing such deep-dive analyses manually requires
deep understanding of how the data is structured, how a specific
metric is computed, and how experiment information is stored
in the data, which is beyond the level of knowledge of a typical
experiment owner.
IV. CONCLUSION
Online controlled experiments are becoming the standard
operating procedure in data-driven companies [2], [20].
However, for these companies to experiment with high velocity
and trust in experiment results, having an experimentation
platform is critical [9]. In this paper, we described the
architecture of a large scale online experimentation platform.
Such platform is used by thousands of users at Microsoft,
supporting hundreds of concurrent experiments, totally over
10000 experiments per year. This platform enables trustworthy
experimentation at scale, adding hundreds of millions of dollars
of additional yearly revenue for Microsoft. We hope that the
architectural guidance provided in this paper will help
companies implement their experimentation systems in an
easier, more cost-effective way, helping to further grow the
experimentation practices in software industry.
REFERENCES
[1] R. Kohavi and R. Longbotham, “Online Controlled Experiments and A/B
Tests,” Encycl. Mach. Learn. Data Min., no. Ries 2011, pp. 111, 2015.
[2] R. Kohavi and S. Thomke, “The Surprising Power of Online
Experiments,” Harvard Business Review, no. Oct ober, 2017.
[3] J. F. Box, “R.A. Fisher and the Design of Experiments, 19221926,” Am.
Stat., vol. 34, no. 1, pp. 17, Feb. 1980.
[4] ExP, “Micros oft Experimentation Platform.” .
[5] A. Deng, J. Lu, and J. Litz, “Trustworthy Analysis of Online A/B Tests:
Pitfalls, Challenges and Solutions,” in Proceedings of the Tenth ACM
International Conference on Web Search and Data Mining, 2017, pp.
641649.
[6] A. Deng, J. Lu, and S. Chen, “Continuous monitoring of A/B tests
without pain: Optional stopping in Bayesian testing,” in Proceedings - 3rd
IEEE International Conference on Data Science and Advanced Analytics,
DSAA 2016, 2016, pp. 24325 2.
[7] D. Tang, A. Agarwal, D. O’Brien, and M. Meyer, “Overlapping
experiment infrastructure,” in Proceedings of the 16th ACM SIGKDD
international conference on Knowledge discovery and data mining - KDD
’10, 2010, p. 17.
[8] R. Kohavi and R. Longbotham, “Online experiments: Lessons learned,”
Computer (Long. Beach. Calif)., vol. 40, no. 9, pp. 103105, 2007.
[9] A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, “The Evolution of
Continuous Experimentation in Software Product Developmen t: From
Data to a Data-Driven Organization at Scale,” in 2017 IEEE/ACM 39th
International Conference on Software Engineering (ICSE), 2017, pp.
770780.
[10]A. Deng, Y. Xu, R. Kohavi, and T. Walker, “Improving the sensitivity of
online contr olled experiments b y util izing pre-experiment data,” in
Proceedings of the sixth ACM international conference on Web search
and data mining - WSDM ’13, 2013, p. 123.
[11]P. Dmitriev and X. Wu , “Measuring Metrics,” Proc. 25th ACM Int. Conf.
Inf. Knowl. Manag. - CIKM ’16, pp. 429437, 2016.
[12]A. Deng and X. Shi, “Data-Driven Metric Development for Online
Controlled Experiments,” in Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining -
KDD ’16, 2016, pp. 77–86.
[13]P. Dmitriev, B . Frasca, S. Gupta, R. Kohavi, and G. Vaz, “Pitfalls of long-
term online controlled experiments,” in Proceedings - 2016 IEEE
International Conference on Big Data, Big Data 2016, 2016, pp. 1367
1376.
[14]H. Hohnhold, D. O’Brien, and D. Tang, “Focusing on the Long-term,” in
Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining - KDD ’15, 2015, pp. 18491858.
[15]A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, “The Benefits of
Controlled Experimentation at Scale,” in 2017 43rd Euromicro
Conference on Software Engineering and Advanced Applications (SEAA),
2017, pp. 1826.
[16]Y. Xu, N. Chen, A. Fernandez, O. Sinno, and A. Bhasin, “Fr om
Infrastructure to Culture: A/B Testing Challenges in Large Scale Social
Networks,” Proc. 21th ACM SIGKDD Int . Conf. Kn owl. Discov. Data
Min., no. Figure 1, pp. 22272236, 2015.
[17]R. Kohavi, A. Deng, R. Longbotham, and Y. Xu, “Seven rules of thu mb
for web site experimenters,” in Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining - KDD
’14, 2014, pp. 18571866.
[18]R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne,
“Controlled experiments on the web: Survey and practical guide,” Data
Min. Knowl. Discov., vol. 18, no. 1, pp. 140181, 2009.
[19]P. Dmitriev, S. Gupta, K. D ong W oo, and G. Vaz, “A Dirty Dozen:
Twelve Common Metric Interpretati on Pitfalls in Online Controlled
Experiments,” Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discov. data
Min. - KDD ’17, pp. 14271436, 2017.
[20]R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohl mann,
“Online controlled experiments at large scale,” in Proceedings of the 19th
ACM SIGKDD international conference on Knowledge discovery and
data mining - KDD ’13, 2013, p. 1168.
[21]R. L. Kaufman, J. Pitchforth, and L. Vermeer, “Democratizing online
controlled experiments at Booking. com,” arXiv Prepr. arXiv1710.08217,
pp. 17, 2017.
[22]P. Runeson and M. H öst, “Guidelines for conducting and reporting case
study research in software engineering,” Empir. Softw. Eng., vol. 14, no.
2, pp. 131164, 2008.
[23]R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu,
“Trustworthy Online Controlled Experiments: Five Puzzling Outcomes
Explained,” Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discov. data
Min., pp. 786794, 2012.
[24]R. B. Bausell and Y.-F. Li, Power analysis for experimental research: a
practical guide for the biological, medical and social sciences.
Cambridge University Press, 2002.
[25]K. L. Morgan and D. B. Rubin, “Rerandomization to improve covariate
balance in experiments,” Ann. Stat., vol. 40, no. 2, pp. 12631282, 2012.
[26]F. Radlinski and N. Craswell, “Optimized interleaving for online retrieval
evaluation,” in Proceedings of the sixth ACM international conference on
Web search and data mining - WSDM ’13, 2013, p. 245.
[27]R. Conradi and B. Westfechtel, “Version Models for Software
Configuration Management,” ACM Comput. Surv., vol. 30, no. 2, pp.
232282, Jun. 1998.
[28]A. Deng, P. Zhang, S. Chen, D. W. Kim, and J. Lu, “Concise
Summariza tion of Heter ogeneous Treatment Effect Using Total Variation
Regularized Regression,” Submiss.
[29]“Multiple Comparisons Problem,Wikipedia. [Online]. Available:
https://en.wikipedia.org/wiki/Multiple_comparisons_problem.
[30]P. K. Pearson, “Fast hashing of variable-length text strings,” Commun.
ACM, vol. 33, no. 6, pp. 67 7680, Jun. 1990.
[31]P. Dmitriev, A . Deng, R. Kohavi, and P. Raff, “A / B Testing at Scale:
Accelerating Software Innovation,” in SIGIR, 2017, pp. 13951397.
  • Conference Paper
    Full-text available
    Accurately learning what delivers value to customers is difficult. Online Controlled Experiments (OCEs), aka A/B tests, are becoming a standard operating procedure in software companies to address this challenge as they can detect small causal changes in user behavior due to product modifications (e.g. new features). However, like any data analysis method, OCEs are sensitive to trustworthiness and data quality issues which, if go unaddressed or unnoticed, may result in making wrong decisions. One of the most useful indicators of a variety of data quality issues is a Sample Ratio Mismatch (SRM) ? the situation when the observed sample ratio in the experiment is different from the expected. Just like fever is a symptom for multiple types of illness, an SRM is a symptom for a variety of data quality issues. While a simple statistical check is used to detect an SRM, correctly identifying the root cause and preventing it from happening in the future is often extremely challenging and time consuming. Ignoring the SRM without knowing the root cause may result in a bad product modification appearing to be good and getting shipped to users, or vice versa. The goal of this paper is to make diagnosing, fixing, and preventing SRMs easier. Based on our experience of running OCEs in four different software companies in over 25 different products used by hundreds of millions of users worldwide, we have derived a taxonomy for different types of SRMs. We share examples, detection guidelines, and best practices for preventing SRMs of each type. We hope that the lessons and practical tips we describe in this paper will speed up SRM investigations and prevent some of them. Ultimately, this should lead to improved decision making based on trustworthy experiment analysis.
  • Conference Paper
    A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex changes like using machine learning models to personalize user experience. The key aspect of A/B testing is evaluation of experiment results. Designing the right set of metrics - correct outcome measures, data quality indicators, guardrails that prevent harm to business, and a comprehensive set of supporting metrics to understand the "why" behind the key movements is the #1 challenge practitioners face when trying to scale their experimentation program [18, 22]. On the technical side, improving sensitivity of experiment metrics is a hard problem and an active research area, with large practical implications as more and more small and medium size businesses are trying to adopt A/B testing and suffer from insufficient power. In this tutorial we will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions.
  • Conference Paper
    Full-text available
    Online Controlled Experiments (OCEs) are transforming the decision-making process of data-driven companies into an experimental laboratory. Despite their great power in identifying what customers actually value, experimentation is very sensitive to data loss, skipped checks, wrong designs, and many other 'hiccups' in the analysis process. For this purpose, experiment analysis has traditionally been done by experienced data analysts and scientists that closely monitored experiments throughout their lifecycle. Depending solely on scarce experts, however, is neither scalable nor bulletproof. To democratize experimentation, analysis should be streamlined and meticulously performed by engineers, managers, or others responsible for the development of a product. In this paper, based on synthesized experience of companies that run thousands of OCEs per year, we examined how experts inspect online experiments. We reveal that most of the experiment analysis happens before OCEs are even started, and we summarize the key analysis steps in three checklists. The value of the checklists is threefold. First, they can increase the accuracy of experiment setup and decision-making process. Second, checklists can enable novice data scientists and software engineers to become more autonomous in setting-up and analyzing experiments. Finally, they can serve as a base to develop trustworthy platforms and tools for OCE setup and analysis.
  • Conference Paper
    A/B/n testing has been adopted by many technology companies as a data-driven approach to product design and optimization. These tests are often run on their websites without explicit consent from users. In this paper, we investigate such online A/B/n tests by using Optimizely as a lens. First, we provide measurement results of 575 websites that use Optimizely drawn from the Alexa Top-1M, and analyze the distributions of their audiences and experiments. Then, we use three case studies to discuss potential ethical pitfalls of such experiments, including involvement of political content, price discrimination, and advertising campaigns. We conclude with a suggestion for greater awareness of ethical concerns inherent in human experimentation and a call for increased transparency among A/B/n test operators.
  • Article
    Full-text available
    Online Controlled Experiments (OCEs) enable an accurate understanding of customer value and generate millions of dollars of additional revenue at Microsoft. Unlike other techniques for learning from customers, OCEs establish an accurate and causal relationship between a change and the impact observed. Although previous research describes technical and statistical dimensions, the key phases of online experimentation are not widely known, their impact and importance are obscure, and how to establish OCEs in an organization is underexplored. In this paper, using a longitudinal in-depth case study, we address this gap by (1) presenting the Experiment Lifecycle, and (2) demonstrating with four example experiments their profound impact. We show that OECs help optimize infrastructure needs and aid in project planning and measuring team efforts, in addition to their primary goal of accurately identifying what customers value. We conclude that product development should fully integrate the Experiment Lifecycle to benefit from the OCEs.
  • Conference Paper
    Full-text available
    Online Controlled Experiments (OCEs) are the norm in data-driven software companies because of the benefits they provide for building and deploying software. Product teams experiment to accurately learn whether the changes that they do to their products (e.g. adding new features) cause any impact (e.g. customers use them more frequently). Experiments also help reduce the risk from deploying software by minimizing the magnitude and duration of harm caused by software bugs, allowing software to be shipped more frequently. To make informed decisions in product development, experiment analysis needs to be granular with a large number of metrics over heterogeneous devices and audiences. Discovering experiment insights by hand, however, can be cumbersome. In this paper, and based on case study research at a large-scale software development company with a long tradition of experimentation, we (1) describe the standard process of experiment analysis, and (2) introduce an artifact to improve the effectiveness and comprehensiveness of this process.
  • Conference Paper
    Full-text available
    Online Controlled Experiments (OCEs, aka A/B tests) are one of the most powerful methods for measuring how much value new features and changes deployed to software products bring to users. Companies like Microsoft, Amazon, and Booking.com report the ability to conduct thousands of OCEs every year. However, the competences of the remainder of the online software industry remain unknown. The main objective of this paper is to reveal the current state of A/B testing maturity in the software industry based on a maturity model from our previous research. We base our findings on 44 responses from an online empirical survey. Our main contribution of this paper is the current state of experimentation maturity as operationalized by the ExG model for a convenience sample of companies doing online controlled experiments. Our findings show that, among others, companies typically develop in-house experimentation platforms, that these platforms are of various levels of maturity, and that designing key metrics-Overall Evaluation Criteria-remains the key challenge for successful experimentation.
  • Thesis
    Full-text available
    Accurately learning what customers value is critical for the success of every company. Despite the extensive research on identifying customer preferences, only a handful of software companies succeed in becoming truly data-driven at scale. Benefiting from novel approaches such as experimentation in addition to the traditional feedback collection is challenging, yet tremendously impactful when performed correctly. In this thesis, we explore how software companies evolve from data-collectors with ad-hoc benefits, to trustworthy data-driven decision makers at scale. We base our work on a 3.5-year longitudinal multiple-case study research with companies working in both embedded systems domain (e.g. engineering connected vehicles, surveillance systems, etc.) as well as in the online domain (e.g. developing search engines, mobile applications, etc.). The contribution of this thesis is three-fold. First, we present how software companies use data to learn from customers. Second, we show how to adopt and evolve con-trolled experimentation to become more accurate in learn-ing what customers value. Finally, we provide detailed guidelines that can be used by companies to improve their experimentation capabilities. With our work, we aim to empower software companies to become truly data-driven at scale through trustworthy experimentation. Ultimately this should lead to better soft-ware products and services.
  • Article
    Full-text available
    There is an extensive literature about online controlled experiments, both on the statistical methods available to analyze experiment results as well as on the infrastructure built by several large scale Internet companies but also on the organizational challenges of embracing online experiments to inform product development. At Booking.com we have been conducting evidenced based product development using online experiments for more than ten years. Our methods and infrastructure were designed from their inception to reflect Booking.com culture, that is, with democratization and decentralization of experimentation and decision making in mind. In this paper we explain how building a central repository of successes and failures to allow for knowledge sharing, having a generic and extensible code library which enforces a loose coupling between experimentation and business logic, monitoring closely and transparently the quality and the reliability of the data gathering pipelines to build trust in the experimentation infrastructure, and putting in place safeguards to enable anyone to have end to end ownership of their experiments have allowed such a large organization as Booking.com to truly and successfully democratize experimentation.
  • Conference Paper
    Full-text available
    Online controlled experiments (for example A/B tests) are increasingly being performed to guide product development and accelerate innovation in online software product companies. The benefits of controlled experiments have been shown in many cases with incremental product improvement as the objective. In this paper, we demonstrate that the value of controlled experimentation at scale extends beyond this recognized scenario. Based on an exhaustive and collaborative case study in a large software-intensive company with highly developed experimentation culture, we inductively derive the benefits of controlled experimentation. The contribution of our paper is twofold. First, we present a comprehensive list of benefits and illustrate our findings with five case examples of controlled experiments conducted at Microsoft. Second, we provide guidance on how to achieve each of the benefits. With our work, we aim to provide practitioners in the online domain with knowledge on how to use controlled experimentation to maximize the benefits on the portfolio, product and team level.
  • Conference Paper
    Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested in web sites, mobile applications, desktop applications, services, and operating systems. One of the key challenges for organizations that run controlled experiments is to come up with the right set of metrics [1] [2] [3]. Having good metrics, however, is not enough. In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment's outcome, which if deployed could hurt the business by millions of dollars. Inspired by Steven Goodman's twelve p-value misconceptions [4], in this paper, we share twelve common metric interpretation pitfalls which we observed repeatedly in our experiments. We illustrate each pitfall with a puzzling example from a real experiment, and describe processes, metric design principles, and guidelines that can be used to detect and avoid the pitfall. With this paper, we aim to increase the experimenters' awareness of metric interpretation issues, leading to improved quality and trustworthiness of experiment results and better data-driven decisions.
  • Conference Paper
    The Internet provides developers of connected software, including web sites, applications, and devices, an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments, also known as A/B tests. From front-end user-interface changes to backend algorithms, from search engines (e.g., Google, Bing, Yahoo!) to retailers (e.g., Amazon, eBay, Etsy) to social networking services (e.g., Facebook, LinkedIn, Twitter) to travel services (e.g., Expedia, Airbnb, Booking.com) to many startups, online controlled experiments are now utilized to make data-driven decisions at a wide range of companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher's experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and evaluation of online controlled experiments at scale (100's of concurrently running experiments) across variety of web sites, mobile apps, and desktop applications presents many pitfalls and new research challenges. In this tutorial we will give an introduction to A/B testing, share key lessons learned from scaling experimentation at Bing to thousands of experiments per year, present real examples, and outline promising directions for future work. The tutorial will go beyond applications of A/B testing in information retrieval and will also discuss on practical and research challenges arising in experimentation on web sites and mobile and desktop apps. Our goal in this tutorial is to teach attendees how to scale experimentation for their teams, products, and companies, leading to better data-driven decisions. We also want to inspire more academic research in the relatively new and rapidly evolving field of online controlled experimentation.
  • Conference Paper
    Full-text available
    Software development companies are increasingly aiming to become data-driven by trying to continuously experiment with the products used by their customers. Although familiar with the competitive edge that the A/B testing technology delivers, they seldom succeed in evolving and adopting the methodology. In this paper, and based on an exhaustive and collaborative case study research in a large software-intense company with highly developed experimentation culture, we present the evolution process of moving from ad-hoc customer data analysis towards continuous controlled experimentation at scale. Our main contribution is the " Experimentation Evolution Model " in which we detail three phases of evolution: technical, organizational and business evolution. With our contribution, we aim to provide guidance to practitioners on how to develop and scale continuous experimentation in software organizations with the purpose of becoming data-driven at scale.
  • Conference Paper
    Full-text available
    Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested on web sites, mobile applications, desktop applications, services, and operating system features. One of the key challenges for organizations that run controlled experiments is to select an Overall Evaluation Criterion (OEC), i.e., the criterion by which to evaluate the different variants. The difficulty is that short-term changes to metrics may not predict the long-term impact of a change. For example, raising prices likely increases short-term revenue but also likely reduces long-term revenue (customer lifetime value) as users abandon. Degrading search results in a Search Engine causes users to search more, thus increasing query share short-term, but increasing abandonment and thus reducing long-term customer lifetime value. Ideally, an OEC is based on metrics in a short-term experiment that are good predictors of long-term value. To assess long-term impact, one approach is to run long-term controlled experiments and assume that long-term effects are represented by observed metrics. In this paper we share several examples of long-term experiments and the pitfalls associated with running them. We discuss cookie stability, survivorship bias, selection bias, and perceived trends, and share methodologies that can be used to partially address some of these issues. While there is clearly value in evaluating long-term trends, experimenters running long-term experiments must be cautious, as results may be due to the above pitfalls more than the true delta between the Treatment and Control. We hope our real examples and analyses will sensitize readers to the issues and encourage the development of new methodologies for this important problem
  • Conference Paper
    Full-text available
    A/B tests (or randomized controlled experiments) play an integral role in the research and development cycles of technology companies. As in classic randomized experiments (e.g., clinical trials), the underlying statistical analysis of A/B tests is based on assuming the randomization unit is independent and identically distributed (i.i.d.). However, the randomization mechanisms utilized in online A/B tests can be quite complex and may render this assumption invalid. Analysis that unjustifiably relies on this assumption can yield untrustworthy results and lead to incorrect conclusions. Motivated by challenging problems arising from actual online experiments, we propose a new method of variance estimation that relies only on practically plausible assumptions, is directly applicable to a wide of range of randomization mechanisms, and can be implemented easily. We examine its performance and illustrate its advantages over two commonly used methods of variance estimation on both simulated and empirical datasets. Our results lead to a deeper understanding of the conditions under which the randomization unit can be treated as i.i.d. In particular, we show that for purposes of variance estimation, the randomization unit can be approximated as i.i.d. when the individual treatment effect variation is small; however, this approximation can lead to variance underestimation when the individual treatment effect variation is large.
  • Article
    Full-text available
    Randomized controlled experiment has long been accepted as the golden standard for establishing causal link and estimating causal effect in various scientific fields. Average treatment effect is often used to summarize the effect estimation, even though treatment effects are commonly believed to be varying among individuals. In the recent decade with the availability of "big data", more and more experiments have large sample size and increasingly rich side information that enable and require experimenters to discover and understand heterogeneous treatment effect (HTE). There are two aspects in HTE understanding, one is to predict the effect conditioned on a given set of side information or a given individual, the other is to interpret the HTE structure and summarize it in a memorable way. The former aspect can be treated as a regression problem, and the latter aspect focuses on concise summarization and interpretation. In this paper we propose a method that can achieve both at the same time. This method can be formulated as a convex optimization problem, for which we provide stable and scalable implementation.
  • Conference Paper
    Online controlled experiments, also called A/B testing, have been established as the mantra for data-driven decision making in many web-facing companies. In recent years, there are emerging research works focusing on building the platform and scaling it up, best practices and lessons learned to obtain trustworthy results, and experiment design techniques and various issues related to statistical inference and testing. However, despite playing a central role in online controlled experiments, there is little published work on treating metric development itself as a data-driven process. In this paper, we focus on the topic of how to develop meaningful and useful metrics for online services in their online experiments, and show how data-driven techniques and criteria can be applied in metric development process. In particular, we emphasize two fundamental qualities for the goal metrics (or Overall Evaluation Criteria) of any online service: directionality and sensitivity. We share lessons on why these two qualities are critical, how to measure these two qualities of metrics of interest, how to develop metrics with clear directionality and high sensitivity by using approaches based on user behavior models and data-driven calibration, and how to choose the right goal metrics for the entire online services.
  • Conference Paper
    Over the past 10+ years, online companies large and small have adopted widespread A/B testing as a robust data-based method for evaluating potential product improvements. In online experimentation, it is straightforward to measure the short-term effect, i.e., the impact observed during the experiment. However, the short-term effect is not always predictive of the long-term effect, i.e., the final impact once the product has fully launched and users have changed their behavior in response. Thus, the challenge is how to determine the long-term user impact while still being able to make decisions in a timely manner. We tackle that challenge in this paper by first developing experiment methodology for quantifying long-term user learning. We then apply this methodology to ads shown on Google search, more specifically, to determine and quantify the drivers of ads blindness and sightedness, the phenomenon of users changing their inherent propensity to click on or interact with ads. We use these results to create a model that uses metrics measurable in the short-term to predict the long-term. We learn that user satisfaction is paramount: ads blindness and sightedness are driven by the quality of previously viewed or clicked ads, as measured by both ad relevance and landing page quality. Focusing on user satisfaction both ensures happier users but also makes business sense, as our results illustrate. We describe two major applications of our findings: a conceptual change to our search ads auction that further increased the importance of ads quality, and a 50% reduction of the ad load on Google's mobile search interface. The results presented in this paper are generalizable in two major ways. First, the methodology may be used to quantify user learning effects and to evaluate online experiments in contexts other than ads. Second, the ads blindness/sighted-ness results indicate that a focus on user satisfaction could help to reduce the ad load on the internet at large with long-term neutral, or even positive, business impact.