ArticlePDF Available

Democratizing online controlled experiments at Booking.com

Authors:

Abstract

There is an extensive literature about online controlled experiments, both on the statistical methods available to analyze experiment results as well as on the infrastructure built by several large scale Internet companies but also on the organizational challenges of embracing online experiments to inform product development. At Booking.com we have been conducting evidenced based product development using online experiments for more than ten years. Our methods and infrastructure were designed from their inception to reflect Booking.com culture, that is, with democratization and decentralization of experimentation and decision making in mind. In this paper we explain how building a central repository of successes and failures to allow for knowledge sharing, having a generic and extensible code library which enforces a loose coupling between experimentation and business logic, monitoring closely and transparently the quality and the reliability of the data gathering pipelines to build trust in the experimentation infrastructure, and putting in place safeguards to enable anyone to have end to end ownership of their experiments have allowed such a large organization as Booking.com to truly and successfully democratize experimentation.
arXiv:1710.08217v1 [cs.HC] 23 Oct 2017
Democratizing online controlled experiments at
Booking.com
Raphael Lopez Kaufman
raphael.lopez@booking.com
Jegar Pitchforth
jegar.pitchforth@booking.com
Lukas Vermeer
lukas.vermeer@booking.com
Abstract
There is an extensive literature about online controlled experiments,
both on the statistical methods available to analyze experiment results
[1, 2, 3] as well as on the infrastructure built by several large scale In-
ternet companies [4, 5, 6, 7] but also on the organizational challenges of
embracing online experiments to inform product development [6, 8]. At
Booking.com we have been conducting evidenced based product develop-
ment using online experiments for more than ten years. Our methods and
infrastructure were designed from their inception to reflect Booking.com
culture, that is, with democratization and decentralization of experimen-
tation and decision making in mind.
In this paper we explain how building a central repository of successes
and failures to allow for knowledge sharing, having a generic and extensi-
ble code library which enforces a loose coupling between experimentation
and business logic, monitoring closely and transparently the quality and
the reliability of the data gathering pipelines to build trust in the experi-
mentation infrastructure, and putting in place safeguards to enable anyone
to have end to end ownership of their experiments have allowed such a
large organization as Booking.com to truly and successfully democratize
experimentation.
Introduction
At Booking.com we have been using online controlled experiments for more
than ten years to conduct evidence based product development. Overall, on a
daily basis, all members of our departments run and analyse more than a thou-
sand concurrent experiments to quickly validate new ideas. These experiments
run across all our products, from mobile apps and tools used by hoteliers to
customer service phone lines and internal systems.
Experimentation has become so ingrained in Booking.com culture that ev-
ery change, from entire redesigns and infrastructure changes to bug fixes, is
wrapped in an experiment. Moreover, experiments are used for asynchronous
feature release and as a safety net, increasing the overall velocity of the product
organization. Finally, they are also a way of gathering learnings on customer
behaviour, for example by revisiting previous successes and failures.
1
It is with the aim of allowing such ubiquitous use cases, and of truly democ-
ratizing experimentation, that we built an in-house experiment infrastructure
trustworthy and flexible enough to lead all the products in the right direction.
Indeed, not only does each department have its own dedicated team to provide
support and to work on improving the experiment platform, but all the steps
required to validate new product ideas are fully decentralized: gathering learn-
ings from previous experiments to inform new hypotheses, implementing and
running new tests, creating metrics to support new areas of the business and
most importantly analysing the results and making decisions. Such democra-
tization is only possible if running experiments is cheap, safe and easy enough
that anyone can go ahead with testing new ideas, which in turn means that
the experiment infrastructure must be generic, flexible and extensible enough
to support all current and future use cases.
In many respects the components of Booking.com experiment infrastructure
are similar to what Microsoft, Linkedin or Facebook have described in previous
contributions [5, 6, 7]. However, in this paper, we share some key features of
our infrastructure which have enabled us to truly democratize and decentralize
experimentation across all the departments and products at Booking.com.
Central repository of successes and failures
Enabling everyone to come up with new hypotheses is key to democratiz-
ing experimentation and move away from a product organization where only
product managers decide what features to test next. Therefore, Booking.com
experiment platform acts as a searchable repository of all the previous successes
and failures, dating back to the very first experiment, which everyone can con-
sult and audit. Experiments can be grouped by teams, areas of the Booking.com
products they ran on, segments of visitors they targeted and much more. The
data of these previous experiments is shown in the exact same state as it was to
their owners, along with the hypotheses that were put to the test. Moreover, de-
scriptions of all the iterations and of the final decision, whether the experiment
was a success or not, are available.
Having such a consistent history implies the extra work of making backwards
compatible changes to the reporting platform and of keeping all the experiment
data, but it has proven immensely worthwhile as cross pollination between teams
and products, by iterating on or revisiting past failures and by disseminating
successes led to many improvements for our customers. However, one of the
challenges we are still facing is being able to answer questions such as “what
were the findings related to improving the clarity of the cancellation policies in
the past year” in the central repository, to go beyond short term cross team and
cross departments learnings, without resorting to using a set of tags to label
experiments. Indeed, this approach is proving not sustainable both because it
involves making sure the tags are normalized and kept up to date and because
it entails poor discoverability and usability of the central repository.
2
Genericity and extensibility
To support experimentation across several departments and several prod-
ucts, the platform needs to be generic enough to allow the setup, implementa-
tion and analysis of an experiment with minimal ad hoc work, apart from the
implementation of new business logic. That means recruitment, randomization
and recording visitors’ behaviour is abstracted away behind a set of APIs made
available for all products. Reporting is also automatically handled by the in-
frastructure and is department and product agnostic. However, both the API
and the reporting need to be easily extensible to support as many use cases as
possible.
Firstly, we provide an extensible metric framework. Anyone can create new
metrics, may they be aggregated in real-time or using batch updates, depending
on what is more suitable. One can choose to have these metrics automatically
aggregated for every experiment because of their general business scope or to
make them available on demand. Finally, the framework accommodates for
both binary (e.g. is this visitor a booker) and real valued metrics (e.g. what is
the euro commission of a booking).
Secondly, to support cross products experimentation and experiments on
new products we also provide an extensible framework to identify visitors, called
tracking methods. Example of such tracking methods are user account and email
address based tracking. Indeed, oftentimes the effects of an experiment are not
limited to one product. Changing the way customers interact with our mobile
apps is likely to also change the way they use our main website.
As an example, consider even the simplest experiment which aims at increas-
ing bookings by changing the way hotel pictures are displayed in the Android
app. It is likely that many people will consult the Booking.com website on their
phone during their commute to decide on a hotel but will eventually book the
trip on their desktop at home. Therefore, we cannot only identify visitors using
their http cookie, which is tied to a specific browser on a single device, if we are
to truly assess the impact on our customers of this new idea, but we must find a
way to identify visitors which is consistent across their journey on our different
products. Adding a new tracking method can be done by any developer and we
now have more than a dozen ways to identify visitors across all our products.
Data which can be trusted
New hires coming from more traditional product organizations often find
themselves humbled and frustrated when their ideas are invalidated by exper-
iments. As such, building trust in our infrastructure is key to democratizing
experimentation. It means that experiment results must draw as accurate a
picture as possible of the true effect of a new feature on our customers. It is
also a necessary condition to ensure that decentralized decision making leads to
customer experience improvements and, eventually, business growth.
To address this, Booking.com experiment infrastructure teams have made
data quality a priority. We monitor the validity of the data used for decision
making by computing a set of common metrics in two entirely separate data
pipelines maintained by different engineers who do not review each other’s code,
one doing near real-time aggregations (less than a five minutes’ delay) and
3
one doing daily batch updates. This allows us to quickly detect bugs in the
aggregation code when we ship new features to these pipelines, and to strengthen
our trust in the quality of the data they generate by alerting on the discrepancies
we found for these metrics on real experiments [9].
We rely on monitoring real experiment data rather than on a simple test
suite because discrepancies can arise not only because of code changes in the
aggregation pipelines but because of code introduced by other teams. As an
example consider how cancellations are recorded. Not only do we need to keep
track of cancellations done by customer themselves but also by customer support
agents and even properties owners and hotel managers. The two pipelines use
different data sources for cancellations corresponding to the different needs of
real-time and batch processing. Any change (e.g. adding a new endpoint for
hoteliers to do bulk cancellations) which would lead to cancellations events being
fed to one of the sources but not to the other will result in a discrepancy in the
cancellation count between the two pipelines.
Moreover, to make sure that no piece of information is lost between when
it is first generated on our products to when it is received by our aggregation
pipeline, we maintain a set of experiments for which we control the input (e.g.
number of visits). Finally, we also monitor the statistical framework used for
decision making by maintaining a pool of AAs (experiments whose treatment
does not introduce any sort of change) which allow to validate its theoretical
properties (e.g. false positive rates or the statistical distribution of metrics).
This complex monitoring infrastructure is also maintained to foster a critical
thinking attitude towards data among experiment owners. Indeed, all too often,
data is taken as the absolute source of truth. Therefore, in the reporting itself
it is possible to see by how much the two aggregation pipelines diverge, because
of inevitable data loss for example. Similarly, the pool of AAs exemplifies the
concept of a false positive, which is at first hard to grasp.
Loose coupling between the experiment infras-
tructure and the business logic
Making sure that the pipelines consistently aggregate data is, however, not
sufficient. Indeed, for experimentation to allow successful evidence based prod-
uct development we must ensure that whatever results are reported to experi-
ment owners are an accurate measurement of the true impact of their new fea-
tures on the customers who would be exposed to the changes were they deemed
successful enough to be permanently shipped. Therefore, it was decided very
early to keep the experiment infrastructure as loosely coupled to the business
logic as possible.
In our infrastructure, the target of an experiment (e.g. logged in, English
speaking and looking for a family trip customers) and where the experiment runs
(e.g. in the hotel descriptions in the search result page) is implemented in the
code by experiment owners and is not addressed by the experiment platform.
It may seem like a decrease in velocity since new code needs to be written,
tested and rolled out. However, this decrease is offset by the fact that exposing
to the new feature all the types of customers that were targeted during the
experiment runtime, and only those, is just matter of removing the API call
4
that was used to decide treatment assignment. Indeed, let us imagine that the
infrastructure responsible for recruiting visitors in experiments would have to be
aware of segments (e.g. leisure or business). Once an experiment is successful,
that would mean that either the line of code which does visitor recruitment
needs to stay forever, adding to the complexity of the codebase (indeed, static
analysers doing cleanups would also have to be aware of experiment statuses),
or that cleaning the experiment logic involves adding a line of code which was
never tested and is not guaranteed to use the same definition of segments.
Moreover, experiment state data is not exposed via an API, purposefully.
For example, it is not possible to know, in the business code, whether a given
experiment is running or in which experiments a visitor was recruited. Indeed,
this would lead experiment results to be context dependent (e.g. on the set of
experiments which are running concurrently) and therefore be poor predictor of
the future business impact of new features.
Building safeguards
When anyone is empowered to run new experiments, it is crucial to have
a very tight integration between the experiment and the monitoring platforms.
Sometimes experiments introduce such severe bugs, removing for example the
ability for certain customers to book, that the overall business is immediately
impacted. Therefore, it must be possible to attribute, in real time (in our current
system that means less than a minute), the impact of every experiment on both
the overall business and the health of the systems.
However, having experiment owners closely monitor their experiments is not
enough. Everyone in the company must be able and feel empowered to stop
harmful experiments of their own accord. It is therefore as much a technical as
a cultural issue. It is worth noting that given our infrastructure, we could easily
automate such stops. However, we feel this goes contrary to democratizing ex-
perimentation which fosters end to end ownership. Moreover, deciding whether
an experiment is more harmful than beneficial, given that some experiments
are aiming at gaining a better understanding of visitors’ behaviour rather than
bringing a “win”, is a judgement call which is best made by those who own the
change.
Safeguards must also be established when it comes to decision making. In-
deed, experiment results must accurately predict the future impact on customers
of new features. Given the frequentist statistical framework we use at Book-
ing.com, that means we need to enforce good practices as much as possible to
make sure successes are not statistical ghosts.
One such important practice is pre-registration. Experiment owners need to
specify up front which customer behavior they want to impact and how, the set
of metrics which is going to support their hypothesis, and how these metrics
are going to change. Coupled with a culture which encourages peer review
of successes this enforcement considerably reduces p-hacking [10] and dubious
decisions.
We have also implemented safeguards regarding missing data. For exper-
iments relying on asynchronous or lossy channels for tracking, such as events
generated from Javascript code, statistical checks are in place to detect selec-
tive attrition (a phenomenon which renders all experiment results void [11]). In
5
that case, not only a warning is visible on the experiment report but compar-
ative statistics used for decision making are hidden, making sure no oblivious
experiment owner can disregard the issue.
As an example, consider an experiment which loads synchronously additional
images. Visitors with slow internet connection will drop off at a higher rate in
treatment than in control before any data can be sent back to our servers, leading
to selective attrition. Similarly, whenever the experiment API is called on a
visitor whose identifier is unknown (e.g. when an experiment is targeting logged
in users and tries to recruit a non logged in visitor), the visitor is shown control
and is not recruited in the experiment, and their behaviour is not recorded. In
such cases a message is displayed on the reporting itself to warn experiment
owners that the data they are looking at is incomplete. Indeed, the data shown
in the reporting might not be an accurate representation of the true business
impact of the change anymore, as the visitor whom we could not identify would
be exposed to the change were it to be permanently shipped. However, in this
case, contrary to selective attrition (missing data not at random), full ownership
of the decision is left to experiment owners rather than relying on automation.
Indeed, many factors are to be taken into account when assessing the impact
of missing data at random on experiment result validity. For example, the
amount of missing identifiers can be insignificant compared to the magnitude
of the improvement brought by the treatment, or the missing identifiers may be
caused by malicious traffic to our products with, therefore, no impact on the
accuracy of the data reported on the business impact of the experiment.
Conclusion
Enabling anyone to form new hypotheses by maintaining a repository of past
failures and successes, building trust in experimentation by accurately measur-
ing customer behaviour, making experimentation accessible and safe by design-
ing extensible frameworks with built-in safeguards, these are the keys features of
our experiment infrastructure that we hope will help other companies truly de-
mocratize experimentation in their product organization. Moreover, we showed
how much emphasis was put on giving as much ownership as possible to exper-
iment owners. This entails that almost everyone at Booking.com needs to be
aware of many of the inner workings of experimentation, regarding, for example,
hypothesis testing, data collection and metrics implementation, in order to be
able to make the best informed judgement calls on their experiments. Therefore,
dedicating considerable time to trainings, both online and in classrooms, and to
in person support is necessary for decentralizing experimentation at scale.
However, we are still facing many unanswered challenges to make sure anyone
at Booking.com can leverage online controlled experiments to improve customer
experience: designing cross-department metrics, exploring the tradeoffs between
velocity and accurate business impact measurement, meaningful assessment of
server-side performance related experiments.
6
References
[1] Rubin, Donald B. Estimating causal effects of treatments in randomized and
nonrandomized studies. Journal of educational Psychology 66.5 (1974): 688.
[2] Box, George EP, J. Stuart Hunter, and William Gordon Hunter. Statistics
for experimenters: design, innovation, and discovery. Vol. 2. New York:
Wiley-Interscience, 2005.
[3] Xie, Huizhi, and Juliette Aurisset. Improving the sensitivity of online con-
trolled experiments: Case studies at netflix. Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Min-
ing. ACM, 2016.
[4] Tang, Diane, Ashish Agarwal, Deirdre O’Brien, Mike Meyer. Overlapping
Experiment Infrastructure: More, Better, Faster Experimentation. Proceed-
ings of the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM, 2010.
[5] Kohavi, Ron, et al. Online controlled experiments at large scale. Proceedings
of the 19th ACM SIGKDD International Conference on Knowledge Discov-
ery and Data Mining. ACM, 2013.
[6] Xu, Ya, et al. From infrastructure to culture: A/b testing challenges in large
scale social networks. Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, 2015.
[7] Bakshy, Eytan, Dean Eckles, Michael S. Bernstein. Designing and Deploying
Online Field Experiments., Proceedings of the 23rd International Conference
on World Wide Web. ACM, 2014.
[8] Fabijan, Aleksander, et al. The Evolution of Continuous Experimentation in
Software Product Development. International Conference on Software Engi-
neering (ICSE). 2017.
[9] Silberzahn, Raphael, et al. Many analysts, one dataset: Making transparent
how variations in analytical choices affect results. (2017).
[10] Munaf`o, Marcus R., et al. A manifesto for reproducible science. Nature
Human Behaviour 1 (2017): 0021.
[11] Zhou, Haotian, and Ayelet Fishbach. The pitfall of experimenting on the
web: How unattended selective attrition leads to surprising (yet false) re-
search conclusions. (2016).
7
... Regarding the development of the platform itself and the value proposition, Lopez Kaufman et al. [57] point out that the development of all products and services at Booking.com is based on experimentation with multiple possible versions and ideas, and decisions are made based on the results of these experiments. Furthermore, these authors emphasize that experimentation is an integral part of the organizational culture at Booking.com and that the company has developed a dedicated infrastructure for experimentation as well as a central repository for successful and failed experiments. ...
Article
Full-text available
Within the fast paced digital transformation of the tourism and hospitality sector, the modalities of booking a tourist accommodation have been radically transformed by the uptake of digital business models and digital platforms. This study examines the underlying mechanisms and key specificities of digital business models for two of the sector’s sub-segments—hotel accommodation bookings and yacht charter accommodation bookings. Based on the literature review findings, the case study method was applied in relation to key digital business models relevant for hotel bookings. On the other end, in relation to the yacht charter segment and its digital business models, an empirical research was conducted, encompassing a final sample of 162 yacht charter agencies from 42 countries worldwide. The analyzed digital business models have multiple similarities, while there are specific differences between the two hospitality segments. Even though digital business models are highly important in both segments, they are generally more developed and sophisticated in the hotel segment, which is related to the segment’s sheer size, in comparison with the younger and smaller yacht charter segment. The novelty is reflected in shedding more light on the characteristics of digital business models in the fast-developing yacht charter segment, including through an empirical study.
... All in all, as of now, growth marketing and conversion rate optimization seem to be quite the success stories: papers are being written (e. g., Lopez Kaufman, Pitchforth, & Vermeer [5]), conversion rates are being optimized (e. g., McFarland [7]), and conferences are being held (e. g., Riddle [12]). ...
Article
Full-text available
In today’s e-commerce industry, conversion rate optimization is often considered essentially the same as user experience optimization. In addition, there is a strong focus on quantitative experimentation, which some deem a jack-of-all-trades solution, often at the expense of qualitative user experience research. Both are worrying developments. This essay elaborates on why it is harmful to consider conversion rate optimization and user experience optimization to be the same thing in the context of growth marketing, and how the three concepts are interrelated.
Article
Online experiments are the gold standard for evaluating impact on user experience and accelerating innovation in software. However, since experiments are typically limited in duration, observed treatment effects are not always stable, sometimes revealing increasing or decreasing patterns over time. There are multiple causes for a treatment effect to change over time. In this paper, we focus on a particular cause, user-learning, which is primarily associated with novelty or primacy. Novelty describes the desire to use new technology that tends to diminish over time. Primacy describes the growing engagement with technology as a result of adoption of the innovation. Estimating user-learning is critical because it holds experimentation responsible for trustworthiness, empowers organizations to make better decisions by providing a long-term view of expected impact, and prevents user dissatisfaction. In this paper, we propose an observational approach, based on difference-in-differences technique to estimate user-learning at scale. We use this approach to test and estimate user-learning in many experiments at Microsoft. We compare our approach with the existing experimental method to show its benefits in terms of ease of use and higher statistical power, and to discuss its limitation in presence of other forms of treatment interaction with time. The supplementary material is available online.
Conference Paper
Full-text available
Background: Applying Continuous Experimentation on a large scale is not easily achieved. Although the evolution within large tech organisations is well understood, we still lack a good understanding of how to transition a company towards applying more experiments. Objective: This study investigates how practitioners define, value and apply experimentation, the blockers they experience and what to do to solve these. Method: We interviewed and surveyed over one hundred practitioners with regards to experimentation perspectives, from a large financial services and e-commerce organization, based in the Netherlands. Results: Many practitioners have different perspectives on experimentation. The value is well understood. We have learned that the practitioners are blocked by a lack of priority, experience and well functioning tooling. Challenges also arise around dependencies between teams and evaluating experiments with the correct metrics. Conclusions: Organisation leaders need to start asking for experiment results and investing in infrastructure and processes to actually enable teams to execute experiments and show the value of their work in terms of value for customers and business.
Article
Full-text available
Field experimentation has been widely adopted as an optimization technique in product design and marketing in several industries. Companies have successfully used field experimentation to reduce costs, increase revenues, and maintain an edge in their customer experience in highly competitive environments. However, in certain quantitative applications, such as revenue management in hospitality, to the authors’ knowledge, there is little publicly documented work on experimentation, and its use remains the privilege of big corporate brands with a small market share. This article discusses the likely causes of the sparse adoption of field experimentation for revenue management in hospitality. It also outlines opportunities that field experimentation can bring to accommodation managers and describes specific types of experimental designs that can help exploit those opportunities. By explicitly addressing the complexities of revenue management, this article aims to start a conversation about experimentation in hospitality that should benefit the industry as a whole.
Preprint
Full-text available
Twenty-nine teams involving 61 analysts used the same dataset to address the same research question: whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players. Analytic approaches varied widely across teams, and estimated effect sizes ranged from 0.89 to 2.93 in odds ratio units, with a median of 1.31. Twenty teams (69%) found a statistically significant positive effect and nine teams (31%) observed a non-significant relationship. Overall 29 different analyses used 21 unique combinations of covariates. We found that neither analysts' prior beliefs about the effect, nor their level of expertise, nor peer-reviewed quality of analysis readily explained variation in analysis outcomes. This suggests that significant variation in analysis of complex data may be difficult to avoid, even by experts with honest intentions. Crowdsourcing data analysis, a strategy by which numerous research teams are recruited to simultaneously investigate the same research question, makes transparent how defensible, yet subjective analytic choices influence research results. Currently available at: https://psyarxiv.com/qkwst/
Article
Full-text available
Improving the reliability and efficiency of scientific research will increase the credibility of the published scientific literature and accelerate discovery. Here we argue for the adoption of measures to optimize key elements of the scientific process: methods, reporting and dissemination, reproducibility, evaluation and incentives. There is some evidence from both simulations and empirical studies supporting the likely effectiveness of these measures, but their broad adoption by researchers, institutions, funders and journals will require iterative evaluation and improvement. We discuss the goals of these measures, and how they can be implemented, in the hope that this will facilitate action toward improving the transparency, reproducibility and efficiency of scientific research.
Conference Paper
Full-text available
Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft’s Bing, the use of controlled experiments has grown exponentially over time, with over 200 concurrent experiments now running on any given day. Running experiments at large scale requires addressing multiple challenges in three areas: cultural/organizational, engineering, and trustworthiness. On the cultural and organizational front, the larger organization needs to learn the reasons for running controlled experiments and the tradeoffs between controlled experiments and other methods of evaluating ideas. We discuss why negative experiments, which degrade the user experience short term, should be run, given the learning value and long-term benefits. On the engineering side, we architected a highly scalable system, able to handle data at massive scale: hundreds of concurrent experiments, each containing millions of users. Classical testing and debugging techniques no longer apply when there are millions of live variants of the site, so alerts are used to identify issues rather than relying on heavy up-front testing. On the trustworthiness front, we have a high occurrence of false positives that we address, and we alert experimenters to statistical interactions between experiments. The Bing Experimentation System is credited with having accelerated innovation and increased annual revenues by hundreds of millions of dollars, by allowing us to find and focus on key ideas evaluated through thousands of controlled experiments. A 1% improvement to revenue equals $10M annually in the US, yet many ideas impact key metrics by 1% and are not well estimated a-priori. The system has also identified many negative features that we avoided deploying, despite key stakeholders’ early excitement, saving us similar large amounts.
Conference Paper
Controlled experiments are widely regarded as the most scientific way to establish a true causal relationship between product changes and their impact on business metrics. Many technology companies rely on such experiments as their main data-driven decision-making tool. The sensitivity of a controlled experiment refers to its ability to detect differences in business metrics due to product changes. At Netflix, with tens of millions of users, increasing the sensitivity of controlled experiments is critical as failure to detect a small effect, either positive or negative, can have a substantial revenue impact. This paper focuses on methods to increase sensitivity by reducing the sampling variance of business metrics. We define Netflix business metrics and share context around the critical need for improved sensitivity. We review popular variance reduction techniques that are broadly applicable to any type of controlled experiment and metric. We describe an innovative implementation of stratified sampling at Netflix where users are assigned to experiments in real time and discuss some surprising challenges with the implementation. We conduct case studies to compare these variance reduction techniques on a few Netflix datasets. Based on the empirical results, we recommend to use post-assignment variance reduction techniques such as post stratification and CUPED instead of at-assignment variance reduction techniques such as stratified sampling in large-scale controlled experiments.
Article
The authors find that experimental studies using online samples (e.g., MTurk) often violate the assumption of random assignment, because participant attrition-quitting a study before completing it and getting paid-is not only prevalent, but also varies systemically across experimental conditions. Using standard social psychology paradigms (e.g., ego-depletion, construal level), they observed attrition rates ranging from 30% to 50% (Study 1). The authors show that failing to attend to attrition rates in online panels has grave consequences. By introducing experimental confounds, unattended attrition misled them to draw mind-boggling yet false conclusions: that recalling a few happy events is considerably more effortful than recalling many happy events, and that imagining applying eyeliner leads to weight loss (Study 2). In addition, attrition rate misled them to draw a logical yet false conclusion: that explaining one's view on gun rights decreases progun sentiment (Study 3). The authors offer a partial remedy (Study 4) and call for minimizing and reporting experimental attrition in studies conducted on the Web. (PsycINFO Database Record
Conference Paper
At Google, experimentation is practically a mantra; we evaluate almost every change that potentially affects what our users experience. Such changes include not only obvious user-visible changes such as modifications to a user interface, but also more subtle changes such as different machine learning algorithms that might affect ranking or content selection. Our insatiable appetite for experimentation has led us to tackle the problems of how to run more experiments, how to run experiments that produce better decisions, and how to run them faster. In this paper, we describe Google's overlapping experiment infrastructure that is a key component to solving these problems. In addition, because an experiment infrastructure alone is insufficient, we also discuss the associated tools and educational processes required to use it effectively. We conclude by describing trends that show the success of this overall experimental environment. While the paper specifically describes the experiment system and experimental processes we have in place at Google, we believe they can be generalized and applied by any entity interested in using experimentation to improve search engines and other web applications.
Article
Presents a discussion of matching, randomization, random sampling, and other methods of controlling extraneous variation. The objective was to specify the benefits of randomization in estimating causal effects of treatments. It is concluded that randomization should be employed whenever possible but that the use of carefully controlled nonrandomized data to estimate causal effects is a reasonable and necessary procedure in many cases. (15 ref) (PsycINFO Database Record (c) 2006 APA, all rights reserved).
From infrastructure to culture: A/b testing challenges in large scale social networks
  • Ya Xu
Xu, Ya, et al. From infrastructure to culture: A/b testing challenges in large scale social networks. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015.
Designing and Deploying Online Field Experiments
  • Eytan Bakshy
  • Dean Eckles
  • Michael S Bernstein
Bakshy, Eytan, Dean Eckles, Michael S. Bernstein. Designing and Deploying Online Field Experiments., Proceedings of the 23rd International Conference on World Wide Web. ACM, 2014.