BookPDF Available

Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

Authors:

Abstract

Getting numbers is easy; getting numbers you can trust is hard. This practical guide by experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to accelerate innovation using trustworthy online controlled experiments, or A/B tests. Based on practical experiences at companies that each run more than 20,000 controlled experiments a year, the authors share examples, pitfalls, and advice for students and industry professionals getting started with experiments, plus deeper dives into advanced topics for practitioners who want to improve the way they make data-driven decisions. Learn how to • Use the scientific method to evaluate hypotheses using controlled experiments • Define key metrics and ideally an Overall Evaluation Criterion • Test for trustworthiness of the results and alert experimenters to violated assumptions • Build a scalable platform that lowers the marginal cost of experiments close to zero • Avoid pitfalls like carryover effects and Twyman's law • Understand how statistical issues play out in practice.
Trustworthy Online Controlled Experiments
A Practical Guide to A/B Testing
Getting numbers is easy; getting numbers you can trust is hard. This practical guide by
experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to
accelerate innovation using trustworthy online controlled experiments, or A/B tests.
Based on practical experiences at companies that each runs more than 20,000 controlled
experiments a year, the authors share examples, pitfalls, and advice for students and
industry professionals getting started with experiments, plus deeper dives into advanced
topics for experienced practitioners who want to improve the way they and their
organizations make data-driven decisions.
Learn how to:
Use the scientic method to evaluate hypotheses using controlled experiments
Dene key metrics and ideally an Overall Evaluation Criterion
Test for trustworthiness of the results and alert experimenters to violated
assumptions
Interpret and iterate quickly based on the results
Implement guardrails to protect key business goals
Build a scalable platform that lowers the marginal cost of experiments close to zero
Avoid pitfalls such as carryover effects, Twymans law, Simpsons paradox, and
network interactions
Understand how statistical issues play out in practice, including common violations
of assumptions
ron kohavi is a vice president and technical fellow at Airbnb. This book was written
while he was a technical fellow and corporate vice president at Microsoft. He was
previously director of data mining and personalization at Amazon. He received his PhD
in Computer Science from Stanford University. His papers have more than 40,000
citations and three of them are in the top 1,000 most-cited papers in Computer Science.
diane tang is a Google Fellow, with expertise in large-scale data analysis and
infrastructure, online controlled experiments, and ads systems. She has an AB from
Harvard and an MS/PhD from Stanford, with patents and publications in mobile
networking, information visualization, experiment methodology, data infrastructure,
data mining, and large data.
ya xu heads Data Science and Experimentation at LinkedIn. She has published several
papers on experimentation and is a frequent speaker at top-tier conferences and
universities. She previously worked at Microsoft and received her PhD in Statistics
from Stanford University.
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
At the core of the Lean Methodology is the scientic method: Creating hypotheses,
running experiments, gathering data, extracting insight and validation or
modication of the hypothesis. A/B testing is the gold standard of creating
veriable and repeatable experiments, and this book is its denitive text.
Steve Blank, Adjunct professor at Stanford University, father of modern
entrepreneurship, author of The Startup Owners Manual and
The Four Steps to the Epiphany
This book is a great resource for executives, leaders, researchers or engineers
looking to use online controlled experiments to optimize product features, project
efciency or revenue. I know rsthand the impact that Kohavis work had on Bing
and Microsoft, and Im excited that these learnings can now reach a wider audience.
Harry Shum, EVP, Microsoft Articial Intelligence and Research Group
A great book that is both rigorous and accessible. Readers will learn how to bring
trustworthy controlled experiments, which have revolutionized internet product
development, to their organizations
Adam DAngelo, Co-founder and CEO of Quora and
former CTO of Facebook
This book is a great overview of how several companies use online experimentation
and A/B testing to improve their products. Kohavi, Tang and Xu have a wealth of
experience and excellent advice to convey, so the book has lots of practical real world
examples and lessons learned over many years of the application of these techniques
at scale.
Jeff Dean, Google Senior Fellow and SVP Google Research
Do you want your organization to make consistently better decisions? This is the new
bible of how to get from data to decisions in the digital age. Reading this book is like
sitting in meetings inside Amazon, Google, LinkedIn, Microsoft. The authors expose
for the rst time the way the worlds most successful companies make decisions.
Beyond the admonitions and anecdotes of normal business books, this book shows
what to do and how to do it well. Its the how-to manual for decision-making in the
digital world, with dedicated sections for business leaders, engineers, and dataanalysts.
Scott Cook, Intuit Co-founder & Chairman of the Executive Committee
Online controlled experiments are powerful tools. Understanding how they work,
what their strengths are, and how they can be optimized can illuminate both
specialists and a wider audience. This book is the rare combination of technically
authoritative, enjoyable to read, and dealing with highly important matters
John P.A. Ioannidis, Professor of Medicine, Health Research and Policy,
Biomedical Data Science, and Statistics at Stanford University
Which online option will be better? We frequently need to make such choices, and
frequently err. To determine what will actually work better, we need rigorous
controlled experiments, aka A/B testing. This excellent and lively book by experts
from Microsoft, Google, and LinkedIn presents the theory and best practices of A/B
testing. A must read for anyone who does anything online!
Gregory Piatetsky-Shapiro, Ph.D., president of KDnuggets,
co-founder of SIGKDD, and LinkedIn Top Voice on
Data Science & Analytics.
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Ron Kohavi, Diane Tang and Ya Xu are the worlds top experts on online
experiments. Ive been using their work for years and Im delighted they have
now teamed up to write the denitive guide. I recommend this book to all my
students and everyone involved in online products and services.
Erik Brynjolfsson, Professor at MIT and Co-Author of
The Second Machine Age
A modern software-supported business cannot compete successfully without online
controlled experimentation. Written by three of the most experienced leaders in the
eld, this book presents the fundamental principles, illustrates them with compelling
examples, and digs deeper to present a wealth of practical advice. Itsamust read!
Foster Provost, Professor at NYU Stern School of Business & co-author of the
best-selling Data Science for Business
In the past two decades the technology industry has learned what scientists have
known for centuries: that controlled experiments are among the best tools to
understand complex phenomena and to solve very challenging problems. The
ability to design controlled experiments, run them at scale, and interpret their
results is the foundation of how modern high tech businesses operate. Between
them the authors have designed and implemented several of the worlds most
powerful experimentation platforms. This book is a great opportunity to learn
from their experiences about how to use these tools and techniques.
Kevin Scott, EVP and CTO of Microsoft
Online experiments have fueled the success of Amazon, Microsoft, LinkedIn and
other leading digital companies. This practical book gives the reader rare access to
decades of experimentation experience at these companies and should be on the
bookshelf of every data scientist, software engineer and product manager.
Stefan Thomke, William Barclay Harding Professor, Harvard Business School,
Author of Experimentation Works: The Surprising Power of Business Experiments
The secret sauce for a successful online business is experimentation. But it is a secret
no longer. Here three masters of the art describe the ABCs of A/B testing so that you
too can continuously improve your online services.
Hal Varian, Chief Economist, Google, and author of
Intermediate Microeconomics: A Modern Approach
Experiments are the best tool for online products and services. This book is full of
practical knowledge derived from years of successful testing at Microsoft Google
and LinkedIn. Insights and best practices are explained with real examples and
pitfalls, their markers and solutions identied. I strongly recommend this book!
Preston McAfee, former Chief Economist and VP of Microsoft
Experimentation is the future of digital strategy and Trustworthy Experimentswill
be its Bible. Kohavi, Tang and Xu are three of the most noteworthy experts on
experimentation working today and their book delivers a truly practical roadmap
for digital experimentation that is useful right out of the box. The revealing case
studies they conducted over many decades at Microsoft, Amazon, Google and
LinkedIn are organized into easy to understand practical lessens with tremendous
depth and clarity. It should be required reading for any manager of a digital business.
Sinan Aral, David Austin Professor of Management,
MIT and author of The Hype Machine
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Indispensable for any serious experimentation practitioner, this book is highly
practical and goes in-depth like Ive never seen before. Its so useful it feels like
you get a superpower. From statistical nuances to evaluating outcomes to measuring
long term impact, this book has got you covered. Must-read.
Peep Laja, top conversion rate expert, Founder and Principal of CXL
Online experimentation was critical to changing the culture at Microsoft. When
Satya talks about Growth Mindset,experimentation is the best way to try new
ideas and learn from them. Learning to quickly iterate controlled experiments drove
Bing to protability, and rapidly spread across Microsoft through Ofce, Windows,
and Azure.
Eric Boyd, Corporate VP, AI Platform, Microsoft
As an entrepreneur, scientist, and executive Ive learned (the hard way) that an
ounce of data is worth a pound of my intuition. But how to get good data? This book
compiles decades of experience at Amazon, Google, LinkedIn, and Microsoft into
an accessible, well-organized guide. It is the bible of online experiments.
Oren Etzioni, CEO of Allen Institute of AI and
Professor of Computer Science at University of Washington
Internet companies have taken experimentation to an unprecedented scale, pace,
and sophistication. These authors have played key roles in these developments and
readers are fortunate to be able to learn from their combined experiences.
Dean Eckles, KDD Career Development Professor in Communications and
Technology at MIT and former scientist at Facebook
A wonderfully rich resource for a critical but under-appreciated area. Real case
studies in every chapter show the inner workings and learnings of successful
businesses. The focus on developing and optimizing an Overall Evaluation
Criterion(OEC) is a particularly important lesson.
Jeremy Howard, Singularity University, founder of fast.ai,
and former president and chief scientist of Kaggle
There are many guides to A/B Testing, but few with the pedigree of Trustworthy
Online Controlled Experiments. Ive been following Ronny Kohavi for eighteen
years and nd his advice to be steeped in practice, honed by experience, and
tempered by doing laboratory work in real world environments. When you add
Diane Tang, and Ya Xu to the mix, the breadth of comprehension is unparalleled.
I challenge you to compare this tome to any other - in a controlled manner, of
course.
Jim Sterne, Founder of Marketing Analytics Summit and
Director Emeritus of the Digital Analytics Association
An extremely useful how-to book for running online experiments that combines
analytical sophistication, clear exposition and the hard-won lessons of practical
experience.
Jim Manzi, Founder of Foundry.ai, Founder and former CEO and
Chairman of Applied Predictive Technologies, and author of Uncontrolled:
The Surprising Payoff of Trial-and-Error for Business, Politics, and Society
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Experimental design advances each time it is applied to a new domain: agriculture,
chemistry, medicine and now online electronic commerce. This book by three top
experts is rich in practical advice and examples covering both how and why to
experiment online and not get fooled. Experiments can be expensive; not knowing
what works can cost even more.
Art Owen Professor of Statistics, Stanford University
This is a must read book for business executives and operating managers. Just as
operations, nance, accounting and strategy form the basic building blocks for
business, today in the age of AI, understanding and executing online controlled
experiments will be a required knowledge set. Kohavi, Tang and Xu have laid out
the essentials of this new and important knowledge domain that is practically
accessible.
Karim R. Lakhani, Professor and Director of Laboratory for
Innovation Science at Harvard, Board Member, Mozilla Corp.
Serious data-drivenorganizations understand that analytics arent enough; they
must commit to experiment. Remarkably accessible and accessibly remarkable, this
book is a manual and manifesto for high-impact experimental design. I found its
pragmatism inspirational. Most importantly, it claries how culture rivals technical
competence as a critical success factor.
Michael Schrage, research fellow at MITs Initiative on the
Digital Economy and author of The Innovators Hypothesis:
How Cheap Experiments Are Worth More than Good Ideas
This important book on experimentation distills the wisdom of three distinguished
leaders from some of the worlds biggest technology companies. If you are a software
engineer, data scientist, or product manager trying to implement a data-driven culture
within your organization, this is an excellent and practical book for you.
Daniel Tunkelang, Chief Scientist at Endeca and former Director of
Data Science and Engineering at LinkedIn
With every industry becoming digitized and data-driven, conducting and beneting
from controlled online experiments becomes a required skill. Kohavi, Tang and Yu
provide a complete and well-researched guide that will become necessary reading
for data practitioners and executives alike.
Evangelos Simoudis, Co-founder and Managing Director Synapse Partners;
author of The Big Data Opportunity in Our Driverless Future
The authors offer over 10 years of hard-fought lessons in experimentation, in the
most strategic book for the discipline yet
Colin McFarland, Director Experimentation Platform at Netix
The practical guide to A/B testing distills the experiences from three of the top
minds in experimentation practice into easy and digestible chunks of valuable and
practical concepts. Each chapter walks you through some of the most important
considerations when running experiments - from choosing the right metric to the
benets of institutional memory. If you are looking for an experimentation coach
that balances science and practicality, then this book is for you.
Dylan Lewis, Experimentation Leader, Intuit
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
The only thing worse than no experiment is a misleading one, because it gives you
false condence! This book details the technical aspects of testing based on insights
from some of the worlds largest testing programs. If youre involved in online
experimentation in any capacity, read it now to avoid mistakes and gain condence
in your results.
- Chris Goward, Author of You Should Test That!,
Founder and CEO of Widerfunnel
This is a phenomenal book. The authors draw on a wealth of experience and have
produced a readable reference that is somehow both comprehensive and detailed at
the same time. Highly recommended reading for anyone who wants to run serious
digital experiments.
- Pete Koomen, Co-founder, Optimizely
The authors are pioneers of online experimentation. The platforms theyve built
and the experiments theyve enabled have transformed some of the largest internet
brands. Their research and talks have inspired teams across the industry to adopt
experimentation. This book is the authoritative yet practical text that the industry has
been waiting for.
Adil Aijaz, Co-founder and CEO, Split Software
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Trustworthy Online Controlled
Experiments
A Practical Guide to A/B Testing
RON KOHAVI
Microsoft
DIANE TANG
Google
YA XU
LinkedIn
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi 110025, India
79 Anson Road, #0604/06, Singapore 079906
Cambridge University Press is part of the University of Cambridge.
It furthers the Universitys mission by disseminating knowledge in the pursuit of
education, learning, and research at the highest international levels of excellence.
www.cambridge.org
Information on this title: www.cambridge.org/9781108724265
DOI: 10.1017/9781108653985
© Ron Kohavi, Diane Tang, and Ya Xu 2020
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2020
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Kohavi, Ron, author. | Tang, Diane, 1974author. | Xu, Ya, 1982author.
Title: Trustworthy online controlled experiments : a practical guide to A/B testing /
Ron Kohavi, Diane Tang, Ya Xu.
Description: Cambridge, United Kingdom ; New York, NY : Cambridge University Press,
2020. | Includes bibliographical references and index.
Identiers: LCCN 2019042021 (print) | LCCN 2019042022 (ebook) | ISBN 9781108724265
(paperback) | ISBN 9781108653985 (epub)
Subjects: LCSH: Social media. | User-generated contentSocial aspects.
Classication: LCC HM741 .K68 2020 (print) | LCC HM741 (ebook) | DDC 302.23/1dc23
LC record available at https://lccn.loc.gov/2019042021
LC ebook record available at https://lccn.loc.gov/2019042022
ISBN 978-1-108-72426-5 Paperback
Cambridge University Press has no responsibility for the persistence or accuracy
of URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Contents
Preface How to Read This Book page xv
Acknowledgments xvii
part i introductory topics for everyone 1
1 Introduction and Motivation 3
Online Controlled Experiments Terminology 5
Why Experiment? Correlations, Causality, and Trustworthiness 8
Necessary Ingredients for Running Useful Controlled Experiments 10
Tenets 11
Improvements over Time 14
Examples of Interesting Online Controlled Experiments 16
Strategy, Tactics, and Their Relationship to Experiments 20
Additional Reading 24
2 Running and Analyzing Experiments: An End-to-End
Example 26
Setting up the Example 26
Hypothesis Testing: Establishing Statistical Signicance 29
Designing the Experiment 32
Running the Experiment and Getting Data 34
Interpreting the Results 34
From Results to Decisions 36
3 Twymans Law and Experimentation Trustworthiness 39
Misinterpretation of the Statistical Results 40
Condence Intervals 43
Threats to Internal Validity 43
Threats to External Validity 48
ix
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Segment Differences 52
Simpsons Paradox 55
Encourage Healthy Skepticism 57
4 Experimentation Platform and Culture 58
Experimentation Maturity Models 58
Infrastructure and Tools 66
part ii selected topics for everyone 79
5 Speed Matters: An End-to-End Case Study 81
Key Assumption: Local Linear Approximation 83
How to Measure Website Performance 84
The Slowdown Experiment Design 86
Impact of Different Page Elements Differs 87
Extreme Results 89
6 Organizational Metrics 90
Metrics Taxonomy 90
Formulating Metrics: Principles and Techniques 94
Evaluating Metrics 96
Evolving Metrics 97
Additional Resources 98
SIDEBAR: Guardrail Metrics 98
SIDEBAR: Gameability 100
7 Metrics for Experimentation and the Overall
Evaluation Criterion 102
From Business Metrics to Metrics Appropriate for Experimentation 102
Combining Key Metrics into an OEC 104
Example: OEC for E-mail at Amazon 106
Example: OEC for Bings Search Engine 108
Goodharts Law, Campbells Law, and the Lucas Critique 109
8 Institutional Memory and Meta-Analysis 111
What Is Institutional Memory? 111
Why Is Institutional Memory Useful? 112
9 Ethics in Controlled Experiments 116
Background 116
Data Collection 121
Culture and Processes 122
SIDEBAR: User Identiers 123
x Contents
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
part iii complementary and alternative
techniques to controlled experiments 125
10 Complementary Techniques 127
The Space of Complementary Techniques 127
Logs-based Analysis 128
Human Evaluation 130
User Experience Research (UER) 131
Focus Groups 132
Surveys 132
External Data 133
Putting It All Together 135
11 Observational Causal Studies 137
When Controlled Experiments Are Not Possible 137
Designs for Observational Causal Studies 139
Pitfalls 144
SIDEBAR: Refuted Observational Causal Studies 147
part iv advanced topics for building an
experimentation platform 151
12 Client-Side Experiments 153
Differences between Server and Client Side 153
Implications for Experiments 156
Conclusions 161
13 Instrumentation 162
Client-Side vs. Server-Side Instrumentation 162
Processing Logs from Multiple Sources 164
Culture of Instrumentation 165
14 Choosing a Randomization Unit 166
Randomization Unit and Analysis Unit 168
User-level Randomization 169
15 Ramping Experiment Exposure: Trading Off Speed,
Quality, and Risk 171
What Is Ramping? 171
SQR Ramping Framework 172
Four Ramp Phases 173
Post Final Ramp 176
Contents xi
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
16 Scaling Experiment Analyses 177
Data Processing 177
Data Computation 178
Results Summary and Visualization 180
part v advanced topics for analyzing
experiments 183
17 The Statistics behind Online Controlled Experiments 185
Two-Sample t-Test 185
p-Value and Condence Interval 186
Normality Assumption 187
Type I/II Errors and Power 189
Bias 191
Multiple Testing 191
Fishers Meta-analysis 192
18 Variance Estimation and Improved Sensitivity: Pitfalls
and Solutions 193
Common Pitfalls 193
Improving Sensitivity 196
Variance of Other Statistics 198
19 The A/A Test 200
Why A/A Tests? 200
How to Run A/A Tests 205
When the A/A Test Fails 207
20 Triggering for Improved Sensitivity 209
Examples of Triggering 209
A Numerical Example (Kohavi, Longbotham et al. 2009) 212
Optimal and Conservative Triggering 213
Overall Treatment Effect 214
Trustworthy Triggering 215
Common Pitfalls 216
Open Questions 217
21 Sample Ratio Mismatch and Other Trust-Related
Guardrail Metrics 219
Sample Ratio Mismatch 219
Debugging SRMs 222
xii Contents
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
22 Leakage and Interference between Variants 226
Examples 227
Some Practical Solutions 230
Detecting and Monitoring Interference 234
23 Measuring Long-Term Treatment Effects 235
What Are Long-Term Effects? 235
Reasons the Treatment Effect May Differ between
Short-Term and Long-Term 236
Why Measure Long-Term Effects? 238
Long-Running Experiments 239
Alternative Methods for Long-Running Experiments 241
References 246
Index 266
Contents xiii
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Preface
How to Read This Book
If we have data, lets look at data.
If all we have are opinions, lets go with mine
Jim Barksdale, Former CEO of Netscape
Our goal in writing this book is to share practical lessons from decades of
experience running online controlled experiments at scale at Amazon and
Microsoft (Ron), Google (Diane), and Microsoft and LinkedIn (Ya). While
we are writing this book in our capacity as individuals and not as representa-
tives of Google, LinkedIn, or Microsoft, we have distilled key lessons and
pitfalls encountered over the years and provide guidance for both software
platforms and the corporate cultural aspects of using online controlled experi-
ments to establish a data-driven culture that informs rather than relies on the
HiPPO (Highest Paid Persons Opinion) (R. Kohavi, HiPPO FAQ 2019). We
believe many of these lessons apply in the online setting, to large or small
companies, or even teams and organizations within a company. A concern we
share is the need to evaluate the trustworthiness of experiment results. We
believe in the skepticism implied by Twymans Law: Any gure that looks
interesting or different is usually wrong; we encourage readers to double-
check results and run validity tests, especially for breakthrough positive
results. Getting numbers is easy; getting numbers you can trust is hard!
Part I is designed to be read by everyone, regardless of background, and
consists of four chapters.
Chapter 1 is an overview of the benets of running online controlled
experiments and introduces experiment terminology.
Chapter 2 uses an example to run through the process of running an
experiment end-to-end.
Chapter 3 describes common pitfalls and how to build experimentation
trustworthiness, and
xv
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Chapter 4 overviews what it takes to build an experiment platform and scale
online experimentation.
Parts II through V can be consumed by everyone as needed but are written with
a focus on a specic audience. Part II contains ve chapters on fundamentals,
such as Organizational Metrics. The topics in Part II are recommended for
everyone, especially leaders and executives. Part III contains two chapters that
introduce techniques to complement online controlled experiments that
leaders, data scientists, engineers, analysts, product managers, and others
would nd useful for guiding resources and time investment. Part IV focuses
on building an experimentation platform and is aimed toward engineers.
Finally, Part V digs into advanced analysis topics and is geared toward data
scientists.
Our website, https://experimentguide.com, is a companion to this book. It
contains additional material, errata, and provides an area for open discussion.
The authors intend to donate all proceeds from this book to charity.
xvi Preface
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Acknowledgments
We would like to thank our colleagues who have worked with us throughout
the years. While too numerous to name individually, this book is based on our
combined work, as well as others throughout the industry and beyond
researching and conducting online controlled experiments. We learned a great
deal from you all, thank you.
On writing the book, wed like to call out Lauren Cowles, our editor, for
partnering with us throughout this process. Cherie Woodward provided great
line editing and style guidance to help mesh our three voices. Stephanie Grey
worked with us on all diagrams and gures, improving them in the process.
Kim Vernon provided nal copy-editing and bibliography checks.
Most importantly, we owe a deep debt of gratitude to our families, as we
missed time with them to work on this book. Thank you to Ronnys family:
Yael, Oren, Ittai, and Noga, to Dianes family: Ben, Emma, and Leah, and to
Yas family: Thomas, Leray, and Tavis. We could not have written this book
without your support and enthusiasm!
Google: Hal Varian, Dan Russell, Carrie Grimes, Niall Cardin, Deirdre
OBrien, Henning Hohnhold, Mukund Sundararajan, Amir Najmi, Patrick
Riley, Eric Tassone, Jen Gennai, Shannon Vallor, Eric Miraglia, David Price,
Crystal Dahlen, Tammy Jih Murray, Lanah Donnelly and all who work on
experiments at Google.
LinkedIn: Stephen Lynch, Yav Bojinov, Jiada Liu, Weitao Duan, Nanyu
Chen, Guillaume Saint-Jacques, Elaine Call, Min Liu, Arun Swami, Kiran
Prasad, Igor Perisic, and the entire Experimentation team.
Microsoft: Omar Alonso, Benjamin Arai, Jordan Atlas, Richa Bhayani, Eric
Boyd, Johnny Chan, Alex Deng, Andy Drake, Aleksander Fabijan, Brian
Frasca, Scott Gude, Somit Gupta, Adam Gustafson, Tommy Guy, Randy
Henne, Edward Jezierski, Jing Jin, Dongwoo Kim, Waldo Kuipers, Jonathan
Litz, Sophia Liu, Jiannan Lu, Qi Lu, Daniel Miller, Carl Mitchell, Nils
xvii
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Pohlmann, Wen Qin, Thomas Schreiter, Harry Shum, Dan Sommereld, Garnet
Vaz, Toby Walker, Michele Zunker, and the Analysis & Experimentation team.
Special thanks to Maria Stone and Marcus Persson for feedback throughout
the book, and Michelle N. Meyer for expert feedback on the ethics chapter
Others who have given feedback include: Adil Aijaz, Jonas Alves, Alon
Amit, Kevin Anderson, Joel Barajas, Houman Bedayat, Beau Bender, Bahador
Biglari, Stuart Buck, Jike Chong, Jed Chou, Pavel Dmitriev, Yurong Fan,
Georgi Georgiev, Ilias Gerostathopoulos. Matt Gershoff, William Grosso,
Aditya Gupta, Rajesh Gupta, Shilpa Gupta, Kris Jack, Jacob Jarnvall, Dave
Karow, Slawek Kierner, Pete Koomen, Dylan Lewis, Bryan Liu, David Man-
heim, Colin McFarland, Tanapol Nearunchron, Dheeraj Ravindranath, Aaditya
Ramdas, Andre Richter, Jianhong Shen, Gang Su, Anthony Tang, Lukas
Vermeer, Rowel Willems, Yu Yang, and Yufeng Wang.
Thank you to the many who helped who are not named explicitly.
xviii Acknowledgments
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
PART I
Introductory Topics for Everyone
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
1
Introduction and Motivation
One accurate measurement is worth more than a thousand expert
opinions
Admiral Grace Hopper
In 2012, an employee working on Bing, Microsofts search engine, suggested
changing how ad headlines display (Kohavi and Thomke 2017). The idea was
to lengthen the title line of ads by combining it with the text from the rst line
below the title, as shown in Figure 1.1.
Nobody thought this simple change, among the hundreds suggested, would
be the best revenue-generating idea in Bings history!
The feature was prioritized low and languished in the backlog for more than
six months until a software developer decided to try the change, given how
easy it was to code. He implemented the idea and began evaluating the idea on
real users, randomly showing some of them the new title layout and others the
old one. User interactions with the website were recorded, including ad clicks
and the revenue generated from them. This is an example of an A/B test, the
simplest type of controlled experiment that compares two variants: A and B, or
aControl and a Treatment.
A few hours after starting the test, a revenue-too-high alert triggered,
indicating that something was wrong with the experiment. The Treatment, that
is, the new title layout, was generating too much money from ads. Such too
good to be truealerts are very useful, as they usually indicate a serious bug,
such as cases where revenue was logged twice (double billing) or where only
ads displayed, and the rest of the web page was broken.
For this experiment, however, the revenue increase was valid. Bings revenue
increased by a whopping 12%, which at the time translated to over $100M
annually in the US alone, without signicantly hurting key user-experience
metrics. The experiment was replicated multiple times over a long period.
3
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
The example typies several key themes in online controlled experiments:
It is hard to assess the value of an idea. In this case, a simple change worth
over $100M/year was delayed for months.
Small changes can have a big impact. A $100M/year return-on-investment
(ROI) on a few dayswork for one engineer is about as extreme as it gets.
Figure 1.1 An experiment changing the way ads display on Bing
4 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Experiments with big impact are rare. Bing runs over 10,000 experiments a
year, but simple features resulting in such a big improvement happen only
once every few years.
The overhead of running an experiment must be small. Bings engineers had
access to ExP, Microsofts experimentation system, which made it easy to
scientically evaluate the idea.
The overall evaluation criterion (OEC, described more later in this chapter)
must be clear. In this case, revenue was a key component of the OEC, but
revenue alone is insufcient as an OEC. It could lead to plastering the web
site with ads, which is known to hurt the user experience. Bing uses an OEC
that weighs revenue against user-experience metrics, including Sessions per
user (are users abandoning or increasing engagement) and several other
components. The key point is that user-experience metrics did not signi-
cantly degrade even though revenue increased dramatically.
The next section introduces the terminology of controlled experiments.
Online Controlled Experiments Terminology
Controlled experiments have a long and fascinating history, which we share
online (Kohavi, Tang and Xu 2019). They are sometimes called A/B tests,
A/B/n tests (to emphasize multiple variants), eld experiments, randomized
controlled experiments, split tests, bucket tests, and ights. In this book, we
use the terms controlled experiments and A/B tests interchangeably, regardless
of the number of variants.
Online controlled experiments are used heavily at companies like Airbnb,
Amazon, Booking.com, eBay, Facebook, Google, LinkedIn, Lyft, Microsoft,
Netix, Twitter, Uber, Yahoo!/Oath, and Yandex (Gupta et al. 2019). These
companies run thousands to tens of thousands of experiments every year,
sometimes involving millions of users and testing everything, including
changes to the user interface (UI), relevance algorithms (search, ads, personal-
ization, recommendations, and so on), latency/performance, content manage-
ment systems, customer support systems, and more. Experiments are run on
multiple channels: websites, desktop applications, mobile applications, and
e-mail.
In the most common online controlled experiments, users are randomly split
between variants in a persistent manner (a user receives the same variant in
multiple visits). In our opening example from Bing, the Control was the
original display of ads and the Treatment was the display of ads with longer
Online Controlled Experiments Terminology 5
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
titles. The usersinteractions with the Bing web site were instrumented, that is,
monitored and logged. From the logged data, metrics are computed, which
allowed us to assess the difference between the variants for each metric.
In the simplest controlled experiments, there are two variants: Control (A)
and Treatment (B), as shown in Figure 1.2.
We follow the terminology of Kohavi and Longbottom (2017), and Kohavi,
Longbottom et al. (2009) and provide related terms from other elds below.
You can nd many other resources on experimentation and A/B testing at the
end of this chapter under Additional Reading.
Overall Evaluation Criterion (OEC): A quantitative measure of the
experiments objective. For example, your OEC might be active days per user,
indicating the number of days during the experiment that users were active
(i.e., they visited and took some action). Increasing this OEC implies that users
are visiting your site more often, which is a great outcome. The OEC must be
measurable in the short term (the duration of an experiment) yet believed to
causally drive long-term strategic objectives (see Strategy, Tactics, and their
Relationship to Experiments later in this chapter and Chapter 7). In the case of a
search engine, the OEC can be a combination of usage (e.g., sessions-per-user),
Users’ interactions instrumented,
analyzed & compared
Analyze at the end
of the experiment
Treatment
Existing System
with Feature X
Control
Existing
System
50%
Users
50%
Users
100%
Users
Figure 1.2 A simple controlled experiment: An A/B Test
6 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
relevance (e.g., successful sessions, time to success), and advertisement revenue
(not all search engines use all of these metrics or only these metrics).
In statistics, this is often called the Response or Dependent variable (Mason,
Gunst and Hess 1989, Box, Hunter and Hunter 2005); other synonyms are
Outcome,Evaluation and Fitness Function (Quarto-vonTivadar 2006). Experi-
ments can have multiple objectives and analysis can use a balanced scorecard
approach (Kaplan and Norton 1996), although selecting a single metric,
possibly as a weighted combination of such objectives is highly desired and
recommended (Roy 2001, 50, 405429).
We take a deeper dive into determining the OEC for experiments in
Chapter 7.
Parameter: A controllable experimental variable that is thought to inu-
ence the OEC or other metrics of interest. Parameters are sometimes called
factors or variables. Parameters are assigned values, also called levels.In
simple A/B tests, there is commonly a single parameter with two values. In
the online world, it is common to use univariable designs with multiple values
(such as, A/B/C/D). Multivariable tests, also called Multivariate Tests (MVTs),
evaluate multiple parameters (variables) together, such as font color and font
size, allowing experimenters to discover a global optimum when parameters
interact (see Chapter 4).
Variant: A user experience being tested, typically by assigning values to
parameters. In a simple A/B test, A and B are the two variants, usually called
Control and Treatment. In some literature, a variant only means a Treatment;
we consider the Control to be a special variant: the existing version on which
to run the comparison. For example, in case of a bug discovered in the
experiment, you would abort the experiment and ensure that all users are
assigned to the Control variant.
Randomization Unit: A pseudo-randomization (e.g., hashing) process is
applied to units (e.g., users or pages) to map them to variants. Proper random-
ization is important to ensure that the populations assigned to the different
variants are similar statistically, allowing causal effects to be determined with
high probability. You must map units to variants in a persistent and independ-
ent manner (i.e., if user is the randomization unit, a user should consistently
see the same experience, and the assignment of a user to a variant should not
tell you anything about the assignment of a different user to its variant). It is
very common, and we highly recommend, to use users as a randomization unit
when running controlled experiments for online audiences. Some experimental
designs choose to randomize by pages, sessions, or user-day (i.e., the experi-
ment remains consistent for the user for each 24-hour window determined by
the server). See Chapter 14 for more information.
Online Controlled Experiments Terminology 7
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Proper randomization is critical! If the experimental design assigns an equal
percentage of users to each variant, then each user should have an equal chance
of being assigned to each variant. Do not take randomization lightly. The
examples below demonstrate the challenge and importance of proper
randomization.
The RAND corporation needed random numbers for Monte Carlo methods
in the 1940s, so they created a book of a million random digits generated
using a pulse machine. However, due to skews in the hardware, the original
table was found to have signicant biases and the digits had to re-
randomized in a new edition of the book (RAND 1955).
Controlled experiments were initially used in medical domains. The US
Veterans Administration (VA) conducted an experiment (drug trial) of
streptomycin for tuberculosis, but the trials failed because physicians intro-
duced biases and inuenced the selection process (Marks 1997). Similar
trials in Great Britain were done with blind protocols and were successful,
creating what is now called a watershed moment in controlled trials (Doll
1998).
No factor should be allowed to inuence variant assignment. Users (units)
cannot be distributed any old which way(Weiss 1997). It is important to
note that random does not mean haphazard or unplanned, but a deliberate
choice based on probabilities(Mosteller, Gilbert and McPeek 1983). Senn
(2012) discusses some myths of randomization.
Why Experiment? Correlations, Causality,
and Trustworthiness
Lets say youre working for a subscription business like Netix, where X% of
users churn (end their subscription) every month. You decide to introduce a
new feature and observe that churn rate for users using that feature is X%/2,
that is, half. You might be tempted to claim causality; the feature is reducing
churn by half. This leads to the conclusion that if we make the feature more
discoverable and used more often, subscriptions will soar. Wrong! Given the
data, no conclusion can be drawn about whether the feature reduces or
increases user churn, and both are possible.
An example demonstrating this fallacy comes from Microsoft Ofce 365,
another subscription business. Ofce 365 users that see error messages and
experience crashes have lower churn rates, but that does not mean that Ofce
365 should show more error messages or that Microsoft should lower code
8 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
quality, causing more crashes. It turns out that all three events are caused by a
single factor: usage. Heavy users of the product see more error messages, experi-
ence more crashes, and have lower churn rates. Correlation does not imply
causality and overly relying on these observations leads to faulty decisions.
In 1995, Guyatt et al. (1995) introduced the hierarchy of evidence as a way to
grade recommendations in medical literature, which Greenhalgh expanded on in
her discussions on practicing evidence-based medicine (1997, 2014). Figure 1.3
shows a simple hierarchy of evidence, translated to our terminology, based on
Bailar (1983, 1). Randomized controlled experiments are the gold standard for
establishing causality. Systematic reviews, that is, meta-analysis, of controlled
experiments provides more evidence and generalizability.
More complex models, such as the Levels of Evidence by the Oxford Centre
for Evidence-based Medicine are also available (2009).
The experimentation platforms used by our companies allow experimenters
at Google, LinkedIn, and Microsoft to run tens of thousands of online con-
trolled experiments a year with a high degree of trust in the results. We believe
online controlled experiments are:
The best scientic way to establish causality with high probability.
Able to detect small changes that are harder to detect with other techniques,
such as changes over time (sensitivity).
Systematic Reviews
of randomized
controlled experiments,
i.e. Meta Analysis
Randomized
Controlled
Experiments
Other controlled experiments
(e.g., natural, non-randomized)
Case studies (analysis of a person of group), anecdotes,
and personal (often expert) opinion, a.k.a. HiPPO
Observational studies
(cohort and case control)
Figure 1.3 A simple hierarchy of evidence for assessing the quality of trial design
(Greenhalgh 2014)
Correlations, Causality, and Trustworthiness 9
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Able to detect unexpected changes. Often underappreciated, but many
experiments uncover surprising impacts on other metrics, be it performance
degradation, increased crashes/errors, or cannibalizing clicks from other
features.
A key focus of this book is highlighting potential pitfalls in experiments and
suggesting methods that improve trust in results. Online controlled experi-
ments provide an unparalleled ability to electronically collect reliable data at
scale, randomize well, and avoid or detect pitfalls (see Chapter 11). We
recommend using other, less trustworthy, methods, including observational
studies, when online controlled experiments are not possible.
Necessary Ingredients for Running Useful
Controlled Experiments
Not every decision can be made with the scientic rigor of a controlled
experiment. For example, you cannot run a controlled experiment on mergers
and acquisitions (M&A), as we cannot have both the merger/acquisition and its
counterfactual (no such event) happening concurrently. We now review the
necessary technical ingredients for running useful controlled experiments
(Kohavi, Crook and Longbotham 2009), followed by organizational tenets.
In Chapter 4, we cover the experimentation maturity model.
1. There are experimental units (e.g., users) that can be assigned to different
variants with no interference (or little interference); for example, users in
Treatment do not impact users in Control (see Chapter 22).
2. There are enough experimental units (e.g., users). For controlled experi-
ments to be useful, we recommend thousands of experimental units: the
larger the number, the smaller the effects that can be detected. The good
news is that even small software startups typically get enough users quickly
and can start to run controlled experiments, initially looking for big effects.
As the business grows, it becomes more important to detect smaller changes
(e.g., large web sites must be able to detect small changes to key metrics
impacting user experience and fractions of a percent change to revenue),
and the sensitivity improves with a growing user base.
3. Key metrics, ideally an OEC, are agreed upon and can be practically
evaluated. If the goals are too hard to measure, it is important to agree on
surrogates (see Chapter 7). Reliable data can be collected, ideally cheaply
and broadly. In software, it is usually easy to log system events and user
actions (see Chapter 13).
10 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
4. Changes are easy to make. Software is typically easier to change than
hardware; but even in software, some domains require a certain level of
quality assurance. Changes to a recommendation algorithm are easy to
make and evaluate; changes to software in airplane ight control systems
require a whole different approval process by the Federal Aviation Admin-
istration (FAA). Server-side software is much easier to change than client-
side (see Chapter 12), which is why calling services from client software is
becoming more common, enabling upgrades and changes to the services to
be done more quickly and using controlled experiments.
Most non-trivial online services meet, or could meet, the necessary ingredients
for running an agile development process based on controlled experiments.
Many implementations of software+services could also meet the requirements
relatively easily. Thomke wrote that organizations will recognize maximal
benets from experimentation when it is used in conjunction with an innov-
ation system(Thomke 2003). Agile software development is such an innov-
ation system.
When controlled experiments are not possible, modeling could be done, and
other experimental techniques might be used (see Chapter 10). The key is that
if controlled experiments can be run, they provide the most reliable and
sensitive mechanism to evaluate changes.
Tenets
There are three key tenets for organizations that wish to run online controlled
experiments (Kohavi et al. 2013):
1. The organization wants to make data-driven decisions and has formalized
an OEC.
2. The organization is willing to invest in the infrastructure and tests to run
controlled experiments and ensure that the results are trustworthy.
3. The organization recognizes that it is poor at assessing the value of ideas.
Tenet 1: The Organization Wants to Make Data-Driven
Decisions and Has Formalized an OEC
You will rarely hear someone at the head of an organization say that they dont
want to be data-driven (with the notable exception of Apple under Steve Jobs,
where Ken Segall claimed that we didnt test a single ad. Not for print, TV,
Tenets 11
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
billboards, the web, retail, or anything(Segall 2012, 42). But measuring the
incremental benet to users from new features has cost, and objective meas-
urements typically show that progress is not as rosy as initially envisioned.
Many organizations will not spend the resources required to dene and
measure progress. It is often easier to generate a plan, execute against it, and
declare success, with the key metric being: percent of plan delivered,ignor-
ing whether the feature has any positive impact to key metrics.
To be data-driven, an organization should dene an OEC that can be easily
measured over relatively short durations (e.g., one to two weeks). Large organ-
izations may have multiple OECs or several key metrics that are shared with
renements for different areas. The hard part is nding metrics measurable in a
short period, sensitive enough to show differences, and that are predictive of
long-term goals. For example, Protis not a good OEC, as short-term theat-
rics (e.g., raising prices) can increase short-term prot, but may hurt it in the long
run. Customer lifetime value is a strategically powerful OEC (Kohavi, Long-
bottom et al. 2009). We cannot overemphasize the importance of agreeing on a
good OEC that your organization can align behind; see Chapter 6.
The terms data-informedor data-awareare sometimes used to avoid the
implication that a single source of data (e.g., a controlled experiment) drives
the decisions (King, Churchill and Tan 2017, Knapp et al. 2006). We use data-
driven and data-informed as synonyms in this book. Ultimately, a decision
should be made with many sources of data, including controlled experiments,
surveys, estimates of maintenance costs for the new code, and so on. A data-
driven or a data-informed organization gathers relevant data to drive a decision
and inform the HiPPO (Highest Paid Persons Opinion) rather than relying on
intuition (Kohavi 2019).
Tenet 2: The Organization Is Willing to Invest in the
Infrastructure and Tests to Run Controlled Experiments and
Ensure That Their Results Are Trustworthy
In the online software domain (websites, mobile, desktop applications, and
services) the necessary conditions for controlled experiments can be met
through software engineering work (see Necessary Ingredients for Running
Useful Controlled Experiments): it is possible to reliably randomize users; it is
possible to collect telemetry; and it is relatively easy to introduce software
changes, such as new features (see Chapter 4). Even relatively small websites
have enough users to run the necessary statistical tests (Kohavi, Crook and
Longbotham 2009).
12 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Controlled experiments are especially useful in combination with Agile
software development (Martin 2008, K. S. Rubin 2012), Customer Develop-
ment process (Blank 2005), and MVPs (Minimum Viable Products), as popu-
larized by Eric Ries in The Lean Startup (Ries 2011).
In other domains, it may be hard or impossible to reliably run controlled
experiments. Some interventions required for controlled experiments in med-
ical domains may be unethical or illegal. Hardware devices may have long lead
times for manufacturing and modications are difcult, so controlled experi-
ments with users are rarely run on new hardware devices (e.g., new mobile
phones). In these situations, other techniques, such as Complementary Tech-
niques (see Chapter 10 ), may be required when controlled experiments cannot
be run.
Assuming you can run controlled experiments, it is important to ensure their
trustworthiness. When running online experiments, getting numbers is easy;
getting numbers you can trust is hard. Chapter 3 is dedicated to trustworthy
results.
Tenet 3: The Organization Recognizes That It Is Poor
at Assessing the Value of Ideas
Features are built because teams believe they are useful, yet in many domains
most ideas fail to improve key metrics. Only one third of the ideas tested at
Microsoft improved the metric(s) they were designed to improve (Kohavi,
Crook and Longbotham 2009). Success is even harder to nd in well-
optimized domains like Bing and Google, whereby some measuressuccess
rate is about 1020% (Manzi 2012).
Fareed Mosavat, Slacks Director of Product and Lifecycle tweeted that with
all of Slacks experience, only about 30% of monetization experiments show
positive results; if you are on an experiment-driven team, get used to, at best,
70% of your work being thrown away. Build your processes accordingly
(Mosavat 2019).
Avinash Kaushik wrote in his Experimentation and Testing primer (Kaushik
2006) that 80% of the time you/we are wrong about what a customer wants.
Mike Moran (Moran 2007, 240) wrote that Netix considers 90% of what they
try to be wrong. Regis Hadiaris from Quicken Loans wrote that in the ve
years Ive been running tests, Im only about as correct in guessing the results
as a major league baseball player is in hitting the ball. Thats right Ive been
doing this for 5 years, and I can only guessthe outcome of a test about 33%
of the time!(Moran 2008). Dan McKinley at Etsy (McKinley 2013) wrote
nearly everything failsand for features, he wrote its been humbling to
Tenets 13
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
realize how rare it is for them to succeed on the rst attempt. I strongly suspect
that this experience is universal, but it is not universally recognized or
acknowledged.Finally, Colin McFarland wrote in the book Experiment!
(McFarland 2012, 20) No matter how much you think its a no-brainer,
how much research youve done, or how many competitors are doing it,
sometimes, more often than you might think, experiment ideas simply fail.
Not every domain has such poor statistics, but most who have run controlled
experiments in customer-facing websites and applications have experienced
this humbling reality: we are poor at assessing the value of ideas.
Improvements over Time
In practice, improvements to key metrics are achieved by many small changes:
0.1% to 2%. Many experiments only impact a segment of users, so you must
dilute the impact of a 5% improvement for 10% of your users, which results in
a much smaller impact (e.g., 0.5% if the triggered population is similar to the
rest of the users); see Chapter 3. As Al Pacino says in the movie Any Given
Sunday,...winning is done inch by inch.
Google Ads Example
In 2011, Google launched an improved ad ranking mechanism after over a year
of development and incremental experiments (Google 2011). Engineers
developed and experimented with new and improved models for measuring
the quality score of ads within the existing ad ranking mechanism, as well as
with changes to the ad auction itself. They ran hundreds of controlled experi-
ments and multiple iterations; some across all markets, and some long term in
specic markets to understand the impact on advertisers in more depth. This
large backend change and running controlled experiments ultimately
validated how planning multiple changes and layering them together improved
the users experience by providing higher quality ads, and improved their
advertisers experience moving towards lower average prices for the higher
quality ads.
Bing Relevance Example
The Relevance team at Bing consists of several hundred people tasked with
improving a single OEC metric by 2% every year. The 2% is the sum of the
Treatment effects (i.e., the delta of the OEC) in all controlled experiments that
14 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
shipped to users over the year, assuming they are additive. Because the team
runs thousands of experiment Treatments, and some may appear positive by
chance (Lee and Shen 2018), credit towards the 2% is assigned based on a
replication experiment: once the implementation of an idea is successful,
possibly after multiple iterations and renements, a certication experiment
is run with a single Treatment. The Treatment effect of this certication
experiment determines the credit towards the 2% goal. Recent development
suggests shrinking the Treatment effect to improve precision (Coey and
Cunningham 2019).
Bing Ads Example
The Ads team at Bing has consistently grown revenue 1525% per year
(eMarketer 2016), but most improvements were done inch-by-inch. Every
month a packagewas shipped, the results of many experiments, as shown
in Figure 1.4. Most improvements were small, some monthly packages were
even known to be negative, as a result of space constraints or legal
requirements.
Figure 1.4 Bing Ad Revenue over Time (y-axis represents about 20% growth/
year). The specic numbers are not important
Improvements over Time 15
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
It is informative to see the seasonality spikes around December when
purchase intent by users rises dramatically, so ad space is increased, and
revenue per thousand searches increases.
Examples of Interesting Online Controlled Experiments
Interesting experiments are ones where the absolute difference between the
expected outcome and the actual result is large. If you thought something was
going to happen and it happened, then you havent learned much. If you
thought something was going to happen and it didnt, then youve learned
something important. And if you thought something minor was going to
happen, and the results are a major surprise and lead to a breakthrough, youve
learned something highly valuable.
The Bing example at the beginning of this chapter and those in this section
are uncommon successes with surprising, highly positive, results. Bings
attempt to integrate with social networks, such as Facebook and Twitter, are
an example of expecting a strong result and not seeing it the effort was
abandoned after many experiments showed no value for two years.
While sustained progress is a matter of continued experimentation and many
small improvements, as shown in the section Bing Ads Example, here are
several examples highlighting large surprising effects that stress how poorly
we assess the value of ideas.
UI Example: 41 Shades of Blue
Small design decisions can have signicant impact, as both Google and Micro-
soft have consistently shown. Google tested 41 gradations of blue on Google
search results pages (Holson 2009), frustrating the visual design lead at the time.
However, Googles tweaks to the color scheme ended up being substantially
positive on user engagement (note that Google does not report on the results of
individual changes) and led to a strong partnership between design and experi-
mentation moving forward. Microsofts Bing color tweaks similarly showed
that users were more successful at completing tasks, their time-to-success
improved, and monetization improved to the tune of over $10 M annually in
the United States (Kohavi et al. 2014, Kohavi and Thomke 2017).
While these are great examples of tiny changes causing massive
impact, given that a wide sweep of colors was done, it is unlikely that playing
around with colors in additional experiments will yield more signicant
improvements.
16 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Making an Offer at the Right Time
In 2004, Amazon placed a credit-card offer on the home page. It was highly
protable but had a very low click-through rate (CTR). The team ran an
experiment to move the offer to the shopping cart page that the user sees after
adding an item, showing simple math highlighting the savings the user would
receive, as shown in Figure 1.5 (Kohavi et al. 2014).
Since users adding an item to the shopping cart have clear purchase intent,
this offer displays at the right time. The controlled experiment demonstrated
that this simple change increased Amazons annual prot by tens of millions of
dollars.
Personalized Recommendations
Greg Linden at Amazon created a prototype to display personalized recom-
mendations based on items in the users shopping cart (Linden 2006, Kohavi,
Longbottom et al. 2009). When you add an item, recommendations come up;
add another item, new recommendations show up. Linden notes that while the
prototype looked promising, a marketing senior vice-president was dead set
against it,claiming it would distract people from checking out. Greg was
forbidden to work on this any further.Nonetheless, he ran a controlled
experiment, and the feature won by such a wide margin that not having it
live was costing Amazon a noticeable chunk of change. With new urgency,
shopping cart recommendations launched.Now multiple sites use cart
recommendations.
Speed Matters a LOT
In 2012, an engineer at Microsofts Bing made a change to the way JavaScript
was generated, which shortened the HTML sent to clients signicantly,
resulting in improved performance. The controlled experiment showed a
surprising number of improved metrics. They conducted a follow-on
Figure 1.5 Amazons credit card offer with savings on cart total
Interesting Online Controlled Experiments 17
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
experiment to estimate the impact on server performance. The result showed
that performance improvements also signicantly improve key user metrics,
such as success rate and time-to-success, and each 10 millisecond performance
improvement (1/30th of the speed of an eye blink) pays for the fully loaded
annual cost of an engineer (Kohavi et al. 2013).
By 2015, as Bings performance improved, there were questions about
whether there was still value to performance improvements when the server
was returning results in under a second at the 95th percentile (i.e., for 95% of
the queries). The team at Bing conducted a follow-on study and key user
metrics still improve signicantly. While the relative impact on revenue was
somewhat reduced, Bings revenue improved so much during the time that
each millisecond in improved performance was worth more than in the past;
every four milliseconds of improvement funded an engineer for a year! See
Chapter 5 for in-depth review of this experiment and the criticality of
performance.
Performance experiments were done at multiple companies with results
indicating how critical performance is. At Amazon, a 100-millisecond slow-
down experiment decreased sales by 1% (Linden 2006b, 10). A joint talk by
speakers from Bing and Google (Schurman and Brutlag 2009) showed the
signicant impact of performance on key metrics, including distinct queries,
revenue, clicks, satisfaction, and time-to-click.
Malware Reduction
Ads are a lucrative business and freewareinstalled by users often contains
malware that pollutes pages with ads. Figure 1.6 shows what a resulting page
from Bing looked like to a user with malware. Note that multiple ads (high-
lighted in red) were added to the page (Kohavi et al. 2014).
Not only were Bing ads removed, depriving Microsoft of revenue, but low-
quality ads and often irrelevant ads displayed, providing a poor user experi-
ence for users who might not have realized why they were seeing so many ads.
Microsoft ran a controlled experiment with 3.8 million users potentially
impacted, where basic routines that modify the DOM (Document Object
Model) were overridden to allow only limited modications from trusted
sources (Kohavi et al. 2014). The results showed improvements to all of Bings
key metrics, including Sessions per user, indicating that users visited more
often or churned less. In addition, users were more successful in their searches,
quicker to click on useful links, and annual revenue improved by several
million dollars. Also, page-load time, a key performance metric we previously
discussed, improved by hundreds of milliseconds for the impacted pages.
18 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Backend Changes
Backend algorithmic changes are often overlooked as an area to use controlled
experiments (Kohavi, Longbottom et al. 2009), but it can yield signicant
results. We can see this both from how teams at Google, LinkedIn, and
Microsoft work on many incremental small changes, as we described above,
and in this example involving Amazon.
Back in 2004, there already existed a good algorithm for making
recommendations based on two sets. The signature feature for Amazons
Figure 1.6 Bing page when the user has malware shows multiple ads
Interesting Online Controlled Experiments 19
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
recommendation was People who bought item X bought item Y,but this was
generalized to People who viewed item X bought item Yand People who
viewed item X viewed item Y.A proposal was made to use the same algorithm
for People who searched for X bought item Y.Proponents of the algorithm
gave examples of underspecied searches, such as 24,which most people
associate with the TV show starring Kiefer Sutherland. Amazons search was
returning poor results (left in Figure 1.7), such as CDs with 24 Italian Songs,
clothing for 24-month old toddlers, a 24-inch towel bar, and so on. The new
algorithm gave top-notch results (right in Figure 1.7), returning DVDs for the
show and related books, based on what items people actually purchased after
searching for 24.One weakness of the algorithm was that some items surfaced
that did not contain the words in the search phrase; however, Amazon ran a
controlled experiment, and despite this weakness, this change increased
Amazons overall revenue by 3% hundreds of millions of dollars.
Strategy, Tactics, and Their Relationship to Experiments
When the necessary ingredients for running online controlled experiments are
met, we strongly believe they should be run to inform organizational decisions
at all levels from strategy to tactics.
Strategy (Porter 1996, 1998) and controlled experiments are synergistic.
David Collis of Lean Strategy wrote that rather than suppressing
Figure 1.7 Amazon search for 24with and without BBS
20 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
entrepreneurial behavior, effective strategy encourages it by identifying the
bounds within which innovation and experimentation should take place
(Collis 2016). He denes a lean strategy process, which guards against the
extremes of both rigid planning and unrestrained experimentation.
Well-run experiments with appropriate metrics complement business strat-
egy, product design, and improve operational effectiveness by making the
organization more data driven. By encapsulating strategy into an OEC, con-
trolled experiments can provide a great feedback loop for the strategy. Are the
ideas evaluated with experiments improving the OEC sufciently? Alterna-
tively, surprising results from experiments can shine a light on alternative
strategic opportunities, leading to pivots in those directions (Ries 2011).
Product design decisions are important for coherency and trying multiple
design variants provides a useful feedback loop to the designers. Finally, many
tactical changes can improve the operational effectiveness, dened by Porter as
performing similar activities better than rivals perform them(Porter 1996).
We now review two key scenarios.
Scenario 1: You Have a Business Strategy and You
Have a Product with Enough Users to Experiment
In this scenario, experiments can help hill-climb to a local optimum based on
your current strategy and product:
Experiments can help identify areas with high ROI: those that improve the
OEC the most, relative to the effort. Trying different areas with MVPs can
help explore a broader set of areas more quickly, before committing signi-
cant resources.
Experiments can also help with optimizations that may not be obvious to
designers but can make a large difference (e.g., color, spacing, performance).
Experiments can help continuously iterate to better site redesigns, rather
than having teams work on complete site redesigns that subject users to
primacy effects (users are primed in the old feature, i.e., used to the way it
works) and commonly fail not only to achieve their goals, but even fail to
achieve parity with the old site on key metrics (Goward 2015, slides 2224,
Rawat 2018, Wolf 2018, Laja 2019).
Experiments can be critical in optimizing backend algorithms and infra-
structure, such as recommendation and ranking algorithms.
Having a strategy is critical for running experiments: the strategy is what
drives the choice of OEC. Once dened, controlled experiments help
Strategy, Tactics, Their Relationship to Experiments 21
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
accelerate innovation by empowering teams to optimize and improve the OEC.
Where we have seen experiments misused is when the OEC is not properly
chosen. The metrics chosen should meet key characteristics and not be game-
able (see Chapter 7).
At our companies, not only do we have teams focused on how to run
experiments properly, but we also have teams focused on metrics: choosing
metrics, validating metrics, and evolving metrics over time. Metric evolution
will happen both due to your strategy evolving over time but also as you learn
more about the limitations of your existing metrics, such as CTR being too
gameable and needing to evolve. Metric teams also work on determining
which metrics measurable in the short term drive long-term objectives, since
experiments usually run over a shorter time frame. Hauser and Katz (1998)
wrote that the rm must identify metrics that the team can affect today, but
which, ultimately, will affect the rms long-term goals(see Chapter 7).
Tying the strategy to the OEC also creates Strategic Integrity (Sinofsky and
Iansiti 2009). The authors point out that Strategic integrity is not about
crafting brilliant strategy or about having the perfect organization: It is about
getting the right strategies done by an organization that is aligned and knows
how to get them done. It is about matching top-down-directed perspectives
with bottom-up tasks.The OEC is the perfect mechanism to make the strategy
explicit and to align what features ship with the strategy.
Ultimately, without a good OEC, you are wasting resources think of experi-
menting to improve the food or lighting on a sinking cruise ship. The weight of
passenger safety term in the OEC for those experiments should be extremely
high in fact, so high that we are not willing to degrade safety. This can be
captured either via high weight in the OEC, or, equivalently, using passenger
safety as a guardrail metric (see Chapter 21). In software, the analogy to the cruise
ship passenger safety is software crashes: if a feature is increasing crashes for the
product, the experience is considered so bad, other factors pale in comparison.
Dening guardrail metrics for experiments is important for identifying what
the organization is not willing to change, since a strategy also requires you to
make tradeoffs in competing to choose what not to do(Porter 1996). The ill-
fated Eastern Air Lines ight 401 crashed because the crew was focused on a
burned-out landing gear indicator light, and failed to notice that the autopilot
was accidentally disengaged; altitude, a key guardrail metric, gradually
decreased and the plane crashed in the Florida Everglades in 1972, resulting
in 101 fatalities (Wikipedia contributors, Eastern Air Lines Flight 401 2019).
Improvements in operational efciencies can provide long-term differenti-
ated advantage, as Porter noted in a section titled Japanese Companies Rarely
have Strategies(1996) and Varian noted in his article on Kaizen (2007).
22 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Scenario 2: You Have a Product, You Have a Strategy,
but the Results Suggest That You Need to Consider a Pivot
In Scenario 1, controlled experiments are a great tool for hill climbing. If you
think of the multi-dimensional space of ideas, with the OEC as the height
that is being optimized, then you may be making steps towards a peak. But
sometimes, either based on internal data about the rate of change or external
data about growth rates or other benchmarks, you need to consider a pivot:
jumping to a different location in the space, which may be on a bigger hill, or
changing the strategy and the OEC (and hence the shape of the terrain).
In general, we recommend always having a portfolio of ideas: most should
be investments in attempting to optimize nearthe current location, but a few
radical ideas should be tried to see whether those jumps lead to a bigger hill.
Our experience is that most big jumps fail (e.g., big site redesigns), yet there is
a risk/reward tradeoff: the rare successes may lead to large rewards that
compensate for many failures.
When testing radical ideas, how you run and evaluate experiments changes
somewhat. Specically, you need to consider:
The duration of experiments. For example, when testing a major UI
redesign, experimental changes measured in the short term may be inu-
enced by primacy effects or change aversion. The direct comparison of
Treatment to Control may not measure the true long-term effect. In a two-
sided marketplace, testing a change, unless sufciently large, may not
induce an effect on the marketplace. A good analogy is an ice cube in a
very cold room: small increases to room temperature may not be noticeable,
but once you go over the melting point (e.g., 32 Fahrenheit), the ice cube
melts. Longer and larger experiments, or alternative designs, such as the
country-level experiments used in the Google Ads Quality example above,
may be necessary in these scenarios (see also Chapter 23).
The number of ideas tested. You may need many different experiments
because each experiment is only testing a specic tactic, which is a com-
ponent of the overall strategy. A single experiment failing to improve the
OEC may be due to the specic tactic being poor, not necessarily indicating
that the overall strategy is bad. Experiments, by design, are testing specic
hypotheses, while strategies are broader. That said, controlled experiments
help rene the strategy, or show its ineffectiveness and encourage a pivot
(Ries 2011). If many tactics evaluated through controlled experiments fail, it
may be time to think about Winston Churchills saying: However beautiful
the strategy, you should occasionally look at the results.For about two
years, Bing had a strategy of integrating with social media, particularly
Strategy, Tactics, Their Relationship to Experiments 23
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
Facebook and Twitter, opening a third pane with social search results. After
spending over $25 million on the strategy with no signicant impact to key
metrics, the strategy was abandoned (Kohavi and Thomke 2017). It may be
hard to give up on a big bet, but economic theory tells us that failed bets are
sunk costs, and we should make a forward-looking decision based on the
available data, which is gathered as we run more experiments.
Eric Ries uses the term achieved failurefor companies that successfully,
faithfully, and rigorously execute a plan that turned out to have been utterly
awed (Ries 2011). Instead, he recommends:
The Lean Startup methodology reconceives a startups efforts as experiments that
test its strategy to see which parts are brilliant and which are crazy. A true
experiment follows the scientic method. It begins with a clear hypothesis that
makes predictions about what is supposed to happen. It then tests those predictions
empirically.
Due to the time and challenge of running experiments to evaluate strategy,
some, like Sinofsky and Iansiti (2009) write:
...product development process as one fraught with risk and uncertainty. These are
two very different concepts ... We cannot reduce the uncertainty you dont
know what you dont know.
We disagree: the ability to run controlled experiments allows you to signi-
cantly reduce uncertainty by trying a Minimum Viable Product (Ries 2011),
getting data, and iterating. That said, not everyone may have a few years to
invest in testing a new strategy, in which case you may need to make decisions
in the face of uncertainty.
One useful concept to keep in mind is EVI: Expected Value of Information
from Douglas Hubbard (2014), which captures how additional information can
help you in decision making. The ability to run controlled experiments allows
you to signicantly reduce uncertainty by trying a Minimum Viable Product
(Ries 2011), gathering data, and iterating.
Additional Reading
There are several books directly related to online experiments and A/B tests
(Siroker and Koomen 2013, Goward 2012, Schrage 2014, McFarland 2012,
King et al. 2017). Most have great motivational stories but are inaccurate on
the statistics. Georgi Georgievs recent book includes comprehensive statis-
tical explanations (Georgiev 2019).
24 1 Introduction and Motivation
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
The literature related to controlled experiments is vast (Mason et al. 1989,
Box et al. 2005, Keppel, Sauey and Tokunaga 1992, Rossi, Lipsey and
Freeman 2004, Imbens and Rubin 2015, Pearl 2009, Angrist and Pischke
2014, Gerber and Green 2012).
There are several primers on running controlled experiments on the web
(Peterson 2004, 7678, Eisenberg 2005, 283286, Chatham, Temkin and
Amato 2004, Eisenberg 2005, Eisenberg 2004); (Peterson 2005, 248253,
Tyler and Ledford 2006, 213219, Sterne 2002, 116119, Kaushik 2006).
A multi-armed bandit is a type of experiment where the experiment trafc
allocation can be dynamically updated as the experiment progresses (Li et al.
2010, Scott 2010). For example, we can take a fresh look at the experiment
every hour to see how each of the variants has performed, and we can adjust
the fraction of trafc that each variant receives. A variant that appears to be
doing well gets more trafc, and a variant that is underperforming gets less.
Experiments based on multi-armed bandits are usually more efcient than
classicalA/B experiments, because they gradually move trafc towards
winning variants, instead of waiting for the end of an experiment. While there
is a broad range of problems they are suitable for tackling (Bakshy, Balandal
and Kashin 2019), some major limitations are that the evaluation objective
needs to be a single OEC (e.g., tradeoff among multiple metrics can be simply
formulated), and that the OEC can be measured reasonably well between re-
allocations, for example, click-through rate vs. sessions. There can also be
potential bias created by taking users exposed to a bad variant and distributing
them unequally to other winning variants.
In December 2018, the three co-authors of this book organized the First
Practical Online Controlled Experiments Summit. Thirteen organizations,
including Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft,
Microsoft, Netix, Twitter, Uber, Yandex, and Stanford University, sent a total
of 34 experts, which presented an overview and challenges from breakout
sessions (Gupta et al. 2019). Readers interested in challenges will benet from
reading that paper.
Additional Reading 25
From Trustworthy Online Controlled Experiments by R Kohavi, D Tang, and Y Xu copyright (c)2020
https://experimentguide.com
Published by Cambridge University Press 2020
... Specifically, bounds for more general discount functions that are, e.g., personalised and context-dependent, would be of both theoretical and practical importance. Indeed, if other covariates exist that impact exposure probabilities, we need to account for them to avoid problems of unobserved confounding that would inevitably lead to further biases [50]. ...
... Online evaluation procedures, on the other hand, leverage interaction with end users to directly measure the quantities we care about -be it short-term, long-term, multi-objective, multi-stakeholder, accuracy-, diversity-, or fairness-oriented. A/Btests are typically used for this, because of their strong theoretical connections to well-known and well-vetted experimental setups like Randomised Controlled Trials (RCTs) [70,50]. ...
... Notwithstanding this, interference also occurs even in simpler settings where we only consider users that can interact [20]. [50] provide guidance for online experimentation in general, describing common situations where problems can occur. These issues should be acknowledged and widely known, to avoid blindly putting A/B-test results on a pedestal as the "gold standard", without being clear about the assumptions. ...
Technical Report
Full-text available
This report documents the program and the outcomes of Dagstuhl Seminar 24211, "Evaluation Perspectives of Recommender Systems: Driving Research and Education", which brought together 41 participants from 16 countries. The seminar brought together distinguished researchers and practitioners from the recom-mender systems community, representing a range of expertise and perspectives. The primary objective was to address current challenges and advance the ongoing discourse on the evaluation of recommender systems. The participants' diverse backgrounds and perspectives on evaluation significantly contributed to the discourse on this subject. The seminar featured eight presentations on current challenges in the evaluation of recom-mender systems. These presentations sparked the general discussion and facilitated the formation of groups around these topics. As a result, five working groups were established, each focusing on the following areas: theory of evaluation, fairness evaluation, best-practices for offline evaluations of recommender systems, multistakeholder and multimethod evaluation, and evaluating the long-term impact of recommender systems. Seminar May 20-24 2024-https://www.dagstuhl.de/24211 2012 ACM Subject Classification Information systems → Recommender systems; Information systems → Evaluation of retrieval results; Human-centered computing → HCI design and evaluation methods License Creative Commons BY 4.0 International license © Christine Bauer, Alan Said, and Eva Zangerle Recommender systems (RS) have become essential tools in everyday life, efficiently helping users discover relevant, useful, and interesting items such as music tracks, movies, or social matches. RS identify the interests and preferences of individual users through explicit input or implicit information inferred from their interactions with the systems and tailor content and recommendations accordingly [13, 16].
... Although g(X) is never actually known ahead of time, there are several settings in which an experimenter might reasonably believe that they have a good proxy for it. In A/B testing (Kohavi et al., 2020), for example, it is common for treatment effects to be very small but nonetheless very valuable. Hence, for a new treatment, a good estimate of E[Y i (0) | X i ] from past data provides a good proxy for g(X i )/2. ...
Preprint
Given covariates for n units, each of which is to receive a treatment with probability 1/2, we study the question of how best to correlate their treatment assignments to minimize the variance of the IPW estimator of the average treatment effect. Past work by \cite{bai2022} found that the optimal stratified experiment is a matched-pair design, where the matching depends on oracle knowledge of the distributions of potential outcomes given covariates. We show that, in the strictly broader class of all admissible correlation structures, the optimal design is to divide the units into two clusters and uniformly assign treatment to exactly one of the two clusters. This design can be computed by solving a 0-1 knapsack problem that uses the same oracle information and can result in an arbitrarily large variance improvement. A shift-invariant version can be constructed by ensuring that exactly half of the units are treated. A method with just two clusters is not robust to a bad proxy for the oracle, and we mitigate this with a hybrid that uses O(nα)O(n^\alpha) clusters for 0<α<10<\alpha<1. Under certain assumptions, we also derive a CLT for the IPW estimator under our design and a consistent estimator of the variance. We compare our proposed designs to the optimal stratified design in simulated examples and find improved performance.
... In online settings, including deployed AI systems in production environments, organizations use evaluation metrics based on logged behavior data. In both settings, evaluation metrics play a critical role in guiding high-level research and model development decisions as well as more granular parameter optimization (Cohen 1995;Kohavi, Tang, and Xu 2020;Joachims 2002;Grotov and de Rijke 2016). In particular, offline metrics are often used-implictly or not-as proxies or predictors of online metrics (Zheng 2015;Suresh and Guttag 2021;Rudin and Wagstaff 2014). ...
Article
Recent work has advocated for training AI models on ever-larger datasets, arguing that as the size of a dataset increases, the performance of a model trained on that dataset will correspondingly increase (referred to as “scaling laws”). In this paper, we draw on literature from the social sciences and machine learning to critically interrogate these claims. We argue that this scaling law relationship depends on metrics used to measure performance that may not correspond with how different groups of people perceive the quality of models' output. As the size of datasets used to train large AI models grows and AI systems impact ever larger groups of people, the number of distinct communities represented in training or evaluation datasets grows. It is thus even more likely that communities represented in datasets may have values or preferences not reflected in (or at odds with) the metrics used to evaluate model performance in scaling laws. Different communities may also have values in tension with each other, leading to difficult, potentially irreconcilable choices about metrics used for model evaluations---threatening the validity of claims that model performance is improving at scale. We end the paper with implications for AI development: that the motivation for scraping ever-larger datasets may be based on fundamentally flawed assumptions about model performance. That is, models may not, in fact, continue to improve as the datasets get larger---at least not for all people or communities impacted by those models. We suggest opportunities for the field to rethink norms and values in AI development, resisting claims for universality of large models, fostering more local, small-scale designs, and other ways to resist the impetus towards scale in AI.
... These concerns can be minimized while ensuring internal validity through partial randomization. Outside of healthcare, it is common practice for companies to employ randomized A/B testing to compare different options, evaluate effectiveness of iterative changes, and make data-driven decisions before deploying something to the entire audience 41,42 . However, health systems tend to implement interventions wholesale and rarely formally analyze the effectiveness of interventions deployed 43 . ...
Article
Full-text available
Digital decision support and remote patient monitoring may improve outcomes and efficiency, but rarely scale beyond a single institution. Over the last 5 years, the platform Timely Interventions for Diabetes Excellence (TIDE) has been associated with reduced care provider screen time and improved, equitable type 1 diabetes care and outcomes for 268 patients in a heterogeneous population as part of the Teamwork, Targets, Technology, and Tight Control (4T) Study (NCT03968055, NCT04336969). Previous efforts to deploy TIDE at other institutions continue to face delays. In partnership with the diabetes technology non-profit, Tidepool, we developed Tidepool-TIDE, a clinic-agnostic, turnkey solution available to any clinic in the United States. We present how we overcame common technical and operational barriers specific to scaling digital health technology from one site to many. The concepts described are broadly applicable for institutions interested in facilitating broader adoption of digital technology for population-level management of chronic health conditions.
... A/B testing has emerged as a fundamental pillar of modern business strategies, with firms leveraging these experiments to drive innovation, enhance user experiences, and boost revenue growth (Thomke (2020), Koning, Hasan and Chatterji (2022), Kohavi, Tang and Xu (2020)). Tech giants like Google and Amazon now deploy thousands of such tests annually, embedding data-driven decision-making into their innovation processes. ...
Preprint
Full-text available
This paper examines how a firm's performance can decline despite consistently implementing successful A/B test innovations -- a phenomenon we term "seesaw experimentation." While these innovations improve the measured primary dimension, they create negative externalities in unmeasured secondary dimensions that exceed the gains. Using a multivariate normal distribution model, we identify the conditions for this decline and propose positive hurdle rates as a solution. Our analysis shows how to set optimal hurdle rates to best mitigate these negative externalities and provides practical guidance for experimental design by demonstrating how these rates should vary with underlying parameters.
... Studying the effects of moderation in real-world systems actively used by millions of users presents significant challenges [39]. Random assignment of moderation is often ethically and practically infeasible. ...
Preprint
Full-text available
Online competitive action games have flourished as a space for entertainment and social connections, yet they face challenges from a small percentage of players engaging in disruptive behaviors. This study delves into the under-explored realm of understanding the effects of moderation on player behavior within online gaming on an example of a popular title - Call of Duty(R): Modern Warfare(R)II. We employ a quasi-experimental design and causal inference techniques to examine the impact of moderation in a real-world industry-scale moderation system. We further delve into novel aspects around the impact of delayed moderation, as well as the severity of applied punishment. We examine these effects on a set of four disruptive behaviors including cheating, offensive user name, chat, and voice. Our findings uncover the dual impact moderation has on reducing disruptive behavior and discouraging disruptive players from participating. We further uncover differences in the effectiveness of quick and delayed moderation and the varying severity of punishment. Our examination of real-world gaming interactions sets a precedent in understanding the effectiveness of moderation and its impact on player behavior. Our insights offer actionable suggestions for the most promising avenues for improving real-world moderation practices, as well as the heterogeneous impact moderation has on indifferent players.
Article
Over the past decade and a half, adoption of Bayesian inference in pulsar timing analysis has led to increasingly sophisticated models. The recent announcement of evidence for a stochastic background of gravitational waves by various pulsar timing array projects highlighted Bayesian inference as a central tool for parameter estimation and model selection. Despite its success, Bayesian inference is occasionally misused in the pulsar timing community. A common workflow is that the data is analyzed in multiple steps: a first analysis of single pulsars individually, and a subsequent analysis of the whole array of pulsars. A mistake that is then sometimes introduced stems from using the posterior distribution to craft the prior for the analysis of the same data in a second step, a practice referred to in the statistics literature as “circular analysis.” This is done to prune the model for computational efficiency. Multiple recent high-profile searches for gravitational waves by pulsar timing array (PTA) projects have this workflow. This letter highlights this error and suggests that Spike and Slab priors can be used to carry out model averaging instead of model selection in a single pass. Spike and Slab priors are proved to be equal to Log-Uniform priors.
Preprint
Full-text available
This paper examines the use of Monte Carlo simulations to understand statistical concepts in A/B testing and Randomized Controlled Trials (RCTs). We discuss the applicability of simulations in understanding false positive rates and estimate statistical power, implementing variance reduction techniques and examining the effects of early stopping. By comparing frequentist and Bayesian approaches, we illustrate how simulations can clarify the relationship between p-values and posterior probabilities, and the validity of such approximations. The study also references how Monte Carlo simulations can be used to understand network effects in RCTs on social networks. Our findings show that Monte Carlo simulations are an effective tool for experimenters to deepen their understanding and ensure their results are statistically valid and practically meaningful.
Article
Full-text available
Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale. To understand the top practical challenges in running OCEs at scale and encourage further academic and industrial exploration, representatives with experience in large-scale experimentation from thirteen different organizations (Airbnb, Amazon, Booking.com, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, Yandex, and Stanford University) were invited to the first Practical Online Controlled Experiments Summit. All thirteen organizations sent representatives. Together these organizations have tested more than one hundred thousand experiment treatments last year. Thirty-four experts from these organizations participated in the summit in Sunnyvale, CA, USA on December 13-14, 2018. While there are papers from individual organizations on some of the challenges and pitfalls in running OCEs at scale, this is the first paper to provide the top challenges faced across the industry for running OCEs at scale and some common solutions.
Conference Paper
Full-text available
Online controlled experiments (e.g., A/B tests) are an integral part of successful data-driven companies. At Microsoft, supporting experimentation poses a unique challenge due to the wide variety of products being developed, along with the fact that experimentation capabilities had to be added to existing, mature products with codebases that go back decades. This paper describes the Microsoft ExP Platform (ExP for short) which enables trustworthy A/B experimentation at scale for products across Microsoft, from web properties (such as bing.com) to mobile apps to device drivers within the Windows operating system. The two core tenets of the platform are trustworthiness (an experiment is meaningful only if its results can be trusted) and scalability (we aspire to expose every single change in any product through an A/B experiment). Currently, over ten thousand experiments are run annually. In this paper, we describe the four core components of an A/B experimentation system: experimentation portal, experiment execution service, log processing service and analysis service, and explain the reasoning behind the design choices made. These four components work together to provide a system where ideas can turn into experiments within minutes and those experiments can provide initial trustworthy results within hours.
Book
The problem of privacy-preserving data analysis has a long history spanning multiple disciplines. As electronic data about individuals becomes increasingly detailed, and as technology enables ever more powerful collection and curation of these data, the need increases for a robust, meaningful, and mathematically rigorous definition of privacy, together with a computationally rich class of algorithms that satisfy this definition. Differential Privacy is such a definition. The Algorithmic Foundations of Differential Privacy starts out by motivating and discussing the meaning of differential privacy, and proceeds to explore the fundamental techniques for achieving differential privacy, and the application of these techniques in creative combinations, using the query-release problem as an ongoing example. A key point is that, by rethinking the computational goal, one can often obtain far better results than would be achieved by methodically replacing each step of a non-private computation with a differentially private implementation. Despite some powerful computational results, there are still fundamental limitations. Virtually all the algorithms discussed herein maintain differential privacy against adversaries of arbitrary computational power — certain algorithms are computationally intensive, others are efficient. Computational complexity for the adversary and the algorithm are both discussed. The monograph then turns from fundamentals to applications other than query-release, discussing differentially private methods for mechanism design and machine learning. The vast majority of the literature on differentially private algorithms considers a single, static, database that is subject to many analyses. Differential privacy in other models, including distributed databases and computations on data streams, is discussed. The Algorithmic Foundations of Differential Privacy is meant as a thorough introduction to the problems and techniques of differential privacy, and is an invaluable reference for anyone with an interest in the topic.
Article
We show that propensity score matching (PSM), an enormously popular method of preprocessing data for causal inference, often accomplishes the opposite of its intended goal—thus increasing imbalance, inefficiency, model dependence, and bias. The weakness of PSM comes from its attempts to approximate a completely randomized experiment, rather than, as with other matching methods, a more efficient fully blocked randomized experiment. PSM is thus uniquely blind to the often large portion of imbalance that can be eliminated by approximating full blocking with other matching methods. Moreover, in data balanced enough to approximate complete randomization, either to begin with or after pruning some observations, PSM approximates random matching which, we show, increases imbalance even relative to the original data. Although these results suggest researchers replace PSM with one of the other available matching methods, propensity scores have other productive uses.
Article
We define causal estimands for experiments on single time series, extending the potential outcome framework to dealing with temporal data. Our approach allows the estimation of a broad class of these estimands and exact randomization-based p-values for testing causal effects, without imposing stringent assumptions. We further derive a general central limit theorem that can be used to conduct conservative tests and build confidence intervals for causal effects. Finally, we provide three methods for generalizing our approach to multiple units that are receiving the same class of treatment, over time. We test our methodology on simulated “potential autoregressions,” which have a causal interpretation. Our methodology is partially inspired by data from a large number of experiments carried out by a financial company who compared the impact of two different ways of trading equity futures contracts. We use our methodology to make causal statements about their trading methods. Supplementary materials for this article are available online.
Article
The definition of second order interaction in a (2 × 2 × 2) table given by Bartlett is accepted, but it is shown by an example that the vanishing of this second order interaction does not necessarily justify the mechanical procedure of forming the three component 2 × 2 tables and testing each of these for significance by standard methods.*
Article
Many scientific and engineering challenges---ranging from personalized medicine to customized marketing recommendations---require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. Given a potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms that, to our knowledge, is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially as the number of covariates increases.
Thesis
Many internet firms use A/B tests to make product decisions. When running an A/B test, the typical objective is to measure the average treatment effect (ATE), i.e. the difference between the average outcome in the counterfactual situation where 100% of users are exposed to the treatment and the average outcome in the status quo situation (in which 100% of users are exposed to the control). However, a simple difference-in-means estimator will give a biased estimate of the ATE when outcomes of units in the control depend on the behavior or outcomes of units in the treatment - a case referred to in this work as test-control interference. Previous work has found evidence of bias due to test-control interference in online marketplace experiments. Using a simulation built on top of scraped Airbnb data, this paper considers the use of experiment designs and ATE estimators from the network interference literature for online marketplace experimentation. I first model the marketplace as a network in which an edge exists between two sellers if their goods substitute for one another. I then create an agent-based simulation to model seller outcomes, both under the status quo and when subjected to "treatments" that force hosts to lower their price. I then use the same simulation framework to approximate ATE distributions obtained when various network experiment design and analysis techniques are used. I find that graph cluster randomization leads to bias reductions of as much as 62%. Unfortunately, the variance of ATE estimators also increases significantly. Replacing the simple difference-in-means estimator with more sophisticated ATE estimators can lead to mixed results. While some methods (i.e., exposure models) provide (small) additional reductions in bias and small reductions in variance, others (i.e., the Hajek estimator for the ATE) lead to increased bias and variance. Although further work is needed, current results suggest that experiment design and analysis techniques from the network experimentation literature are promising tools for for reducing bias due to test-control interference in marketplace experiments.