Conference PaperPDF Available

Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained

Authors:

Abstract and Figures

Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher's experiments at the Rothamsted Agricultural Experimental Station in England in the 1920s, the deployment and mining of online controlled experiments at scale—thousands of experiments now—has taught us many lessons. These exemplify the proverb that the difference between theory and practice is greater in practice than in theory. We present our learnings as they happened: puzzling outcomes of controlled experiments that we analyzed deeply to understand and explain. Each of these took multiple-person weeks to months to properly analyze and get to the often surprising root cause. The root causes behind these puzzling results are not isolated incidents; these issues generalized to multiple experiments. The heightened awareness should help readers increase the trustworthiness of the results coming out of controlled experiments. At Microsoft's Bing, it is not uncommon to see experiments that impact annual revenue by millions of dollars, thus getting trustworthy results is critical and investing in understanding anomalies has tremendous payoff: reversing a single incorrect decision based on the results of an experiment can fund a whole team of analysts. The topics we cover include: the OEC (Overall Evaluation Criterion), click tracking, effect trends, experiment length and power, and carryover effects.
Content may be subject to copyright.
A preview of the PDF is not available
... Experimentation forms the foundation of scientific decision-making, from natural and social sciences to industry. Yet even for online platforms with hundreds of millions of users, achieving sufficient statistical power is costly [43,44,45,13,24]. Adaptive experimentation can significantly improve efficiency by focusing resources on promising treatments. ...
Preprint
Adaptive experimentation can significantly improve statistical power, but standard algorithms overlook important practical issues including batched and delayed feedback, personalization, non-stationarity, multiple objectives, and constraints. To address these issues, the current algorithm design paradigm crafts tailored methods for each problem instance. Since it is infeasible to devise novel algorithms for every real-world instance, practitioners often have to resort to suboptimal approximations that do not address all of their challenges. Moving away from developing bespoke algorithms for each setting, we present a mathematical programming view of adaptive experimentation that can flexibly incorporate a wide range of objectives, constraints, and statistical procedures. By formulating a dynamic program in the batched limit, our modeling framework enables the use of scalable optimization methods (e.g., SGD and auto-differentiation) to solve for treatment allocations. We evaluate our framework on benchmarks modeled after practical challenges such as non-stationarity, personalization, multi-objectives, and constraints. Unlike bespoke algorithms such as modified variants of Thomson sampling, our mathematical programming approach provides remarkably robust performance across instances.
Article
We study the identification and estimation of long-term treatment effects by combining short-term experimental data and long-term observational data subject to unobserved confounding. This problem arises often when concerned with long-term treatment effects since experiments are often short-term due to operational necessity while observational data can be more easily collected over longer time frames but may be subject to confounding. In this paper, we tackle the challenge of persistent confounding: unobserved confounders that can simultaneously affect the treatment, short-term outcomes, and long-term outcome. In particular, persistent confounding invalidates identification strategies in previous approaches to this problem. To address this challenge, we exploit the sequential structure of multiple short-term outcomes and develop several novel identification strategies for the average long-term treatment effect. Based on these, we develop estimation and inference methods with asymptotic guarantees. To demonstrate the importance of handling persistent confounders, we apply our methods to estimate the effect of a job training program on long-term employment using semi-synthetic data.
Chapter
Full-text available
This chapter highlights the digital marketing with special reference to wholesalers and retailers marketing specifically moving from website or email marketing to advanced marketing using artificial intelligence. They also discuss how such new technologies as AI, data analytics can be applied in a community. of electronic commerce has also impacted customer relations and business activities in a remarkable manner. From the analyzed text, one can conclude that the current trends are in omnichannel marketing, personalization , Augmented Reality and Virtual Reality technologies. It ensure the transfer of information and skills on matters; Data security, privacy concerns, organizational cultural transformation, among others. The chapter also illustrates how the following can be applied as a way of competitiveness improvement; This includes predictive analytics, artificial in inventory and supply chain and digital identity. The curriculum also emphasizes the idea that marketing can not be unethical and that the notion of sustainability has to be discussed in the context of Internet age.
Thesis
Full-text available
This doctoral thesis deals with digital marketplaces that focus on electronic commerce between different companies. These electronic marketplaces (EMs) usually act as intermediaries or platforms between the supply and demand side in business-to-business (B2B) markets. Electronic commerce's relevance is increasing in numerous commercial and industrial contexts, which makes digital marketplaces increasingly important. In the private consumer sector, successful marketplaces have been in existence for several years now. They have already become part of the daily routine of many people. In business-to-business trading, digital marketplaces continue to be perceived as innovative or new, which makes them an interesting phenomenon and object of investigation. This phenomenon is examined in depth in this thesis, which should serve both science and practice.
Article
Full-text available
This research delves into the "seasonal fixed bonus" phenomenon in Mexico, Spain, the United States, and Canada, examining how employees prefer receiving payments during the Christmas season or evenly distributed throughout the year. Two hypotheses explore biases arising from bounded rationality: Hypothesis 1 suggests Mexican/Spanish workers may resist receiving the bonus dispersed throughout the year, while Hypothesis 2 posits American/Canadian workers may resist a reduction in monthly payments for a seasonal bonus. Using a utopic international competition, the study reveals that Mexican/Spanish participants exhibit a preference for end-of-year rewards, partially supporting Hypothesis 1, whereas American/Canadian participants lean towards immediate rewards, partially supporting Hypothesis 2. Statistical significance is found in Mexico and Spain, aligning with mental accounting principles, while the U.S. and Canada show similar trends but lack significance. This implies a potential status quo bias among American and Canadian workers regarding seasonal bonuses.
Article
Full-text available
Confirmation bias, as the term is typically used in the psychological literature, connotes the seeking or interpreting of evidence in ways that are partial to existing beliefs, expectations, or a hypothesis in hand. The author reviews evidence of such a bias in a variety of guises and gives examples of its operation in several practical contexts. Possible explanations are considered, and the question of its utility or disutility is discussed.
Article
Full-text available
Tracking users' online clicks and form submits (e.g., searches) is critical for web analytics, controlled experiments, and business intelligence. Most sites use web beacons to track user actions, but waiting for the beacon to return on clicks and submits slows the next action (e.g., showing search results or the destination page). One possibility is to use a short timeout and common wisdom is that the more time given to the tracking mechanism (suspending the user action), the lower the data loss. Research from Amazon, Google, and Microsoft showed that small delays of a few hundreds of milliseconds have dramatic negative impact on revenue and user experience (Kohavi, et al., 2009 p. 173), yet we found that many websites allow long delays in order to collect click. For example, until March 2010, multiple Microsoft sites waited for click beacons to return with a 2-second timeout, introducing a delay of about 400msec on user clicks. To the best of our knowledge, this is the first published empirical study of the subject under a controlled environment. While we confirm the common wisdom about the tradeoff in general, a surprising result is that the tradeoff does not exist for the most common browser family, Microsoft Internet Explorer (IE), where no delay suffices. This finding has significant implications for tracking users since no waits is required to prevent data loss for IE browsers and it could significantly improve revenue and user experience. The recommendations here have been implemented by the MSN US home page and Hotmail.
Article
Full-text available
Confirmation bias, as the term is typically used in the psychological literature, connotes the seeking or interpreting of evidence in ways that are partial to existing beliefs, expectations, or a hypothesis in hand. The author reviews evidence of such a bias in a variety of guises and gives examples of its operation in several practical contexts. Possible explanations are considered, and the question of its utility or disutility is discussed. When men wish to construct or support a theory, how they torture facts into their service! (Mackay, 1852/ 1932, p. 552) Confirmation bias is perhaps the best known and most widely accepted notion of inferential error to come out of the literature on human reasoning. (Evans, 1989, p. 41) If one were to attempt to identify a single problematic aspect of human reasoning that deserves attention above all others, the confirma- tion bias would have to be among the candidates for consideration. Many have written about this bias, and it appears to be sufficiently strong and pervasive that one is led to wonder whether the bias, by itself, might account for a significant fraction of the disputes, altercations, and misun- derstandings that occur among individuals, groups, and nations.
Article
The most straightforward statistical designs to implement are those for which the sequencing of test runs or the assignment of factor combinations to experimental units can be entirely randomized. In this chapter we introduce completely randomized designs for factorial experiments. Included in this discussion are the following topics: completely randomized designs, factorial experiments, and the calculation of factor effects as measures of the individual and joint influences of factor levels on a response. A problem solving section appears at the end of the chapter. The appendix of this chapter briefly outline an extension to factors whose numbers of levels are greater than two, with special emphasis on factors whose numbers of levels are powers of two.