This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive version is published at ICSA’18, April 30 – May 4, 2018, Seattle, USA.
research area. ExP platform employs one such method ,
notifying experiment owners if interesting or unusual
movements are detected in any segment.
To deeper understand the metric change in a specific
segment, the system should provide tools to easily obtain
examples of user sessions, containing the events that are used
to compute the metric of interest. For example, in an increase
it may be useful for experimenter to see examples of the error
types and queries they triggered on, and conduct simple
analysis to see if a specific error type, or a specific query, is
causing the increase. While such analyses are conceptually
simple, implementing tools and automation to support them is
important. Doing such deep-dive analyses manually requires
deep understanding of how the data is structured, how a specific
metric is computed, and how experiment information is stored
in the data, which is beyond the level of knowledge of a typical
Online controlled experiments are becoming the standard
operating procedure in data-driven companies , .
However, for these companies to experiment with high velocity
and trust in experiment results, having an experimentation
platform is critical . In this paper, we described the
architecture of a large scale online experimentation platform.
Such platform is used by thousands of users at Microsoft,
supporting hundreds of concurrent experiments, totally over
10000 experiments per year. This platform enables trustworthy
experimentation at scale, adding hundreds of millions of dollars
of additional yearly revenue for Microsoft. We hope that the
architectural guidance provided in this paper will help
companies implement their experimentation systems in an
easier, more cost-effective way, helping to further grow the
experimentation practices in software industry.
 R. Kohavi and R. Longbotham, “Online Controlled Experiments and A/B
Tests,” Encycl. Mach. Learn. Data Min., no. Ries 2011, pp. 1–11, 2015.
 R. Kohavi and S. Thomke, “The Surprising Power of Online
Experiments,” Harvard Business Review, no. Oct ober, 2017.
 J. F. Box, “R.A. Fisher and the Design of Experiments, 1922–1926,” Am.
Stat., vol. 34, no. 1, pp. 1–7, Feb. 1980.
 ExP, “Micros oft Experimentation Platform.” .
 A. Deng, J. Lu, and J. Litz, “Trustworthy Analysis of Online A/B Tests:
Pitfalls, Challenges and Solutions,” in Proceedings of the Tenth ACM
International Conference on Web Search and Data Mining, 2017, pp.
 A. Deng, J. Lu, and S. Chen, “Continuous monitoring of A/B tests
without pain: Optional stopping in Bayesian testing,” in Proceedings - 3rd
IEEE International Conference on Data Science and Advanced Analytics,
DSAA 2016, 2016, pp. 243–25 2.
 D. Tang, A. Agarwal, D. O’Brien, and M. Meyer, “Overlapping
experiment infrastructure,” in Proceedings of the 16th ACM SIGKDD
international conference on Knowledge discovery and data mining - KDD
’10, 2010, p. 17.
 R. Kohavi and R. Longbotham, “Online experiments: Lessons learned,”
Computer (Long. Beach. Calif)., vol. 40, no. 9, pp. 103–105, 2007.
 A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, “The Evolution of
Continuous Experimentation in Software Product Developmen t: From
Data to a Data-Driven Organization at Scale,” in 2017 IEEE/ACM 39th
International Conference on Software Engineering (ICSE), 2017, pp.
A. Deng, Y. Xu, R. Kohavi, and T. Walker, “Improving the sensitivity of
online contr olled experiments b y util izing pre-experiment data,” in
Proceedings of the sixth ACM international conference on Web search
and data mining - WSDM ’13, 2013, p. 123.
P. Dmitriev and X. Wu , “Measuring Metrics,” Proc. 25th ACM Int. Conf.
Inf. Knowl. Manag. - CIKM ’16, pp. 429–437, 2016.
A. Deng and X. Shi, “Data-Driven Metric Development for Online
Controlled Experiments,” in Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining -
KDD ’16, 2016, pp. 77–86.
P. Dmitriev, B . Frasca, S. Gupta, R. Kohavi, and G. Vaz, “Pitfalls of long-
term online controlled experiments,” in Proceedings - 2016 IEEE
International Conference on Big Data, Big Data 2016, 2016, pp. 1367–
H. Hohnhold, D. O’Brien, and D. Tang, “Focusing on the Long-term,” in
Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining - KDD ’15, 2015, pp. 1849–1858.
A. Fabijan, P. Dmitriev, H. H. Olsson, and J. Bosch, “The Benefits of
Controlled Experimentation at Scale,” in 2017 43rd Euromicro
Conference on Software Engineering and Advanced Applications (SEAA),
2017, pp. 18–26.
Y. Xu, N. Chen, A. Fernandez, O. Sinno, and A. Bhasin, “Fr om
Infrastructure to Culture: A/B Testing Challenges in Large Scale Social
Networks,” Proc. 21th ACM SIGKDD Int . Conf. Kn owl. Discov. Data
Min., no. Figure 1, pp. 2227–2236, 2015.
R. Kohavi, A. Deng, R. Longbotham, and Y. Xu, “Seven rules of thu mb
for web site experimenters,” in Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining - KDD
’14, 2014, pp. 1857–1866.
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne,
“Controlled experiments on the web: Survey and practical guide,” Data
Min. Knowl. Discov., vol. 18, no. 1, pp. 140–181, 2009.
P. Dmitriev, S. Gupta, K. D ong W oo, and G. Vaz, “A Dirty Dozen:
Twelve Common Metric Interpretati on Pitfalls in Online Controlled
Experiments,” Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discov. data
Min. - KDD ’17, pp. 1427–1436, 2017.
R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohl mann,
“Online controlled experiments at large scale,” in Proceedings of the 19th
ACM SIGKDD international conference on Knowledge discovery and
data mining - KDD ’13, 2013, p. 1168.
R. L. Kaufman, J. Pitchforth, and L. Vermeer, “Democratizing online
controlled experiments at Booking. com,” arXiv Prepr. arXiv1710.08217,
pp. 1–7, 2017.
P. Runeson and M. H öst, “Guidelines for conducting and reporting case
study research in software engineering,” Empir. Softw. Eng., vol. 14, no.
2, pp. 131–164, 2008.
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu,
“Trustworthy Online Controlled Experiments : Five Puzzling Outcomes
Explained,” Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discov. data
Min., pp. 786–794, 2012.
R. B. Bausell and Y.-F. Li, Power analysis for experimental research: a
practical guide for the biological, medical and social sciences.
Cambridge University Press, 2002.
K. L. Morgan and D. B. Rubin, “Rerandomization to improve covariate
balance in experiments,” Ann. Stat., vol. 40, no. 2, pp. 1263–1282, 2012.
F. Radlinski and N. Craswell, “Optimized interleaving for online retrieval
evaluation,” in Proceedings of the sixth ACM international conference on
Web search and data mining - WSDM ’13, 2013, p. 245.
R. Conradi and B. Westfechtel, “Version Models for Software
Configuration Management,” ACM Comput. Surv., vol. 30, no. 2, pp.
232–282, Jun. 1998.
A. Deng, P. Zhang, S. Chen, D. W. Kim, and J. Lu, “Concise
Summariza tion of Heter ogeneous Treatment Effect Using Total Variation
Regularized Regression,” Submiss.
“Multiple Comparisons Problem,” Wikipedia. [Online]. Available:
P. K. Pearson, “Fast hashing of variable-length text strings,” Commun.
ACM, vol. 33, no. 6, pp. 67 7–680, Jun. 1990.
P. Dmitriev, A . Deng, R. Kohavi, and P. Raff, “A / B Testing at Scale :
Accelerating Software Innovation,” in SIGIR, 2017, pp. 1395–1397.