About
131
Publications
360,530
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
52,701
Citations
Introduction
Current institution
Kohavi
Current position
- Owner
Additional affiliations
November 2019 - November 2021
Airbnb
Position
- Vice President and Technical Fellow
Description
- - Led Relevance and Experimentation, a team of world-class engineers and data scientists that included two principals and ten senior staff members. The team delivered 6%+ improvements to booking conversion from about 20 successful product changes out of over 250 ideas that were tested in controlled experiments. Surfaces impacted include search, product detail page, home page, post-booking cross-sells, and email. See Improving Deep Learning for Ranking Stays at Airbnb (https://medium.com/airbnb-e
August 1991 - July 1995
Publications
Publications (131)
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments, A/B tests (and their generalizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) and parallel flights. Controlled experiments embody the best scientific design for establishing a causal rela...
Web-facing companies, including Amazon, eBay, Etsy, Facebook, Google, Groupon, Intuit, LinkedIn, Microsoft, Netflix, Shop Direct, StumbleUpon, Yahoo, and Zynga use online controlled experiments to guide product development and accelerate innovation. At Microsoft’s Bing, the use of controlled experiments has grown exponentially over time, with over...
Many good resources are available with motivation and explanations about online controlled experiments (Kohavi et al. 2009a, 2020; Thomke 2020; Luca and Bazerman 2020; Georgiev 2018, 2019; Kohavi and Thomke 2017; Siroker and Koomen 2013; Goward 2012; Schrage 2014; King et al. 2017; McFarland 2012; Manzi 2012; Tang et al. 2010). For organizations ru...
The rise of internet-based services and products in the late 1990's brought about an unprecedented opportunity for online businesses to engage in large scale data-driven decision making. Over the past two decades, organizations such as Airbnb, Alibaba, Amazon, Baidu, Booking.com, Alphabet's Google, LinkedIn, Lyft, Meta's Facebook, Microsoft, Netfli...
A/B tests, or online controlled experiments, are heavily used in industry to evaluate implementations of ideas. While the statistics behind controlled experiments are well documented and some basic pitfalls known, we have observed some seemingly intuitive concepts being touted, including by A/B tool vendors and agencies, which are misleading, often...
Getting numbers is easy; getting numbers you can trust is hard. This practical guide by
experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to
accelerate innovation using trustworthy online controlled experiments, or A/B tests.
Based on practical experiences at companies that each runs more than 20,000 controlled
experimen...
Getting numbers is easy; getting numbers you can trust is hard. This practical guide by experimentation leaders at Google, LinkedIn, and Microsoft will teach you how to accelerate innovation using trustworthy online controlled experiments, or A/B tests. Based on practical experiences at companies that each run more than 20,000 controlled experiment...
Background:
Many technology companies, including Airbnb, Amazon, Booking.com, eBay, Facebook, Google, LinkedIn, Lyft, Microsoft, Netflix, Twitter, Uber, and Yahoo!/Oath, run online randomized controlled experiments at scale, namely hundreds of concurrent controlled experiments on millions of users each, commonly referred to as A/B tests. Originall...
For the digital parts of businesses in Society 5.0, such as web site and mobile applications, manual testing is impractical and slow. Instead, implementation of ideas can now be evaluated with scientific rigor using online controlled experiments (A/B tests), which provide trustworthy reliable assessments of the impact of the implementations to key...
The Internet and the general digitalization of products and operations provides an unprecedented opportunity to accelerate innovation while applying a rigorous and trustworthy methodology for supporting key product decisions. Developers of connected software, including web sites, applications, and devices, can now evaluate ideas quickly and accurat...
Online controlled experiments (OCEs), also known as A/B tests, have become ubiquitous in evaluating the impact of changes made to software products and services. While the concept of online controlled experiments is simple, there are many practical challenges in running OCEs at scale. To understand the top practical challenges in running OCEs at sc...
The Internet provides developers of connected software, including web sites, applications, and devices, an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments, also known as A/B tests. From front-end user-interface changes to backend algorithms, from search engines (e.g., Google...
The Internet connectivity of client software (e.g., apps running on phones and PCs), websites, and online services provide an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called A/B tests, split tests, randomized experiments, control/treatment tests, and online field experiments. Unlike most data mining tec...
Online controlled experiments (e.g., A/B tests) are now regularly used to guide product development and accelerate innovation in software. Product ideas are evaluated as scientific hypotheses, and tested on web sites, mobile applications, desktop applications, services, and operating system features.
One of the key challenges for organizations tha...
It’s easy to run a controlled experiment and compute a p-value with five digits after the decimal point. While getting such precise numbers is easy, getting numbers you can trust is much harder. We share practical pitfalls from online controlled experiments across multiple groups at Microsoft
The internet connectivity of client software (e.g., apps running on phones and PCs), web sites, and online services provide an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called A/B tests, split tests, randomized experiments, control/treatment tests, and online field experiments. Unlike most data mining te...
Online controlled experiments are now widely run in the software industry. I share several challenging problems and motivate their importance. These include high-variance metrics, issues with p-values, metric-driven vs. design-driven decisions, novelty effects, and leaks.
The Internet provides developers of connected software, including web sites, applications, and devices, an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using trustworthy controlled experiments (e.g., A/B tests and their generalizations). From front-end user-interface changes to backend recommendation...
Web site owners, from small web sites to the largest properties that include Amazon, Facebook, Google, LinkedIn, Microsoft, and Yahoo, attempt to improve their web sites, optimizing for criteria ranging from repeat usage, time on site, to revenue. Having been involved in running thousands of controlled experiments at Amazon, Booking.com, LinkedIn,...
The web provides an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments (e.g., A/B tests and their generalizations). From front-end user-interface changes to backend algorithms, online controlled experiments are now utilized to make data-driven decisions at many other companies....
A system and method are disclosed for automatically detecting associations between particular sets of search criteria, such as particular search strings, and particular items. Actions of users of an interactive system, such as a web site, are monitored over time to generate event histories reflective of searches, item selection actions, and possibl...
Computing services that unwanted entities may wish to access for improper, and potentially illegal, use can be more effectively protected by using Active HIP systems and methodologies. An Active HIP involves dynamically swapping one random HIP challenge, e.g., but not limited to, image, for a second random HIP challenge, e.g., but not limited to, i...
Online controlled experiments are at the heart of making data-driven decisions at a diverse set of companies, including Amazon, eBay, Facebook, Google, Microsoft, Yahoo, and Zynga. Small differences in key metrics, on the order of fractions of a percent, may have very significant business implications. At Bing it is not uncommon to see experiments th...
The web provides an unprecedented opportunity to accelerate innovation by evaluating ideas quickly and accurately using controlled experiments (e.g., A/B tests and their generalizations). Whether for front-end user-interface changes, or backend recommendation systems and relevance algorithms, online controlled experiments are now utilized to make d...
Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft, eBay, Facebook, Google, Yahoo, Zynga, and at many other companies. While the theory of a controlled experiment is simple, and dates back to Sir Ronald A. Fisher's experiments at the Rothamsted Agricultural Experimental Station in England in the 1920...
From ancient times through
the 19th century, physicians
used bloodletting
to treat acne, cancer,
diabetes, jaundice, plague, and hundreds
of other diseases and ailments
(D. Wooton, Doctors Doing Harm
since Hippocrates, Oxford Univ. Press,
2006). It was judged most effective
to bleed patients while they were sitting
upright or standing erect, and
bl...
Tracking users' online clicks and form submits (e.g., searches) is critical for web analytics, controlled experiments, and business intelligence. Most sites use web beacons to track user actions, but waiting for the beacon to return on clicks and submits slows the next action (e.g., showing search results or the destination page). One possibility i...
Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. Offline controlled experiments have been well studied and documented since Sir Ronald A. Fisher led the development of statistical experimental design while work...
Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. Through randomization and proper design, experiments allow establishing causality scientifically, which is why they are the gold standard in drug tests. In softw...
Controlled experiments, also called randomized experiments and A/B tests, have had a profound influence on multiple fields, including medicine, agriculture, manufacturing, and advertising. While the theoretical aspects of offline controlled experiments have been well studied and documented, the practical aspects of running them in online settings,...
While at Amazon.com
from 1997 to 2002, Greg
Linden created a prototype
system that made
personalized recommendations
to customers when they placed
items in their shopping cart (http://
glinden.blogspot.com/2006/04/earlyamazon-
shopping-cart.html). The prototype
looked promising, but “a marketing
senior vice-president was dead set
against it,” claim...
The web provides an unprecedented opportunity to evaluate ideas quickly using controlled experiments, also called randomized experiments (single-factor or factorial designs), A/B tests (and their generalizations), split tests, Control/Treatment tests, and parallel flights. Controlled experiments embody the best scientific design for establishing a...
We describe an experimental study of pruning methods for decision tree classifiers when the goal is minimizing loss rather than error. In addition to two common methods for error minimization, CART's cost-complexity pruning and C4.5's error-based pruning, we study the extension of cost-complexity pruning to loss and one pruning variant based on the...
Electronic Commerce is now entering its second decade, with Amazon.com and eBay now in existence for ten years. With massive
amounts of data, an actionable domain, and measurable ROI, multiple companies use data mining and knowledge discovery to understand
their customers and improve interactions. We present important lessons and challenges using e...
The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception. With clickstreams being collected at the application-server layer, high-level events being logged, and data automatically transformed into a data warehouse using meta-data, common problems plaguing da...
Segmentation based on RFM (Recency, Frequency, and Monetary) has been used for over 50 years by direct marketers to target a subset of their customers, save mailing costs, and improve profits. RFM analysis is commonly performed using the Arthur Hughes method, which bins each of the three RFM attributes independently into five equal frequency bins....
Typical web analytic packages provide basic key performance indicators and standard reports to help assess traffic patterns on the website, evaluate site performance, and identify potential problems such as bad links resulting in page not found errors. Based on our experience in mining data for multiple retail e-commerce sites, we offer several rec...
Bayesian classification addresses the classification problem by learning the distribution of instances given different class values. We review the basic notion of Bayesian classification, describe in some detail the naive Bayesian classifier, and briefly discuss some extensions.
this paper we discuss the technology and enterprise-adoption trends in the area of business analytics. The key consumer of these analytics is the business user, a person whose job is not directly related to analytics per-se (e.g., a merchandiser, marketer, salesperson), but who typically must use analytical tools to improve the results of a busines...
This study compares five well-known association rule algorithms using three real-world datasets and an artificial dataset. The experimental results confirm the performance improvements previously claimed by the authors on the artificial data, but some of these gains do not carry over to the real datasets, indicating overfitting of the algorithms to...
We show that the e-commerce domain can provide all the right ingredients for successful data mining. We describe an integrated architecture for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We...
Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from ...
WorkshopTheme The ease and speed with which business transactions can be carried out over the Web has been a key driving force in the rapid growth of electronic commerce. In addition, customer interactions, including personalized content, e-mail c- paigns, and online feedback provide new channels of communication that were not previously available...
This study compares five well-known association rule algorithms using three real-world datasets and an artificial dataset. The experimental results confirm the performance improvements previously claimed by the authors on the artificial data, but some of these gains do not carry over to the real datasets, indicating overfitting of the algorithms to...
Organizations conducting Electronic Commerce (e-commerce) can greatly benefit from the insight that data mining of transactional and clickstream data provides. Such insight helps not only to improve the electronic channel (e.g., a web site), but it is also a learning vehicle for the bigger organization conducting business at brick-and-mortar stores...
We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and standard benchmark data sets. The results raise serious concerns about the use of accuracy for comparing classifiers and drawinto question the conclusi...
Electronic commerce provides all the right ingredients for successful data mining (the Good). Web logs, however, are at a very low granularity level, and attempts to mine e-commerce data using only web logs often result in little interesting insight (the Bad). Getting the data into minable formats requires significant pre-processing and data transf...
this document is to provide coding standards for writing C code in MLC++ . The description here can be used as a general guideline for programming in C , independent of MLC++ , but it is a low-level guide that does not discuss important issues of design. The MLC++ coding standards defines defines higher-level concepts used in MLC++ , including erro...
. The simple Bayesian classifier (SBC), sometimes called Naive-Bayes, is built based on a conditional independence model of each attribute given the class. The model was previously shown to be surprisingly robust to obvious violations of this independence assumption, yielding accurate classification models even when there are clear conditional depe...
We present a comparison of error-based and entropybased methods for discretization of continuous features. Our study includes both an extensive empirical comparison as well as an analysis of scenarios where error minimization may be an inappropriate discretization criterion. We present a discretization method based on the C4.5 decision tree algorit...
We review accuracy estimation methods and compare the two most common methods: crossvalidation and bootstrap. Recent experimental results on artificial data and theoretical results in restricted settings have shown that for selecting a good classifier from a set of classifiers (model selection), ten-fold cross-validation may be better than the more...
In the wrapper approach to feature subset selection, a search for an optimal set of features is made using the induction algorithm as a blackbox. The estimated future performance of the algorithm is the heuristic guiding the search. Statistical methods for feature subset selection including forward selection, backward elimination, and their stepwis...
We show that the e-commerce domain can provide all the right ingredients for successful data mining. We describe an integrated architecture for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We...
We show that the e-commerce domain can provide allthe right ingredients for successful data mining andclaim that it is a killer domain for data mining. Wedescribe an integrated architecture, based on our experienceat Blue Martini Software, for supporting thisintegration. The architecture can dramatically reducethe pre-processing, cleaning, and data...
Applications of Data Mining to Electronic Commerce brings together in one place important contributions and up-to-date research results in this fast moving area.
Applications of Data Mining to Electronic Commerce serves as an excellent reference, providing insight into some of the most challenging research issues in the field.
We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria, and highlighting the need to "peel the onion" and drill deeper into the reasons for the initial patterns fou...
a data mining project. Unfortunately, the other 80% contains several substantial hurdles that without heroic eort may block the successful completion of the project. The following are ve desiderata for success. Seldom are they they all present in one data mining application. 1. Data with rich descriptions. For example, wide customer records with ma...
We show that the e-commerce domain can provide all the right ingredients for successful data mining. We describe an integrate architecture for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80%of the time in knowledge discovery projects. We e...
Electronic commerce is emerging as the killer domain for data mining technology. The following are five desiderata for success. Seldom are they they all present in one data mining application. 1. Data with rich descriptions. For example, wide customer records with many potentially useful fields allow data mining algorithms to search beyond obvious...
. This paper presents a case study of a machine-aided knowledge discovery process within the general area of drug design. Within drug design, the particular problem of pharmacophore discovery is isolated, and the Inductive Logic Programming (ILP) system progol is applied to the problem of identifying potential pharmacophores for ACE inhibition. The...
Data Mining is the process of identifying new patterns and insights in data. As the volume of data collected and stored in databases grows, there is a growing need to provide data summarization (e.g., through visualization), identify important patterns and trends, and act upon the findings. Insight derived from data mining can provide tremendous ec...
At KDD-99, the panel on Integrating Data Mining into Vertical Solutions addressed a series of questions regarding future trends in industrial applications. Panelists were chosen to represent different viewpoints from a variety of industry segments, including data providers (Jim Bozik), horizontal and vertical tool providers (Ken Ono and Steve Belch...
Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three varia...
In this doctoral dissertation, we study three basic problems in machine learning and two new hypothesis spaces with corresponding learning algorithms. The problems we investigate are: accuracy estimation, feature subset selection, and parameter tuning. The latter two problems are related and are studied under the wrapper approach. The hypothesis sp...
: In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider...
We describe an experimental study of Option Decision Trees with majority votes. Option Decision Trees generalize regular decision trees by allowing option nodes in addition to decision nodes; such nodes allow for several possible tests to be conducted instead of the commonly used single test. Our goal was to explore when option nodes are most usefu...
this paper we detail some things that worked well, some things that did not work as well as we hoped, and some thoughts about the future. 2 What Worked Well
Loan level modeling of prepayment is an important aspect of hedging, risk assessment, and retention efforts of the hundreds of companies in the US that trade and initiate Mortgage Backed Securities (MBS). In this paper we review and investigate different aspects of modeling customers who have taken jumbo loans in the US using MineSet. We show how r...
The simple Bayesian classifier (SBC), sometimes called Naive-Bayes, is built based on a conditional independence model of each attribute given the class. The model was previously shown to be surprisingly robust to obvious violations of this independence assumption, yielding accurate classification models even when there are clear conditional depend...
In the feature subset selection problem, a learning algorithm is faced
with the problem of selecting a relevant subset of features upon which
to focus its attention, while ignoring the rest. To achieve the best
possible performance with a particular learning algorithm on a
particular training set, a feature subset selection method should
consider h...
We present a comparison of error-based and entropybased methods for discretization of continuous features. Our study includes both an extensive empirical comparison as well as an analysis of scenarios where error minimization may be an inappropriate discretization criterion. We present a discretization method based on the C4.5 decision tree algorit...
We address the problem of finding a subset of features that allows a supervised induction algorithm to induce small high-accuracy concepts. We examine notions of relevance and irrelevance, and show that the definitions used in the machine learning literature do not adequately partition the features into useful categories of relevance. We present de...
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular domain, a feature subset selection method should consider how the...
We address the problem of finding the parameter settings that will result in optimal performance of a given learning algorithm using a particular dataset as training data. We describe a “wrapper” method, considering determination of the best parameters as a discrete function optimization problem. The method uses best-first search and crossvalidatio...
The reasons for including applications papers in the machine learning literature, are discussed. Application papers in Machine learning literature are often included because these papers have success stories which act as an advertisement and boost morale. However there is another reason why such papers are value to field, which is even more vital....
Business users and analysts commonly use spread- sheets and 2D plots to analyze and understand their data. On-line Analytical Processing (OLAP) provides these users with added flexibility in pivoting data around dierent attributes and drilling up and down the multi-dimensional cube of aggregations. Machine learning researchers, however, have concen...
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider h...
Data mining algorithms including machine learning, statistical analysis, and pattern recognition techniques can greatly improve our understanding of data warehouses that are now becoming more widespread. In this paper, we focus on classification algorithms and review the need for multiple classification algorithms. We describe a system called MLC++...
. Nearest-neighbor algorithms are known to depend heavily on their distance metric. In this paper, we investigate the use of a weighted Euclidean metric in which the weight for each feature comes from a small set of options. We describe Diet, an algorithm that directs search through a space of discrete weights using cross-validation error as its ev...
. We evaluate the power of decision tables as a hypothesis space for supervised learning algorithms. Decision tables are one of the simplest hypothesis spaces possible, and usually they are easy to understand. Experimental results show that on artificial and real-world domains containing only discrete features, IDTM, an algorithm inducing decision...
blem. Only 8% of the articles presented results for more than one problem using real world data. While Prechelt (1996) only looked at whether comparisons were done, Cohen went a step further and described how to design good experiments. He wrote that "books like this one encourage well-designed experiments, which, if one isn't careful, can be utter...
MineSet TM , Silicon Graphics' interactive system for data mining, integrates three powerful technologies: database access, analytical data mining, and data visualization. It supports the knowledge discovery process from data access and preparation through iterative analysis and visualization to deployment. MineSet is based on a client-server archi...
Lazy learning algorithms, exemplified by nearestneighbor algorithms, do not induce a concise hypothesis from a given training set; the inductive process is delayed until a test instance is given. Algorithms for constructing decision trees, such as C4.5, ID3, and CART create a single "best" decision tree during the training phase, and this tree is t...