Content uploaded by Claudia Perlich

Author content

All content in this area was uploaded by Claudia Perlich on Oct 17, 2014

Content may be subject to copyright.

Estimating The Effect Of Online Display Advertising On

Browser Conversion In Observational Data

Ori Stitelman

media6degrees

37 East 18th Street

New York, NY, 10003

ori@media6degrees.com

Brian Dalessandro

media6degrees

37 East 18th Street

New York, NY, 10003

briand@media6degrees.com

Claudia Perlich

media6degrees

37 East 18th Street

New York, NY, 10003

claudia@media6degrees.com

Foster Provost

NYU Stern School of Business

New York, NY, 10012

fprovost@stern.nyu.edu

ABSTRACT

This paper examines ways to estimate the causal eﬀect of

display advertising on browser post-view conversion (i.e. vis-

iting the site after viewing the ad rather than clicking on the

ad to get to the site). The eﬀectiveness of online display ads

beyond simple click-through evaluation is not well estab-

lished in the literature. Are the high conversion rates seen

for subsets of browsers the result of choosing to display ads

to a group that has a naturally higher tendency to convert or

does the advertisement itself cause an additional lift? How

does showing an ad to diﬀerent segments of the population

aﬀect their tendencies to take a speciﬁc action, or convert?

We present an approach for assessing the eﬀect of display ad-

vertising on customer conversion that does not require the

cumbersome and expensive setup of a controlled experiment,

but rather uses the observed events in a regular campaign

setting. Our general approach can be applied to many ad-

ditional types of causal questions in display advertising. In

this paper we show in-depth the results for one particular

campaign (a major fast food chain) of interest and mea-

sure the eﬀect of advertising to particular sub-populations.

We show that advertising to individuals that were identi-

ﬁed (using machine learning methods) as good prospective

new customers resulted in an additional 280 browsers visit-

ing the site per 100,000 advertisements shown. This result

was shown to be extremely signiﬁcant. Whereas, displaying

ads to the general population, not including those that vis-

ited the site in the past, resulted in an additional 200 more

browsers visiting the site per 100,000 advertisements shown

(not signiﬁcant at the ten percent level). We also show that

advertising to past converters resulted in a borderline sig-

niﬁcant increase of an additional 400 browsers visiting the

site for every 100,000 online display ads shown.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

ADKDD ’11 San Diego, California USA

Copyright 2011 ACM X-XXXXX-XX-X/XX/XX ...$10.00.

Categories and Subject Descriptors

H.2.8 [Database Management]: Database Applications—

data mining; I.2.6 [Artiﬁcial Intelligence]: Learning—in-

duction; I.5.1 [Pattern Recognition]: Models—statistics;

J.4 [Computer Applications]: Social and Behavioral Sci-

ences

General Terms

Algorithms, Design, Experimentation

Keywords

online display advertising, causal eﬀects, predictive model-

ing, social networks, user-generated content, privacy

1. INTRODUCTION

Controlled experiments, often called randomized tests or

A/B tests, are commonly used online to assess the eﬀects

of any kind of changes to the browsing experience (formally

interventions) on browser behavior (see e.g.[6],[8],[7]). Some

of those eﬀorts have been devoted to extending these ex-

periments to evaluating the causal eﬀect of online display

advertising on browser conversion ([8],[7]). However, the

cost and diﬃculty of implementing A/B testing successfully

in the display ad environment are very high, as will be ex-

plained below. This makes analytical approaches that can

estimate the eﬀect of ads while running the campaign reg-

ularly (observational setting; without creating any special

testing controls) appealing. Despite the appeal of estimat-

ing the eﬀects based on observational data there are many

practical considerations that make this a diﬃcult task. A

number of analytical methods have been developed in a wide

range of ﬁelds to estimate the causal eﬀects of a binary treat-

ment on a binary outcome of interest. Much of the relevant

causal literature has been developed in the ﬁeld of epidemi-

ology and biostatistics (see e.g. [15] and [23]). Though the

reasons for not using randomized tests in the medical liter-

ature are often diﬀerent than in the advertising community,

the lessons learned from the use of causal methods there

are directly applicable to our current setting. For our pur-

poses the treatment of interest is an advertisement and the

outcome of interest is browser conversion such as visiting a

webpage of interest, providing an email address, or making

an online purchase. Ultimately, we are concerned with an-

swering the question: “What is the causal eﬀect of online

display advertising?”

For the remainder of this article we will refer to A/B tests

as randomized tests. We will also interchangeably refer to

treatment as the act of showing an advertisement and refer

to a conversion or outcome of interest as taking a speciﬁc ac-

tion on the brand’s website. Furthermore, when we refer to

display advertising we mean online display advertising. We

will primarily focus in our experiments on whether or not

a browser visits the website (site visit) of the advertising

brand. We will also use the common convention that capi-

tal letters represent random variables and lower case letters

represent realizations of a level of those variables.

There are many diﬃculties of implementing A/B tests on-

line in general[5]. Kohavi, 2010, discusses the diﬃculties of

implementing a randomized test in the online setting and

provides examples of how the implementation of the ran-

domization test can result in unforeseen artifacts that make

estimating the intended eﬀects diﬃcult or impossible. In

fact, they suggest using a form of testing called A/A/B,

where there are three possible treatment scenarios–one for

the current treatment, another for the current treatment

using the A/B test implementation and a third for the new

treatment using the A/B test implementation. The A/A/B

test allows one to test if the observed eﬀect is due to the

new treatment or due to the implementation of the test.

This type of set-up to assess the implementation of the test,

though sensible, begins to add new costs to the implementa-

tion that must be considered. A paper presented by Google

at the last KDD brieﬂy addresses another major drawback

of implementing A/B tests for assessing the eﬀect of dis-

play advertising on browser conversion[1]. The major con-

cern expressed was that advertisers would not want to pay

to present advertisements, such as public service announce-

ments (PSA), that did not promote their product. The fact

that A/B tests are prone to unforeseen error and that mar-

keters don’t want to pay to present PSAs, coupled with the

fact that there are additional overhead costs associated with

A/B testing, suggests that alternative methods for estimat-

ing causal eﬀects without intervening on the observed system

would be preferable.

A common misconception is that randomized tests are the

only study design in which causal eﬀects may be estimated

(see e.g.[6], [9]). In fact, there is a long history of litera-

ture devoted to estimating causal eﬀects in observational,

or non-randomized, settings (see e.g. [18],[23],[20]). Rubin

[18] established a counterfactual framework that deﬁnes the

eﬀect of all levels of possible treatment for each observed

subject, and allows for the consideration and estimation of

what would have happened when individuals received a spe-

ciﬁc level of treatment, possibly contrary to what is actually

observed. The outcomes for the unobserved levels of treat-

ment are referred to as potential outcomes in the literature.

This framework also allows one to deﬁne summary measures

that quantify the eﬀect of the treatment on average for the

population or sub-population of interest. Two commonly re-

ported summary measures of interest include the diﬀerence

in the outcome probabilities and the ratio of the outcome

probabilities when treated versus not treated.

1

We will refer

1

These summary measures are typically referred to as addi-

"WHAT IF" Analysis

A/B Test

A

B

Show Ad

Don't

Show

Ad

A

Conversion Rate

B

Conversion Rate

Compare

Saw Ad

Didn't

See Ad

Model

Randomized

Observed

"WHAT IF"

Saw Ad

"WHAT IF"

Didn't See

Ad

Predicted

Conversion

Rate

Predicted

Conversion

Rate

}

}

}

Compare

& Estimator

Figure 1: Causal assumptions in the form of a causal

graph.

to the diﬀerence in conversion probabilities as the additive

impact of the advertisement, and the ratio as the relative

impact of the advertisement.

Chan et al., presented at last year’s KDD, proposed the

use of several related methods that are able to estimate the

eﬀect of advertising in observational data[1]. In particular,

their paper focused on estimating the causal eﬀect of ad-

vertising among those shown the advertisement, and high-

lighted the beneﬁts of applying their methods in pipeline.

2

Several variants of the methods they surveyed as well as

others will be discussed below. In addition to implementing

variants of the methods they proposed, we will explore the

beneﬁts of estimating several other parameters of interest,

explore another method for estimating causal eﬀects (tar-

geted maximum likelihood estimation (TMLE))[24]), and

discuss some other advantages of estimating causal eﬀects

beyond its implementation in pipeline. Furthermore, we

display the advantages of estimating causal eﬀects within

diﬀerent segments of the population.

Consider what is the purpose of implementing an A/B,

or randomized, test. The entire point of running the test is

to easily identify the eﬀect of the treatment, or in our case

the display advertisement, on the outcome of interest. The

ﬁrst step in a randomized study is to randomly assign each

subject, or browser, to one of two groups, A or B. The top

box in ﬁgure 1 shows this approach. The motivation for ran-

dom sampling is to ensure that both groups are similar with

respect to the distribution of all relevant variables that can

potentially aﬀect the probability of taking the desired ac-

tion (e.g., gender, browsing activity, past purchase activity,

etc.). If the two groups were not similar, the diﬀerence in

the observed eﬀect might be due to those variables and not

to the treatment. Variables that can aﬀect both the proba-

bility of treatment and the probability of conversion would

tive or attributable risk and relative risk, respectively, in the

epidemiology and biostatistics literature because they were

commonly used to quantify the increase in the risk of a dis-

ease or death when exposed to a possibly harmful substance.

2

They deﬁne a pipeline as “an automated pipeline that re-

trieves data, computes estimates, and decides whether to

release results, suppress results, or send them to an expert

data analyst for review.”

make it diﬃcult to estimate the eﬀect of the treatment and

are known as confounders. Group A is then shown the ad-

vertisement and group B is not shown the advertisement.

Next the conversion rate is calculated for both groups and

is compared to assess the eﬀect of showing the advertise-

ment. If the randomization is successful it allows one to

directly compare the outcomes of the two groups and the

practice of relying on it to make the conclusion about the

eﬀect is known as the randomization assumption.

The bottom box in ﬁgure 1 outlines an approach for a

counterfactual analysis that allows the estimation of the ef-

fect directly in the observed data even in the presence of con-

founders. In observational data one cannot directly compare

the group that is shown the advertisement (the red rectan-

gle) to the group that is not shown the advertisement (the

gray rectangle) because of the confounding of targeting. The

group that was shown the ad was selected speciﬁcally because

they are assumed to have a higher conversion rate. There-

fore adjustment must be made for all variables that aﬀect

the conversion rate. This is done by analytically conduct-

ing the counterfactual or “what if” analysis. In general, the

approach is to use a model/estimator to estimate the con-

version rate had the entire population of interest been shown

the advertisement. Subsequently, the conversion rate is esti-

mated had the entire population not been shown the adver-

tisement and the two conversion rates are then compared. A

number of observational methods exist that provide a way

to adjust for the fact that the treated and untreated groups

are not the same with respect to the confounders in the ob-

served data. Several of these methods are discussed and

implemented in the subsequent sections.

In this paper we will describe a practical approach for es-

timating the causal eﬀect of advertising. We will follow a

uniﬁed approach that may be extended to other interest-

ing causal questions of interest in the display advertising

environment. This approach relies on (1) posing the ques-

tion of interest (2) making assumptions about the observed

data (3) clearly identifying a parameter that answers the

question given the assumptions and (4) estimating the pa-

rameter. This approach loosely follows the roadmap for con-

structing a TMLE presented in [21]. However, rather than

just presenting a TMLE we will discuss several ways that

the parameter of interest may be estimated and expound on

the advantages and shortcomings of each method. Finally,

we will present an analysis of the eﬀect of advertising for

a major fast food chain and use the presented methods to

assess the eﬀect within diﬀerent sub-populations of interest.

In summary our results show that advertising to past con-

verters, “re-targeting”, results in an additional 400 browsers

visiting the site for every 100,000 online display ads shown

(1.1 times more site visitors when shown the ad). Whereas,

advertising to individuals that were identiﬁed (using ma-

chine learning methods discussed in [14]) as good prospective

new customers results in a 1.5 times increase in conversions,

which equates to an additional 280 browsers visiting the site

per 100,000 advertisements shown (p-value < 10

−16

). Fi-

nally, displaying ads to the general population, not including

those that visited the site in the past, resulted in 2.4 times

more site visits, or an additional 200 more browsers visiting

the site per 100,000 advertisements shown (p-value = .13).

2. POSING THE QUESTION OF INTEREST

Primarily we will focus on answering the question: “What

is the eﬀect of display advertising on customer conversion?”

In particular, we are not just interested in the immediate

response of clicking. Increasingly, clicks are perceived as

notoriously random and unreliable measure of eﬀectiveness

(see e.g. [2]). The much more relevant question is whether

seeing an ad (without necessarily clicking on it) aﬀects the

probability of conversion (e.g., visiting the brand’s website)

within a reasonable timeframe (this is also known as a“post-

view” conversion). Answering this question for the universe

of all browsers may not be of great interest because that

population includes a large portion of people who are highly

unlikely to convert whether they are shown a particular ad-

vertisement or not. Moreover, the estimation problem is

potentially much more diﬃcult when looking at the entire

population, rather than segments of the population, and

may require larger data sets to answer the question of inter-

est because of the low overall conversion rates. Fortunately,

the questions that have the most ﬁnancial relevance from

the perspective of an advertiser relate to the eﬀect of adver-

tising on appropriate populations that have higher baseline

conversion rates (even absent advertising) than the overall

population. We will focus on estimating how advertising af-

fects conversion, where conversion is measured in terms of

future site visits. We will focus on answering this question

for particular sub-populations of browsers. In particular, we

will focus on answering the following three questions:

1. What is the eﬀect of display advertising on conversions

for individuals that have visited the site in the past?

2. What is the eﬀect of display advertising on conversions

for potential new customers (browsers with no past

site-visits) that were targeted based on their natural

expected tendency to convert at a higher rate than the

general population?

3. What is the eﬀect of display advertising on conversions

for potential new customers in the general population?

For illustrative purposes, we have chosen a campaign with

high conversion rates relative to other campaigns we have

analyzed. By doing this we are able to estimate parameters

that answer all three of these questions of interest reliably.

In cases where the conversion rates are low it may be diﬃcult

to answer the third question of interest regardless of the

estimation method employed.

3. THE OBSERVED DATA STRUCTURE AND

LIKELIHOOD

In this section we will deﬁne the data structure we use,

and introduce some notation as well as some overall causal

assumptions about the observed system. We will then use

those causal assumptions to deﬁne parameters of interest in

the following section.

Our data structure is a common one in the causal liter-

ature. For each subject i (i.e., browser) we observe O

i

=

(W

i

, A

i

, Y

i

), where O

i

is an observation from the true, and

unknown, data generating distribution, P

0

.

3

A is the binary

random variable of intervention/treatment (i.e., showing an

ad), where A equals one if an individual is treated and zero

otherwise. W is a vector of baseline covariates that records

information speciﬁc to a browser prior to intervention. W

should in our case include the browser’s past web-activity,

3

The subscript 0 denotes truth, whereas a subscript n —

will denote an estimate of the distribution.

W

A

Y

Figure 2: Causal Graph

past actions taken, past advertisements viewed, as well as

any other relevant information that aﬀects the outcome of

interest and the treatment level the browser is exposed to.

The random variable Y is the binary outcome of interest

(i.e., is equal to one when a browser takes the action of in-

terest and zero otherwise).

Now we can deﬁne a time ordering of the observed vari-

ables that allows us to factorize the data into an observed

data likelihood. The time ordering of the observed ran-

dom variables in this simple data structure corresponds to

a causal graph that implies a particular factorization of the

likelihood. The causal graph is shown in ﬁgure 2. More com-

plicated observed data structures require careful deﬁning of

the time ordering of the variable and both time cues and

subject matter knowledge should be used in constructing a

causal graph. This causal graph lays out the assumptions

one is making that allow for the construction of a param-

eter of interest that directly answers the scientiﬁc or busi-

ness question of interest. Speciﬁcally, this causal structure

states that the baseline covariates are not caused by tar-

geting or conversion, and that targeting is not caused by

conversion. Furthermore, it states that there are no other

unobserved variables that cause both A and Y , or unob-

served confounders. Unobserved variables that cause any

of the nodes in the graph individually are okay. The fac-

torization of the likelihood that corresponds to our causal

assumptions is:

P

0

(O) =

Q

W

0

z }| {

P (W )

g

A

0

(A,W )

z }| {

P (A | W )

Q

Y

0

(A,W )

z }| {

P (Y | A, W ) . (1)

Thus, the likelihood factorizes with a part associated with

the non-intervention nodes Q

0

= (Q

W

0

, Q

Y

0

) and a factor

associated with the intervention node, g

A

0

. We refer to the

node for A as an intervention node because we are interested

in what happens to the outcome when we intervene on A,

showing an ad, for each browser.

4. DEFINING A PARAMETER OF INTER-

EST

The parameter of interest is speciﬁcally chosen to answer

our primary question, “What is the eﬀect of display adver-

tising?” It is common in the machine learning community

to refer to “tuning parameters” such as “k” in the k-nearest

neighbor algorithm and the number of leafs in a regression

tree in general as parameters. This is not how we use the

term parameter here. In fact, when we deﬁne a parame-

ter of interest, we deﬁne a quantity of interest that directly

answers our question and thus we would like to obtain an es-

timate for it. This is a common use of the terms“parameter”

and “parameter estimation” in the statistics community.

Now that the likelihood of the observed data is factorized

according to a causal graph we can consider the interventions

(showing an ad) on the observed system and how one may

“observe”the outcome in the case of intervention A = a. The

counterfactual distribution of the data structure under inter-

vention is known by the causal inference community as the

G-computation formula[16]. The G-computation formula is

very similar to the do-calculus proposed by Pearl for causal

analysis[11]. The A node which we are intervening upon in

the causal graph is set to the intervention level, a, in the

likelihood and the conditional distribution of A given W is

removed from the likelihood since it is no longer a random

variable (it is now deterministically set by the intervention).

The following is the resulting G-computation formula, or

distribution of the data under intervention A = a:

P

0,A=a

(O) = P (W )P (Y | A = a, W ). (2)

The G-computation formula now may be used to guide the

choice of causal parameter that will be useful for answering

a particular business question of interest.

We are interested in the size of the eﬀect of display adver-

tising. If we knew the true distributions Q

W

0

and Q

Y

0

how

would we answer this question? One straightforward ap-

proach would be to evaluate the conditional distribution of

Y | A, W for A = 1 and then again for A = 0 at all possible

levels of baseline variables W . Then take the mean weighted

by P (W ) for each group. This would result in the two quan-

tities E

W

[Y

A=1

] and E

W

[Y

A=0

]. Where E

W

[Y

A=a

] is the

mean of Y assuming everyone in a population is treated

at level a. We can now combine E

W

[Y

A=1

] and E

W

[Y

A=0

]

in useful ways to assess the eﬀect of diﬀerent levels of the

treatment variable A. Two commonly used parameters of

this type are the following:

Additive Impact = Ψ

AI

(P

0

) = E

W

[Y

A=1

] − E

W

[Y

A=0

](3)

Relative Impact = Ψ

RI

(P

0

) = E

W

[Y

A=1

]/E

W

[Y

A=0

] (4)

The additive impact quantiﬁes the additive eﬀect of show-

ing the advertisement to everyone in the population versus

not showing the ad to anybody in the population. This

value could be interpreted as the average number of addi-

tional conversions per 100 people had everyone been shown

the ad versus had everyone not been shown the ad. Thus, if

the additive impact were 3 percent the following statement

would be appropriate: “Showing the ad versus not showing

the ad results in 3 additional conversions per 100 browsers.”

The relative impact quantiﬁes the multiplicative eﬀect of the

advertisement. This quantity is the ratio of the probability

of the outcome had everyone been shown the ad divided by

the probability of the outcome had nobody been shown the

ad. Thus, a relative impact of 3 would correspond with the

following statement: “Showing the ad makes browsers on

average three times more likely to convert.” The choice of

parameter to estimate should be driven by the business ques-

tion one is trying to answer and in many situations it may be

useful to estimate both parameters. The additive impact di-

rectly addresses the return on investment while the relative

impact is highly inﬂuenced by the level of the untreated con-

version rate. If the untreated conversion rate is low a small

additive impact will manifest itself as a large relative impact

even-though the advertisment may have little aﬀect on the

number of additional customers converting. One particular

advantage of using these parameters to answer our question,

rather than say log odds in a logistic regression, is that they

are numbers that may be interpreted by statisticians and

non-statisticians alike. Thus, the estimates may be handed

oﬀ to individuals with business knowledge to make action-

able decisions based on them.

Note that the parameters described above quantify the

eﬀect of the advertisement over the entire population of in-

terest. Thus, any conclusions, or inferences, made from the

estimates generalize to the entire population being exam-

ined. For example, as we do below, we estimate the additive

impact of advertising among past site visitors for which a

medium size ad serving company received a bid request

4

.

Any inferences made based on an estimate are generalizable

to this group. This is a diﬀerent parameter than the ones es-

timated by Google in their paper where they estimated the

eﬀect only among the treated. Inferences made by the es-

timators and methodology proposed there are generalizable

only to a population that is treated[1]. Thus, the parameters

we propose to estimate here are generally valid for estimat-

ing the eﬀect of advertising on a group of people to which

one could potentially advertise and who may or may not

have been advertised to in the past. This type of parameter

is directly relevant to answer business questions that regard

assessing the eﬀect of potential interventions on the entire

group of browsers that may be intervened upon.

5. ESTIMATION METHODS

The process described in the preceding sections has been

primarily concerned with deﬁning a parameter of interest

that directly answers the question: “What and how big is

the eﬀect of display advertising?” The presented approach

loosely follows the ﬁrst few steps of the uniﬁed approach pre-

sented in [21] for estimating causal eﬀects. In this section we

explore alternative estimators that might be considered for

estimating the parameters deﬁned in the previous sections.

Each of the estimators we will present is a function of an

estimate of the Q factor of the likelihood, the g factor of

the likelihood, or both. Up until now we have not presented

a model, or collection of possible data-generating distribu-

tions, for Q

W

0

, Q

Y

0

, and g

A

0

. For Q

W

0

we will use the

empirical distribution, as this is the eﬃcient non-parametric

maximum likelihood estimator for this distribution (It will

be explained later how this is done in practice). For the

two conditional distributions, Q

Y

0

, and g

A

0

, ideally a non-

parametric model which imposes no rigid assumptions on the

functional form of the relationship would be used. However,

practical concerns given the size of the data and dimension-

ality of the problem make this a computationally diﬃcult

task. So for the time being we employed a logistic regres-

sion model that performed cross-validated variable selection

for both main terms and polynomials of order 2 to estimate

the conditional distributions, Q

Y

n

and g

A

n

. Note however,

that we are not interested in the estimated parameters of

this logistic model, but use it only to generally estimate

a functional dependence. One advantage of estimating the

functional dependence of the underlying distributions rather

4

A large percentage of the display advertising examined is

ﬂowing through ad-exchanges with real-time bidding. When

a browser is visiting some site on the web, the site may send

a request to such an exchange including relevant information

on the browser. At that point, an auction is run in real-time

and the highest bidder gets to show the browser an ad. A bid

request is an event where the ad-exchange solicits potential

bidders.

than one speciﬁc beta of a linear model is that as compu-

tation power increases, better methods with less bias can

come to bear that allow for searching over larger spaces.

Those methods can be incorporated directly into our pro-

cess for estimating causal eﬀects. Thus, the better we get

at estimating conditional distributions, the better we get at

estimating causal eﬀects. This is not the case for approaches

that pre-specify the causal eﬀect as a speciﬁc parameter of

a linear regression or logistic regression model.

We will now present the diﬀerent estimators ψ

n,∗

of the

parameters of interest Ψ shown above. We refer the reader

to external sources for more in-depth understanding of each

estimator, as that is outside the scope of this paper.

5

For

each of them, we will deﬁne the estimator as well as de-

scribe some characteristics and develop intuition about when

the estimator behaves well and when it may break down in

practice. The estimators under consideration are unadjusted

(UNADJ), a maximum-likelihood based evaluation of the g-

computation parameter (MLE), inverse treatment weighted

(IPTW), augmented-IPTW (AIPTW), and targeted maxi-

mum likelihood (TMLE). For a more in-depth treatment of

these estimators see [4].

For each estimator, or method, we will estimate the aver-

age conversion rate for everyone as though they were shown

the advertisement, ψ

a=1

n,MET HOD

, and for everyone as though

they were all not shown the advertisement, ψ

a=0

n,MET HOD

.

These estimates then may be combined to estimate the ad-

ditive impact (AI) and relative impact (RI) in the following

way:

ψ

AI

n,MET HOD

= ψ

a=1

n,MET HOD

− ψ

a=0

n,MET HOD

(5)

ψ

RI

n,MET HOD

= ψ

a=1

n,MET HOD

/ψ

a=0

n,MET HOD

(6)

We will now begin to describe the diﬀerent methods of

estimation. The unadjusted estimate (UNADJ) is a biased

estimate of the causal eﬀect because it does not account for

the fact that individuals who are more likely to get adver-

tised to are also more likely to convert. In other words,

the estimator does not account for confounding. The UN-

ADJ estimator for relative risk is the conversion rate of the

treated divided by the conversion rate of the untreated. Sim-

ilarly, the UNADJ estimator for the additive impact is the

conversion rate of the treated minus the conversion rate of

the untreated:

ψ

a

n,UNADJ

=

P

n

i=1

I(A

i

= a)Y

i

P

n

i=1

I(A

i

= a)

. (7)

An MLE based estimator is a substitution estimator that

relies on a consistent estimate of the conditional distribution

Q

Y

0

. By a substitution estimator, we are speciﬁcally refer-

ring to the fact that the estimator follows the proper bounds

of the model (i.e., estimates probabilities between 0 and 1)

by evaluating at a particular P

n

. It takes the following form:

ψ

a

n,MLE

=

1

n

n

X

i=1

Q

Y

n

(a, W

i

). (8)

Some drawbacks of this method are that there is no available

theory for the construction of the variance estimates and

5

The subscript n indicates this to be an estimate (rather

than truth) based on n observations.

therefore conﬁdence intervals. Furthermore, it is not robust

to mis-speciﬁcation of the outcome regression model.

The IPTW estimator is an estimating equation-based es-

timator that adjusts for confounding through g. Thus, it is

a consistent estimate of the causal eﬀect when g

n

is a con-

sistent estimate of g

0

. Unlike a substitution estimator, the

estimating equation estimators do not obey the bounds of

the proper model (i.e. return estimates of probabilities not

between 0 and 1). The estimator takes the following form:

ψ

a

n,IP T W

=

1

n

n

X

i=1

I(A

i

= a)Y

i

g

A

n

(A

i

, W

i

)

. (9)

The A-IPTW estimator is an estimating equation-based

estimator that is doubly robust. Double robustness means

that it is a consistent estimate of the causal eﬀect when ei-

ther Q

n

is a consistent estimate of Q

0

or g

n

is a consistent

estimate of g

0

. Furthermore, the A-IPTW estimator is lo-

cally eﬃcient. The A-IPTW estimator is:

ψ

a

n,A−IP T W

=

1

n

n

X

i=1

I(A

i

= a)

g

A

n

(A

i

, W

i

)

(Y

i

− Q

Y

n

(a, W

i

))(10)

+

1

n

n

X

i=1

(Q

Y

n

(a, W

i

)).

Both the IPTW estimator and the A-IPTW estimator may

blow up when g

A

n

(A

i

, W

i

) is not well-bounded. In situa-

tions where g

A

n

(A

i

, W

i

) is close to zero for some of the ob-

served browsers these estimators may be unstable. A quick

examination of these estimators reveals why they may be

unstable. Since g

A

n

(A

i

, W

i

) is in the denominator if it is

close to zero it will result in a very large contribution to the

estimating equation for the particular browser. This contri-

bution is unbounded when g

A

n

(A

i

, W

i

) is not bounded and

this can end up resulting in a very poor estimate. The lack

of boundedness of g

A

n

(A

i

, W

i

) has been called a violation of

the positivity assumptions or an ETA (experimental treat-

ment assumption). For a more thorough discussion of these

types of violations see [12],[15],[17].

The last estimator we will explore is the Targeted Max-

imum Likelihood Estimator (TMLE). TMLE is a substitu-

tion estimator, like MLE above. Moreover, the TMLE is

double robust and locally eﬃcient [10], like A-IPTW. The

TMLE, Ψ(Q

∗

Y

n

), is of the following form:

ψ

a

n,T MLE

=

1

n

n

X

i=1

Q

∗

Y

n

(a, W

i

), (11)

where Q

∗

Y

n

(a, W

i

) is an update of Q

Y

n

(a, W

i

) speciﬁcally

chosen to target the parameter of interest. This update is

done by ﬂuctuating Q

Y

n

(a, W

i

) with a parametric sub-model

of the following form:

logit(Q

∗

Y

n

(a, W

i

)) = logit(Q

Y

n

(a, W

i

) + h

a,n

(A

i

, W

i

)), (12)

where h

a,n

(A

i

, W

i

) = I(A

i

= a)/g

A

n

(a, W

i

). Using the

logit here is just a computational trick to arrive at the TMLE

and is not based on the fact that logistic regression was used

to estimate the initial Q

Y

0

or g

A

0

. Implementing this sub-

model ﬂuctuation can easily be done using the standard glm

function in most statistical packages with an oﬀset equal to

the inverse logit of Q

Y

n

(a, W

i

). The theoretical basis for

the choice of h(A, W ) is explained in van der Laan’s semi-

nal paper on TMLE [25]. A more in-depth explanation of

the implementation, as well as code for implementing this

TMLE may be found in Gruber and van der Laan’s gentle

introduction to TMLE[3]. The TMLE is not as sensitive to

violations in the positivity assumption as the A-IPTW and

IPTW estimators presented above because the contribution

of each browser to the estimator is bounded between 0 and 1.

Thus, the estimator follows a proper model and the ﬁnal es-

timates are guaranteed to produce a proper probability that

falls in the expected range. However, under more extreme

violations of the positivity assumption, the TMLE may also

lose some stability and extensions of TMLE that are more

robust in these situations have been proposed and imple-

mented [22, 4, 19]. These methods are outside the scope of

the current paper.

Conﬁdence intervals and p-values for the A-IPTW, and

TMLE may be constructed using the variance of the inﬂu-

ence curve as described in [3]. They can be similarly con-

structed for the IPTW estimator; however, they should be

conservative for the IPTW estimator. Alternatively, boot-

strap methods may be used to construct these estimates and

have been shown to construct better estimates of conﬁdence

intervals in ﬁnite samples (see e.g [19]).

6. ANALYSIS

In this paper we focused on estimating the eﬀect of dis-

play advertising for one marketer of interest. The marketing

campaign analyzed was for a major fast food chain. Each of

the above methods was implemented to estimate the eﬀect

of advertising in each of the following three sub-populations:

1. Individuals who have visited the chains website in the

past, or Action Takers (AT).

2. Individuals who have not visited the site in the past

but were targeted based on their natural tendency to

convert at a higher rate than the general population,

or Network Neighbors (NN). These individuals were

targeted using machine learning algorithms.

3. Individuals who have not visited the site, or Run-of-

Network (RON).

In each case the data was sampled in the following way:

1. A day t

0

was deﬁned

2. On t

0

all individuals within the speciﬁed sub-population

for whom a medium-sized ad serving company received

a bid request were sampled.

3. W , the vector of potential confounders at baseline,

was recorded. These potential confounders included

past browsing content, past browsing intensity, IP type

(.com, .org, .gov, etc.), Internet connection type, browser

used, number of times an ad network has seen the

browser, and days since the browser was ﬁrst seen, as

well as if an individual visited the site in the past two

days.

4. On the following day, t

0

+ 1, it was recorded whether

each sampled browser saw an advertisement for the

marketer of interest. A = 1 for individuals who saw an

advertisement for the marketer. It was also recorded

whether or not the individual took the desired action

between t

0

and the time of the impression. If an in-

dividual took an action prior to seeing the impression,

this action was recorded in their vector of baseline vari-

ables, W.

5. The action window was determined to be ﬁve days.

6. For those shown the advertisemen,t the action window

begins the second the ﬁrst advertisement is shown on

t

0

+ 1. For those not shown the advertisement, the

action window starts at the same time of day on t

0

+ 1

that the ﬁrst bid request is observed on t

0

.

7. Each browser was observed for the following ﬁve days

in seconds and the second the ﬁve day window closes it

was recorded if the action was observed in the action

window. If an action was observed Y = 1 and if no

action was observed Y = 0.

Since the data was sampled in this way the causal eﬀects

estimated may be used to make inferences about how adver-

tising aﬀects conversions in the sub-population(e.g. past site

visitors) for whom the ad company received bid requests.

Choosing the time to start the action window for the un-

treated, A = 0, requires some assumptions since there is no

action, such as an advertisement, for which to start the ac-

tion window. Fortunately, since this is for the unadvertised

group there is no reason to believe that right after the start

of the window there should be a jump in the probability of

converting right at that time. Thus, we chose to use time

of day on t

0

+ 1 that the ﬁrst bid request was observed on

t

0

. We did a sensitivity analysis and chose other points in

time to start the action window and no diﬀerence was seen

in the results. It should be made completely clear that the

action window was still exactly 5 days to the millisecond for

both those shown the advertisement and those not shown

the advertisement.

7. RESULTS

Table 1 shows the results of our experiments using the

approaches described in the previous sections. In particu-

lar, for each of the 5 methods we show the estimates of the

conversion rates (C-Rates) with and without the ad as well

as the measures of impact. The p-values of 0.000 in the

table indicate that the p-value was less than 10

−16

. The

p-value is not shown for the UNADJ estimates since those

estimates are known to be biased, and thus the p-values do

not provide relevant information. The p-values for the MLE

estimate are not provided since there are no theoretical ba-

sis for their construction as discussed above. The p-values

are displayed for the additive impact and the p-values for

relative impact are similar.

In order to obtain reasonable results for the IPTW and

AIPTW estimators, the estimated treatment probabilities

had to be bounded at .98. Thus, all predictions g

n

(a =

1, W

i

) that were greater than 0.98 were set to 0.98. The

results without this truncation are shown in Table 2. The

violation in the positivity assumption has a drastic eﬀect

on the estimate of the conversion probability for the un-

treated. This is because there are levels of baseline vari-

ables that are almost perfectly predictive of being shown

an ad, and the IPTW and AIPTW estimator contributions

for those browsers is extremely high, as discussed above,

causing the estimate to fall way outside the range of a prob-

ability, between 0% and 100%. While these observations

have suﬃcient inﬂuence in the situation presented in table

2 to drive the estimate out of the proper range, it is also

possible that less severe violations can bias the results even

AT

UNADJ MLE IPTW A-IPTW TMLE

No Ad C-Rate 3.6% 3.6% 15.4% 9.6% 3.7%

Ad C-Rate 4.4% 4.1% 4.1% 4.2% 4.1%

Relative Impact 1.2 1.1 0.3 0.4 1.1

Additive Impact 0.8% 0.5% -11.4% -5.5% 0.4%

p-value 0.17 0.48 0.05

NN

UNADJ MLE IPTW A-IPTW TMLE

No Ad C-Rate 0.51% 0.52% 0.52% 0.51% 0.52%

Ad C-Rate 1.03% 0.73% 0.81% 0.83% 0.80%

Relative Impact 2.0 1.4 1.5 1.6 1.5

Additive Impact 0.52% 0.21% 0.29% 0.33% 0.28%

p-value 0.000 0.000 0.000

RON

UNADJ MLE IPTW A-IPTW TMLE

No Ad C-Rate 0.15% 0.15% 0.15% 0.15% 0.15%

Ad C-Rate 0.37% 0.37% 0.35% 0.37% 0.35%

Relative Impact 2.5 2.5 2.4 2.5 2.4

Additive Impact 0.23% 0.23% 0.20% 0.22% 0.20%

p-value 0.126 0.097 0.125

Table 1: Conversion Rates and Impact Of Advertis-

ing for the three diﬀerent subpopulations.

if the estimates are within the proper range. For this rea-

son the TMLE is preferable. Situations where the IPTW

and A-IPTW estimators blow up and the TMLEs are stable

are fairly common and not speciﬁc to the situation observed

here. In fact, there are many articles that address this is-

sue (see e.g., [13],[22],[19]). Notice in Table 1 even after

bounding the denominator g

A

n

, though the resulting IPTW

and A-IPTW point estimates lie in the appropriate range

for the ATs, they are still returning unreasonable (biased)

results. In fact the point estimate for IPTW suggests that

displaying the advertisement results in 11,400 less conver-

sions per every 100,000 times the display advertisement is

shown, and the A-IPTW estimate 5,500 less conversions per

100,000. The fact that certain individuals are targeted for

ads based on their characteristics suggests that violations in

the positivity assumption are common in the display adver-

tising environment, making the instability of the IPTW and

A-IPTW a valid concern.

IPTW A-IPTW TMLE

No Ad C-Rate 136.3% -119901.8% 3.6%

Ad C-Rate 4.1% 4.2% 4.2%

Table 2: Impact Of Advertising To Past Converters

With Unbounded g

A

n

Causes IPTW and A-IPTW

Estimates To Blow Up

Now we will make some general observations based on Ta-

ble 1 presented above. For these observations, we will focus

on the results based on the TMLE. For past action takers,

AT, the advertisement results in 400 extra conversions for

every 100,000 times the advertisement was displayed. How-

ever, this diﬀerence is only borderline signiﬁcant, suggesting

that there may or may not be an eﬀect of showing the ad-

vertisement to past action takers. It is not particularly sur-

prising that the eﬀect is borderline signiﬁcant considering

that the act of retargeting, or displaying the advertisement

to people that have visited the site in the past, is a common

practice in display advertising. The untreated, or those not

exposed to the advertisement in the AT segment of the pop-

ulation, most probably have been shown the display adver-

tisement several times by other ﬁrms that perform display

advertising. For network neighbors, NN, if all of them were

shown an advertisement 800 people out of 100,000 would

have converted; whereas, if they were all not shown an ad-

vertisement 520 would have converted. The advertisement

results in an extra 280 conversions for every 100,000 times

the advertisement was displayed or a 1.5 times greater con-

version rate. This diﬀerence in the conversion rates is ex-

tremely signiﬁcant with a p-value of less than 10

−16

. For

run-of-network, RON, if all of them were shown an adver-

tisement 350 people out of 100,000 would have converted;

whereas, if they were all not shown an advertisement 150

would have converted. The advertisement results, within

RON, in an extra 200 conversions for every 100,000 times

the advertisement was displayed or a 2.4 times greater con-

version rate. (Note that ratios may be diﬀerent than just di-

viding treated by untreated probabilities that are displayed

because of rounding.) The eﬀect of the advertisement within

RON is not signiﬁcant at the 10 percent level.

Now for some general observations comparing the eﬀects

of advertising between sub-groups. The base conversion

rates for ATs are much larger than for NNs. Unexposed

ATs provide 3,180 more conversions than unexposed NNs

per 100,000 browsers. In addition, the base rate (unexposed)

for RON is 150 conversions per 100,000 browsers while the

base rate for NNs is 520. This suggests that the machine

learning algorithm mechanism used for choosing individu-

als to target is successfully choosing individuals that have

higher base conversion rates than RON. In fact, those peo-

ple grouped into the NN segments are 3.5 times more likely

than RON to convert even when not shown the advertise-

ment. Furthermore, the eﬀect of advertising to the NNs

is larger than the eﬀect of advertising to RON, 280 extra

conversions per 100,000 advertisements shown versus 200.

6

This suggests that the machine learning algorithms used to

choose individuals to target for display advertisements based

on their propensity to convert are successfully choosing in-

dividuals that are more likely to respond to the display ad-

vertisement.

8. INTERNAL METHOD VALIDATION

In this section we will present some evidence that the

methods presented above are actually performing as expected.

There are several simulations studies for TMLE that display

how it performs under diﬀerent scenarios and compare its

performance to the other methods presented above (see e.g.

[13], [22],[10] ). We will present here some additional analy-

ses that verify that the methods are working in our current

setting.

First a negative test was performed. In this test we ana-

lyzed whether the advertisement for a diﬀerent marketer, a

telecommunications company, had any eﬀect on the conver-

sions for the fast food chain we were analyzing above. By

performing this test, we can see if the methods we have im-

plemented are returning spurious results when, in fact, there

is no eﬀect of the advertisement. In running this test we

would expect the telecommunication company’s advertise-

ment to have no eﬀect on the fast food site visit conversions.

6

For comparison between sub-groups that have diﬀerent un-

treated conversion rates we prefer comparing based on addi-

tive impact since relative impact is highly inﬂuenced by the

level of the untreated conversion rate as discussed above.

Table 3 presents the results of this test for the marketer of

interest’s ATs. The TMLE estimates that browsers shown

the advertisement will convert at a rate of 3.79 percent and

those not shown the advertisement convert at a rate of 3.84

percent for an additive diﬀerence of -0.06 percent (p-value

0.89). Thus no eﬀect of the telecommunication company’s

ad was observed.

UNADJ MLE IPTW A-IPTW TMLE

No Ad C-Rate 3.84% 3.85% 3.89% 3.84% 3.84%

Ad C-Rate 4.07% 3.59% 3.89% 3.79% 3.79%

Relative Impact 1.06 0.93 1.00 0.99 0.98

Additive Impact 0.23% -0.26% -0.00% -0.05% -0.06%

p-value .99 0.91 0.89

Table 3: Impact Of Telecommunication Company’s

Advertisement On Fast Food Conversion

Another test we ran to assess the validity of our proposed

approach is that we compared the results to an A/B test that

was recently run. The test was run for a diﬀerent market-

ing campaign of a clothing retailer. The A/B test revealed a

conversion rate for showing the advertisement of 3.26 conver-

sions per 1,000 people compared to a TMLE estimate of 3.19

conversions per 1,000 people. For the untreated, the A/B

test estimate was 9.9 conversions per 10,000 compared to 8.2

conversions per 10,000. Again, the IPTW and EE equation-

based methods did not perform as well as the TMLE. For the

untreated estimates, they were close; however, both meth-

ods estimated 2.2 conversions per 1,000 people for showing

the advertisement.

9. CONCLUSION AND FURTHER WORK

The results displayed above show that by using methods

developed in other ﬁelds for estimating causal eﬀects, we can

estimate the eﬀect of advertising in observational data and

alleviate the need for implementing an A/B, or randomized,

test. These methods may also be used to estimate the ef-

fect of the advertisement within particular sub-populations

of interest. We showed that for a particular fast food mar-

keting campaign the display advertisement resulted in an

additional 280 extra conversions per 100,000 non-past ac-

tion takers that were targeted for advertising (NNs), and 200

additional conversions for run-of-network (RON). The eﬀect

estimated within the NNs is extremely signiﬁcant while the

eﬀect within RON is not signiﬁcant at the 10 percent level.

We also showed that advertising to past action takers results

in borderline signiﬁcant increase of 400 extra conversions per

every 100,000 times the advertisement is displayed. Despite

the fact that the estimated additive impact for ATs is higher

than for NNs (400 vs. 280), those conversions for non past

action takers (NNs and RON) are potentially more valuable

from the company’s perspective because they represent a

new stream of potential income, and once they convert they

become ATs.

The above results also displayed the stability and robust-

ness of using a double robust substitution estimator, tar-

geted maximum likelihood estimation (TMLE), to estimate

the causal eﬀect of advertising. Inverse probability weighted

estimators (IPTW) and estimating-equation based double

robust estimators (A-IPTW) tend to be unstable when es-

timating the causal eﬀect of advertising in situations where

there are levels of baseline variables that are highly predic-

tive of browsers seeing a particular display ad in the sub-

population of interest. The stability of TMLE relative to

these other methods, in these situations where the positiv-

ity assumption is violated, makes it particularly appealing

for estimating the causal eﬀect of online display advertising.

The analysis presented here is just one example of how

causal eﬀect estimation methods may be implemented in the

display advertising environment. Extensions of the approach

presented here may be used to answer other causal business

questions of interest. For example, the approach presented

may be used to estimate the eﬀect of the intensity of display

advertising, the timing of display advertising, or the creative

being displayed. Furthermore, extensions of these methods

may be used to estimate the eﬀect of advertising on the time

until customer conversion.

10. ACKNOWLEDGEMENTS

We would like to thank Tom Phillips, Rod Hook, Brian

May, and Andre Comeau for their valuable insights and sug-

gestions. We would also like to thank Edward Capriolo,

whose patience and support in using the Hadoop distributed

ﬁle system and HIVE query language was critical, and with-

out whom this analysis would not have been possible.

11. REFERENCES

[1] D. Chan, R. Ge, O. Gershony, T. Hesterberg, and

D. Lambert. Evaluating online ad campaigns in a

pipeline: causal models at scale. In Proceedings of

KDD, KDD ’10, pages 7–16, New York, NY, USA,

2010. ACM.

[2] J. Ebbert. The ctr means nothing says hp researchers

Leighton and Satiroglu.

http://www.adexchanger.com/research/clickthrough-

rate-rethink11/, Mar.

2011.

[3] S. Gruber and M. van der Laan. Targeted maximum

likelihood estimation: A gentle introduction. UC

Berkeley Division of Biostatistics Working Paper

Series, page 252, 2009.

[4] S. Gruber and M. van der Laan. An application of

collaborative targeted maximum likelihood estimation

in causal inference and genomics. The International

Journal of Biostatistics, 6(1):18, 2010.

[5] R. Kohavi and R. Longbotham. Unexpected results in

online controlled experiments. ACM SIGKDD

Explorations Newsletter, 12(2):31–35, 2010.

[6] R. Kohavi, R. Longbotham, D. Sommerﬁeld, and

R. Henne. Controlled experiments on the web: survey

and practical guide. Data Mining and Knowledge

Discovery, 18:140–181, 2009.

10.1007/s10618-008-0114-1.

[7] R. Lewis and D. Reiley. Does retail advertising work:

Measuring the eﬀects of advertising on sales via a

controlled experiment on yahoo. Technical report,

Working paper, 2010.

[8] R. Lewis, D. Reiley, and T. Schreiner. Can online

display advertising attract new customers? measuring

an advertiser’s new accounts with a large-scale

experiment on Yahoo! Technical report, Working

paper, 2010.

[9] S. Lewis. Mendelian randomization as applied to

coronary heart disease, including recent advances

incorporating new technology. Circulation:

Cardiovascular Genetics, 3(1):109, 2010.

[10] K. Moore and M. van der Laan. Covariate adjustment

in randomized trials with binary outcomes: Targeted

maximum likelihood estimation. Statistics in

medicine, 28(1):39–64, 2009.

[11] J. Pearl. Causality: Models, Reasoning, and Inference.

Cambridge University Press, Cambridge, 2008.

[12] M. Petersen, K. Porter, S. Gruber, Y. Wang, and

M. van der Laan. Diagnosing and responding to

violations in the positivity assumption. UC Berkeley

Division of Biostatistics Working Paper Series, page

269, 2010.

[13] K. Porter, S. Gruber, M. van der Laan, and

J. Sekhon. The relative performance of targeted

maximum likelihood estimators. UC Berkeley Division

of Biostatistics Working Paper Series, page 279, 2011.

[14] F. Provost, B. Dalessandro, R. Hook, X. Zhang, and

A. Murray. Audience selection for on-line brand

advertising: privacy-friendly social network targeting.

In Proceedings of KDD, pages 707–716. ACM, 2009.

[15] J. Robins. A new approach to causal inference in

mortality studies with a sustained exposure

period–application to control of the healthy worker

survivor eﬀect. Mathematical Modelling,

7(9-12):1393–1512, 1986.

[16] J. Robins. A new approach to causal inference in

mortality studies with sustained exposure periods -

application to control of the healthy worker survivor

eﬀect. Mathematical Modelling, 7:1393–1512, 1986.

[17] J. Robins. Robust estimation in sequentially ignorable

missing data and causal inference models. In

Proceedings of the American Statistical Association

Section on Bayesian Statistical Science, volume 6,

1999.

[18] D. Rubin. Estimating causal eﬀects of treatments in

randomized and nonrandomized studies. Journal of

Educational Psychology, 66(5):688 – 701, 1974.

[19] O. Stitelman and M. van der Laan. Collaborative

targeted maximum likelihood for time to event data.

The International Journal of Biostatistics, 6(1):21,

2010.

[20] A. Tsiatis. Semiparametric theory and missing data.

Springer Verlag, 2006.

[21] M. van der Laan. Targeted maximum likelihood based

causal inference: Part 1. The International Journal of

Biostatistics, 6, 2010.

[22] M. van der Laan and S. Gruber. Collaborative double

robust targeted maximum likelihood estimation. The

international journal of biostatistics, 6(1):17, 2010.

[23] M. van der Laan and J. Robins. Uniﬁed methods for

censored longitudinal data and causality. Springer,

New York, 2003.

[24] M. van der Laan and S. Rose. Targeted Learning:

Causal Inference for Observational and Experimental

Data. New York, NY: Springer Publishing Company,

2011.

[25] M. van der Laan and D. Rubin. Targeted maximum

likelihood learning. The International Journal of

Biostatistics, 2(1), 2006.