Content uploaded by Thorsten Joachims

Author content

All content in this area was uploaded by Thorsten Joachims on Aug 17, 2018

Content may be subject to copyright.

Unbiased Learning-to-Rank with Biased Feedback

Thorsten Joachims

Cornell University, Ithaca, NY

tj@cs.cornell.edu

Adith Swaminathan

Cornell University, Ithaca, NY

adith@cs.cornell.edu

Tobias Schnabel

Cornell University, Ithaca, NY

tbs49@cornell.edu

ABSTRACT

Implicit feedback (e.g., clicks, dwell times, etc.) is an abun-

dant source of data in human-interactive systems. While

implicit feedback has many advantages (e.g., it is inexpen-

sive to collect, user centric, and timely), its inherent biases

are a key obstacle to its eﬀective use. For example, posi-

tion bias in search rankings strongly inﬂuences how many

clicks a result receives, so that directly using click data as a

training signal in Learning-to-Rank (LTR) methods yields

sub-optimal results. To overcome this bias problem, we

present a counterfactual inference framework that provides

the theoretical basis for unbiased LTR via Empirical Risk

Minimization despite biased data. Using this framework,

we derive a Propensity-Weighted Ranking SVM for discrim-

inative learning from implicit feedback, where click models

take the role of the propensity estimator. In contrast to

most conventional approaches to de-biasing the data using

click models, this allows training of ranking functions even

in settings where queries do not repeat. Beyond the theoret-

ical support, we show empirically that the proposed learning

method is highly eﬀective in dealing with biases, that it is

robust to noise and propensity model misspeciﬁcation, and

that it scales eﬃciently. We also demonstrate the real-world

applicability of our approach on an operational search en-

gine, where it substantially improves retrieval performance.

1. INTRODUCTION

Batch training of retrieval systems requires annotated test

collections that take substantial eﬀort and cost to amass.

While economically feasible for Web Search, eliciting rele-

vance annotations from experts is infeasible or impossible

for most other ranking applications (e.g., personal collec-

tion search, intranet search). For these applications, im-

plicit feedback from user behavior is an attractive source of

data. Unfortunately, existing approaches for Learning-to-

Rank (LTR) from implicit feedback – and clicks on search

results in particular – have several limitations or drawbacks.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

WSDM 2017, February 06 - 10, 2017, Cambridge, United Kingdom

c

2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ISBN 978-1-4503-4675-7/17/02. . . $15.00

DOI: http://dx.doi.org/10.1145/3018661.3018699

First, the na¨

ıve approach of treating a click/no-click as a

positive/negative relevance judgment is severely biased. In

particular, the order of presentation has a strong inﬂuence

on where users click [11]. This presentation bias leads to an

incomplete and skewed sample of relevance judgments that

is far from uniform, thus leading to biased learning-to-rank.

Second, treating clicks as preferences between clicked and

skipped documents has been found to be accurate [9, 11],

but it can only infer preferences that oppose the presented

order. This again leads to severely biased data, and learning

algorithms trained with these preferences tend to reverse the

presented order unless additional heuristics are used [9].

Third, probabilistic click models (see [4]) have been used

to model how users produce clicks, and they can take posi-

tion and context biases into account. By estimating latent

parameters of these generative click models, one can infer the

relevance of a given document for a given query. However,

inferring reliable relevance judgments typically requires that

the same query is seen multiple times, which is unrealistic

in many retrieval settings (e.g., personal collection search)

and for tail queries.

Fourth, allowing the LTR algorithm to randomize what is

presented to the user, like in online learning algorithms [17,

6] and batch learning from bandit feedback (BLBF) [25] can

overcome the problem of bias in click data in a principled

manner. However, requiring that rankings be actively per-

turbed during system operation whenever we collect training

data decreases ranking quality and, therefore, incurs a cost

compared to observational data collection.

In this paper we present a theoretically principled and em-

pirically eﬀective approach for learning from observational

implicit feedback that can overcome the limitations outlined

above. By drawing on counterfactual estimation techniques

from causal inference [8], we ﬁrst develop a provably un-

biased estimator for evaluating ranking performance using

biased feedback data. Based on this estimator, we propose

a Propensity-Weighted Empirical Risk Minimization (ERM)

approach to LTR, which we implement eﬃciently in a new

learning method we call Propensity SVM-Rank. While our

approach uses a click model, the click model is merely used

to assign propensities to clicked results in hindsight, not to

extract aggregate relevance judgments. This means that our

Propensity SVM-Rank does not require queries to repeat,

making it applicable to a large range of ranking scenarios.

Finally, our methods can use observational data and we do

not require that the system randomizes rankings during data

collection, except for a small pilot experiment to estimate

the propensity model.

When deriving our approach, we provide theoretical jus-

tiﬁcation for each step, leading to a rigorous end-to-end ap-

proach that does not make unspeciﬁed assumptions or em-

ploys heuristics. This provides a principled basis for fur-

ther improving components of the approach (e.g., the click

propensity model, the ranking performance measure, the

learning algorithm). We present an extensive empirical eval-

uation testing the limits of the approach on synthetic click

data, ﬁnding that it performs robustly over a large range

of bias, noise, and misspeciﬁcation levels. Furthermore, we

ﬁeld our method in a real-world application on an opera-

tional search engine, ﬁnding that it is robust in practice and

manages to substantially improve retrieval performance.

2. RELATED WORK

There are two groups of approaches for handling biases

in implicit feedback for learning-to-rank. The ﬁrst group

assumes the feedback collection step is ﬁxed, and tries to

interpret the observationally collected data so as to mini-

mize bias eﬀects. Approaches in the second group intervene

during feedback collection, trying to present rankings that

will lead to less biased feedback data overall.

Approaches in the ﬁrst group commonly assume some

model of user behavior in order to explain bias eﬀects. For

example, in a cascade model [5], users are assumed to se-

quentially go down a ranking and click on a document if

it is relevant. Clicks, under this model, let us learn pref-

erences between skipped and clicked documents. Learning

from these relative preferences lowers the impact of some

biases [9]. Other click models ([5, 3, 1], also see [4]) have

been proposed, and are trained to maximize log-likelihood of

observed clicks. In these click modeling approaches, perfor-

mance on downstream learning-to-rank algorithms is merely

an afterthought. In contrast, we separate click propensity

estimation and learning-to-rank in a principled way, and we

optimize for ranking performance directly. Our framework

allows us to plug-and-play more sophisticated user models

in place of the simple click models we use in this work.

The key technique used by approaches in the second group

to obtain more reliable click data are randomized exper-

iments. For instance, randomizing documents across all

ranks lets us learn unbiased relevances for each document,

and swapping neighboring pairs of documents [16] lets us

learn reliable pairwise preferences. Similarly, randomized

interleaving can detect preferences between diﬀerent rankers

reliably [2]. Diﬀerent from online learning via bandit algo-

rithms and interleaving [30, 22], batch learning from ban-

dit feedback (BLBF) [25] still uses randomization during

feedback collection, and then performs oﬄine learning. Our

problem formulation can be interpreted as being half way

between the BLBF setting (loss function is unknown and no

assumptions on loss function) and learning-to-rank from ed-

itorial judgments (components of ranking are fully labeled

and loss function is given) since we know the form of the loss

function but labels for only some parts of the ranking are

revealed. All approaches that use randomization suﬀer from

two limitations. First, randomization typically degrades

ranking quality during data collection; second, deploying

non-deterministic ranking functions introduces bookkeeping

overhead. In this paper, the system can be deterministic and

we merely exploit and model stochasticity in user behavior.

Moreover, our framework also allows (but does not require)

the use of randomized data collection in order to mitigate

the eﬀect of biases and to lower estimator variance.

Our approach uses inverse propensity scoring (IPS), orig-

inally employed in survey sampling [7] and causal inference

from observational studies [19], but more recently also in

whole page optimization [29], IR evaluation with manual

judgments [20], and recommender evaluation [13, 21]. We

use randomized interventions similar to [5, 12, 28] to esti-

mate propensities in a position discount model. Unlike the

uniform ranking randomization of [28] (with its high perfor-

mance impact) or swapping adjacent pairs as in [5], we swap

documents in diﬀerent ranks to the top position randomly

as in [12]. See Section 5.3 for details.

Finally, our approach is similar in spirit to [28], where

propensity-weighting is used to correct for selection bias

when discarding queries without clicks during learning-to-

rank. The key insight of our work is to recognize that inverse

propensity scoring can be employed much more powerfully,

to account for position bias, trust bias, contextual eﬀects,

document popularity etc. using appropriate click models to

estimate the propensity of each click rather than the propen-

sity for a query to receive a click as in [28].

3. FULL-INFO LEARNING TO RANK

Before we derive our approach for LTR from biased im-

plicit feedback, we ﬁrst review the conventional problem of

LTR from editorial judgments. In conventional LTR, we

are given a sample Xof i.i.d. queries xi∼P(x) for which

we assume the relevances rel(x, y) of all documents yare

known. Since all relevances are assumed to be known, we

call this the Full-Information Setting. The relevances can be

used to compute the loss ∆(y|x) (e.g., negative DCG) of any

ranking yfor query x. Aggregating the losses of individual

rankings by taking the expectation over the query distribu-

tion, we can deﬁne the overall risk of a ranking system S

that returns rankings S(x) as

R(S) = Z∆(S(x)|x)dP(x).(1)

The goal of learning is to ﬁnd a ranking function S∈ S that

minimizes R(S) for the query distribution P(x). Since R(S)

cannot be computed directly, it is typically estimated via

the empirical risk

ˆ

R(S) = 1

|X|X

xi∈X

∆(S(xi)|xi).

A common learning strategy is Empirical Risk Minimization

(ERM) [26], which corresponds to picking the system ˆ

S∈ S

that optimizes the empirical risk

ˆ

S= argminS∈S nˆ

R(S)o,

possibly subject to some regularization in order to control

overﬁtting. There are several LTR algorithms that follow

this approach (see [15]), and we use SVM-Rank [9] as a

representative algorithm in this paper.

The relevances rel(x, y) are typically elicited via expert

judgments. Apart from being expensive and often infeasi-

ble (e.g., in personal collection search), expert judgments

come with at least two other limitations. First, since it

is clearly impossible to get explicit judgments for all docu-

ments, pooling techniques [23] are used such that only the

most promising documents are judged. While cutting down

on judging eﬀort, this introduces an undesired pooling bias

because all unjudged documents are typically assumed to be

irrelevant. The second limitation is that expert judgments

rel(x, y) have to be aggregated over all intents that underlie

the same query string, and it can be challenging for a judge

to properly conjecture the distribution of query intents to

assign an appropriate rel(x, y).

4. PARTIAL-INFO LEARNING TO RANK

Learning from implicit feedback has the potential to over-

come the above-mentioned limitations of full-information

LTR. By drawing the training signal directly from the user,

it naturally reﬂects the user’s intent, since each user acts

upon their own relevance judgement subject to their speciﬁc

context and information need. It is therefore more appropri-

ate to talk about query instances xithat include contextual

information about the user, instead of query strings x. For

a given query instance xi, we denote with ri(y) the user-

speciﬁc relevance of result yfor query instance xi. One may

argue that what expert assessors try to capture with rel(x, y)

is the mean of the relevances ri(y) over all query instances

that share the query string, so, using implicit feedback for

learning is able to remove a lot of guesswork about what the

distribution of users meant by a query.

However, when using implicit feedback as a relevance sig-

nal, unobserved feedback is an even greater problem than

missing judgments in the pooling setting. In particular, im-

plicit feedback is distorted by presentation bias, and it is

not missing completely at random [14]. To nevertheless de-

rive well-founded learning algorithms, we adopt the follow-

ing counterfactual model. It closely follows [20], which uni-

ﬁes several prior works on evaluating information retrieval

systems.

For concreteness and simplicity, assume that relevances

are binary, ri(y)∈ {0,1}, and our performance measure of

interest is the sum of the ranks of the relevant results

∆(y|xi,ri) = X

y∈y

rank(y|y)·ri(y).(2)

Analogous to (1), we can deﬁne the risk of a system as

R(S) = Z∆(S(x)|x,r) dP(x,r).(3)

In our counterfactual model, there exists a true vector of

relevances rifor each incoming query instance (xi,ri)∼

P(x,r). However, only a part of these relevances is observed

for each query instance, while typically most remain unob-

served. In particular, given a presented ranking ¯

yiwe are

more likely to observe the relevance signals (e.g., clicks) for

the top-ranked results than for results ranked lower in the

list. Let oidenote the 0/1 vector indicating which relevance

values were revealed, oi∼P(o|xi,¯

yi,ri). For each element

of oi, denote with Q(oi(y) = 1|xi,¯

yi,ri) the marginal prob-

ability of observing the relevance ri(y) of result yfor query

xi, if the user was presented the ranking ¯

yi. We refer to this

probability value as the propensity of the observation. We

will discuss how oiand Qcan be obtained in Section 5.

Using this counterfactual modeling setup, we can get an

unbiased estimate of ∆(y|xi,ri) for any new ranking y(typ-

ically diﬀerent from the presented ranking ¯

yi) via the inverse

propensity scoring (IPS) estimator [7, 19, 8]

ˆ

∆IP S (y|xi,¯

yi, oi) = X

y:oi(y)=1

rank(y|y)·ri(y)

Q(oi(y)=1|xi,¯

yi,ri)

=X

y:oi(y)=1

Vri(y)=1

rank(y|y)

Q(oi(y)=1|xi,¯

yi,ri).

This is an unbiased estimate of ∆(y|xi,ri) for any y, if

Q(oi(y) = 1|xi,¯

yi,ri)>0 for all ythat are relevant ri(y) =

1 (but not necessarily for the irrelevant y).

Eoi[ˆ

∆IP S (y|xi,¯

yi, oi)]

=Eoi

X

y:oi(y)=1

rank(y|y)·ri(y)

Q(oi(y)=1|xi,¯

yi,ri)

=X

y∈y

Eoioi(y)·rank(y|y)·ri(y)

Q(oi(y)=1|xi,¯

yi,ri))

=X

y∈y

Q(oi(y) = 1|xi,¯

yi,ri)·rank(y|y)·ri(y)

Q(oi(y) = 1|xi,¯

yi,ri)

=X

y∈y

rank(y|y) ri(y)

= ∆(y|xi,ri).

The second step uses linearity of expectation, and the fourth

step uses Q(oi(y) = 1|xi,¯

yi,ri)>0.

An interesting property of ˆ

∆IP S (y|xi,¯

yi, oi) is that only

those results ywith [oi(y)=1∧ri(y) = 1] (i.e. clicked re-

sults, as we will see later) contribute to the estimate. We

therefore only need the propensities Q(oi(y) = 1|xi,¯

yi,ri)

for relevant results. Since we will eventually need to esti-

mate the propensities Q(oi(y) = 1|xi,¯

yi,ri), an additional

requirement for making ˆ

∆IP S (y|xi,¯

yi, oi) computable while

remaining unbiased is that the propensities only depend on

observable information (i.e., unconfoundedness, see [8]).

To deﬁne the empirical risk to optimize during learning,

we begin by collecting a sample of Nquery instances xi,

recording the partially-revealed relevances rias indicated

by oi, and the propensities Q(oi(y)=1|xi,¯

yi,ri) for the

observed relevant results in the ranking ¯

yipresented by the

system. Then, the empirical risk of a system is simply the

IPS estimates averaged over query instances:

ˆ

RIP S (S) = 1

N

N

X

i=1 X

y:oi(y)=1

Vri(y)=1

rank(y|S(xi))

Q(oi(y)=1|xi,¯

yi,ri).(4)

Since ˆ

∆IP S (y|xi,¯

yi, oi) is unbiased for each query instance,

the aggregate ˆ

RIP S (S) is also unbiased for R(S) from (3),

E[ˆ

RIP S (S)] = R(S).

Furthermore, it is easy to verify that ˆ

RIP S (S) converges to

the true R(S) under mild additional conditions (i.e., propen-

sities bounded away from 0) as we increase the sample size

Nof query instances. So, we can perform ERM using this

propensity-weighted empirical risk,

ˆ

S= argminS∈S nˆ

RIP S (S)o.

Finally, using standard results from statistical learning the-

ory [26], consistency of the empirical risk paired with ca-

pacity control implies consistency also for ERM. In intuitive

terms, this means that given enough training data, the learn-

ing algorithm is guaranteed to ﬁnd the best system in S.

5. FEEDBACK PROPENSITY MODELS

In Section 4, we showed that the relevance signal ri, the

observation pattern oi, and the propensities of the obser-

vations Q(oi(y)=1|xi,¯

yi,ri) are the key components for

unbiased LTR from biased observational feedback. We now

outline how these quantities can be elicited and modeled in

a typical search-engine application. However, the general

framework of Section 4 extends beyond this particular ap-

plication, and beyond the particular feedback model below.

5.1 Position-Based Propensity Model

Search engine click logs provide a sample of query in-

stances xi, the presented ranking ¯

yiand a (sparse) click-

vector where each ci(y)∈ {0,1}indicates whether result y

was clicked or not. To derive propensities of observed clicks,

we will employ a click propensity model. For simplicity, we

consider a straightforward examination model analogous to

[18], where a click on a search result depends on the prob-

ability that a user examines a result (i.e., ei(y)) and then

decides to click on it (i.e., ci(y)) in the following way:

P(ei(y) = 1|rank(y|¯

y)) ·P(ci(y) = 1|ri(y),ei(y) = 1).

In this model, examination depends only on the rank of y

in ¯

y. So, P(ei(y) = 1|rank(y|¯

yi)) can be represented by

a vector of examination probabilities pr, one for each rank

r. These examination probabilities can model presentation

bias documented in eye-tracking studies [11], where users

are more likely to see results at the top of the ranking than

those further down.

For the probability of click on an examined result P(ci(y) =

1|ri(y),ei(y) = 1), we ﬁrst consider the simplest model

where clicking is a deterministic noise-free function of the

users private relevance assessment ri(y). Under this model,

users click if and only if the result is examined and relevant

(ci(y) = 1 ↔[ei(y) = 1 ∧ri(y) = 1]). This means that

for examined results (i.e., ei(y) = 1) clicking is synonymous

with relevance (ei(y)=1→[ci(y)=ri(y)]). Furthermore,

it means that we observe the value of ri(y) perfectly when

ei(y) = 1 (ei(y)=1→oi(y) = 1), and that we gain no

knowledge of the true ri(y) when a result is not examined

(ei(y)=0→oi(y) = 0). Therefore, examination equals

observation and Q(oi(y)|xi,¯

yi,ri)≡P(ei(y)|rank(y|¯

yi)).

Using these equivalences, we can simplify the IPS estima-

tor from (4) by substituting pras the propensities and by

using ci(y) = 1 ↔[oi(y) = 1 ∧ri(y) = 1]

ˆ

RIP S (S) = 1

n

n

X

i=1 X

y:ci(y)=1

rank(y|S(xi))

prank(y|¯

yi)

.(5)

ˆ

RIP S (S) is an unbiased estimate of R(S) under the position-

based propensity model if pr>0 for all ranks. While ab-

sence of a click does not imply that the result is not relevant

(i.e., ci(y)=06→ ri(y) = 0), the IPS estimator has the

nice property that such explicit negative judgments are not

needed to compute an unbiased estimate of R(S) for the loss

in (2). Similarly, while absence of a click leaves us unsure

about whether the result was examined (i.e., ei(y) =?), the

IPS estimator only needs to know the indicators oi(y) = 1

for results that are also relevant (i.e., clicked results).

Finally, note the conceptual diﬀerence in how we use this

standard examination model compared to most prior work.

We do not try to estimate an average relevance rating rel(x, y)

by taking repeat instances of the same query x, but we use

the model as a propensity estimator to de-bias individual

observed user judgments ri(y) to be used directly in ERM.

5.2 Incorporating Click Noise

In Section 5.1, we assumed that clicks reveal the user’s

true riin a noise-free way. This is clearly unrealistic. In ad-

dition to the stochasticity in the examination distribution

P(ei(y) = 1|rank(y|y)), we now also consider noise in the

distribution that generates the clicks. In particular, we no

longer require that a relevant result is clicked with probabil-

ity 1 and an irrelevant result is clicked with probability 0,

but instead, for 1 ≥+> −≥0,

P(ci(y) = 1|ri(y) = 1, oi(y) = 1) = +,

P(ci(y) = 1|ri(y) = 0, oi(y) = 1) = −.

The ﬁrst line means that users click on a relevant result only

with probability +, while the second line means that users

may erroneously click on an irrelevant result with probabil-

ity −. An alternative and equivalent way of thinking about

click noise is that users still click deterministically as in the

previous section, but based on a noisily corrupted version

˜riof ri. This means that all reasoning regarding observa-

tion (examination) events oiand their propensities prstill

holds, and that we still have that ci(y)=1→oi(y) = 1.

What does change, though, is that we no longer observe

the “correct” ri(y) but instead get feedback according to the

noise-corrupted version ˜ri(y). What happens to our learning

process if we estimate risk using (5), but now with ˜ri?

Fortunately, the noise does not aﬀect ERM’s ability to

ﬁnd the best ranking system given enough data. While using

noisy clicks leads to biased empirical risk estimates w.r.t. the

true ri(i.e., E[ˆ

RIP S (S)] 6=R(S)), in expectation this bias

is order preserving for R(S) such that the risk minimizer

remains the same.

E[ˆ

RIP S (S1)] >E[ˆ

RIP S (S2)]

⇔Ex,r,¯

y

EoEc|o

X

y:c(y)=1

rank(y|S1

(x))−rank(y|S2(x))

prank(y|¯

y)

>0

⇔Ex,r

"X

y

P(c(y) = 1|o(y) = 1,r(y))δrank(y|x)#>0

⇔Ex,r

"X

y

δrank(y|x)·(+r(y) + −(1 −r(y)))#>0

⇔Ex,r

"X

y

δrank(y|x)·((+−−) r(y) + −)#>0

∗ ⇔ Ex,r

"X

y

δrank(y|x)·(+−−) r(y)#>0

⇔Ex,r

"X

y

δrank(y|x)·r(y)#>0

⇔R(S1)> R(S2),

where δrank(y|x) is short for rank(y|S1(x))−rank(y|S2(x))

and we use the fact that −Py∈¯

yδrank(y|x) = 0 in the step

marked ∗. This implies that our propensity-weighted ERM

is a consistent approach for ﬁnding a ranking function with

the best true R(S),

ˆ

S= argminS∈S {R(S)}

= argminS∈S nE[ˆ

RIP S (S)]o,(6)

even when the objective is corrupted by click noise as spec-

iﬁed above.

5.3 Propensity Estimation

As the last step of deﬁning the click propensity model,

we need to address the question of how to estimate its pa-

rameters (i.e. the vector of examination probabilities pr)

for a particular search engine [12]. The following shows that

we can get estimates using data from a simple intervention

similar to [28], but without the strong negative impact of

presenting uniformly random results to some users. This

also relates to the Click@1 metric proposed by [3].

First, note that it suﬃces to estimate the prup to some

positive multiplicative constant, since any such constant does

not change how the IPS estimator (5) orders diﬀerent sys-

tems. We therefore merely need to estimate how much pr

changes relative to pkfor some “landmark” rank k. This sug-

gests the following experimental intervention for estimating

pr: before presenting the ranking to the user, swap the re-

sult at rank kwith the result at rank r. If we denote with y0

the results originally in rank k, our click model before and

after the intervention indicates that

P(ci(y0) = 1|no-swap) = pk·P(ci(y0) = 1|ei(y0) = 1)

P(ci(y0) = 1|swap-k-and-r) = pr·P(ci(y0) = 1|ei(y0) = 1)

where

P(ci(y0) = 1|ei(y0) = 1)

=X

v∈{0,1}

P(ci(y0)=1|ri(y0)= v , ei(y0)= 1) ·P(ri(y0)= v)

is constant regardless of the intervention. This means that

the clickthrough rates P(ci(y0) = 1|swap-k-and-r), which we

can estimate from the intervention data, are proportional to

the parameters prfor any r. By performing the swapping

intervention between rank kand all other ranks r, we can

estimate all the prparameters.

This swap-intervention experiment is of much lower im-

pact than the uniform randomization proposed in [28] for a

diﬀerent propensity estimation problem, and careful consid-

eration of which rank kto choose can further reduce impact

of the swap experiment. From a practical perspective, it may

also be unnecessary to separately estimate prfor each rank.

Instead, one may want to interpolate between estimates at

well-chosen ranks and/or employ smoothing. Finally, note

that the intervention only needs to be applied on a small sub-

set of the data used for ﬁtting the click propensity model,

while the actual data used for training the ERM learning

algorithm does not require any interventions.

5.4 Alternative Feedback Propensity Models

The click propensity model we deﬁne above is arguably

one of the simplest models one can employ for propensity

modeling in LTR, and there is broad scope for extensions.

First, one could extend the model by incorporating other

biases, for example, trust bias [11] which aﬀects perceived

relevance of a result based on its position in the ranking.

This can be captured by conditioning click probabilities also

on the position P(ci(y0)=1|ri(y0),ei(y0)=1,rank(y|¯

yi)).

We have already explored that the model can be extended to

include trust bias, but it is omitted due to space constraints.

Furthermore, it is possible to model saliency biases [31] by

replacing the prwith a regression function.

Second, we conjecture that a wide range of other click

models (e.g., cascade model [5] and others [5, 3, 1, 4]) can

be adapted as propensity models. The main requirement

is that we can compute marginal click probabilities for the

clicked documents in hindsight, which may be feasible for

other existing models.

Third, we may be able to deﬁne and train new types of

click models. In particular, for our propensity ERM ap-

proach we only need the propensities Q(oi(y)=1|xi,¯

yi,ri)

for observed and relevant documents to evaluate the IPS

estimator, but not for irrelevant documents. This can be

substantially easier than a full generative model of how peo-

ple reveal relevance judgments through implicit feedback.

In particular, this model can condition on all the revealed

relevances ri(yj) in hindsight, and it does not need to treat

them as latent variables.

Finally, the ERM learning approach is not limited to bi-

nary click feedback, but applies to a large range of feedback

settings. For example, the feedback may be explicit star

ratings in a movie recommendation system, and the propen-

sities may be the results of self-selection by the users as in

[21]. In such an explicit feedback setting, oiis fully known,

which simpliﬁes propensity estimation substantially.

6. PROPENSITY-WEIGHTED SVM-RANK

We now derive a concrete learning method that imple-

ments propensity-weighted LTR. It is based on SVM-Rank

[9, 10], but we conjecture that propensity-weighted versions

of other LTR methods can be derived as well.

Consider a dataset of nexamples of the following form.

For each query-result pair (xj, yj) that is clicked, we com-

pute the propensity qi=Q(oi(y) = 1|xi,¯

yi,ri) of the click

according to our click propensity model. We also record the

candidate set Yjof all results for query xj. Typically, Yj

contains a few hundred documents – selected by a stage-one

ranker [27] – that we aim to rerank. Note that each click

generates a separate training example, even if multiple clicks

occur for the same query.

Given this propensity-scored click data, we deﬁne Propen-

sity SVM-Rank as a generalization of conventional SVM-

Rank. Propensity SVM-Rank learns a linear scoring func-

tion f(x, y) = w·φ(x, y) that can be used for ranking results,

where wis a weight vector and φ(x, y) is a feature vector

that describes the match between query xand result y.

Propensity SVM-Rank optimizes the following objective,

ˆw= argminw,ξ

1

2w·w+C

n

n

X

j=1

1

qjX

y∈Yj

ξjy

s.t. ∀y∈Y1\{y1}:w·[φ(x1, y1)−φ(x1, y )] ≥1−ξ1y

.

.

.

∀y∈Yn\{yn}:w·[φ(xn, yn)−φ(xn, y )] ≥1−ξny

∀j∀y:ξjy ≥0.

Cis a regularization parameter that is typically selected

via cross-validation. The training objective optimizes an

upper bound on the regularized IPS estimated empirical risk

of (5), since each line of constraints corresponds to the rank

of a relevant document (minus 1). In particular, for any

feasible (w, ξ)

rank(yi|y)−1 = X

y6=yi

1w·[φ(xi,y)−φ(xi,yi)]>0

≤X

y6=yi

max(1 −w·[φ(xi, yi)−φ(xi, y)],0)

≤X

y6=yi

ξiy.

We can solve this type of Quadratic Program eﬃ-

ciently via a one-slack formulation [10], and we are using

SVM-Rank with appropriate modiﬁcations to include IPS

weights 1/qj. The modiﬁcations are integrated into the

latest version of SVM-Rank, and the code is available at

http://www.joachims.org/svm_light/svm_proprank.html.

In the empirical evaluation, we compare against the naive

application of SVM-Rank, which minimizes the rank of the

clicked documents while ignoring presentation bias. In par-

ticular, Naive SVM-Rank sets all the qiuniformly to the

same constant (e.g., 1).

7. EMPIRICAL EVALUATION

We take a two-pronged approach to evaluating our ap-

proach empirically. First, we use synthetically generated

click data to explore the behavior of our methods over the

whole spectrum of presentation bias severity, click noise, and

propensity misspeciﬁcation. Second, we explore the real-

world applicability of our approach by evaluating on an op-

erational search engine using real click-logs from live traﬃc.

7.1 Synthetic Data Experiments

To be able to explore the full spectrum of biases and noise,

we conducted experiments using click data derived from the

Yahoo Learning to Rank Challenge corpus (set 1). This

corpus contains a large number of manually judged queries,

where we binarized relevance by assigning ri(y) = 1 to all

documents that got rated 3 or 4, and ri(y) = 0 for ratings

0,1,2. We adopt the train, validation, test splits in the cor-

pus. This means that queries in the three sets are disjoint,

and we never train on any data from queries in the test set.

To have a gold standard for reporting test-set performance,

we measure performance on the binarized full-information

ratings using (2).

To generate click data from this full-information dataset

of ratings, we ﬁrst trained a normal Ranking SVM using 1

percent of the full-information training data to get a ranking

function S0. We employ S0as the “Production Ranker”,

and it is used to “present” rankings ¯

ywhen generating the

click data. We generate clicks using the rankings ¯

yand

ground-truth binarized relevances from the Yahoo dataset

according to the following process. Depending on whether

we are generating a training or a validation sample of click

data, we ﬁrst randomly draw a query xfrom the respective

full-information dataset. For this query we compute ¯

y=

S0(x) and generate clicks based on the model from Section 5.

Whenever a click is generated, we record a training example

with its associated propensity Q(o(y)=1|x,¯

y,r). For the

10

10.5

11

11.5

12

12.5

13

1.7E3 1.7E4 1.7E5 1.7E6

Avg. Rank of Relevant Results

Number of Training Clicks

Production Ranker

Propensity SVM-Rank

Clipped Propensity SVM-Rank

Naive SVM-Rank

Noise-free Full-info Skyline

Figure 1: Test set performance in terms of (2) for

Propensity SVM-Rank with and without clipping

compared to SVM-Rank naively ignoring the bias

in clicks (η= 1,−= 0.1). The skyline is a Rank-

ing SVM trained on all data without noise in the

full-information setting, and the baseline is the pro-

duction ranker S0.

experiments, we model presentation bias via

Q(o(y) = 1|x,¯

y,r) = prank(y|¯

y)=1

rank(y|¯

y)η

.(7)

The parameter ηlets us control the severity of the presenta-

tion bias. We also introduce noise into the clicks according

to the model described in Section 5. When not mentioned

otherwise, we use the parameters η= 1, −= 0.1, and

+= 1, which leads to click data where about 33% of the

clicks are noisy clicks on irrelevant results and where the

result at rank 10 has a 10% probability of being examined.

We also explore other bias proﬁles and noise levels in the

following experiments.

In all experiments, we select any parameters (e.g., C) of

the learning methods via cross-validation on a validation set.

The validation set is generated using the same click model as

the training set, but using the queries in the validation-set

portion of the Yahoo dataset. For Propensity SVM-Rank,

we always use the (unclipped) IPS estimator (5) to estimate

validation set performance. Keeping with the proportions

of the original Yahoo data, the validation set size is always

about 15% the size of the training set.

The primary baseline we compare against is a naive appli-

cation of SVM-Rank that simply ignores the bias in the click

data. We call this method Naive SVM-Rank. It is equivalent

to a standard ranking SVM [9], but is most easily explained

as equivalent to Propensity SVM-Rank with all qjset to 1.

Analogously, we use the corresponding naive version of (5)

with propensities set to 1 to estimate validation set perfor-

mance for Naive SVM-Rank.

7.2 How does ranking performance scale with

training set size?

We ﬁrst explore how the test-set ranking performance

changes as the learning algorithm is given more and more

click data. The resulting learning curves are given in Fig-

ure 1, and the performance of S0is given as a baseline. The

click data has presentation bias according to (2) with η= 1

10

10.5

11

11.5

12

12.5

13

0 0.5 1 1.5 2

Avg. Rank of Relevant Results

Severity of Presentation Bias

Propensity SVM-Rank

5x Propensity SVM-Rank

Naive SVM-Rank

5x Naive SVM-Rank

Figure 2: Test set performance for Propensity SVM-

Rank and Naive SVM-Rank as presentation bias

becomes more severe in terms of η(n= 45Kand

n= 225K,−= 0).

and noise −= 0.1. For small datasets, results are averaged

over 5 draws of the click data.

With increasing amounts of click data, Propensity SVM-

Rank approaches the skyline performance of the full-

information SVM-Rank trained on the complete training set

of manual ratings without noise. This is in stark contrast to

Naive SVM-Rank which fails to account for the bias in the

data and does not reach this level of performance. Further-

more, Naive SVM-Rank cannot make eﬀective use of addi-

tional data and its learning curve is essentially ﬂat. This

is consistent with the theoretical insight that estimation er-

ror in Naive SVM-Rank’s empirical risk ˆ

R(S) is dominated

by asymptotic bias due to biased clicks, which does not de-

crease with more data and leads to suboptimal learning. The

unbiased risk estimate ˆ

RIP S (S) of Propensity SVM-Rank,

however, has estimation error only due to ﬁnite sample vari-

ance, which is decreased by more data and leads to consis-

tent learning.

While unbiasedness is an important property when click

data is plenty, the increased variance of ˆ

RIP S (S) can be a

drawback for small datasets. This can be seen in Figure 1,

where Naive SVM-Rank outperforms Propensity SVM-Rank

for small datasets. This can be remedied using techniques

like “propensity clipping” [24], where small propensities are

clipped to some threshold value τto trade bias for variance.

ˆ

RCI P S (S) = 1

nX

xiX

y∈S(xi)

rank(y|S(xi)) ·ri(y)

max{τ, Q(oi(y)=1|xi,¯

yi,ri)}.

Figure 1 shows the learning curve of Propensity SVM-Rank

with clipping, cross-validating both the clipping threshold

τand C. Clipping indeed improves performance for small

datasets. While τ= 1 is equivalent to Naive SVM-Rank,

the validation set is too small (and hence, the ﬁnite sample

error of the validation performance estimate too high) to

reliably select this model in every run. In practice, however,

we expect click data to be plentiful such that lack of training

data is unlikely to be a persistent issue.

7.3 How much presentation bias can be toler-

ated?

We now vary the severity of the presentation bias via η

to understand its impact on Propensity SVM-Rank. Fig-

10

10.5

11

11.5

12

12.5

13

0 0.05 0.1 0.15 0.2 0.25 0.3

Avg. Rank of Relevant Results

Noise Level

Propensity SVM-Rank

5x Propensity SVM-Rank

Naive SVM-Rank

5x Naive SVM-Rank

Figure 3: Test set performance for Propensity SVM-

Rank and Naive SVM-Rank as the noise level in-

creases in terms of −(n= 170Kand n= 850K,η= 1).

ure 2 shows that inverse propensity weighting is beneﬁcial

whenever substantial bias exists. Furthermore, increasing

the amount of training data by a factor of 5 leads to fur-

ther improvement for the Propensity SVM-Rank, while the

added training data has no eﬀect on Naive SVM-Rank. This

is consistent with our arguments from Section 4 – more train-

ing data does not help when bias dominates estimation er-

ror, but it can reduce estimation error from variance in the

unbiased risk estimate of Propensity SVM-Rank.

7.4 How robust are the methods to click noise?

Figure 3 shows that Propensity SVM-Rank also enjoys a

substantial advantage when it comes to noise. When in-

creasing the noise level in terms of −from 0 up to 0.3

(resulting in click data where 59.8% of all clicks are on irrel-

evant documents), Propensity SVM-Rank increasingly out-

performs Naive SVM-Rank. And, again, the unbiasedness

of the empirical risk estimate allows Propensity SVM-Rank

to beneﬁt from more data.

7.5 How robust is Propensity SVM-Rank to

misspeciﬁed propensities?

So far all experiments have assumed that Propensity SVM-

Rank has access to accurate propensities. In practice, how-

ever, propensities need to be estimated and are subject to

model assumptions. We now evaluate how robust Propen-

sity SVM-Rank is to misspeciﬁed propensities. Figure 4

shows the performance of Propensity SVM-Rank when the

training data is generated with η= 1, but the propensities

used by Propensity SVM-Rank are misspeciﬁed using the η

given in the x-axis of the plot. The plot shows that even

misspeciﬁed propensities can give substantial improvement

over naively ignoring the bias, as long as the misspeciﬁcation

is “conservative” – i.e., overestimating small propensities is

tolerable (which happens when η < 1), but underestimat-

ing small propensities can be harmful (which happens when

η > 1). This is consistent with theory, and clipping is one

particular way of overestimating small propensities that can

even improve performance. Overall, we conclude that even

a mediocre propensity model can improve over the naive ap-

proach – after all, the naive approach can be thought of as a

particularly poor propensity model that implicitly assumes

no presentation bias and uniform propensities.

10

10.5

11

11.5

12

12.5

13

0 0.5 1 (true) 1.5 2

Avg. Rank of Relevant Results

Assumed Propensity Model (eta)

Propensity SVM-Rank

Naive SVM-Rank

Figure 4: Test set performance for Propensity SVM-

Rank and Naive SVM-Rank as propensities are mis-

speciﬁed (true η= 1,n= 170K,−= 0.1).

7.6 Real-World Experiment

We now examine the performance of Propensity SVM-

rank when learning a new ranking function for the Arxiv

Full-Text Search (http://search.arxiv.org:8081/) based on

real-world click logs from this system. The search en-

gine uses a linear scoring function as outlined in Sec-

tion 6. Query-document features φ(x, y) are represented

by a 1000−dimensional vector, and the production ranker

used for collecting training clicks employs a hand-crafted

weight vector w(denoted Prod). Observed clicks on rank-

ings served by this ranker over a period of 21 days provide

implicit feedback data for LTR as outlined in Section 6.

To estimate the propensity model, we consider the simple

position-based model of Section 5.1 and we collect new click

data via randomized interventions for 7 days as outlined in

Section 5.3 with landmark rank k= 1. Before presenting

the ranking, we take the top-ranked document and swap it

with the document at a uniformly at random chosen rank

j∈ {1, . . . 21}. The ratio of observed click-through rates

(CTR) on the formerly top-ranked document now at position

jvs. its CTR at position 1 gives a noisy estimate of pj/p1

in the position-based click model. We additionally smooth

these estimates by interpolating with the overall observed

CTR at position j(normalized so that C T R@1 = 1). This

yields prthat approximately decay with rank rwith the

smallest pr'0.12. For r > 21, we impute pr=p21.

We partition the click-logs into a train-validation split:

the ﬁrst 16 days are the train set and provide 5437 click-

events for SVM-rank, while the remaining 5 days are the

validation set with 1755 click events. The hyper-parameter

Cis picked via cross validation. Analogous to Section 7.1, we

use the IPS estimator for Propensity SVM-Rank, and naive

estimator with Q(o(y) = 1|x,¯

y,r) = 1 for Naive SVM-Rank.

With the best hyper-parameter settings, we re-train on all

21 days worth of data to derive the ﬁnal weight vectors for

either method.

We ﬁelded these learnt weight vectors in two online in-

terleaving experiments [2], the ﬁrst comparing Propensity

SVM-Rank against Prod and the second comparing Propen-

sity SVM-Rank against Naive SVM-Rank. The results are

summarized in Table 1. We ﬁnd that Propensity SVM-

Rank signiﬁcantly outperforms the hand-crafted production

ranker that was used to collect the click data for training

Table 1: Per-query balanced interleaving results for

detecting relative performance between the hand-

crafted production ranker used for click data col-

lection (Prod), Naive SVM-Rank and Propensity

SVM-Rank.

Propensity SVM-Rank

Interleaving Experiment wins loses ties

against Prod 87 48 83

against Naive SVM-Rank 95 60 102

(two-tailed binomial sign test p= 0.001 with relative risk

0.71 compared to null hypothesis). Furthermore, Propensity

SVM-Rank similarly outperforms Naive SVM-Rank, demon-

strating that even a simple propensity model provides ben-

eﬁts on real-world data (two-tailed binomial sign test p=

0.006 with relative risk 0.77 compared to null hypothesis).

Note that Propensity SVM-Rank not only signiﬁcantly, but

also substantially outperforms both other rankers in terms of

eﬀect size – and the synthetic data experiments suggest that

additional training data will further increase its advantage.

8. CONCLUSIONS AND FUTURE

This paper introduced a principled approach for learning-

to-rank under biased feedback data. Drawing on counterfac-

tual modeling techniques from causal inference, we present a

theoretically sound Empirical Risk Minimization framework

for LTR. We instantiate this framework with a Propensity-

Weighted Ranking SVM, and provide extensive empirical

evidence that the resulting learning method is robust to se-

lection biases, noise, and model misspeciﬁcation. Further-

more, our real-world experiments on a live search engine

show that the approach leads to substantial retrieval im-

provements, without any heuristic or manual interventions

in the learning process.

Beyond the speciﬁc learning methods and propensity mod-

els we propose, this paper may have even bigger impact for

its theoretical contribution of developing the general coun-

terfactual model for LTR, thus articulating the key compo-

nents necessary for LTR under biased feedback. First, the

insight that propensity estimates are crucial for ERM learn-

ing opens a wide area of research on designing better propen-

sity models. Second, the theory demonstrates that LTR

methods should optimize propensity-weighted ERM objec-

tives, raising the question of which other learning methods

beyond the Ranking SVM can be adapted to the Propensity

ERM approach. Third, we conjecture that Propensity ERM

approaches can be developed also for pointwise and listwise

LTR methods using techniques from [20].

Beyond learning from implicit feedback, propensity-

weighted ERM techniques may prove useful even for opti-

mizing oﬄine IR metrics on manually annotated test collec-

tions. First, they can eliminate pooling bias, since the use of

sampling during judgment elicitation puts us in a controlled

setting where propensities are known (and can be optimized

[20]) by design. Second, propensities estimated via click

models can enable click-based IR metrics like click-DCG to

better correlate with test set DCG.

This work was supported in part through NSF Awards

IIS-1247637, IIS-1513692, IIS-1615706, and a gift from

Bloomberg. We thank Maarten de Rijke, Alexey Borisov,

Artem Grotov, and Yuning Mao for valuable feedback and

discussions.

9. REFERENCES

[1] A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov.

A neural click model for web search. In Proceedings of

the 25th International Conference on World Wide

Web, pages 531–541, 2016.

[2] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue.

Large-scale validation and analysis of interleaved

search evaluation. ACM Transactions on Information

Systems (TOIS), 30(1):6:1–6:41, 2012.

[3] O. Chapelle and Y. Zhang. A dynamic bayesian

network click model for web search ranking. In

International Conference on World Wide Web

(WWW), pages 1–10. ACM, 2009.

[4] A. Chuklin, I. Markov, and M. de Rijke. Click Models

for Web Search. Synthesis Lectures on Information

Concepts, Retrieval, and Services. Morgan & Claypool

Publishers, 2015.

[5] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An

experimental comparison of click position-bias models.

In International Conference on Web Search and Data

Mining (WSDM), pages 87–94. ACM, 2008.

[6] K. Hofmann, A. Schuth, S. Whiteson, and

M. de Rijke. Reusing historical interaction data for

faster online learning to rank for ir. In International

Conference on Web Search and Data Mining

(WSDM), pages 183–192, 2013.

[7] D. G. Horvitz and D. J. Thompson. A generalization

of sampling without replacement from a ﬁnite

universe. Journal of the American Statistical

Association, 47(260):663–685, 1952.

[8] G. Imbens and D. Rubin. Causal Inference for

Statistics, Social, and Biomedical Sciences. Cambridge

University Press, 2015.

[9] T. Joachims. Optimizing search engines using

clickthrough data. In ACM SIGKDD Conference on

Knowledge Discovery and Data Mining (KDD), pages

133–142, 2002.

[10] T. Joachims. Training linear SVMs in linear time. In

ACM SIGKDD International Conference On

Knowledge Discovery and Data Mining (KDD), pages

217–226, 2006.

[11] T. Joachims, L. Granka, B. Pan, H. Hembrooke,

F. Radlinski, and G. Gay. Evaluating the accuracy of

implicit feedback from clicks and query reformulations

in web search. ACM Transactions on Information

Systems (TOIS), 25(2), April 2007.

[12] J. Langford, A. Strehl, and J. Wortman. Exploration

scavenging. In Proceedings of the 25th International

Conference on Machine Learning, pages 528–535,

2008.

[13] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased

oﬄine evaluation of contextual-bandit-based news

article recommendation algorithms. In International

Conference on Web Search and Data Mining

(WSDM), pages 297–306, 2011.

[14] R. J. A. Little and D. B. Rubin. Statistical Analysis

with Missing Data. John Wiley, 2002.

[15] T.-Y. Liu. Learning to rank for information retrieval.

Foundations and Trends in Information Retrieval,

3(3):225–331, Mar. 2009.

[16] K. Raman and T. Joachims. Learning socially optimal

information systems from egoistic users. In European

Conference on Machine Learning (ECML), pages

128–144, 2013.

[17] K. Raman, T. Joachims, P. Shivaswamy, and

T. Schnabel. Stable coactive learning via perturbation.

In International Conference on Machine Learning

(ICML), pages 837–845, 2013.

[18] M. Richardson, E. Dominowska, and R. Ragno.

Predicting clicks: Estimating the click-through rate

for new ads. In International Conference on World

Wide Web (WWW), pages 521–530. ACM, 2007.

[19] P. R. Rosenbaum and D. B. Rubin. The central role of

the propensity score in observational studies for causal

eﬀects. Biometrika, 70(1):41–55, 1983.

[20] T. Schnabel, A. Swaminathan, P. Frazier, and

T. Joachims. Unbiased comparative evaluation of

ranking functions. In ACM International Conference

on the Theory of Information Retrieval (ICTIR), 2016.

[21] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak,

and T. Joachims. Recommendations as treatments:

Debiasing learning and evaluation. In International

Conference on Machine Learning (ICML), 2016.

[22] A. Schuth, H. Oosterhuis, S. Whiteson, and

M. de Rijke. Multileave gradient descent for fast

online learning to rank. In International Conference

on Web Search and Data Mining (WSDM), pages

457–466, 2016.

[23] K. Sparck-Jones and C. J. V. Rijsbergen. Report on

the need for and provision of an “ideal” information

retrieval test collection. Technical report, University of

Cambridge, 1975.

[24] A. L. Strehl, J. Langford, L. Li, and S. Kakade.

Learning from logged implicit exploration data. In

Conference on Neural Information Processing Systems

(NIPS), pages 2217–2225, 2010.

[25] A. Swaminathan and T. Joachims. Batch learning

from logged bandit feedback through counterfactual

risk minimization. Journal of Machine Learning

Research (JMLR), 16:1731–1755, Sep 2015.

[26] V. Vapnik. Statistical Learning Theory. Wiley,

Chichester, GB, 1998.

[27] L. Wang, J. J. Lin, and D. Metzler. A cascade ranking

model for eﬃcient ranked retrieval. In ACM

Conference on Research and Development in

Information Retrieval (SIGIR), pages 105–114, 2011.

[28] X. Wang, M. Bendersky, D. Metzler, and M. Najork.

Learning to rank with selection bias in personal search.

In ACM Conference on Research and Development in

Information Retrieval (SIGIR). ACM, 2016.

[29] Y. Wang, D. Yin, L. Jie, P. Wang, M. Yamada,

Y. Chang, and Q. Mei. Beyond ranking: Optimizing

whole-page presentation. In Proceedings of the Ninth

ACM International Conference on Web Search and

Data Mining, WSDM ’16, pages 103–112, 2016.

[30] Y. Yue and T. Joachims. Interactively optimizing

information retrieval systems as a dueling bandits

problem. In International Conference on Machine

Learning (ICML), pages 151–159, 2009.

[31] Y. Yue, R. Patel, and H. Roehrig. Beyond position

bias: examining result attractiveness as a source of

presentation bias in clickthrough data. In

International Conference on World Wide Web

(WWW), pages 1011–1018. ACM, 2010.