Conference PaperPDF Available

Unbiased Learning-to-Rank with Biased Feedback

Authors:

Abstract

Implicit feedback (e.g., clicks, dwell times, etc.) is an abundant source of data in human-interactive systems. While implicit feedback has many advantages (e.g., it is inexpensive to collect, user-centric, and timely), its inherent biases are a key obstacle to its effective use. For example, position bias in search rankings strongly influences how many clicks a result receives, so that directly using click data as a training signal in Learning-to-Rank (LTR) methods yields sub-optimal results. To overcome this bias problem, we present a counterfactual inference framework that provides the theoretical basis for unbiased LTR via Empirical Risk Minimization despite biased data. Using this framework, we derive a propensity-weighted ranking SVM for discriminative learning from implicit feedback, where click models take the role of the propensity estimator. Beyond the theoretical support, we show empirically that the proposed learning method is highly effective in dealing with biases, that it is robust to noise and propensity model mis-specification, and that it scales efficiently. We also demonstrate the real-world applicability of our approach on an operational search engine, where it substantially improves retrieval performance.
Unbiased Learning-to-Rank with Biased Feedback
Thorsten Joachims
Cornell University, Ithaca, NY
tj@cs.cornell.edu
Adith Swaminathan
Cornell University, Ithaca, NY
adith@cs.cornell.edu
Tobias Schnabel
Cornell University, Ithaca, NY
tbs49@cornell.edu
ABSTRACT
Implicit feedback (e.g., clicks, dwell times, etc.) is an abun-
dant source of data in human-interactive systems. While
implicit feedback has many advantages (e.g., it is inexpen-
sive to collect, user centric, and timely), its inherent biases
are a key obstacle to its effective use. For example, posi-
tion bias in search rankings strongly influences how many
clicks a result receives, so that directly using click data as a
training signal in Learning-to-Rank (LTR) methods yields
sub-optimal results. To overcome this bias problem, we
present a counterfactual inference framework that provides
the theoretical basis for unbiased LTR via Empirical Risk
Minimization despite biased data. Using this framework,
we derive a Propensity-Weighted Ranking SVM for discrim-
inative learning from implicit feedback, where click models
take the role of the propensity estimator. In contrast to
most conventional approaches to de-biasing the data using
click models, this allows training of ranking functions even
in settings where queries do not repeat. Beyond the theoret-
ical support, we show empirically that the proposed learning
method is highly effective in dealing with biases, that it is
robust to noise and propensity model misspecification, and
that it scales efficiently. We also demonstrate the real-world
applicability of our approach on an operational search en-
gine, where it substantially improves retrieval performance.
1. INTRODUCTION
Batch training of retrieval systems requires annotated test
collections that take substantial effort and cost to amass.
While economically feasible for Web Search, eliciting rele-
vance annotations from experts is infeasible or impossible
for most other ranking applications (e.g., personal collec-
tion search, intranet search). For these applications, im-
plicit feedback from user behavior is an attractive source of
data. Unfortunately, existing approaches for Learning-to-
Rank (LTR) from implicit feedback – and clicks on search
results in particular – have several limitations or drawbacks.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
WSDM 2017, February 06 - 10, 2017, Cambridge, United Kingdom
c
2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4675-7/17/02. . . $15.00
DOI: http://dx.doi.org/10.1145/3018661.3018699
First, the na¨
ıve approach of treating a click/no-click as a
positive/negative relevance judgment is severely biased. In
particular, the order of presentation has a strong influence
on where users click [11]. This presentation bias leads to an
incomplete and skewed sample of relevance judgments that
is far from uniform, thus leading to biased learning-to-rank.
Second, treating clicks as preferences between clicked and
skipped documents has been found to be accurate [9, 11],
but it can only infer preferences that oppose the presented
order. This again leads to severely biased data, and learning
algorithms trained with these preferences tend to reverse the
presented order unless additional heuristics are used [9].
Third, probabilistic click models (see [4]) have been used
to model how users produce clicks, and they can take posi-
tion and context biases into account. By estimating latent
parameters of these generative click models, one can infer the
relevance of a given document for a given query. However,
inferring reliable relevance judgments typically requires that
the same query is seen multiple times, which is unrealistic
in many retrieval settings (e.g., personal collection search)
and for tail queries.
Fourth, allowing the LTR algorithm to randomize what is
presented to the user, like in online learning algorithms [17,
6] and batch learning from bandit feedback (BLBF) [25] can
overcome the problem of bias in click data in a principled
manner. However, requiring that rankings be actively per-
turbed during system operation whenever we collect training
data decreases ranking quality and, therefore, incurs a cost
compared to observational data collection.
In this paper we present a theoretically principled and em-
pirically effective approach for learning from observational
implicit feedback that can overcome the limitations outlined
above. By drawing on counterfactual estimation techniques
from causal inference [8], we first develop a provably un-
biased estimator for evaluating ranking performance using
biased feedback data. Based on this estimator, we propose
a Propensity-Weighted Empirical Risk Minimization (ERM)
approach to LTR, which we implement efficiently in a new
learning method we call Propensity SVM-Rank. While our
approach uses a click model, the click model is merely used
to assign propensities to clicked results in hindsight, not to
extract aggregate relevance judgments. This means that our
Propensity SVM-Rank does not require queries to repeat,
making it applicable to a large range of ranking scenarios.
Finally, our methods can use observational data and we do
not require that the system randomizes rankings during data
collection, except for a small pilot experiment to estimate
the propensity model.
When deriving our approach, we provide theoretical jus-
tification for each step, leading to a rigorous end-to-end ap-
proach that does not make unspecified assumptions or em-
ploys heuristics. This provides a principled basis for fur-
ther improving components of the approach (e.g., the click
propensity model, the ranking performance measure, the
learning algorithm). We present an extensive empirical eval-
uation testing the limits of the approach on synthetic click
data, finding that it performs robustly over a large range
of bias, noise, and misspecification levels. Furthermore, we
field our method in a real-world application on an opera-
tional search engine, finding that it is robust in practice and
manages to substantially improve retrieval performance.
2. RELATED WORK
There are two groups of approaches for handling biases
in implicit feedback for learning-to-rank. The first group
assumes the feedback collection step is fixed, and tries to
interpret the observationally collected data so as to mini-
mize bias effects. Approaches in the second group intervene
during feedback collection, trying to present rankings that
will lead to less biased feedback data overall.
Approaches in the first group commonly assume some
model of user behavior in order to explain bias effects. For
example, in a cascade model [5], users are assumed to se-
quentially go down a ranking and click on a document if
it is relevant. Clicks, under this model, let us learn pref-
erences between skipped and clicked documents. Learning
from these relative preferences lowers the impact of some
biases [9]. Other click models ([5, 3, 1], also see [4]) have
been proposed, and are trained to maximize log-likelihood of
observed clicks. In these click modeling approaches, perfor-
mance on downstream learning-to-rank algorithms is merely
an afterthought. In contrast, we separate click propensity
estimation and learning-to-rank in a principled way, and we
optimize for ranking performance directly. Our framework
allows us to plug-and-play more sophisticated user models
in place of the simple click models we use in this work.
The key technique used by approaches in the second group
to obtain more reliable click data are randomized exper-
iments. For instance, randomizing documents across all
ranks lets us learn unbiased relevances for each document,
and swapping neighboring pairs of documents [16] lets us
learn reliable pairwise preferences. Similarly, randomized
interleaving can detect preferences between different rankers
reliably [2]. Different from online learning via bandit algo-
rithms and interleaving [30, 22], batch learning from ban-
dit feedback (BLBF) [25] still uses randomization during
feedback collection, and then performs offline learning. Our
problem formulation can be interpreted as being half way
between the BLBF setting (loss function is unknown and no
assumptions on loss function) and learning-to-rank from ed-
itorial judgments (components of ranking are fully labeled
and loss function is given) since we know the form of the loss
function but labels for only some parts of the ranking are
revealed. All approaches that use randomization suffer from
two limitations. First, randomization typically degrades
ranking quality during data collection; second, deploying
non-deterministic ranking functions introduces bookkeeping
overhead. In this paper, the system can be deterministic and
we merely exploit and model stochasticity in user behavior.
Moreover, our framework also allows (but does not require)
the use of randomized data collection in order to mitigate
the effect of biases and to lower estimator variance.
Our approach uses inverse propensity scoring (IPS), orig-
inally employed in survey sampling [7] and causal inference
from observational studies [19], but more recently also in
whole page optimization [29], IR evaluation with manual
judgments [20], and recommender evaluation [13, 21]. We
use randomized interventions similar to [5, 12, 28] to esti-
mate propensities in a position discount model. Unlike the
uniform ranking randomization of [28] (with its high perfor-
mance impact) or swapping adjacent pairs as in [5], we swap
documents in different ranks to the top position randomly
as in [12]. See Section 5.3 for details.
Finally, our approach is similar in spirit to [28], where
propensity-weighting is used to correct for selection bias
when discarding queries without clicks during learning-to-
rank. The key insight of our work is to recognize that inverse
propensity scoring can be employed much more powerfully,
to account for position bias, trust bias, contextual effects,
document popularity etc. using appropriate click models to
estimate the propensity of each click rather than the propen-
sity for a query to receive a click as in [28].
3. FULL-INFO LEARNING TO RANK
Before we derive our approach for LTR from biased im-
plicit feedback, we first review the conventional problem of
LTR from editorial judgments. In conventional LTR, we
are given a sample Xof i.i.d. queries xiP(x) for which
we assume the relevances rel(x, y) of all documents yare
known. Since all relevances are assumed to be known, we
call this the Full-Information Setting. The relevances can be
used to compute the loss ∆(y|x) (e.g., negative DCG) of any
ranking yfor query x. Aggregating the losses of individual
rankings by taking the expectation over the query distribu-
tion, we can define the overall risk of a ranking system S
that returns rankings S(x) as
R(S) = Z∆(S(x)|x)dP(x).(1)
The goal of learning is to find a ranking function S∈ S that
minimizes R(S) for the query distribution P(x). Since R(S)
cannot be computed directly, it is typically estimated via
the empirical risk
ˆ
R(S) = 1
|X|X
xiX
∆(S(xi)|xi).
A common learning strategy is Empirical Risk Minimization
(ERM) [26], which corresponds to picking the system ˆ
S∈ S
that optimizes the empirical risk
ˆ
S= argminS∈S nˆ
R(S)o,
possibly subject to some regularization in order to control
overfitting. There are several LTR algorithms that follow
this approach (see [15]), and we use SVM-Rank [9] as a
representative algorithm in this paper.
The relevances rel(x, y) are typically elicited via expert
judgments. Apart from being expensive and often infeasi-
ble (e.g., in personal collection search), expert judgments
come with at least two other limitations. First, since it
is clearly impossible to get explicit judgments for all docu-
ments, pooling techniques [23] are used such that only the
most promising documents are judged. While cutting down
on judging effort, this introduces an undesired pooling bias
because all unjudged documents are typically assumed to be
irrelevant. The second limitation is that expert judgments
rel(x, y) have to be aggregated over all intents that underlie
the same query string, and it can be challenging for a judge
to properly conjecture the distribution of query intents to
assign an appropriate rel(x, y).
4. PARTIAL-INFO LEARNING TO RANK
Learning from implicit feedback has the potential to over-
come the above-mentioned limitations of full-information
LTR. By drawing the training signal directly from the user,
it naturally reflects the user’s intent, since each user acts
upon their own relevance judgement subject to their specific
context and information need. It is therefore more appropri-
ate to talk about query instances xithat include contextual
information about the user, instead of query strings x. For
a given query instance xi, we denote with ri(y) the user-
specific relevance of result yfor query instance xi. One may
argue that what expert assessors try to capture with rel(x, y)
is the mean of the relevances ri(y) over all query instances
that share the query string, so, using implicit feedback for
learning is able to remove a lot of guesswork about what the
distribution of users meant by a query.
However, when using implicit feedback as a relevance sig-
nal, unobserved feedback is an even greater problem than
missing judgments in the pooling setting. In particular, im-
plicit feedback is distorted by presentation bias, and it is
not missing completely at random [14]. To nevertheless de-
rive well-founded learning algorithms, we adopt the follow-
ing counterfactual model. It closely follows [20], which uni-
fies several prior works on evaluating information retrieval
systems.
For concreteness and simplicity, assume that relevances
are binary, ri(y)∈ {0,1}, and our performance measure of
interest is the sum of the ranks of the relevant results
∆(y|xi,ri) = X
yy
rank(y|y)·ri(y).(2)
Analogous to (1), we can define the risk of a system as
R(S) = Z∆(S(x)|x,r) dP(x,r).(3)
In our counterfactual model, there exists a true vector of
relevances rifor each incoming query instance (xi,ri)
P(x,r). However, only a part of these relevances is observed
for each query instance, while typically most remain unob-
served. In particular, given a presented ranking ¯
yiwe are
more likely to observe the relevance signals (e.g., clicks) for
the top-ranked results than for results ranked lower in the
list. Let oidenote the 0/1 vector indicating which relevance
values were revealed, oiP(o|xi,¯
yi,ri). For each element
of oi, denote with Q(oi(y) = 1|xi,¯
yi,ri) the marginal prob-
ability of observing the relevance ri(y) of result yfor query
xi, if the user was presented the ranking ¯
yi. We refer to this
probability value as the propensity of the observation. We
will discuss how oiand Qcan be obtained in Section 5.
Using this counterfactual modeling setup, we can get an
unbiased estimate of ∆(y|xi,ri) for any new ranking y(typ-
ically different from the presented ranking ¯
yi) via the inverse
propensity scoring (IPS) estimator [7, 19, 8]
ˆ
IP S (y|xi,¯
yi, oi) = X
y:oi(y)=1
rank(y|y)·ri(y)
Q(oi(y)=1|xi,¯
yi,ri)
=X
y:oi(y)=1
Vri(y)=1
rank(y|y)
Q(oi(y)=1|xi,¯
yi,ri).
This is an unbiased estimate of ∆(y|xi,ri) for any y, if
Q(oi(y) = 1|xi,¯
yi,ri)>0 for all ythat are relevant ri(y) =
1 (but not necessarily for the irrelevant y).
Eoi[ˆ
IP S (y|xi,¯
yi, oi)]
=Eoi
X
y:oi(y)=1
rank(y|y)·ri(y)
Q(oi(y)=1|xi,¯
yi,ri)
=X
yy
Eoioi(y)·rank(y|y)·ri(y)
Q(oi(y)=1|xi,¯
yi,ri))
=X
yy
Q(oi(y) = 1|xi,¯
yi,ri)·rank(y|y)·ri(y)
Q(oi(y) = 1|xi,¯
yi,ri)
=X
yy
rank(y|y) ri(y)
= ∆(y|xi,ri).
The second step uses linearity of expectation, and the fourth
step uses Q(oi(y) = 1|xi,¯
yi,ri)>0.
An interesting property of ˆ
IP S (y|xi,¯
yi, oi) is that only
those results ywith [oi(y)=1ri(y) = 1] (i.e. clicked re-
sults, as we will see later) contribute to the estimate. We
therefore only need the propensities Q(oi(y) = 1|xi,¯
yi,ri)
for relevant results. Since we will eventually need to esti-
mate the propensities Q(oi(y) = 1|xi,¯
yi,ri), an additional
requirement for making ˆ
IP S (y|xi,¯
yi, oi) computable while
remaining unbiased is that the propensities only depend on
observable information (i.e., unconfoundedness, see [8]).
To define the empirical risk to optimize during learning,
we begin by collecting a sample of Nquery instances xi,
recording the partially-revealed relevances rias indicated
by oi, and the propensities Q(oi(y)=1|xi,¯
yi,ri) for the
observed relevant results in the ranking ¯
yipresented by the
system. Then, the empirical risk of a system is simply the
IPS estimates averaged over query instances:
ˆ
RIP S (S) = 1
N
N
X
i=1 X
y:oi(y)=1
Vri(y)=1
rank(y|S(xi))
Q(oi(y)=1|xi,¯
yi,ri).(4)
Since ˆ
IP S (y|xi,¯
yi, oi) is unbiased for each query instance,
the aggregate ˆ
RIP S (S) is also unbiased for R(S) from (3),
E[ˆ
RIP S (S)] = R(S).
Furthermore, it is easy to verify that ˆ
RIP S (S) converges to
the true R(S) under mild additional conditions (i.e., propen-
sities bounded away from 0) as we increase the sample size
Nof query instances. So, we can perform ERM using this
propensity-weighted empirical risk,
ˆ
S= argminS∈S nˆ
RIP S (S)o.
Finally, using standard results from statistical learning the-
ory [26], consistency of the empirical risk paired with ca-
pacity control implies consistency also for ERM. In intuitive
terms, this means that given enough training data, the learn-
ing algorithm is guaranteed to find the best system in S.
5. FEEDBACK PROPENSITY MODELS
In Section 4, we showed that the relevance signal ri, the
observation pattern oi, and the propensities of the obser-
vations Q(oi(y)=1|xi,¯
yi,ri) are the key components for
unbiased LTR from biased observational feedback. We now
outline how these quantities can be elicited and modeled in
a typical search-engine application. However, the general
framework of Section 4 extends beyond this particular ap-
plication, and beyond the particular feedback model below.
5.1 Position-Based Propensity Model
Search engine click logs provide a sample of query in-
stances xi, the presented ranking ¯
yiand a (sparse) click-
vector where each ci(y)∈ {0,1}indicates whether result y
was clicked or not. To derive propensities of observed clicks,
we will employ a click propensity model. For simplicity, we
consider a straightforward examination model analogous to
[18], where a click on a search result depends on the prob-
ability that a user examines a result (i.e., ei(y)) and then
decides to click on it (i.e., ci(y)) in the following way:
P(ei(y) = 1|rank(y|¯
y)) ·P(ci(y) = 1|ri(y),ei(y) = 1).
In this model, examination depends only on the rank of y
in ¯
y. So, P(ei(y) = 1|rank(y|¯
yi)) can be represented by
a vector of examination probabilities pr, one for each rank
r. These examination probabilities can model presentation
bias documented in eye-tracking studies [11], where users
are more likely to see results at the top of the ranking than
those further down.
For the probability of click on an examined result P(ci(y) =
1|ri(y),ei(y) = 1), we first consider the simplest model
where clicking is a deterministic noise-free function of the
users private relevance assessment ri(y). Under this model,
users click if and only if the result is examined and relevant
(ci(y) = 1 [ei(y) = 1 ri(y) = 1]). This means that
for examined results (i.e., ei(y) = 1) clicking is synonymous
with relevance (ei(y)=1[ci(y)=ri(y)]). Furthermore,
it means that we observe the value of ri(y) perfectly when
ei(y) = 1 (ei(y)=1oi(y) = 1), and that we gain no
knowledge of the true ri(y) when a result is not examined
(ei(y)=0oi(y) = 0). Therefore, examination equals
observation and Q(oi(y)|xi,¯
yi,ri)P(ei(y)|rank(y|¯
yi)).
Using these equivalences, we can simplify the IPS estima-
tor from (4) by substituting pras the propensities and by
using ci(y) = 1 [oi(y) = 1 ri(y) = 1]
ˆ
RIP S (S) = 1
n
n
X
i=1 X
y:ci(y)=1
rank(y|S(xi))
prank(y|¯
yi)
.(5)
ˆ
RIP S (S) is an unbiased estimate of R(S) under the position-
based propensity model if pr>0 for all ranks. While ab-
sence of a click does not imply that the result is not relevant
(i.e., ci(y)=06→ ri(y) = 0), the IPS estimator has the
nice property that such explicit negative judgments are not
needed to compute an unbiased estimate of R(S) for the loss
in (2). Similarly, while absence of a click leaves us unsure
about whether the result was examined (i.e., ei(y) =?), the
IPS estimator only needs to know the indicators oi(y) = 1
for results that are also relevant (i.e., clicked results).
Finally, note the conceptual difference in how we use this
standard examination model compared to most prior work.
We do not try to estimate an average relevance rating rel(x, y)
by taking repeat instances of the same query x, but we use
the model as a propensity estimator to de-bias individual
observed user judgments ri(y) to be used directly in ERM.
5.2 Incorporating Click Noise
In Section 5.1, we assumed that clicks reveal the user’s
true riin a noise-free way. This is clearly unrealistic. In ad-
dition to the stochasticity in the examination distribution
P(ei(y) = 1|rank(y|y)), we now also consider noise in the
distribution that generates the clicks. In particular, we no
longer require that a relevant result is clicked with probabil-
ity 1 and an irrelevant result is clicked with probability 0,
but instead, for 1 +> 0,
P(ci(y) = 1|ri(y) = 1, oi(y) = 1) = +,
P(ci(y) = 1|ri(y) = 0, oi(y) = 1) = .
The first line means that users click on a relevant result only
with probability +, while the second line means that users
may erroneously click on an irrelevant result with probabil-
ity . An alternative and equivalent way of thinking about
click noise is that users still click deterministically as in the
previous section, but based on a noisily corrupted version
˜riof ri. This means that all reasoning regarding observa-
tion (examination) events oiand their propensities prstill
holds, and that we still have that ci(y)=1oi(y) = 1.
What does change, though, is that we no longer observe
the “correct” ri(y) but instead get feedback according to the
noise-corrupted version ˜ri(y). What happens to our learning
process if we estimate risk using (5), but now with ˜ri?
Fortunately, the noise does not affect ERM’s ability to
find the best ranking system given enough data. While using
noisy clicks leads to biased empirical risk estimates w.r.t. the
true ri(i.e., E[ˆ
RIP S (S)] 6=R(S)), in expectation this bias
is order preserving for R(S) such that the risk minimizer
remains the same.
E[ˆ
RIP S (S1)] >E[ˆ
RIP S (S2)]
Ex,r,¯
y
EoEc|o
X
y:c(y)=1
rank(y|S1
(x))rank(y|S2(x))
prank(y|¯
y)
>0
Ex,r
"X
y
P(c(y) = 1|o(y) = 1,r(y))δrank(y|x)#>0
Ex,r
"X
y
δrank(y|x)·(+r(y) + (1 r(y)))#>0
Ex,r
"X
y
δrank(y|x)·((+) r(y) + )#>0
∗ ⇔ Ex,r
"X
y
δrank(y|x)·(+) r(y)#>0
Ex,r
"X
y
δrank(y|x)·r(y)#>0
R(S1)> R(S2),
where δrank(y|x) is short for rank(y|S1(x))rank(y|S2(x))
and we use the fact that Py¯
yδrank(y|x) = 0 in the step
marked . This implies that our propensity-weighted ERM
is a consistent approach for finding a ranking function with
the best true R(S),
ˆ
S= argminS∈S {R(S)}
= argminS∈S nE[ˆ
RIP S (S)]o,(6)
even when the objective is corrupted by click noise as spec-
ified above.
5.3 Propensity Estimation
As the last step of defining the click propensity model,
we need to address the question of how to estimate its pa-
rameters (i.e. the vector of examination probabilities pr)
for a particular search engine [12]. The following shows that
we can get estimates using data from a simple intervention
similar to [28], but without the strong negative impact of
presenting uniformly random results to some users. This
also relates to the Click@1 metric proposed by [3].
First, note that it suffices to estimate the prup to some
positive multiplicative constant, since any such constant does
not change how the IPS estimator (5) orders different sys-
tems. We therefore merely need to estimate how much pr
changes relative to pkfor some “landmark” rank k. This sug-
gests the following experimental intervention for estimating
pr: before presenting the ranking to the user, swap the re-
sult at rank kwith the result at rank r. If we denote with y0
the results originally in rank k, our click model before and
after the intervention indicates that
P(ci(y0) = 1|no-swap) = pk·P(ci(y0) = 1|ei(y0) = 1)
P(ci(y0) = 1|swap-k-and-r) = pr·P(ci(y0) = 1|ei(y0) = 1)
where
P(ci(y0) = 1|ei(y0) = 1)
=X
v∈{0,1}
P(ci(y0)=1|ri(y0)= v , ei(y0)= 1) ·P(ri(y0)= v)
is constant regardless of the intervention. This means that
the clickthrough rates P(ci(y0) = 1|swap-k-and-r), which we
can estimate from the intervention data, are proportional to
the parameters prfor any r. By performing the swapping
intervention between rank kand all other ranks r, we can
estimate all the prparameters.
This swap-intervention experiment is of much lower im-
pact than the uniform randomization proposed in [28] for a
different propensity estimation problem, and careful consid-
eration of which rank kto choose can further reduce impact
of the swap experiment. From a practical perspective, it may
also be unnecessary to separately estimate prfor each rank.
Instead, one may want to interpolate between estimates at
well-chosen ranks and/or employ smoothing. Finally, note
that the intervention only needs to be applied on a small sub-
set of the data used for fitting the click propensity model,
while the actual data used for training the ERM learning
algorithm does not require any interventions.
5.4 Alternative Feedback Propensity Models
The click propensity model we define above is arguably
one of the simplest models one can employ for propensity
modeling in LTR, and there is broad scope for extensions.
First, one could extend the model by incorporating other
biases, for example, trust bias [11] which affects perceived
relevance of a result based on its position in the ranking.
This can be captured by conditioning click probabilities also
on the position P(ci(y0)=1|ri(y0),ei(y0)=1,rank(y|¯
yi)).
We have already explored that the model can be extended to
include trust bias, but it is omitted due to space constraints.
Furthermore, it is possible to model saliency biases [31] by
replacing the prwith a regression function.
Second, we conjecture that a wide range of other click
models (e.g., cascade model [5] and others [5, 3, 1, 4]) can
be adapted as propensity models. The main requirement
is that we can compute marginal click probabilities for the
clicked documents in hindsight, which may be feasible for
other existing models.
Third, we may be able to define and train new types of
click models. In particular, for our propensity ERM ap-
proach we only need the propensities Q(oi(y)=1|xi,¯
yi,ri)
for observed and relevant documents to evaluate the IPS
estimator, but not for irrelevant documents. This can be
substantially easier than a full generative model of how peo-
ple reveal relevance judgments through implicit feedback.
In particular, this model can condition on all the revealed
relevances ri(yj) in hindsight, and it does not need to treat
them as latent variables.
Finally, the ERM learning approach is not limited to bi-
nary click feedback, but applies to a large range of feedback
settings. For example, the feedback may be explicit star
ratings in a movie recommendation system, and the propen-
sities may be the results of self-selection by the users as in
[21]. In such an explicit feedback setting, oiis fully known,
which simplifies propensity estimation substantially.
6. PROPENSITY-WEIGHTED SVM-RANK
We now derive a concrete learning method that imple-
ments propensity-weighted LTR. It is based on SVM-Rank
[9, 10], but we conjecture that propensity-weighted versions
of other LTR methods can be derived as well.
Consider a dataset of nexamples of the following form.
For each query-result pair (xj, yj) that is clicked, we com-
pute the propensity qi=Q(oi(y) = 1|xi,¯
yi,ri) of the click
according to our click propensity model. We also record the
candidate set Yjof all results for query xj. Typically, Yj
contains a few hundred documents – selected by a stage-one
ranker [27] – that we aim to rerank. Note that each click
generates a separate training example, even if multiple clicks
occur for the same query.
Given this propensity-scored click data, we define Propen-
sity SVM-Rank as a generalization of conventional SVM-
Rank. Propensity SVM-Rank learns a linear scoring func-
tion f(x, y) = w·φ(x, y) that can be used for ranking results,
where wis a weight vector and φ(x, y) is a feature vector
that describes the match between query xand result y.
Propensity SVM-Rank optimizes the following objective,
ˆw= argminw,ξ
1
2w·w+C
n
n
X
j=1
1
qjX
yYj
ξjy
s.t. yY1\{y1}:w·[φ(x1, y1)φ(x1, y )] 1ξ1y
.
.
.
yYn\{yn}:w·[φ(xn, yn)φ(xn, y )] 1ξny
jy:ξjy 0.
Cis a regularization parameter that is typically selected
via cross-validation. The training objective optimizes an
upper bound on the regularized IPS estimated empirical risk
of (5), since each line of constraints corresponds to the rank
of a relevant document (minus 1). In particular, for any
feasible (w, ξ)
rank(yi|y)1 = X
y6=yi
1w·[φ(xi,y)φ(xi,yi)]>0
X
y6=yi
max(1 w·[φ(xi, yi)φ(xi, y)],0)
X
y6=yi
ξiy.
We can solve this type of Quadratic Program effi-
ciently via a one-slack formulation [10], and we are using
SVM-Rank with appropriate modifications to include IPS
weights 1/qj. The modifications are integrated into the
latest version of SVM-Rank, and the code is available at
http://www.joachims.org/svm_light/svm_proprank.html.
In the empirical evaluation, we compare against the naive
application of SVM-Rank, which minimizes the rank of the
clicked documents while ignoring presentation bias. In par-
ticular, Naive SVM-Rank sets all the qiuniformly to the
same constant (e.g., 1).
7. EMPIRICAL EVALUATION
We take a two-pronged approach to evaluating our ap-
proach empirically. First, we use synthetically generated
click data to explore the behavior of our methods over the
whole spectrum of presentation bias severity, click noise, and
propensity misspecification. Second, we explore the real-
world applicability of our approach by evaluating on an op-
erational search engine using real click-logs from live traffic.
7.1 Synthetic Data Experiments
To be able to explore the full spectrum of biases and noise,
we conducted experiments using click data derived from the
Yahoo Learning to Rank Challenge corpus (set 1). This
corpus contains a large number of manually judged queries,
where we binarized relevance by assigning ri(y) = 1 to all
documents that got rated 3 or 4, and ri(y) = 0 for ratings
0,1,2. We adopt the train, validation, test splits in the cor-
pus. This means that queries in the three sets are disjoint,
and we never train on any data from queries in the test set.
To have a gold standard for reporting test-set performance,
we measure performance on the binarized full-information
ratings using (2).
To generate click data from this full-information dataset
of ratings, we first trained a normal Ranking SVM using 1
percent of the full-information training data to get a ranking
function S0. We employ S0as the “Production Ranker”,
and it is used to “present” rankings ¯
ywhen generating the
click data. We generate clicks using the rankings ¯
yand
ground-truth binarized relevances from the Yahoo dataset
according to the following process. Depending on whether
we are generating a training or a validation sample of click
data, we first randomly draw a query xfrom the respective
full-information dataset. For this query we compute ¯
y=
S0(x) and generate clicks based on the model from Section 5.
Whenever a click is generated, we record a training example
with its associated propensity Q(o(y)=1|x,¯
y,r). For the
10
10.5
11
11.5
12
12.5
13
1.7E3 1.7E4 1.7E5 1.7E6
Avg. Rank of Relevant Results
Number of Training Clicks
Production Ranker
Propensity SVM-Rank
Clipped Propensity SVM-Rank
Naive SVM-Rank
Noise-free Full-info Skyline
Figure 1: Test set performance in terms of (2) for
Propensity SVM-Rank with and without clipping
compared to SVM-Rank naively ignoring the bias
in clicks (η= 1,= 0.1). The skyline is a Rank-
ing SVM trained on all data without noise in the
full-information setting, and the baseline is the pro-
duction ranker S0.
experiments, we model presentation bias via
Q(o(y) = 1|x,¯
y,r) = prank(y|¯
y)=1
rank(y|¯
y)η
.(7)
The parameter ηlets us control the severity of the presenta-
tion bias. We also introduce noise into the clicks according
to the model described in Section 5. When not mentioned
otherwise, we use the parameters η= 1, = 0.1, and
+= 1, which leads to click data where about 33% of the
clicks are noisy clicks on irrelevant results and where the
result at rank 10 has a 10% probability of being examined.
We also explore other bias profiles and noise levels in the
following experiments.
In all experiments, we select any parameters (e.g., C) of
the learning methods via cross-validation on a validation set.
The validation set is generated using the same click model as
the training set, but using the queries in the validation-set
portion of the Yahoo dataset. For Propensity SVM-Rank,
we always use the (unclipped) IPS estimator (5) to estimate
validation set performance. Keeping with the proportions
of the original Yahoo data, the validation set size is always
about 15% the size of the training set.
The primary baseline we compare against is a naive appli-
cation of SVM-Rank that simply ignores the bias in the click
data. We call this method Naive SVM-Rank. It is equivalent
to a standard ranking SVM [9], but is most easily explained
as equivalent to Propensity SVM-Rank with all qjset to 1.
Analogously, we use the corresponding naive version of (5)
with propensities set to 1 to estimate validation set perfor-
mance for Naive SVM-Rank.
7.2 How does ranking performance scale with
training set size?
We first explore how the test-set ranking performance
changes as the learning algorithm is given more and more
click data. The resulting learning curves are given in Fig-
ure 1, and the performance of S0is given as a baseline. The
click data has presentation bias according to (2) with η= 1
10
10.5
11
11.5
12
12.5
13
0 0.5 1 1.5 2
Avg. Rank of Relevant Results
Severity of Presentation Bias
Propensity SVM-Rank
5x Propensity SVM-Rank
Naive SVM-Rank
5x Naive SVM-Rank
Figure 2: Test set performance for Propensity SVM-
Rank and Naive SVM-Rank as presentation bias
becomes more severe in terms of η(n= 45Kand
n= 225K,= 0).
and noise = 0.1. For small datasets, results are averaged
over 5 draws of the click data.
With increasing amounts of click data, Propensity SVM-
Rank approaches the skyline performance of the full-
information SVM-Rank trained on the complete training set
of manual ratings without noise. This is in stark contrast to
Naive SVM-Rank which fails to account for the bias in the
data and does not reach this level of performance. Further-
more, Naive SVM-Rank cannot make effective use of addi-
tional data and its learning curve is essentially flat. This
is consistent with the theoretical insight that estimation er-
ror in Naive SVM-Rank’s empirical risk ˆ
R(S) is dominated
by asymptotic bias due to biased clicks, which does not de-
crease with more data and leads to suboptimal learning. The
unbiased risk estimate ˆ
RIP S (S) of Propensity SVM-Rank,
however, has estimation error only due to finite sample vari-
ance, which is decreased by more data and leads to consis-
tent learning.
While unbiasedness is an important property when click
data is plenty, the increased variance of ˆ
RIP S (S) can be a
drawback for small datasets. This can be seen in Figure 1,
where Naive SVM-Rank outperforms Propensity SVM-Rank
for small datasets. This can be remedied using techniques
like “propensity clipping” [24], where small propensities are
clipped to some threshold value τto trade bias for variance.
ˆ
RCI P S (S) = 1
nX
xiX
yS(xi)
rank(y|S(xi)) ·ri(y)
max{τ, Q(oi(y)=1|xi,¯
yi,ri)}.
Figure 1 shows the learning curve of Propensity SVM-Rank
with clipping, cross-validating both the clipping threshold
τand C. Clipping indeed improves performance for small
datasets. While τ= 1 is equivalent to Naive SVM-Rank,
the validation set is too small (and hence, the finite sample
error of the validation performance estimate too high) to
reliably select this model in every run. In practice, however,
we expect click data to be plentiful such that lack of training
data is unlikely to be a persistent issue.
7.3 How much presentation bias can be toler-
ated?
We now vary the severity of the presentation bias via η
to understand its impact on Propensity SVM-Rank. Fig-
10
10.5
11
11.5
12
12.5
13
0 0.05 0.1 0.15 0.2 0.25 0.3
Avg. Rank of Relevant Results
Noise Level
Propensity SVM-Rank
5x Propensity SVM-Rank
Naive SVM-Rank
5x Naive SVM-Rank
Figure 3: Test set performance for Propensity SVM-
Rank and Naive SVM-Rank as the noise level in-
creases in terms of (n= 170Kand n= 850K,η= 1).
ure 2 shows that inverse propensity weighting is beneficial
whenever substantial bias exists. Furthermore, increasing
the amount of training data by a factor of 5 leads to fur-
ther improvement for the Propensity SVM-Rank, while the
added training data has no effect on Naive SVM-Rank. This
is consistent with our arguments from Section 4 – more train-
ing data does not help when bias dominates estimation er-
ror, but it can reduce estimation error from variance in the
unbiased risk estimate of Propensity SVM-Rank.
7.4 How robust are the methods to click noise?
Figure 3 shows that Propensity SVM-Rank also enjoys a
substantial advantage when it comes to noise. When in-
creasing the noise level in terms of from 0 up to 0.3
(resulting in click data where 59.8% of all clicks are on irrel-
evant documents), Propensity SVM-Rank increasingly out-
performs Naive SVM-Rank. And, again, the unbiasedness
of the empirical risk estimate allows Propensity SVM-Rank
to benefit from more data.
7.5 How robust is Propensity SVM-Rank to
misspecified propensities?
So far all experiments have assumed that Propensity SVM-
Rank has access to accurate propensities. In practice, how-
ever, propensities need to be estimated and are subject to
model assumptions. We now evaluate how robust Propen-
sity SVM-Rank is to misspecified propensities. Figure 4
shows the performance of Propensity SVM-Rank when the
training data is generated with η= 1, but the propensities
used by Propensity SVM-Rank are misspecified using the η
given in the x-axis of the plot. The plot shows that even
misspecified propensities can give substantial improvement
over naively ignoring the bias, as long as the misspecification
is “conservative” – i.e., overestimating small propensities is
tolerable (which happens when η < 1), but underestimat-
ing small propensities can be harmful (which happens when
η > 1). This is consistent with theory, and clipping is one
particular way of overestimating small propensities that can
even improve performance. Overall, we conclude that even
a mediocre propensity model can improve over the naive ap-
proach – after all, the naive approach can be thought of as a
particularly poor propensity model that implicitly assumes
no presentation bias and uniform propensities.
10
10.5
11
11.5
12
12.5
13
0 0.5 1 (true) 1.5 2
Avg. Rank of Relevant Results
Assumed Propensity Model (eta)
Propensity SVM-Rank
Naive SVM-Rank
Figure 4: Test set performance for Propensity SVM-
Rank and Naive SVM-Rank as propensities are mis-
specified (true η= 1,n= 170K,= 0.1).
7.6 Real-World Experiment
We now examine the performance of Propensity SVM-
rank when learning a new ranking function for the Arxiv
Full-Text Search (http://search.arxiv.org:8081/) based on
real-world click logs from this system. The search en-
gine uses a linear scoring function as outlined in Sec-
tion 6. Query-document features φ(x, y) are represented
by a 1000dimensional vector, and the production ranker
used for collecting training clicks employs a hand-crafted
weight vector w(denoted Prod). Observed clicks on rank-
ings served by this ranker over a period of 21 days provide
implicit feedback data for LTR as outlined in Section 6.
To estimate the propensity model, we consider the simple
position-based model of Section 5.1 and we collect new click
data via randomized interventions for 7 days as outlined in
Section 5.3 with landmark rank k= 1. Before presenting
the ranking, we take the top-ranked document and swap it
with the document at a uniformly at random chosen rank
j∈ {1, . . . 21}. The ratio of observed click-through rates
(CTR) on the formerly top-ranked document now at position
jvs. its CTR at position 1 gives a noisy estimate of pj/p1
in the position-based click model. We additionally smooth
these estimates by interpolating with the overall observed
CTR at position j(normalized so that C T R@1 = 1). This
yields prthat approximately decay with rank rwith the
smallest pr'0.12. For r > 21, we impute pr=p21.
We partition the click-logs into a train-validation split:
the first 16 days are the train set and provide 5437 click-
events for SVM-rank, while the remaining 5 days are the
validation set with 1755 click events. The hyper-parameter
Cis picked via cross validation. Analogous to Section 7.1, we
use the IPS estimator for Propensity SVM-Rank, and naive
estimator with Q(o(y) = 1|x,¯
y,r) = 1 for Naive SVM-Rank.
With the best hyper-parameter settings, we re-train on all
21 days worth of data to derive the final weight vectors for
either method.
We fielded these learnt weight vectors in two online in-
terleaving experiments [2], the first comparing Propensity
SVM-Rank against Prod and the second comparing Propen-
sity SVM-Rank against Naive SVM-Rank. The results are
summarized in Table 1. We find that Propensity SVM-
Rank significantly outperforms the hand-crafted production
ranker that was used to collect the click data for training
Table 1: Per-query balanced interleaving results for
detecting relative performance between the hand-
crafted production ranker used for click data col-
lection (Prod), Naive SVM-Rank and Propensity
SVM-Rank.
Propensity SVM-Rank
Interleaving Experiment wins loses ties
against Prod 87 48 83
against Naive SVM-Rank 95 60 102
(two-tailed binomial sign test p= 0.001 with relative risk
0.71 compared to null hypothesis). Furthermore, Propensity
SVM-Rank similarly outperforms Naive SVM-Rank, demon-
strating that even a simple propensity model provides ben-
efits on real-world data (two-tailed binomial sign test p=
0.006 with relative risk 0.77 compared to null hypothesis).
Note that Propensity SVM-Rank not only significantly, but
also substantially outperforms both other rankers in terms of
effect size – and the synthetic data experiments suggest that
additional training data will further increase its advantage.
8. CONCLUSIONS AND FUTURE
This paper introduced a principled approach for learning-
to-rank under biased feedback data. Drawing on counterfac-
tual modeling techniques from causal inference, we present a
theoretically sound Empirical Risk Minimization framework
for LTR. We instantiate this framework with a Propensity-
Weighted Ranking SVM, and provide extensive empirical
evidence that the resulting learning method is robust to se-
lection biases, noise, and model misspecification. Further-
more, our real-world experiments on a live search engine
show that the approach leads to substantial retrieval im-
provements, without any heuristic or manual interventions
in the learning process.
Beyond the specific learning methods and propensity mod-
els we propose, this paper may have even bigger impact for
its theoretical contribution of developing the general coun-
terfactual model for LTR, thus articulating the key compo-
nents necessary for LTR under biased feedback. First, the
insight that propensity estimates are crucial for ERM learn-
ing opens a wide area of research on designing better propen-
sity models. Second, the theory demonstrates that LTR
methods should optimize propensity-weighted ERM objec-
tives, raising the question of which other learning methods
beyond the Ranking SVM can be adapted to the Propensity
ERM approach. Third, we conjecture that Propensity ERM
approaches can be developed also for pointwise and listwise
LTR methods using techniques from [20].
Beyond learning from implicit feedback, propensity-
weighted ERM techniques may prove useful even for opti-
mizing offline IR metrics on manually annotated test collec-
tions. First, they can eliminate pooling bias, since the use of
sampling during judgment elicitation puts us in a controlled
setting where propensities are known (and can be optimized
[20]) by design. Second, propensities estimated via click
models can enable click-based IR metrics like click-DCG to
better correlate with test set DCG.
This work was supported in part through NSF Awards
IIS-1247637, IIS-1513692, IIS-1615706, and a gift from
Bloomberg. We thank Maarten de Rijke, Alexey Borisov,
Artem Grotov, and Yuning Mao for valuable feedback and
discussions.
9. REFERENCES
[1] A. Borisov, I. Markov, M. de Rijke, and P. Serdyukov.
A neural click model for web search. In Proceedings of
the 25th International Conference on World Wide
Web, pages 531–541, 2016.
[2] O. Chapelle, T. Joachims, F. Radlinski, and Y. Yue.
Large-scale validation and analysis of interleaved
search evaluation. ACM Transactions on Information
Systems (TOIS), 30(1):6:1–6:41, 2012.
[3] O. Chapelle and Y. Zhang. A dynamic bayesian
network click model for web search ranking. In
International Conference on World Wide Web
(WWW), pages 1–10. ACM, 2009.
[4] A. Chuklin, I. Markov, and M. de Rijke. Click Models
for Web Search. Synthesis Lectures on Information
Concepts, Retrieval, and Services. Morgan & Claypool
Publishers, 2015.
[5] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey. An
experimental comparison of click position-bias models.
In International Conference on Web Search and Data
Mining (WSDM), pages 87–94. ACM, 2008.
[6] K. Hofmann, A. Schuth, S. Whiteson, and
M. de Rijke. Reusing historical interaction data for
faster online learning to rank for ir. In International
Conference on Web Search and Data Mining
(WSDM), pages 183–192, 2013.
[7] D. G. Horvitz and D. J. Thompson. A generalization
of sampling without replacement from a finite
universe. Journal of the American Statistical
Association, 47(260):663–685, 1952.
[8] G. Imbens and D. Rubin. Causal Inference for
Statistics, Social, and Biomedical Sciences. Cambridge
University Press, 2015.
[9] T. Joachims. Optimizing search engines using
clickthrough data. In ACM SIGKDD Conference on
Knowledge Discovery and Data Mining (KDD), pages
133–142, 2002.
[10] T. Joachims. Training linear SVMs in linear time. In
ACM SIGKDD International Conference On
Knowledge Discovery and Data Mining (KDD), pages
217–226, 2006.
[11] T. Joachims, L. Granka, B. Pan, H. Hembrooke,
F. Radlinski, and G. Gay. Evaluating the accuracy of
implicit feedback from clicks and query reformulations
in web search. ACM Transactions on Information
Systems (TOIS), 25(2), April 2007.
[12] J. Langford, A. Strehl, and J. Wortman. Exploration
scavenging. In Proceedings of the 25th International
Conference on Machine Learning, pages 528–535,
2008.
[13] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased
offline evaluation of contextual-bandit-based news
article recommendation algorithms. In International
Conference on Web Search and Data Mining
(WSDM), pages 297–306, 2011.
[14] R. J. A. Little and D. B. Rubin. Statistical Analysis
with Missing Data. John Wiley, 2002.
[15] T.-Y. Liu. Learning to rank for information retrieval.
Foundations and Trends in Information Retrieval,
3(3):225–331, Mar. 2009.
[16] K. Raman and T. Joachims. Learning socially optimal
information systems from egoistic users. In European
Conference on Machine Learning (ECML), pages
128–144, 2013.
[17] K. Raman, T. Joachims, P. Shivaswamy, and
T. Schnabel. Stable coactive learning via perturbation.
In International Conference on Machine Learning
(ICML), pages 837–845, 2013.
[18] M. Richardson, E. Dominowska, and R. Ragno.
Predicting clicks: Estimating the click-through rate
for new ads. In International Conference on World
Wide Web (WWW), pages 521–530. ACM, 2007.
[19] P. R. Rosenbaum and D. B. Rubin. The central role of
the propensity score in observational studies for causal
effects. Biometrika, 70(1):41–55, 1983.
[20] T. Schnabel, A. Swaminathan, P. Frazier, and
T. Joachims. Unbiased comparative evaluation of
ranking functions. In ACM International Conference
on the Theory of Information Retrieval (ICTIR), 2016.
[21] T. Schnabel, A. Swaminathan, A. Singh, N. Chandak,
and T. Joachims. Recommendations as treatments:
Debiasing learning and evaluation. In International
Conference on Machine Learning (ICML), 2016.
[22] A. Schuth, H. Oosterhuis, S. Whiteson, and
M. de Rijke. Multileave gradient descent for fast
online learning to rank. In International Conference
on Web Search and Data Mining (WSDM), pages
457–466, 2016.
[23] K. Sparck-Jones and C. J. V. Rijsbergen. Report on
the need for and provision of an “ideal” information
retrieval test collection. Technical report, University of
Cambridge, 1975.
[24] A. L. Strehl, J. Langford, L. Li, and S. Kakade.
Learning from logged implicit exploration data. In
Conference on Neural Information Processing Systems
(NIPS), pages 2217–2225, 2010.
[25] A. Swaminathan and T. Joachims. Batch learning
from logged bandit feedback through counterfactual
risk minimization. Journal of Machine Learning
Research (JMLR), 16:1731–1755, Sep 2015.
[26] V. Vapnik. Statistical Learning Theory. Wiley,
Chichester, GB, 1998.
[27] L. Wang, J. J. Lin, and D. Metzler. A cascade ranking
model for efficient ranked retrieval. In ACM
Conference on Research and Development in
Information Retrieval (SIGIR), pages 105–114, 2011.
[28] X. Wang, M. Bendersky, D. Metzler, and M. Najork.
Learning to rank with selection bias in personal search.
In ACM Conference on Research and Development in
Information Retrieval (SIGIR). ACM, 2016.
[29] Y. Wang, D. Yin, L. Jie, P. Wang, M. Yamada,
Y. Chang, and Q. Mei. Beyond ranking: Optimizing
whole-page presentation. In Proceedings of the Ninth
ACM International Conference on Web Search and
Data Mining, WSDM ’16, pages 103–112, 2016.
[30] Y. Yue and T. Joachims. Interactively optimizing
information retrieval systems as a dueling bandits
problem. In International Conference on Machine
Learning (ICML), pages 151–159, 2009.
[31] Y. Yue, R. Patel, and H. Roehrig. Beyond position
bias: examining result attractiveness as a source of
presentation bias in clickthrough data. In
International Conference on World Wide Web
(WWW), pages 1011–1018. ACM, 2010.
... For example, position bias occurs when items at the top of a ranked list receive more clicks than those relevant lower down: higher items in a list absorb more exposure. Studies show that such a bias, if left uncorrected, degrades the ranking quality of a system trained on the user interactions [2,19,48,51]. As a result, a system should return rankings that strive to a certain extent for fairness of exposure. ...
... We follow previous work on implicit bias [23] and model group bias with a multiplicative factor. This allows us to use the inverse propensity scoring (IPS) method to correct for the bias [19,51]. Measuring group bias, however, is not as simple as measuring position or trust bias. ...
... unless is equal across all groups, i.e., there is no group bias. Similar to studies on position and trust bias [1,19,[46][47][48], in our experiments, we analyze the effect of group bias on the ranking quality of the LTR model (RQ1). Due to the relationship between group bias and fairness concerns, we go one step further and assess how leaving the group bias uncorrected affects the optimization of fairness metrics. ...
Preprint
When learning to rank from user interactions, search and recommendation systems must address biases in user behavior to provide a high-quality ranking. One type of bias that has recently been studied in the ranking literature is when sensitive attributes, such as gender, have an impact on a user's judgment about an item's utility. For example, in a search for an expertise area, some users may be biased towards clicking on male candidates over female candidates. We call this type of bias group membership bias or group bias for short. Increasingly, we seek rankings that not only have high utility but are also fair to individuals and sensitive groups. Merit-based fairness measures rely on the estimated merit or utility of the items. With group bias, the utility of the sensitive groups is under-estimated, hence, without correcting for this bias, a supposedly fair ranking is not truly fair. In this paper, first, we analyze the impact of group bias on ranking quality as well as two well-known merit-based fairness metrics and show that group bias can hurt both ranking and fairness. Then, we provide a correction method for group bias that is based on the assumption that the utility score of items in different groups comes from the same distribution. This assumption has two potential issues of sparsity and equality-instead-of-equity, which we use an amortized approach to solve. We show that our correction method can consistently compensate for the negative impact of group bias on ranking quality and fairness metrics.
... We focus on off-policy estimation in the standard contextual multi-armed bandit setting, but we note that our work is applicable to counterfactual learning-to-rank [8,9] and slate recommendation [18]. ...
... This lead to a situation where both logging and target policy had full support (all actions had a positive probability of being selected) but the logging policy "favored" actions in order 8 , 7 , 6 , ..., 1 , whereas the target policy "favored" actions in order 1 , 2 , 3 , ..., 8 . ...
Preprint
Full-text available
"Clipping" (a.k.a. importance weight truncation) is a widely used variance-reduction technique for counterfactual off-policy estimators. Like other variance-reduction techniques, clipping reduces variance at the cost of increased bias. However, unlike other techniques, the bias introduced by clipping is always a downward bias (assuming non-negative rewards), yielding a lower bound on the true expected reward. In this work we propose a simple extension, called $\textit{double clipping}$, which aims to compensate this downward bias and thus reduce the overall bias, while maintaining the variance reduction properties of the original estimator.
... Unfortunately, neither users' browsing history nor bounce position is included in the dataset. So we construct a dataset in the pre-processes proposed in [7] according to Yahoo dataset. In the following, we introduce the pre-processes in more details. ...
... Exposure sequences. We adopt a click data generation method like [7] to generate exposure sequences. More specifically, first train a ranking model MART with the raw data, note that an item is labeled as clicked if its rating score is above 2 which is consistent with the original data. ...
Preprint
Full-text available
Ranking is a crucial module using in the recommender system. In particular, the ranking module using in our YoungTao recommendation scenario is to provide an ordered list of items to users, to maximize the click number throughout the recommendation session for each user. However, we found that the traditional ranking method for optimizing Click-Through rate(CTR) cannot address our ranking scenario well, since it completely ignores user leaving, and CTR is the optimization goal for the one-step recommendation. To effectively undertake the purpose of our ranking module, we propose a long-term optimization goal, named as CTE (Click-Through quantity expectation), for explicitly taking the behavior of user leaving into account. Based on CTE, we propose an effective model trained by reinforcement learning. Moreover, we build a simulation environment from offline log data for estimating PBR and CTR. We conduct extensive experiments on offline datasets and an online e-commerce scenario in TaoBao. Experimental results show that our method can boost performance effectively
... The conceptual framework we developed in this section still applies, as long as the popularity ranking is taken into account in calculating the score. Further, recent algorithmic approaches estimate the objective utility or relevance u i of different items by debiasing the number of clicks from attention imbalances [1,34]. Even for these algorithms, however, ranking-based rich-get-richer dynamics can be at play if a link's actual or perceived utility for the users depends on the object's popularity [4,47]. ...
Article
Full-text available
We study a discrete-time Markov process X_n ∈ R^d for which the distribution of the future increments depends only on the relative ranking of its components (descending order by value). We endow the process with a rich-get-richer assumption and show that, together with a finite second moments assumption, it is enough to guarantee almost sure convergence of X_n /n. We characterize the possible limits if one is free to choose the initial state and we give a condition under which the initial state is irrelevant. Finally, we show how our framework can account for ranking-based Pólya urns and can be used to study ranking algorithms for web interfaces.
... In recent years, as described in two surveys [3,6], several methods from causality have been applied in Recommender Systems (RS) [9,18]. Their main application is to address the biases problem in RSs, using the Inverse Probability Weighting (IPW) estimator [11,20,26] or doubly robust estimators [19]. Moreover, Structural Causal Models (SCMs) approaches have also been applied to the RS problem. ...
Preprint
Full-text available
We approached the causal discovery task in the recommender system domain to learn a causal graph by combining observational data provided by a meta-search booking platform for online hotel search with prior knowledge made available by domain experts. The results show that it is possible to learn a causal graph coherent with previous findings in the recommender systems literature about the relations between different factors. Furthermore, we also discovered new insights that could help in the recommendation process.
... Algorithms are known to have interventional effects that induce the underlying data distribution to change. For example, the feedback effect, commonly known in interactive machine learning systems such as recommendation systems [429,430,431,432,433] and search engine [434,435], possibly also exists in LLMs due to the fact that human feedback data are adopted to fine-tune LLMs such as InstructGPT [1]. The feedback effect describes the observations that existing disparities in data among different user groups might create differentiated experiences when users interact with an algorithmic system (e.g. a recommendation system), which will further reinforce the bias. ...
Preprint
Full-text available
Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.
Article
In leading collaborative filtering (CF) models, representations of users and items are prone to learn popularity bias in the training data as shortcuts. The popularity shortcut tricks are good for in-distribution (ID) performance but poorly generalized to out-of-distribution (OOD) data, i.e., when popularity distribution of test data shifts w.r.t. the training one. To close the gap, debiasing strategies try to assess the shortcut degrees and mitigate them from the representations. However, there exist two deficiencies: (1) when measuring the shortcut degrees, most strategies only use statistical metrics on a single aspect ( i.e., item frequency on item and user frequency on user aspect), failing to accommodate the compositional degree of a user-item pair; (2) when mitigating shortcuts, many strategies assume that the test distribution is known in advance. This results in low-quality debiased representations. Worse still, these strategies achieve OOD generalizability with a sacrifice on ID performance. In this work, we present a simple yet effective debiasing strategy, PopGo , which quantifies and reduces the interaction-wise popularity shortcut without any assumptions on the test data. It first learns a shortcut model, which yields a shortcut degree of a user-item pair based on their popularity representations. Then, it trains the CF model by adjusting the predictions with the interaction-wise shortcut degrees. By taking both causal- and information-theoretical looks at PopGo, we can justify why it encourages the CF model to capture the critical popularity-agnostic features while leaving the spurious popularity-relevant patterns out. We use PopGo to debias two high-performing CF models (MF [ 28 ], LightGCN [ 19 ]) on four benchmark datasets. On both ID and OOD test sets, PopGo achieves significant gains over the state-of-the-art debiasing strategies ( e.g., DICE [ 71 ], MACR [ 58 ]). Codes and datasets are available at https://github.com/anzhang314/PopGo.
Conference Paper
Full-text available
Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling enables the design of estimators that are provably unbiased even when reusing data with missing judgments. In this paper, we first unify and extend these sampling approaches by viewing the evaluation problem as a Monte Carlo estimation task that applies to a large number of common IR metrics. Drawing on the theoretical clarity that this view offers, we tackle three practical evaluation scenarios: comparing two systems, comparing k systems against a baseline, and ranking k systems. For each scenario, we derive an estimator and a variance-optimizing sampling distribution while retaining the strengths of sampling-based evaluation, including unbiasedness, reusability despite missing data, and ease of use in practice. In addition to the theoretical contribution, we empirically evaluate our methods against previously used sampling heuristics and find that they often cut the number of required relevance judgments at least in half.
Article
Full-text available
We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method -- called Policy Optimizer for Exponential Models (POEM) -- for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.
Conference Paper
Understanding user browsing behavior in web search is key to improving web search effectiveness. Many click models have been proposed to explain or predict user clicks on search engine results. They are based on the probabilistic graphical model (PGM) framework, in which user behavior is represented as a sequence of observable and hidden events. The PGM framework provides a mathematically solid way to reason about a set of events given some information about other events. But the structure of the dependencies between the events has to be set manually. Different click models use different hand-crafted sets of dependencies. We propose an alternative based on the idea of distributed representations: to represent the user's information need and the information available to the user with a vector state. The components of the vector state are learned to represent concepts that are useful for modeling user behavior. And user behavior is modeled as a sequence of vector states associated with a query session: the vector state is initialized with a query, and then iteratively updated based on information about interactions with the search engine results. This approach allows us to directly understand user browsing behavior from click-through data, i.e., without the need for a predefined set of rules as is customary for PGM-based click models. We illustrate our approach using a set of neural click models. Our experimental results show that the neural click model that uses the same training data as traditional PGM-based click models, has better performance on the click prediction task (i.e., predicting user click on search engine results) and the relevance prediction task (i.e., ranking documents by their relevance to a query). An analysis of the best performing neural click model shows that it learns similar concepts to those used in traditional click models, and that it also learns other concepts that cannot be designed manually.
Conference Paper
Modern search engines aggregate results from different verticals: webpages, news, images, video, shopping, knowledge cards, local maps, etc. Unlike "ten blue links", these search results are heterogeneous in nature and not even arranged in a list on the page. This revolution directly challenges the conventional "ranked list" formulation in ad hoc search. Therefore, finding proper presentation for a gallery of heterogeneous results is critical for modern search engines. We propose a novel framework that learns the optimal page presentation to render heterogeneous results onto search result page (SERP). Page presentation is broadly defined as the strategy to present a set of items on SERP, much more expressive than a ranked list. It can specify item positions, image sizes, text fonts, and any other styles as long as variations are within business and design constraints. The learned presentation is content-aware, i.e. tailored to specific queries and returned results. Simulation experiments show that the framework automatically learns eye-catchy presentations for relevant results. Experiments on real data show that simple instantiations of the framework already outperform leading algorithm in federated search result presentation. It means the framework can learn its own result presentation strategy purely from data, without even knowing the "probability ranking principle".
Conference Paper
Click-through data has proven to be a critical resource for improving search ranking quality. Though a large amount of click data can be easily collected by search engines, various biases make it difficult to fully leverage this type of data. In the past, many click models have been proposed and successfully used to estimate the relevance for individual query-document pairs in the context of web search. These click models typically require a large quantity of clicks for each individual pair and this makes them difficult to apply in systems where click data is highly sparse due to personalized corpora and information needs, e.g., personal search. In this paper, we study the problem of how to leverage sparse click data in personal search and introduce a novel selection bias problem and address it in the learning-to-rank framework. This paper proposes a few bias estimation methods, including a novel query-dependent one that captures queries with similar results and can successfully deal with sparse data. We empirically demonstrate that learning-to-rank that accounts for query-dependent selection bias yields significant improvements in search effectiveness through online experiments with one of the world's largest personal search engines.
Conference Paper
Many information systems aim to present results that maximize the collective satisfaction of the user population. The product search of an online store, for example, needs to present an appropriately diverse set of products to best satisfy the different tastes and needs of its user population. To address this problem, we propose two algorithms that can exploit observable user actions (e.g. clicks) to learn how to compose diverse sets (and rankings) that optimize expected utility over a distribution of utility functions. A key challenge is that individual users evaluate and act according to their own utility function, but that the system aims to optimize collective satisfaction. We characterize the behavior of our algorithms by providing upper bounds on the social regret for a class of submodular utility functions in the coactive learning model. Furthermore, we empirically demonstrate the efficacy and robustness of the proposed algorithms for the problem of search result diversification.
Conference Paper
Modern search systems are based on dozens or even hundreds of ranking features. The dueling bandit gradient descent (DBGD) algorithm has been shown to effectively learn combinations of these features solely from user interactions. DBGD explores the search space by comparing a possibly improved ranker to the current production ranker. To this end, it uses interleaved comparison methods, which can infer with high sensitivity a preference between two rank-ings based only on interaction data. A limiting factor is that it can compare only to a single exploratory ranker. We propose an online learning to rank algorithm called multileave gradient descent (MGD) that extends DBGD to learn from so-called multileaved comparison methods that can compare a set of rankings instead of merely a pair. We show experimentally that MGD allows for better selection of candidates than DBGD without the need for more comparisons involving users. An important implication of our results is that orders of magnitude less user interaction data is required to find good rankers when multileaved comparisons are used within online learning to rank. Hence, fewer users need to be exposed to possibly inferior rankers and our method allows search engines to adapt more quickly to changes in user preferences.