Conference PaperPDF Available

From a User Model for Query Sessions to Session Rank Biased Precision (sRBP)

Authors:

Abstract and Figures

To satisfy their information needs, users usually carry out searches on retrieval systems by continuously trading off between the examination of search results retrieved by under-specified queries and the refinement of these queries through reformulation. In Information Retrieval (IR), a series of query reformulations is known as a query-session. Research in IR evaluation has traditionally been focused on the development of measures for the ad hoc task, for which a retrieval system aims to retrieve the best documents for a single query. Thus, most IR evaluation measures, with a few exceptions , are not suitable to evaluate retrieval scenarios that call for multiple refinements over a query-session. In this paper, by formally modeling a user's expected behaviour over query-sessions, we derive a session-based evaluation measure, which results in a generalization of the evaluation measure Rank Biased Precision (RBP). We demonstrate the quality of this new session-based evaluation measure, named Session RBP (sRBP), by evaluating its user model against the observed user behaviour over the query-sessions of the 2014 TREC Session track.
Content may be subject to copyright.
From a User Model for ery Sessions
to Session Rank Biased Precision (sRBP)
Aldo Lipani
University College London
London, United Kingdom
aldo.lipani@ucl.ac.uk
Ben Carterette
Spotify
New York, NY, United States
carteret@acm.org
Emine Yilmaz
University College London
London, United Kingdom
emine.yilmaz@ucl.ac.uk
ABSTRACT
To satisfy their information needs, users usually carry out searches
on retrieval systems by continuously trading o between the exam-
ination of search results retrieved by under-specied queries and
the renement of these queries through reformulation. In Informa-
tion Retrieval (IR), a series of query reformulations is known as
a query-session. Research in IR evaluation has traditionally been
focused on the development of measures for the ad hoc task, for
which a retrieval system aims to retrieve the best documents for
a single query. Thus, most IR evaluation measures, with a few ex-
ceptions, are not suitable to evaluate retrieval scenarios that call
for multiple renements over a query-session. In this paper, by
formally modeling a user’s expected behaviour over query-sessions,
we derive a session-based evaluation measure, which results in a
generalization of the evaluation measure Rank Biased Precision
(RBP). We demonstrate the quality of this new session-based evalu-
ation measure, named Session RBP (sRBP), by evaluating its user
model against the observed user behaviour over the query-sessions
of the 2014 TREC Session track.
KEYWORDS
session search, retrieval evaluation, user model, sRBP
ACM Reference Format:
Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2019. From a User Model
for Query Sessions to Session Rank Biased Precision (sRBP). In The 2019
ACM SIGIR International Conference on the Theory of Information Retrieval
(ICTIR ’19), October 2–5, 2019, Santa Clara, CA, USA. ACM, New York, NY,
USA, 8 pages.
1 INTRODUCTION
In order to improve search we need to evaluate it. Research in
Information Retrieval (IR) evaluation plays a critical role in the
improvement of search engines [
13
]. Traditionally, IR evaluation
has been focused on the evaluation of search engines capable of
serving the best results given a single query. However, this kind of
evaluation does not reect a more realistic search scenario where
users are allowed to reformulate their queries.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6881-0/19/10. . . $15.00
Users usually start their search with under-specied queries
that are later reformulated based on the new information acquired
during the search [
14
]. Reformulations can be numerous and en-
dure until the users have either satised their information need, or
given up with the search due to frustration. Carterette et al
. [1]
, for
the Text REtrieval Conference (TREC) Session track, dene such
reformulations as a query-session.
However, the evaluation measures used in the TREC Session
track are still traditional evaluation measures developed for a single-
query search. In this track, search engines are given as input a
recording of a user session, which consists of an initial query and
subsequent reformulations with their search results and clicks. The
task of the engines is to provide a search result to the last refor-
mulation of each session for which no search result and clicks are
provided. Thus evaluating over the full session is rank-equivalent
to evaluating only the last reformulation.
In this paper we introduce a novel session evaluation metric.
We start by developing a user model for query sessions. From this
user model we then develop a query session-based evaluation mea-
sure. This evaluation measure results in a generalization of RBP
[
11
]. Hence, we name it session RBP (sRBP). sRBP extends RBP by
introducing a new parameter, named balancer (
b
). This parameter
quanties the users proclivity to persist in their search by reformu-
lating the query rather than examining more documents from the
current search result. Their persistency, like for RBP, is controlled
by the other parameter
p
. With our experiments we aim to answer
the following research questions:
RQ1.
How does the user model at the base of session-based
evaluation measures predicts the expected user behaviour on
sessions? How do they compare to single-query measures?
RQ2.
Are the single-query measures, DCG and RBP, similar to
the session-based measures, sDCG and sRBP, in evaluating
search engines?
RQ3. In the same context of RQ2, is sDCG similar to sRBP?
We evaluate these research questions using the 2014 TREC Session
track query-session dataset. The contributions of this paper are: (i)
a novel user model for query-sessions, (ii) a theoretically grounded
derivation of the evaluation measure sRBP (and with obvious sim-
plications also of RBP), (iii) the derived evaluation measure, (iv) a
thorough comparison of the single-query measures (DCG and RBP)
and session-based measures (sDCG and sRBP).
The remainder of this paper is structured as follows. In Section
2 we present related work. In Section 3 we introduce the notation
used throughout the paper. The user model is presented in Section
4. In Section 5 we derive the evaluation measure sRBP. In Section 6
we present the experiments and results. We conclude in Section 7.
ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA Aldo Lipani, Ben Carteree, and Emine Yilmaz
2 RELATED WORK
The main limitation of single-query search evaluation measures is
that they force us, when considering sessions, to either (i) evaluate
only the last reformulation, or (ii) aggregate evaluations of the
query and all reformulations together. However, both approaches
fail in capturing the session length, which like the length of search
results, aects user satisfaction; both approaches consider a session
with
n
reformulations equal to a session with
n
+ 1 reformulations.
This makes this evaluation biased towards longer sessions because
engines by gaining more information provide better results [12].
In the TREC Session track, search engines were given as input
session information plus an additional reformulation whose search
result needed to be provided by the engines. In this case, these two
approaches make sense only when comparing sessions to them-
selves. When comparing search engines, we can analytically show
that these approaches produce rank equivalent results (their scores
dier only by a constant). In fact, only the rst approach was used in
the TREC Session track; the second approach would have provided
the same ranking of search engines.
Driven by the lack of evaluation measures able to evaluate query-
session searches, Järvelin et al
. [8]
developed the evaluation mea-
sure session Discounted Cumulative Gain (sDCG), which is a gen-
eralization of the DCG evaluation measure [
7
]. sDCG evaluates a
query session by weighting each subsequent reformulation with a
progressively smaller weight:
sDCG(r,q) =
M
X
m=1
N
X
n=1
1
(1 + logbq(m)) logb(n+ 1) j(rm,n,q),
where
M
is the length of the query-sessions,
N
is the length of
search results,
r
is the list of results,
j
(
rm,n,q
)returns the relevance
value for the topic
q
associated to the document
rm,n
ranked at
position
n
for the query
m
, and
bq
and
b
are free parameters of the
evaluation measure. These two parameters can model various user
behaviours: a small value for
bq
(or
b
) models impatient users, who
most likely will not reformulate their queries (or not examine in
depth the search result), while a larger value models patient users,
who most likely will make several reformulations (or examine in
depth the search result). However, as we will show in this paper the
sDCG discount function does not characterize well the behaviour
observed on Session track users.
Beside developing session-based evaluation measures, another
branch of research focuses into developing simulation-based ap-
proaches. Kanoulas et al
. [9]
simulate all the possible browsing
paths to transform query-sessions into classical search results in or-
der to enable the use of standard evaluation measures. With a focus
on novelty, Yang and Lad
[15]
simulate all the possible reformula-
tions to compute an expected utility of a query-session. Contrary
to these works, in this paper we take an analytical approach, which
does not rely on simulations.
To develop our own user model, we took inspiration from the
probabilistic framework developed for user models for predicting
user clicks in single-query search [
4
6
,
10
]. These models are rooted
in the cascade model proposed by Craswell et al
. [3]
, which says that
users examine documents from top to bottom until the rst click,
after which they never go back. In this paper we extend the cascade
model to query-sessions. However, in contrast to the click chain
model [
5
] and the complex searcher model [
10
] we do not consider
clicks but examinations. Where an examination can be interpreted
classically as the examination of a retrieved document or, as in the
case of the TREC Session track test collection, as the examination
of its snippet. This allowed us to simplify the derivation of the
evaluation measure. We leave to future work the extension of the
user model to consider also clicks.
3 NOTATION
The following is a set of symbols, function, and random variables
used in this paper:
Symbols
MThe length of the query-session.
NThe length of search results.
mThe current reformulation.
nThe current document in the search result.
QA collection of query-sessions.
RA set of runs.
qA query-session.
rA run r∈ R.
rnThe document retrieved at rank n.
rm,n
The document at reformulation
m
and rank
n
.
Function
f(r,q)An evaluation measure for rand q.
d(rm,n)
Returns the discount value for
rm,n
dened
for f.
j(rm,n,q)Returns the degree of relevance of dfor q.
Random Variables
QSystem is Queried ={q,q}.
ETitle and Snippet are Examined ={e,e}.
LUser Leaves Topic or Search ={ℓ, Ü
ℓ, ℓ}.
JDocument Relevance ={j,j}.
4 USER MODEL
Users start searching by querying the search system. The system
responds with a search result for this initial query. Users then
examine the rst document in the search result, by which we mean
either the examination of the title and snippet of the document or
its full content via clicking. After this examination, users face three
options: (1) continue examining the next document in the search
result; (2) continue re-querying the system with a reformulated
query; or (3) leave the search, which can occur for two reasons:
users have satised their information need or they have given up
with the search due to frustration. A graphical representation of
this user model is shown in the ow-chart in Figure 1.
We formalize this user model by associating to every user deci-
sion point a discrete random variable:
Q={q,q}for querying or not querying the system;
E
=
{e,e}
for examining or not examining a ranked docu-
ment;
L
=
{ℓ, Ü
ℓ, ℓ}
for leaving search, continuing with a reformula-
tion, or continuing to browse the ranked results, respectively;
J
=
{j,j}
, which is the observed relevance of an examined
document.
From a User Model for ery Sessions to Session Rank Biased Precision (sRBP) ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA
Query the system
Examine the document
Is it relevant? Examine more?
Examine more?
¨
e
q
j
j
¨
End Search
Start Search
Figure 1: Flow-chart of the proposed user model.
Qm
Em,n
Lm,n
Qm+1
M
N
Figure 2: Graphical model of the proposed user model.
Each one of these random variables is indexed with one or two
indices:
m
identies the query or the result produced by this query
in the query-sessions, and
n
identies the rank of the document
at which the document has been retrieved by the query
m
.
M
and
N
are the lengths of the query-session and search result ranking
respectively.
The graphical model in Figure 1 denes (a) the dependence struc-
ture among these random variables, and (b) how these variables
interact at each examination step. The former formally translates
into the following:
p(Qm,Em,n,Lm,n) = p(Qm)p(Em,n|Qm,n)p(Lm,n|Em,n),(1)
where we dene that the probability of leaving depends on the
examination of the document, and the probability of examining a
document in the search ranking depends on the outcome of the
querying of the system. The latter is presented in the following
paragraphs.
We dene the probability of querying the system
p
(
Qm
)in func-
tion of the random variables associated to the previous query (
m
1),
in the following recursive way:
p(Q1=q)=1
p(Qm=q) =
N
X
n=1
p(Qm1=q,Lm1,n=Ü
).(2)
The rst equation says that the probability of issuing the rst query
is certain. The second equation says that the probability of querying
the system at step
m
is equal to the sum over the documents of the
search result for the query
m
1of the joint probability between
having queried the system and having left the search result in order
to query the system with a reformulated query.
We dene the probability of examining a document
P
(
Em,n
)
in function of the random variables associated to the previous
document (n1), in the following recursive way:
p(Em,1=e|Qm=q)=1
p(Em,n=e|Qm=q) = p(Em,n1=e,Lm,n1=|Qm=q).(3)
The rst equation says that the probability of examining the rst
document is certain. The second equation says that the probability
of examining a document at rank
n
is equal to the joint probability
between having examined a document at rank
n
1and then
continued search by not leaving the search result.
The next two propositions will be useful in the next section for
developing an evaluation measure. The rst proposition is:
P(Em,n=e|Qm=q)=0.(4)
This is the probability of examining a document if the user has not
issued any query, which should be 0. The second proposition is:
p(Lm,n=|Em,n=e)=1.(5)
This is the probability of leaving search if the user has not examined
the current document, which should be 1.
5 EVALUATION MEASURES
Most utility-based measures compute the eectiveness of a search
engine as the product between a discount function
d
and a relevance
function jas follows [2]:
f(r,q) =
N
X
n=1
d(rn)·j(rn,q),(6)
where
rn
is the document retrieved at rank
n
on a search result
r
returned by the search engine given as input a query
q
and a
collection of documents. However,
f
is not suitable to evaluate
ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA Aldo Lipani, Ben Carteree, and Emine Yilmaz
query-sessions because it is limited to a single-query, while sessions
may happen over a number of queries.
To overcome this limitation, we generalize Eq. (6) as follows:
f(r,q) =
M
X
m=1
N
X
n=1
d(rm,n)·j(rm,n,qm),(7)
where
rm,n
is the document retrieved at rank
n
on the search result
r
returned by the search engine given as input a query
qm
and a
collection of documents. Where
qm
is one of the reformulation of
the query-session
q
. This generalization expands the set of actions
that can be taken into account by the discount function
d
. In Eq.
(6)
d
can only model actions a user can take within a single search
result, now in Eq.
(7) d
extends this set of actions by including
actions a user can take over a query-session.
5.1 User Model Driven Derivation of a
Discount Function
Based on the user model developed in Section 4 we dene the
discount function for a document equal to its probability of being
examined. Formally, this can be written as:
d(rm,n) = p(Em,n=e),(8)
where
rm,n
is the document retrieved by the search engine for the
query qmat the the rank n.
Based on Eq. 1, Eq. 8 can be expanded as:
p(Em,n=e) = X
QmX
Lm,n
p(Qm,Em,n=e,Lm,n)
=X
QmX
Lm,n
p(Qm)p(Em,n=e|Qm)p(Lm,n|Em,n=e)
=X
Qm
p(Qm)p(Em,n=e|Qm)X
Lm,n
p(Lm,n|Em,n=e)
| {z }
1
=p(Qm=q)p(Em,n=e|Qm=q),
where the last equality is simplied by using Eq. 4.
Now, we need to derive
P
(
Qm
=
q
)and
P
(
Em,n
=
e|Qm
=
q
).
We start from the latter. In order to estimate
P
(
Em,n
=
e|Qm
=
q
),
according to Eq. 3 we need to quantify
p
(
Em,n
=
e,Lm,n
=
|Qm
=
q). Based on Eq. 1 we obtain:
p(Em,n=e,Lm,n=|Qm=q) =
=p(Em,n=c|Qm=q)p(Lm,n=|Em,n=e)
| {z }
αm,n
,
where, for the sake of clarity, we substitute the probability of not
leaving search given the document for the query
m
at rank
n
has
been examined with
αm,n
. Substituting this equation to Eq. 3 we
can simplify it as following:
p(Em,n=e|Qm=q) =
n1
Y
n=1
αm,n.(9)
This is possible thanks to a property of the product operator that
returns 1 when its upper bound is lower than its lower bound.
In order to estimate
P
(
Qm
=
q
), according to Eq. 2 we need to
quantify PN
n=1 p(Qm=q,Lm,n=Ü
). Based on Eq. 1 we obtain:
N
X
n=1
p(Qm=q,Lm,n=Ü
)
=
N
X
n=1 X
Em,n
p(Qm,n=q,Em,n,Lm,n=Ü
)
=p(Qm= 1)
N
X
n=1 X
Em,n
p(Em,n|Qm=q)p(Lm,n=Ü
|Em,n)
=p(Qm= 1)
N
X
n=1
p(Em,n=e|Qm=q)p(Lm,n=Ü
|Em,n=e)
| {z }
βm,n
,
where the last equality is simplied by using Eq. 5. For the sake
of clarity, we substitute the probability of continuing search by
reformulating a query given that the document for the query
m
at
rank
n
has been examined with
βm,n
. Substituting this equation to
Eq. 2 we can simplify it as following:
p(Qm=q) =
m1
Y
m=1
N
X
n=1
n1
Y
n=1
αm
,nβm
,n.(10)
Combining Eq. 9 and Eq. 10 we now can compute the discount
function as:
d(rm,n) =
m1
Y
m=1
N
X
n=1
n1
Y
n′′=1
αm
,n′′βm
,n
n1
Y
n=1
αm,n.(11)
Two probabilities need to be estimated in order to calculate this
discount function. These probabilities are the ones substituted by
αm,nand βm,n.
5.2 Session Rank-Biased Precision
In order to estimate the two probabilities
αm,n
and
βm,n
we make
the simple assumption that these probabilities do not change over
m
and
n
, they are constant. If these probabilities do not change
(αm,n=αand βm,n=β), we can simplify Eq. 11 as:
d(rm,n) =
m1
Y
m=1
N
X
n=1
αn1βα n1.
Moreover, we can observe that by taking the limit of
N
to innity,
the discount function simplies as following:
d(rm,n) = β
1αm1
αn1.
We can then understand
α
and
β
as the probabilities of, after
having examined a document, continuing searching by not leaving
the current ranking and leaving the current ranking, respectively.
The sum of these probabilities plus the probability of leaving search
(
p
(
Lm,n
=
l|Em,n
=
e
)), which we substitute with
γ
has, by deni-
tion, to sum to one: α+β+γ= 1.
This means that the range of parameter values these probabilities
can take is restricted by this last equation. To avoid this problem,
and give a more human-friendly interpretation of these parameters
we apply the following substitutions:
α=bp,β= (1 b)p,γ= 1 p,
where we name
b
[0
,
1] as the balance parameter, which balances
between reformulating queries and examining more documents
From a User Model for ery Sessions to Session Rank Biased Precision (sRBP) ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA
in the search result, and; we name
p
[0
,
1] as the persistence
parameter because it is similar to the persistence parameter of RBP
[11], which denes the persistence of users in continuing search.
Applying these substitutions to the discount function we obtain:
d(rm,n) = pbp
1bp m1
(bp)n1.(12)
It turns out that the sum of the discount values so dened over a
query-session of innite length is equal to 1
p
. This value can be
used as a normalization factor, similarly to how RBP is normalized.
Based on these observations, substituting the discount function
to Eq. 7 we dene the new evaluation measure sRBP as follows:
sRBP(r,q) = (1 p)
M
X
m=1 pbp
1bp m1N
X
n=1
(bp)n1·j(rn,m,q).
When
b
= 1, sRBP simplies to RBP for the rst query, and ignores
the rest of reformulations in the query-session
1
. When
b
= 0, sRBP
will score only the rst document of every reformulation.
We can now show that sRBP is a generalization of RBP. If we set
b
= 1 and consider a query-session with only a single query, sRBP
is equal to:
sRBPb=1(r,q)if |q|=M=1
= (1 p)
N
X
n=1
pn1·j(rn,q) = RBP(r,q).
This equality not only demonstrates that this user model is consis-
tent with the user model at the base of RBP but also provides an
additional intuition on RBP since this derivation is grounded on
probability theory.
6 EXPERIMENTS
This experimental section aims to answer the three research ques-
tions presented in the introduction. The software used in this paper
is available on the website of the rst author.
6.1 Material
We used the 2014 Session TREC track test collection [
1
]. This dataset
contains 1257 query-sessions, including queries and reformulations,
ranked results from a search engine, and clicks; Out of these 1257
query-sessions, only 101 of them have been judged for relevance,
using a pool of retrieved documents from 73 participating teams.
The judgment process produced 16,949 judged documents. These
101 judged query-sessions are developed over 52 unique topics.
6.2 Experimental Setup
To evaluate the quality of sRBP with respect to sDCG in predicting
the expected user behaviour (RQ1), we compare their user models
with the user behaviour observed over the 1257 query-sessions. In
order to observe the examination of a document we assume that
if a document at rank
n
has been clicked, then all the documents
with rank lower or equal than
n
have been examined. Using these
query-sessions we compute the probability of a user to examine a
document at rank
n
for a query
m
. We do this for every
m
and
n
. In
Table 1 we observe these probabilities.
The observed user behaviour is compared against the user model
of the two session-based evaluation measures, sDCG and sRBP. To
1This is the case since 00= 1 while 0n= 0
compute the probability of a user to examine a document at rank
n
we use the discount functions of the two evaluation measures.
Following the discount function for sDCG:
dsDCG(rm,n) = 1
(1 + logbq(m)) logb(n+ 1) .
The discount function for sRBP is in Eq. 12. To compute the prob-
ability of a user to examined a document at rank
n
for a query
m
we compute the discount function, which is then normalized by its
sum in order to generate a probability distribution over the queries
(m) and documents (n).
To compare the user models and the observed user behaviour we
use the 3 following measures of error: Total Squared Error (TSE),
Total Absolute Error (TAE), and Kullback-Leibler Divergence (KLD).
All our evaluation measures have parameters. To nd the best
parameter values we perform a grid search on the TSE measure with
grid dimension of 0
.
01. For sDCG, we search the parameter values
in their recommended ranges [
8
], 1
<bq
1000 and 1
<b
20.
For sRBP, we search the parameter values in their ranges, 0
p
1
and 0b1.
In order to compare the quality of the single-query measures,
RBP and DCG, against the two session-based measures we perform
a similar experiment as above. However, in this case we consider
every query and reformulation in the query-sessions as individ-
ual queries, assuming each is an independent event. The learning
strategy, range of the parameters, and measure of error we used
during learning are the same as above. This experiment, in addi-
tion to showing how the single-query measures compare against
the observed user-behaviour, will also show how the learned pa-
rameters change when we evaluate engines without query-session
information.
To analyze if single-query measures provide a dierent perspec-
tive with respect to session-based measures (RQ2), we perform a
correlation analysis between the measures DCG and sDCG, and
RBP and sRBP. We use the Kendall’s tau correlation coecient. This
analysis is performed using the parameters learned in the previous
experiment.
We do this rst over the 101 judged query-sessions, and then over
the combination the 73 search results and the 101 judged query-
sessions. In the former case, when comparing query-sessions, for
the single-query measures we two approaches, as also mentioned
in the introduction: (i) we evaluate only the last reformulation, and
(ii) we evaluate all queries and reformulations. When presenting
results, we refer to these two approaches as “RBP (i)” and “RBP (ii)”
or “DCG (i)” and “DCG (ii)”.
Finally, we compare sRBP against sDCG (RQ3) by performing
the same correlation analysis as done for RQ2. This analysis will
inform us about how similar is the information provided by sRBP
with respect to sDCG.
6.3 Results
6.3.1 RQ1: Model accuracy. We rst address the question of whether
our user model provides an accurate prediction of actual user be-
havior. Table 1 shows observed user behaviour measured on query-
sessions, while Tables 2 and 3 show the predictions made by the
sRBP and sDCG user models. Comparing the model predictions (in
Table 2 and Table 3) by eye, we can clearly see that sRBP better
ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA Aldo Lipani, Ben Carteree, and Emine Yilmaz
Table 1: Observed user behaviour on the Session Track 2014 query-sessions. Every cell contains the probability of a user ex-
amining a document retrieved at the n-th rank (rows) in the search result produced by the m-th reformulation (columns).
n\m 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0.1598 0.0968 0.0698 0.0462 0.0286 0.0149 0.0086 0.0045 0.0020 0.0015 0.0008 0.0006 0.0005 0.0003 0.0001
2 0.0429 0.0290 0.0154 0.0096 0.0055 0.0026 0.0011 0.0006 0.0003 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001
3 0.0326 0.0218 0.0123 0.0080 0.0045 0.0019 0.0009 0.0006 0.0003 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001
4 0.0246 0.0179 0.0097 0.0063 0.0038 0.0015 0.0008 0.0005 0.0003 0.0001 0.0000 0.0000 0.0001 0.0000 0.0000
5 0.0194 0.0147 0.0086 0.0050 0.0033 0.0014 0.0006 0.0004 0.0003 0.0001 0.0000 0.0000 0.0001 0.0000 0.0000
6 0.0162 0.0120 0.0070 0.0042 0.0027 0.0011 0.0006 0.0003 0.0003 0.0001 0.0000 0.0000 0.0001 0.0000 0.0000
7 0.0135 0.0102 0.0061 0.0037 0.0027 0.0011 0.0006 0.0003 0.0003 0.0000 0.0000 0.0000 0.0001 0.0000 0.0000
8 0.0111 0.0086 0.0052 0.0033 0.0027 0.0010 0.0006 0.0003 0.0003 0.0000 0.0000 0.0000 0.0001 0.0000 0.0000
9 0.0088 0.0073 0.0046 0.0033 0.0026 0.0009 0.0006 0.0001 0.0003 0.0000 0.0000 0.0000 0.0001 0.0000 0.0000
10 0.0063 0.0057 0.0042 0.0031 0.0026 0.0009 0.0005 0.0001 0.0003 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···
61 0.0000 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Table 2: Normalized sRBP discount values (b= 0.64 and p= 0.86). The content of this table is meaningwise equal to Table 1.
n\m 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0.1504 0.1019 0.0690 0.0467 0.0316 0.0214 0.0145 0.0098 0.0066 0.0045 0.0030 0.0021 0.0014 0.0009 0.0006
2 0.0806 0.0545 0.0369 0.0250 0.0169 0.0115 0.0078 0.0053 0.0036 0.0024 0.0016 0.0011 0.0007 0.0005 0.0003
3 0.0431 0.0292 0.0198 0.0134 0.0091 0.0061 0.0042 0.0028 0.0019 0.0013 0.0009 0.0006 0.0004 0.0003 0.0002
4 0.0231 0.0156 0.0106 0.0072 0.0049 0.0033 0.0022 0.0015 0.0010 0.0007 0.0005 0.0003 0.0002 0.0001 0.0001
5 0.0124 0.0084 0.0057 0.0038 0.0026 0.0018 0.0012 0.0008 0.0005 0.0004 0.0003 0.0002 0.0001 0.0001 0.0001
6 0.0066 0.0045 0.0030 0.0021 0.0014 0.0009 0.0006 0.0004 0.0003 0.0002 0.0001 0.0001 0.0001 0.0000 0.0000
7 0.0035 0.0024 0.0016 0.0011 0.0007 0.0005 0.0003 0.0002 0.0002 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000
8 0.0019 0.0013 0.0009 0.0006 0.0004 0.0003 0.0002 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
9 0.0010 0.0007 0.0005 0.0003 0.0002 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
10 0.0005 0.0004 0.0002 0.0002 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···
61 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Table 3: Normalized sDCG discount values (bq = 1.07 and b= 4.54). The content of this table is meaningwise equal to Table 1.
n\m 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 0.0489 0.0032 0.0021 0.0017 0.0014 0.0013 0.0012 0.0011 0.0011 0.0010 0.0010 0.0009 0.0009 0.0009 0.0009
2 0.0309 0.0020 0.0013 0.0010 0.0009 0.0008 0.0008 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005
3 0.0245 0.0016 0.0010 0.0008 0.0007 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005 0.0005 0.0005 0.0004 0.0004
4 0.0211 0.0014 0.0009 0.0007 0.0006 0.0006 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004
5 0.0189 0.0012 0.0008 0.0006 0.0006 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003 0.0003
6 0.0174 0.0011 0.0007 0.0006 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0003 0.0003 0.0003 0.0003 0.0003
7 0.0163 0.0011 0.0007 0.0006 0.0005 0.0004 0.0004 0.0004 0.0004 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003
8 0.0154 0.0010 0.0007 0.0005 0.0005 0.0004 0.0004 0.0004 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003
9 0.0147 0.0010 0.0006 0.0005 0.0004 0.0004 0.0004 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003
10 0.0141 0.0009 0.0006 0.0005 0.0004 0.0004 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003
··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···
61 0.0082 0.0005 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 0.0001
characterizes the observed user behaviour. The measures of error
comparing these user models are provided in Table 4, which shows
that a visual inspection is correct: sRBP gives an order of magnitude
better prediction as compared to sDCG.
Table 5 shows the errors measured on the transformed query-
sessions – query and reformulation of each session are treated
independently. This indicates that the user models behind sRBP
and sDCG are no worse than their single-query variants.
Figure 3 shows the sensitivity of the sRBP parameters. In the rst
plot we depict the full TSE landscape. The two black lines identify
the points of minimum for the two parameters
b
and
p
. In the second
and third plots we show how the error changes when varying
b
(
p
) xing
p
(
b
) to its optimal. The sRBP parameters sensitivity
analysis shows that the TSE landscape has a convex shape for
sRBP. This is not true for sDCG; its TSE landscape is concave. The
convexity of the error landscape guarantees the stability of the
From a User Model for ery Sessions to Session Rank Biased Precision (sRBP) ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA
0.00
0.05
0.10
0.15
0.20
0.00 0.25 0.50 0.75 1.00
b
TSE
0.00
0.05
0.10
0.15
0.20
0.00 0.25 0.50 0.75 1.00
p
TSE
Figure 3: Sensitivity plots of sRBP parameters. The two lines in the rst plot on the left delimit the points where the gradient
is maximum. The two plots on the right are 2d-projections of the rst plot on the left made by xing a dimension to its best
parameter value.
0.00
0.25
0.50
0.75
1.00
0.0 0.2 0.4 0.6 0.8
sRBP
RBP (i), RBP (ii)
RBP (i)
RBP (ii)
0.00
0.02
0.04
0.06
0 5 10 15
sDCG
DCG (i), DCG (ii)
DCG (i)
DCG (ii)
0
5
10
15
0.0 0.2 0.4 0.6 0.8
sRBP
sDCG
Figure 4: Scatter plots of evaluation measures over the 101 judged query-sessions. Every point is a query-session.
0.2
0.3
0.4
0.5
0.22 0.23 0.24 0.25
sRBP
RBP
0.025
0.050
0.075
0.100
0.125
4.00 4.25 4.50
sDCG
DCG
4.00
4.25
4.50
0.22 0.23 0.24 0.25
sRBP
sDCG
Figure 5: Scatter plots of evaluation measures over the 73 search results combined with the 101 judged query-sessions. Every
point is a search result.
learned parameters. In particular, we notice that the best parameter
value for
p
is in the optimal range as the one suggested by Moat
and Zobel
[11]
for a standard web-user (
0
.
8). We conclude that
session-based models better predict the expected user behaviour.
In particular, among the evaluated models, sRBP performs the best.
6.3.2 RQ2: Single-query versus session measures. Next, we address
the use of single-query measures versus full-session measures for
the task of evaluating systems. In Figures 4 and 5, the rst two
plots on the left show the results obtained when comparing the
evaluation single-query measures against the session-based mea-
ICTIR ’19, October 2–5, 2019, Santa Clara, CA, USA Aldo Lipani, Ben Carteree, and Emine Yilmaz
Table 4: User models against the observed user behaviour on
the Session Track 2014 query-sessions.
Parameters TSE TAE KLD
sRBP b= 0.64,p= 0.86 0.0046 0.4950 0.9475
sDCG bq = 1.07,b= 4.54 0.0362 1.3357 2.2710
Table 5: User models against observed user behaviour on the
Session Track 2014 making every query and reformulations
of query-sessions independent.
Parameters TSE TAE KLD
RBP p= 0.59 0.0252 0.4242 0.6624
DCG b= 1.29 0.1521 1.2162 1.5035
sRBP b= 0.92,p= 0.64 0.0252 0.4238 0.6679
sDCG bq = 1.01,b= 1.26 0.1521 1.2162 1.5035
Table 6: Kendall’s tau correlations over query-sessions.
RBP (ii) DCG (i) DCG (ii) sRBP sDCG
RBP (i) 0.675 0.906 0.666 0.555 0.532
RBP (ii) - 0.659 0.869 0.801 0.747
DCG (i) - - 0.702 0.532 0.560
DCG (ii) - - - 0.746 0.780
sRBP - - - - 0.772
Table 7: Kendall’s tau correlations over search results.
DCG sRBP sDCG
RBP 0.293 0.843 0.270
DCG - 0.315 0.950
sRBP - - 0.290
-sures on query-sessions and search results. The last plot on the
right compare the two query-session based evaluation measures.
In Tables 6 and 7 we nd the Kendall’s tau correlation coecients
over all possible pairs of evaluation measures on query-sessions
and search results.
The evaluation of query-sessions varies a great deal between
single-query measures and sessions-based measures (in Figure 4).
The correlation between RBP (i) and sRBP, and DCG (i) and sDCG
is low (0.555 and 0.560). However, when considering the second
evaluation approach RBP (ii) and DCG (ii), the correlation increases
(0.801 and 0.780). This suggests that the rst part of the session
provides a dierent information than only the last reformulation (as
in (i)) and that the considering the full session (as in (ii)) correlated
better with session-based measures.
When evaluating search results we observe that the correlation
between RBP and sRBP, and between DCG and sDCG is higher
(0.843 and 0.950). However, although DCG and sDCG correlate
well, this is not true for RBP and sRBP. We conclude that query-
session evaluation measures provide a dierent perspective when
evaluating sessions or search results. In particular, we observe
that dierence between how sRBP would rank search results with
respect to RBP is wider than the one we would get using sDCG
with respect to DCG.
6.3.3 RQ3: sRBP versus sDCG. Finally, we compare the two session-
based measures to each other. The last plots on the right in Figures
4 and 5 compare the two query-session based evaluation measures
on query-sessions and search results.
sRBP and sDCG are very dierent in ranking sessions (Figure
4). Their correlation coecients is 0.772, which is similar to the
correlation between RBP (i) and DCG (i) and between RBP (ii) and
DCG (ii) (0.659 and 0.869). This dierence is particularly exacer-
bated when evaluating search results (Figure 5). Their correlation
coecient is in this case much lower (0.290). However, it is again
consistent to the correlation between RBP and DCG (0.293). We
conclude that sRBP and sDCG provide two dierent evaluation
perspectives and that they are consistent with how RBP dier from
DCG in single-query evaluation.
7 CONCLUSION
In this paper we have developed a user model for query-sessions
under a well-dened probabilistic framework. This user model can
be easily extended by making more realistic assumptions about
the probabilities of leaving search, continuing examining the docu-
ments of the search result and continuing reformulating a query.
Under simplifying assumptions – these probabilities are constant
over time and independent by the relevance of the examined docu-
ments – we have on one hand, derived a new session-based eval-
uation measure, sRBP, on the other hand, demonstrated that this
user model well approximates the expected behaviour as measured
on 2014 TREC Session track query-sessions and justied its exis-
tence by showing that this evaluation measure provides a dierent
perspective with respect to sDCG.
ACKNOWLEDGMENTS
This project was funded by the EPSRC Fellowship titled "Task Based
Information Retrieval", grant reference number EP/P024289/1.
REFERENCES
[1]
Ben Carterette, Evangelos Kanoulas, Mark Hall, and Paul Clough. 2014. Overview
of the TREC 2014 session track. Technical Report.
[2]
Praveen Chandar and Ben Carterette. 2012. Using Preference Judgments for
Novel Document Retrieval. In Proc. of SIGIR.
[3]
Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. 2008. An Experi-
mental Comparison of Click Position-bias Models. In Proc. of WSDM.
[4]
Georges Dupret and Ciya Liao. 2010. A Model to Estimate Intrinsic Document
Relevance from the Clickthrough Logs of a WebSearch Engine. In Proc. of WSDM.
[5]
Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang,
and Christos Faloutsos. 2009. Click Chain Model in Web Search. In Proc. of
WWW.
[6]
Fan Guo, Chao Liu, and Yi Min Wang. 2009. Ecient Multiple-click Models in
Web Search. In Proc. of WSDM.
[7]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-based Evaluation
of IR Techniques. ACM Trans. Inf. Syst. 20, 4 (2002).
[8]
Kalervo Järvelin, Susan L. Price, Lois M. L. Delcambre, and Marianne Lykke
Nielsen. 2008. Discounted Cumulated Gain Based Evaluation of Multiple-Query
IR Sessions. In Proc. of ECIR.
[9]
Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011.
Evaluating Multi-query Sessions. In Proc. of SIGIR.
[10]
David Maxwell, Leif Azzopardi, Kalervo Järvelin, and Heikki Keskustalo. 2015.
Searching and Stopping: An Analysis of Stopping Rules and Strategies. In Proc. of
CIKM.
[11]
Alistair Moat and Justin Zobel. 2008. Rank-biased Precision for Measurement
of Retrieval Eectiveness. ACM Trans. Inf. Syst. 27, 1 (2008).
[12]
Anne Schuth, Floor Sietsma, Shimon Whiteson, and Maarten De Rijke. 2014.
Optimizing base rankers using clicks. In Proc. of ECIR.
[13]
Ellen M Voorhees, Donna K Harman, et al
.
2005. TREC: Experiment and evaluation
in information retrieval. Vol. 1. MIT press Cambridge.
[14]
Dietmar Wolfram, Amanda Spink, Bernard J. Jansen, and Tefko Saracevic. 2001.
Vox populi: The public searching of the web. JASIST 52, 12 (2001).
[15]
Yiming Yang and Abhimanyu Lad. 2009. Modeling Expected Utility of Multi-
session Information Distillation. In Proc. of ICTIR.
... And how can we measure the contribution that the RS is making to the entire search session (RQ2)? In this paper, we introduce a possible approach to comparing and evaluating recommendation and search that attempts to answer these research questions by adopting and adjusting session-level evaluation metric(s) from the field of IR [9,10,11,12]. We argue this allows for a realistic comparison between single recommendation lists and entire search sessions, and that it supports offline evaluation. ...
... As single-list evaluation metrics such as nDCG are aimed at measuring the effectiveness of a single query or list of recommendations, several session-level evaluation metrics have been proposed over the years [9,10,11,12]. All of these session-level metrics generalize to the following Eq. ...
... For a session consisting of a single query, sDCG is equal to DCG. The main difference between the different instantiations of Eq. 1 is in the use of different discount factors, such as the sRBP metric proposed by Lipani et al. [11]. ...
Conference Paper
Full-text available
Query-level evaluation metrics such as nDCG that originate from field of Information Retrieval (IR) have seen widespread adoption in the Recommender Systems (RS) community for comparing the quality of different ranked lists of recommendations with different levels of relevance to the user. However, the traditional (offline) RS evaluation paradigm is typically restricted to evaluating a single results list. In contrast, IR researchers have also developed evaluation metrics over the past decade for the session-based evaluation of more complex search tasks. Here, the sessions consist of multiple queries and multi-round search interactions, and the metrics evaluate the quality of the session as a whole. Despite the popularity of the more traditional single-list evaluation paradigm, RS can also be used to assist users with complex information access tasks. In this paper, we explore the usefulness of session-level evaluation metrics for evaluating and comparing the performance of both recommender systems and search engines. We show that, despite possible misconceptions that comparing both scenarios is akin to comparing apples to oranges, it is indeed possible to compare recommendation results from a single ranked list to the results from a whole search session. In doing so, we address the following questions: (1) how can we fairly and realistically compare the quality of an individual list of recommended items to the quality of an entire manual search session; (2) how can we measure the contribution that the RS is making to the entire search session. We contextualize our claims by focusing on a particular complex search scenario: the problem of talent search. An example of professional search, talent search involves recruiters searching for relevant candidates given a specific job posting by issuing multiple queries in the course of a search session. We show that it is possible to compare the search behavior and success of recruiters to that of a matchmaking recommender system that generates a single ranked list of relevant candidates for a given job posting. In particular, we adopt a session-based metric from IR and motivate how it can be used to perform valid and realistic comparisons of recommendation lists to multiple-query search sessions.
... The above metrics are primarily used for evaluating the single-query search quality; while session-level search quality metrics exist (e.g. [19,21,29]), they are beyond the scope of the present study. ...
Conference Paper
Full-text available
One of the most challenging aspects of operating a large-scale web search engine is to accurately evaluate and monitor the search en-gine's result quality regardless of search types. From a business perspective, in the face of such challenges, it is important to establish a universal search quality metric that can be easily understood by the entire organisation. In this paper, we introduce a model-based quality metric using Explainable Boosting Machine as the classi-fier and online user behaviour signals as features to predict search quality. The proposed metric takes into account a variety of search types and has good interpretability. To examine the performance of the metric, we constructed a large dataset of user behaviour on search engine results pages (SERPs) with SERP quality ratings from professional annotators. We compared the performance of the model in our metric to those of other black-box machine learning models on the dataset. We also share a few experiences within our company for the org-wide adoption of this metric relevant to metric design. CCS CONCEPTS • Information systems → Evaluation of retrieval results.
... Recently, there have been some works about multi-turn evaluation. sRBP [40] evaluates session rank based on user model. Wang and Ai [62,63] evaluate conversational search sessions by simulating with various types of users. ...
Preprint
Full-text available
Existing conversational search studies mainly focused on asking better clarifying questions and/or improving search result quality. These works aim at retrieving better responses according to the search context, and their performances are evaluated on either single-turn tasks or multi-turn tasks under naive conversation policy settings. This leaves some questions about their applicability in real-world multi-turn conversations where realistically, each and every action needs to be made by the system itself, and search session efficiency is often an important concern of conversational search systems. While some recent works have identified the need for improving search efficiency in conversational search, they mostly require extensive data annotations and use hand-crafted rewards or heuristics to train systems that can achieve reasonable performance in a restricted number of turns, which has limited generalizability in practice. In this paper, we propose a reward-free conversation policy imitation learning framework, which can train a conversation policy without annotated conversation data or manually designed rewards. The trained conversation policy can be used to guide the conversational retrieval models to balance conversational search quality and efficiency. To evaluate the proposed conversational search system, we propose a new multi-turn-multi-response conversational evaluation metric named Expected Conversational Reciprocal Rank (ECRR). ECRR is designed to evaluate entire multi-turn conversational search sessions towards comprehensively evaluating both search result quality and search efficiency.
... Understanding how much these assumptions hold will allow the extension of the Cranfield paradigm to CSSs. This is important also in order to inform the development of better user models [27,28] and user simulators [21], therefore better evaluation measures and training procedures. ...
Conference Paper
Full-text available
Due to the sequential and interactive nature of conversations, the application of traditional Information Retrieval (IR) methods like the Cranfield paradigm require stronger assumptions. When building a test collection for Ad Hoc search, it is fair to assume that the relevance judgments provided by an annotator correlate well with the relevance judgments perceived by an actual user of the search engine. However, when building a test collection for conversational search, we do not know if it is fair to assume the same. In this paper, we perform a crowdsourcing study to evaluate the applicability of the Cranfield paradigm to conversational search systems. Our main aim is to understand what is the agreement in terms of user satisfaction between the users performing a search task in a conversational search system (i.e., directly assessing the system) and the users observing the search task being performed (i.e., indirectly assessing the system). The results of this study are paramount because they underpin and guide 1) the development of more realistic user models and simulators, and 2) the design of more reliable and robust evaluation measures for conversational search systems. Our results show that there is a fair agreement between direct and indirect assessments in terms of user satisfaction and that these two kinds of assessments share similar conversational patterns. Indeed, by collecting relevance assessments for each system utterance, we tested several conversational patterns that show a promising ability to predict user satisfaction.
Article
Understanding the roles of search gain and cost in users' search decision‐making is a key topic in interactive information retrieval (IIR). While previous research has developed user models based on simulated gains and costs, it is unclear how users' actual perceptions of search gains and costs form and change during search interactions. To address this gap, our study adopted expectation‐confirmation theory (ECT) to investigate users' perceptions of gains and costs. We re‐analyzed data from our previous study, examining how contextual and search features affect users' perceptions and how their expectation‐confirmation states impact their following searches. Our findings include: (1) The point where users' actual dwell time meets their constant expectation may serve as a reference point in evaluating perceived gain and cost; (2) these perceptions are associated with in situ experience represented by usefulness labels, browsing behaviors, and queries; (3) users' current confirmation states affect their perceptions of Web page usefulness in the subsequent query. Our findings demonstrate possible effects of expectation‐confirmation, prospect theory, and information foraging theory, highlighting the complex relationships among gain/cost, expectations, and dwell time at the query level, and the reference‐dependent expectation at the session level. These insights enrich user modeling and evaluation in human‐centered IR.
Chapter
Simulating user interactions enables a more user-oriented evaluation of information retrieval (IR) systems. While user simulations are cost-efficient and reproducible, many approaches often lack fidelity regarding real user behavior. Most notably, current user models neglect the user’s context, which is the primary driver of perceived relevance and the interactions with the search results. To this end, this work introduces the simulation of context-driven query reformulations. The proposed query generation methods build upon recent Large Language Model (LLM) approaches and consider the user’s context throughout the simulation of a search session. Compared to simple context-free query generation approaches, these methods show better effectiveness and allow the simulation of more efficient IR sessions. Similarly, our evaluations consider more interaction context than current session-based measures and reveal interesting complementary insights in addition to the established evaluation protocols. We conclude with directions for future work and provide an entirely open experimental setup.
Conference Paper
Full-text available
We study the problem of optimizing an individual base ranker using clicks. Surprisingly, while there has been considerable attention for using clicks to optimize linear combinations of base rankers, the problem of optimizing an individual base ranker using clicks has been ignored. The problem is different from the problem of optimizing linear combinations of base rankers as the scoring function of a base ranker may be highly non-linear. For the sake of concreteness, we focus on the optimization of a specific base ranker, viz. BM25. We start by showing that significant improvements in performance can be obtained when optimizing the parameters of BM25 for individual datasets. We also show that it is possible to optimize these parameters from clicks, i.e., without the use of manually annotated data, reaching or even beating manually tuned parameters.
Conference Paper
Full-text available
Many tasks that leverage web search users' implicit feedback rely on a proper and unbiased interpretation of user clicks. Previous eye-tracking experiments and studies on explaining position-bias of user clicks provide a spectrum of hypotheses and models on how an average user examines and possibly clicks web documents returned by a search engine with respect to the submitted query. In this paper, we attempt to close the gap between previous work, which studied how to model a single click, and the reality that multiple clicks on web documents in a single result page are not uncommon. Specifically, we present two multiple-click models: the independent click model (ICM) which is reformulated from previous work, and the dependent click model (DCM) which takes into consideration dependencies between multiple clicks. Both models can be efficiently learned with linear time and space complexities. More importantly, they can be incrementally updated as new click logs flow in. These are well-demanded properties in reality. We systematically evaluate the two models on click logs obtained in July 2008 from a major commercial search engine. The data set, after preprocessing, contains over 110 thousand distinct queries and 8.8 million query sessions. Extensive experimental studies demonstrate the gain of modeling multiple clicks and their dependencies. Finally, we note that since our experimental setup does not rely on tweaking search result rankings, it can be easily adopted by future studies.
Conference Paper
Full-text available
We propose a new model to interpret the clickthrough logs of a web search engine. This model is based on explicit assumptions on the user behavior. In particular, we draw conclusions on a document relevance by observing the user behavior after he examined the document and not based on whether a user clicks or not a document url. This results in a model based on intrinsic relevance, as opposed to perceived relevance. We use the model to predict document relevance and then use this as feature for a "Learning to Rank" ma- chine learning algorithm. Comparing the ranking functions obtained by training the algorithm with and without the new feature we observe surprisingly good results. This is particularly notable given that the baseline we use is the heavily optimized ranking function of a leading commercial search engine. A deeper analysis shows that the new fea- ture is particularly helpful for non navigational queries and queries with a large abandonment rate or a large average number of queries per session. This is important because these types of query is considered to be the most difficult to solve.
Conference Paper
Full-text available
The standard system-based evaluation paradigm has focused on assessing the performance of retrieval systems in serving the best results for a single query. Real users, however, often begin an interaction with a search engine with a sufficiently under-specified query that they will need to reformulate before they find what they are looking for. In this work we consider the problem of evaluating retrieval systems over test collections of multi-query sessions. We propose two families of measures: a model-free family that makes no assumption about the user's behavior over a session, and a model-based family with a simple model of user interactions over the session. In both cases we generalize traditional evaluation metrics such as average precision to multi-query session evaluation. We demonstrate the behavior of the proposed metrics by using the new TREC 2010 Session track collection and simulations over the TREC-9 Query track collection.
Conference Paper
Full-text available
Given a terabyte click log, can we build an efficient and ef- fective click model? It is commonly believed that web search click logs are a gold mine for search business, because they reflect users' preference over web documents presented by the search engine. Click models provide a principled ap- proach to inferring user-perceived relevance of web docu- ments, which can be leveraged in numerous applications in search businesses. Due to the huge volume of click data, scalability is a must. We present the click chain model (CCM), which is based on a solid, Bayesian framework. It is both scalable and in- cremental, perfectly meeting the computational challenges imposed by the voluminous click logs that constantly grow. We conduct an extensive experimental study on a data set containing 8.8 million query sessions obtained in July 2008 from a commercial search engine. CCM consistently outper- forms two state-of-the-art competitors in a number of met- rics, with over 9.7% better log-likelihood, over 6.2% better click perplexity and much more robust (up to 30%) predic- tion of the first and the last clicked position.
Conference Paper
Searching naturally involves stopping points, both at a query level (how far down the ranked list should I go?) and at a session level (how many queries should I issue?). Understanding when searchers stop has been of much interest to the community because it is fundamental to how we evaluate search behaviour and performance. Research has shown that searchers find it difficult to formalise stopping criteria, and typically resort to their intuition of what is "good enough". While various heuristics and stopping criteria have been proposed, little work has investigated how well they perform, and whether searchers actually conform to any of these rules. In this paper, we undertake the first large scale study of stopping rules, investigating how they influence overall session performance, and which rules best match actual stopping behaviour. Our work is focused on stopping at the query level in the context of ad-hoc topic retrieval, where searchers undertake search tasks within a fixed time period. We show that stopping strategies based upon the disgust or frustration point rules - both of which capture a searcher's tolerance to non-relevance - typically result in (i) the best overall performance, and (ii) provide the closest approximation to actual searcher behaviour, although a fixed depth approach also performs remarkably well. Findings from this study have implications regarding how we build measures, and how we conduct simulations of search behaviours.
Article
There has been considerable interest in incorporating diversity in search results to account for redundancy and the space of possible user needs. Most work on this problem is based on subtopics: diversity rankers score documents against a set of hypothesized subtopics, and diversity rankings are evaluated by assigning a value to each ranked document based on the number of novel (and redundant) subtopics it is relevant to. This can be seen as modeling a user who is always interested in seeing more novel subtopics, with progressively decreasing interest in seeing the same subtopic multiple times. We put this model to test: if it is correct, then users, when given a choice, should prefer to see a document that has more value to the evaluation. We formulate some specific hypotheses from this model and test them with actual users in a novel preference-based design in which users express a preference for document A or document B given document C. We argue that while the user study shows the subtopic model is good, there are many other factors apart from novelty and redundancy that may be influencing user preferences. From this, we introduce a new framework to construct an ideal diversity ranking using only preference judgments, with no explicit subtopic judgments whatsoever.
Conference Paper
IR research has a strong tradition of laboratory evaluation of systems. Such research is based on test collections, pre-defined test topics, and standard evaluation metrics. While recent research has emphasized the user viewpoint by proposing user-based metrics and non-binary relevance assessments, the methods are insufficient for truly user-based evaluation. The common assumption of a single query per topic and session poorly represents real life. On the other hand, one well-known metric for multiple queries per session, instance recall, does not capture early (within session) retrieval of (highly) relevant documents. We propose an extension to the Discounted Cumulated Gain (DCG) metric, the Session-based DCG (sDCG) metric for evaluation scenarios involving multiple query sessions, graded relevance assessments, and open-ended user effort including decisions to stop searching. The sDCG metric discounts relevant results from later queries within a session. We exemplify the sDCG metric with data from an interactive experiment, we discuss how the metric might be applied, and we present research questions for which the metric is helpful.