Content uploaded by Mark Sanderson

Author content

All content in this area was uploaded by Mark Sanderson

Content may be subject to copyright.

Evaluating Multi-Query Sessions

Evangelos Kanoulas∗, Ben Carterette†, Paul D. Clough∗, Mark Sanderson‡

e.kanoulas@shef.ac.uk, carteret@cis.udel.edu,

p.d.clough@shef.ac.uk, mark.sanderson@rmit.edu.au

∗Information School, University of Shefﬁeld, Shefﬁeld, UK

†Department of Computer & Information Sciences, University of Delaware, Newark, DE, USA

‡School of Computer Science & Information Technology, RMIT, Melbourne, Australia

ABSTRACT

The standard system-based evaluation paradigm has focused

on assessing the performance of retrieval systems in serving

the best results for a single query. Real users, however, often

begin an interaction with a search engine with a suﬃciently

under-speciﬁed query that they will need to reformulate be-

fore they ﬁnd what they are looking for. In this work we con-

sider the problem of evaluating retrieval systems over test

collections of multi-query sessions. We propose two families

of measures: a model-free family that makes no assumption

about the user’s behavior over a session, and a model-based

family with a simple model of user interactions over the ses-

sion. In both cases we generalize traditional evaluation met-

rics such as average precision to multi-query session evalua-

tion. We demonstrate the behavior of the proposed metrics

by using the new TREC 2010 Session track collection and

simulations over the TREC-9 Query track collection.

Categories and Subject Descriptors: H.3.4 [Informa-

tion Storage and Retrieval] Performance Evaluation

General Terms: Experimentation, Measurement

Keywords: information retrieval, test collections, evalua-

tion, sessions

1. INTRODUCTION

Evaluation measures play a critical role in the develop-

ment of retrieval systems, both as measures in compara-

tive evaluation experiments and as objective functions for

optimizing system eﬀectiveness. The standard evaluation

paradigm has focused on assessing the performance of re-

trieval systems in serving the best results for a single query,

for varying deﬁnitions of “best”: for ad hoc tasks, the most

relevant results; for diversity tasks, the results that do the

best job of covering a space of information needs; for known-

item tasks, the single document the user is looking for. There

are many test collections for repeatable experiments on these

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

SIGIR’11, July 24–28, 2011, Beijing, China.

Copyright 2011 ACM 978-1-4503-0757-4/11/07 ...$5.00.

tasks, and dozens of evaluation measures assessing diﬀerent

aspects of task eﬀectiveness.

Real users, however, often begin an interaction with a

search engine with a query that they will need to reformulate

one or more times before they ﬁnd what they are looking for.

Early studies on web search query logs showed that half of

all Web users reformulated their initial query: 52% of the

users in 1997 Excite data set, 45% of the users in the 2001

Excite dataset [15].

The standard evaluation paradigm of single-query test col-

lections seems unable to assess the eﬀectiveness of retrieval

systems over sequences of query reformulations. Interactive

evaluation has been employed as an alternative. In inter-

active evaluation the user is part of the evaluation cycle

and freely interacts with the results of a retrieval system.

Measures such as instance recall [11] and session discounted

cumulative gain [6] have been proposed to capture the ef-

fectiveness of systems in these settings. Even though an

interactive evaluation paradigm can better capture the ac-

tual user experience, it is both noisy due to the high degrees

of freedom of user interactions and expensive due to its low

reusability and need for many test subjects. Furthermore,

conducting interactive comparative evaluation experiments

is by no means an easy task.

The TREC 2010 Session track [7] proposed an experiment

for the evaluation of retrieval systems over multi-query ses-

sions. We deﬁned a session as a sequence of reformulations

in the service of satisfying a general information need, and

constructed a test collection of two query reformulations (an

initial and a follow-up query) for each of 150 information

needs. This collection makes compromises for simplicity and

tractability, but it provides a starting point for investigation

of questions about test collection-based session evaluation.

In addition to a test collection, new evaluation measures

are necessary as well. Traditional evaluation measures only

capture per-query eﬀectiveness; they are not necessarily ap-

propriate for evaluating the eﬀectiveness of a retrieval sys-

tem over a multi-query session. While one could evaluate

results for each query in the session in isolation, it may not

be the case that the system is serving results for each query

independently. Doing so would lose potentially valuable in-

formation about the ability of the system to provide results

for the session as a unit, and thus reduce our ability to op-

timize system performance across sessions.

Due to the lack of appropriate measures J¨

arvelin et al. [6]

extended the normalized discounted cumulative gain (nDCG)

measure to a measure that considers multi-query sessions.

The measure—called normalized session discounted cumula-

tive gain (nsDCG)—discounts documents that appear lower

in the ranked list for a given query as well as documents

that appear after more query reformulations. In a sense the

new model incorporates a cost for reformulating a query as

well as scanning down a ranked list.

The nsDCG measure is computed as follows: for each

query in a series of reformulations, DCG is computed in

isolation of all other queries in the series. Each DCG is

then discounted by a function of the position qof the query

in the series. The measure can evaluate the eﬀectiveness

of retrieval systems over multiple queries in an interactive

retrieval scenario, in which a user moves down a ranked

list of documents and at some rank reformulates the query.

Since the reformulation points are known (from observing

the users), DCG is computed at those points for each query

and at the stopping point for the last reformulation. In

a test collection of static sessions, however, reformulation

points are unknown. Using nsDCG requires the selection of

a ﬁxed reformulation cut-oﬀ, which clearly does not reﬂect

the fact that diﬀerent retrieval results may trigger diﬀerent

user behavior. Further, the measure does not model early

abandonment of a query session; our TREC session collec-

tion comes with a ﬁxed number of reformulations, but a

user may choose to abandon the session before reaching the

last reformulation (either due to satisfaction or due to frus-

tration). A multi-query session measure should be able to

model such behavior.

Yang and Lad [16] overcame the need to deﬁne a ﬁxed

reformulation point by deﬁning a session measure as an ex-

pectation over a set of possible browsing paths. Based on

this they proposed a measure of expected utility for a multi-

query information distillation task. Given a series of mre-

formulations the proposed measure accounts for all possible

browsing paths that end in the kth reformulation. Each

path has a certain probability to be realized by a user. To

deﬁne the probability of a user following a certain path Yang

and Lad [16] follow the rank biased precision (RBP) frame-

work [10], replacing RBP’s stopping condition with a refor-

mulation condition. The utility of each path is a function of

the relevance and novelty of the returned documents being

considered. The system eﬀectiveness is then deﬁned as the

expected utility calculated over the aforementioned proba-

bilistic space of browsing paths. Though the expected utility

solves the problem of variable reformulation points, it still

does not allow early abandonment of the query session.

In this work we consider the problem of evaluating re-

trieval systems over test collections of static multi-query ses-

sions. We propose two families of measures: one that makes

no assumptions about the user’s behavior over a session,

and another with a simple model of user interactions over

the session. In the latter case we provide a general frame-

work to accommodate diﬀerent models of user interactions

in the course of a session, avoiding predeﬁned reformulation

cut-oﬀs and allowing early abandonment. In both cases we

generalize traditional evaluation measures such as average

precision to multi-query session evaluation.

2. MULTI-QUERY SESSION COLLECTION

AND USER MODEL

We deﬁne a session test collection as one with a set of top-

ics, each of which consist of a description of an information

need and a static sequence of mtitle queries (an initial query

query q1,1query q1,2query q1,3

ranking 1 ranking 2 ranking 3

d1Nd0

1Rd00

1R

d2Nd0

2Rd00

2R

d3Nd0

3Rd00

3R

d4Nd0

4Rd00

4R

d5Nd0

5Rd00

5R

d6Nd0

6Nd00

6R

d7Nd0

7Nd00

7R

d8Nd0

8Nd00

8R

d9Nd0

9Nd00

9R

d10 Nd0

10 Nd00

10 R

... ... ... ... ... ...

Table 1: Example rankings (document IDs and rel-

evance judgments) for three queries q1,1, q1,2, q1,3in

a session for topic number 1. Here we assume all

documents are unique.

and m−1 reformulations), and judgments of the relevance

of documents to the topics. For simplicity, we assume that

all reformulations are directed towards a single information

need, so there is a global set of relevant documents of size

R.1This deﬁnition is similar to those used by J¨

arvelin et

al. [6] and Yang & Lad [16], and it is essentially the deﬁ-

nition used for the TREC 2010 Session track in the case of

speciﬁcation and generalization reformulations [7].

Following Cooper in his proposal for the expected search

length measure [5], we assume a user stepping down the

ranked list of results until some decision point. To this we

add an additional possible action: the decision point can be

either a stopping point for the search, or a point at which

the user reformulates their query. Thus, a user experiences

results by either moving down a ranking (i.e. moving from

rank kto rank k+ 1 in ranking ~ri) or to the top of the

next ranking by reformulating ( i.e. moving from rank kin

ranking ~rito rank 1 in ranking ~ri+1.).

Consider the example in Table 1. A user with a certain

information need formulates a query and submits it to a re-

trieval system. The retrieval system returns a ranked list

of documents, ~r1= (d1, d2, ..., d10 , ...). Suppose the user’s

ﬁrst decision point occurs at document d5. After seeing

that document, the user decides to reformulate the original

query and resubmit it to the retrieval system. The retrieval

system responds with a second ranked list of documents,

~r2= (d0

1, d0

1, ..., d0

10, ...). The user reads the documents once

again from top to bottom and abandons the session. If we

only consider the documents the user has examined over the

session of reformulations then a ﬁnal composite ranked list,

~

cl, can be composed: ~

cl = (d1, d2,...,d5, d0

1, d0

2, d0

3, . . . ).

Given the relevance of the documents in the composite

ranked list ~

cl, any traditional evaluation measure can be

calculated in the usual manner. We may require assump-

tions about the relevance of duplicates, e.g. if d1and d0

2are

the same relevant document, how they should count towards

the evaluation measure; we will consider these in Section 5.

1In practice users’ information needs may change during a

search session and over time [1]; assuming it is ﬁxed is a

modeling assumption we make for tractability. Some of our

measures require this assumption, but it can be relaxed for

other measures.

2.1 Evaluation over paths

The composite list ~

cl is the outcome of a series of decisions.

We deﬁne a path ωthrough the results as a series of decisions

to either move down a ranking, reformulate and start at

the top of the next ranking, or abandon the search. We

assume that at least one document—the ﬁrst one—is viewed

in each ranking. A path of length kis a path that results

in kdocuments viewed. We denote the set of unique paths

of length kas Ωk, and the set of all unique paths as Ω.

A path can be represented as a series of actions, e.g. ω=

{down,down, ..., reformulate,down, ..., abandon}; as a series

of document IDs viewed, e.g. ω=~

cl above; or as a series

of ranks at which reformulations or abandonment occurred,

e.g. ω={5, ...}. The three are equivalent in the sense of

providing complete information about the series of decisions;

the last, being most compact, is the one we will use.

Diﬀerent paths result in diﬀerent documents being seen,

and in many cases diﬀerent numbers of relevant documents.

Precision after kdocuments viewed may result in very diﬀer-

ent values depending on the path chosen: a user that views

10 documents in ~r1(the relevance of which is shown in Ta-

ble 1), experiences 0 precision, while one that reformulates

immediately after the ﬁrst document in ~r1and steps down ~r2

until rank 9 experiences precision of 5/10. In an interactive

evaluation scenario where real users interact with the ranked

list of documents returned by a retrieval system, the point

at which a user decides either to reformulate their previous

query or to abandon the search can be explicitly recorded

by observation, or implicitly inferred by looking at (for in-

stance) the last clicked document. In batch laboratory ex-

periments with static sessions, however, the reformulation

and stopping points are undeﬁned—there is no user from

which to record them. This presents a challenge for deﬁning

evaluation measures.

In this work we propose evaluating static sessions by sum-

marizing evaluation results over all paths through the re-

sults. We will consider two directions: one a “model-free”

approach inspired by interpolated precision and recall, the

other a model-based approach that explicitly deﬁnes prob-

abilities for certain actions, then averages over paths. In

both approaches we would like to make as few assumptions

as possible about the reasons a user reformulates.

2.2 Counting paths

The number of possible paths grows fairly fast. Consider

a path of length kending in reformulation number j. For

example, the paths of length 4 ending at reformulation 2 are

{d1, d2, d3, d0

1},{d1, d2, d0

1, d0

2}, and {d1, d0

1, d0

2, d0

3}. For any

given kand j, we can count the number of possible paths

as follows: imagine a ranking as a list of kdocuments, then

place j−1 reformulation points between any two documents

in that list. The number of diﬀerent places we can insert

them is `k−1

j−1´, and this is therefore the number of paths of

length kthat end at reformulation j.

The total number of paths of length kis:

|Ωk|=

m

X

j=1 k−1

j−1!

This is the deﬁnition of elements in the Bernoulli triangle.

Its rate of growth is unknown, but it is O(k2) for m= 2 and

approaches 2k−1as mincreases to k. The total number of

paths of any length is |Ω|=Pn

k=1 |Ωk|=O(2n).

j= 1 j= 2 j= 3 ···

k= 1 {0}– –

k= 2 {0} {1}–

k= 3 {0} {1,2} {2}

k= 4 {0} {1,2,3} {2,3,3}

k= 5 {0} {1,2,3,4} {2,3,3,4,4,4}

···

Table 2: Relevant document counts for diﬀerent

paths of length kending at ranking jfrom the ex-

ample in Table 1.

On the other hand, if we only consider paths that end at

reformulation jbut continue down ranked list ~rjindeﬁnitely,

the number is more manageable. We can enumerate these

by simply iterating over stopping points k1= 1...|~r1|, and

for each of those over stopping points k2= 1...|~r2|, and so

on. Within the (j−1)st loop, ω={k1, ..., kk−1}is the path

to that point. This takes |~r1| × |~r2| × · · · × |~rm−1|=O(nm)

time, which, while not exactly fast, is at least manageable.

3. MODEL-FREE SESSION EVALUATION

Our ﬁrst goal is to develop a method for evaluating the

eﬀectiveness of a system over a set of reformulations mak-

ing no assumptions about when or why users reformulate.

The approach is inspired by interpolated precision: there is

no formal user model behind interpolated precision, but it

reduces the full evaluation data (precision at every rank) to

a manageable set while still providing useful intuition about

system performance, particularly when plotted against re-

call values. Likewise, there is no formal user model behind

these measures, but they give some intuition while greatly

reducing the amount of evaluation data, which as we saw

above grows exponentially.

Consider all paths of length kthat end at reformulation

j. On each of those paths the user will see a certain number

of relevant documents. Let us deﬁne a set of relevant counts

rR@j, k as the set of counts of relevant documents seen on

all such paths.2In the example in Table 1, there is only one

possible way for a user to see 4 documents without reformu-

lating, and none of those documents are relevant; therefore

rR@1,4 = {0}. There are three ways for a user to see 4

documents over two queries: {d1, d2, d3, d0

1};{d1, d2, d0

1, d0

2};

{d1, d0

1, d0

2, d0

3}. These paths have 1, 2, and 3 relevant docu-

ments respectively. Therefore rR@2,4 = {1,2,3}. All sets

rR for j= 1..3 and k= 1..5 are shown in Table 2; the size

of a set is `k−1

j−1´as described above.

We can then deﬁne session versions of precision and recall

by dividing the relevant counts rR@j, k by k(for precision)

or R(for recall). We will call these rPC@j, k and rRC@j, k.

This gives the session generalization to precision and recall:

precision and recall for each possible path through the re-

sults. In traditional systems-based evaluation there is only

one possible path of length k, and precision/recall for that

path is identical to precision/recall at rank k.

Precision is sometimes interpolated to a particular recall

point rby ﬁnding the ﬁrst rank at which recall ris achieved,

then taking the maximum of all precisions at that rank or

deeper. Let us consider an analogous process for sessions by

deﬁning precision at a particular recall point in a particular

2We use boldface to denote sets and italics to denote scalars.

recall

reformulation

precision

Figure 1: Reformulation precision-recall surface for

the example in Table 1.

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

ranking 1

recall

precision

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

ranking 2

recall

precision

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

ranking 3

recall

precision

Figure 2: Reformulation precision-recall cross-

sections of Figure 1 for the example in Table 1.

Note that these are not precision-recall curves for

the three rankings independently.

reformulation. At recall 1/R in ~r2, there is a set of possible

precision values {1/2,1/3,1/4, ...}, each of which is achieved

by a user looking at k= 1,2,3, ... documents in ranking

1, then reformulating and looking at the ﬁrst document in

ranking 2. At recall 2/R in ~r2, the set is {2/3,2/4,2/5, ...}.

Now we will deﬁne sPC @r, j as the maximum value of the

set of possible precisions at the ﬁrst rank in ~rjat which

recall ris achieved. This reduces the amount of data to m·R

precision values (with mbeing the number of queries in the

static session) that reﬂect the best possible eﬀectiveness a

user could experience. Note that this is not interpolation

in the traditional sense. sP C@1/R, 2 may still be less than

sP C@2/R, 2, which is not possible in the usual deﬁnition.

Once we have computed sP C @r, j for each rand j, we can

plot a“precision-recall surface”over reformulations. Figure 1

shows the surface for the example in Table 1 with R= 20

under the assumption that all relevant documents in rank-

ing 3 are unique (meaning there are ﬁve additional relevant

documents that were not retrieved). We can see that pre-

cision increases with both recall and reformulation number,

suggesting that the system is doing a better job with the

later queries. (It may be easier to read cross-sections of the

surface. They are shown in Figure 2.)

Finally, just as average precision is computed as the area

under the precision-recall curve, we can deﬁne a model-free

“session average precision” (sAP ) as the the volume under

reformulation order sAP

~r1, ~r2, ~r30.261

~r1, ~r3, ~r20.335

~r2, ~r1, ~r30.344

~r2, ~r3, ~r10.519

~r3, ~r1, ~r20.502

~r3, ~r2, ~r10.602

Table 3: Session average precisions for diﬀerent per-

mutations of the three ranked lists in Table 1. A

system is rewarded for ﬁnding more relevant docu-

ments in earlier queries.

the precision-recall surface. An expression for sAP is:

sAP =1

mR

R

X

r=1

sP C@r /R, j

where sP C@r, j is the max of the set of all possible preci-

sions at the ﬁrst point in ~rjat which recall ris achieved.

Computing sP C@r, j can be done with the O(nm) approach

described in Section 2.2:3within the jth loop, calculate pre-

cision and recall rover the documents on the path to that

point; if precision is greater than the current known maxi-

mum for sP C@r, j, update sP C@r, j to that precision.

In this example the volume under the surface is 0.261.

To test whether sAP follows our intuition that it should be

greater when more relevant documents are found for earlier

queries, we calculate it for each permutation of the three

rankings in our example. The results are shown in Table 3.

4. MODEL-BASED SESSION EVALUATION

In the previous section we extended three traditional eval-

uation measures to the case of multi-query session collections

in a model-free fashion. The session-based system measures

capture the optimal contribution of a system to answer an

information need over an entire session. In this section we

look at the problem from a user perspective.

Note that our deﬁnition of sP C in the previous section

takes the maximum value of a set. We could instead take

the expected value; this has the advantage of using all of

the data as well as not assuming that a user will have the

optimal experience with the system. However, taking such

an expectation requires a probability distribution over paths;

formulating such a distribution requires a user model4.

To simplify the space of all possible browsing paths we

follow the user model described in Section 2: a user steps

down a ranked list of documents until some decision point.

It is important that any realization of the distribution over

possible paths allows for paths that end before the last re-

formulation in the static collection. Then, if Ω is the set

of all possible browsing paths that follow the simple user

model described earlier, P(ω) the probability of a certain

path ω∈Ω, and Mωa measure over the path ω, then we

deﬁne a session based measure as the expectation

esM =X

ω∈Ω

P(ω)Mω

3There is also an O(nm) dynamic programming approach,

but we have not included it in this paper for reasons of space.

4A uniform distribution would not work, since most of the

paths are very long and therefore have approximately zero

precision.

4.1 Probability of browsing paths

As noted in Section 2, we can express a path ωas a set

of reformulation points.Let us therefore formulate P(ω=

{k1, k2, ..., ki}) as a joint probability distribution of a user

abandoning the session at reformulation i, while reformu-

lating at positions −→

ref = {k1, k2, ..., ki−1}. Note that we do

not include ki, the abandonment cut-oﬀ at ranking i, in the

probability. For the sake of generalizing traditional mea-

sures, we will assume that once the user arrives at the ith

reformulation, they continue down that ranking as far as

necessary to compute the measure.

We express the probability of a path ωas,

P(ω) = P(ri,−→

ref) = P(ri)·P(−→

ref|ri)

Here we introduce a simplifying assumption: the reformu-

lation position is independent across the ranked lists 1..ri.

Then P(−→

ref|ri) can be expressed as,

P(−→

ref|ri) = P(k1|ri)P(k2|ri)···P(ki−1|ri)

In general, we could make each reformulation point depen-

dent on the reformulation number and possibly even on the

relevance of documents in the ranking; in this work we have

elected to keep them independent for simplicity.

For the realization of the probability distribution over dif-

ferent browsing paths we follow Moﬀat and Zobel [10] in

their deﬁnition of ranked biased precision and use two geo-

metric distributions. The ﬁrst gives the probability that the

ith reformulation is the last; it has an adjustable parameter

preform representing the probability that the user reformu-

lates again from their current query. They will only arrive

at reformulation iif they reformulate i−1 times, so:

P(ri) = pi−1

reform(1 −preform )

Similarly, the second distribution gives the probability that

the kth rank is a stopping or reformulation point with ad-

justable parameter pdown. A user will arrive at rank konly

after deciding to progress down k−1 times, so:

P(k) = pk−1

down(1 −pdown)

The probability of a path is then

P(ri,−→

ref) = pi−1

reform(1 −preform )

i−1

Y

j=1

pkj−1

down(1 −pdown )

=P(ri)

i−1

Y

j=1

P(kj)

Our deﬁnition of P(ri) may give non-zero probability to

a path that is not valid for a particular collection of static

sessions, e.g. one that ends at some point past the last (mth)

reformulation in the collection. To address this, we will trun-

cate the distribution P(ri) and renormalize it to ensure that

the probabilities to stop at diﬀerent reformulations sum to

1. To do this we simply renormalize the probabilities from

1 to mby P r{r≤ri}= 1 −pm

reform. That is,

P0(ri) = pi−1

reform(1 −preform )

1−pm

reform

We could similarly truncate and renormalize P(kj). How-

ever, paths that extend beyond the last retrieved document

for a reformulation (typically ranks beyond 1000) will have

very low probability and thus they will not contribute in any

signiﬁcant way in the calculation of a measure.

4.2 Expectations over paths

Given a measure Mωto be computed over the documents

viewed on a particular path ωalong with the probability dis-

tribution over all paths ω∈Ω, we can deﬁne a session mea-

sure as the expectation of Mωover the probabilistic space

of paths.

esM =EΩ[M] = X

ω∈Ω

P(ω)Mω

Let us consider a path-averaging generalization to preci-

sion at cut-oﬀ k. First deﬁne PC @k(ω) as the precision of

the ﬁrst kdocuments experienced on path ω. Then:

esP C@k=X

ω∈Ω

P(ω)P C@k(ω)

=X

ω∈Ω

P C@k(ω)P0(ri)

i−1

Y

j=1

P(kj)

P C@k(ω) is the total proportion of relevant documents at

ranks 1..k1in ~r1, 1..k2in ~r2, and so on.

P C@k(ω) = 1

k k1

X

j=1

rel1j+

k2

X

j=1

rel2j+···+

ki

X

j=1

relij !

where ki, the abandonment cut-oﬀ at ranking i, is equal to

k−(k1+k2+···+ki−1). Plugging that into the expression

for E[P C@k] completes the formula.

Similarly, the expectations of recall after kdocuments can

be computed as

esRC@k=1

RX

ω∈Ω

RC@k(ω)P0(ri)

i−1

Y

j=1

P(kj)

where

RC@k(ω) = 1

R k1

X

j=1

rel1j+···+

ki

X

j=1

relij !

and kiis deﬁned as above.

A path-averaging generalization to average precision is:

esAP =X

ω∈Ω

AP (ω)P0(ri)

i−1

Y

j=1

P(kj)

where AP (ω) is the average precision of the concatenated

list of documents on path ω.

We can continue to deﬁne any measure this way. We will

conclude with a path-averaging generalization to nDCG:

esnDCG@k=X

ω∈Ω

nDCG@k(ω)P0(ri)

i−1

Y

j=1

P(kj)

where nDCG@k(ω) is the nDCG@k of the concatenated list.

All of the above formulations involve summing over paths

ω. In general, summing a function f(ω) over all paths can

be expressed in a brute-force way as:

X

ω∈Ω

f(ω) =

|r1|

X

k1=1

|r2|

X

k2=1

···

|ri−1|

X

ki−1=1

f({k1, k2, ..., ki−1})

Note that computing it involves on the order of |r1| × |r2| ×

· · · × |rm−1|=O(nm) steps.

4.3 Monte Carlo estimation

A running time of O(nm) is manageable, but it is not

fast, especially as mgrows. Since our model for these mea-

sures is fully probabilistic, a faster alternative approach to

estimating them uses simulation. A Monte Carlo simulation

method allows the estimation of a measure via repeated ran-

dom sampling. Running a Monte Carlo experiment requires

deﬁning a domain of possible inputs, generating inputs ran-

domly from the domain using speciﬁc probability distribu-

tions, performing a deterministic computation using the in-

puts, and ﬁnally aggregating the results of the individual

computations into the ﬁnal result.

In the case of the user-model based measures proposed

above, the input space is the ranking riat which the user

abandons the query and the reformulation cut-oﬀs at all

previous queries {k1, ..., ki−1}.

Each of the above path-averaging measures can be thought

as the expected outcome of the following random experi-

ment:

1. Sample the last reformulation rifrom P0(rj).

2. Sample (k1, k2, ..., ki−1) i.i.d. from P(kj) to form a

path ω.

3. Create a ranked list of documents by concatenating

ranks 1...k1from r1, 1...k2from r2, ..., 1...ki−1from

ri−1, and 1...n from ri. These are the documents seen

along path ω.

4. Output measure Mover that ranked list.

This random experiment deﬁnes one round of the Monte

Carlo experiment. Executing the ﬁrst two steps requires

sampling from a geometric distribution. This can be easily

performed assuming access to an algorithm that generates

pseudo-random numbers uniformly distributed in the inter-

val (0,1). Regarding the distribution of the last reformula-

tion, since it is renormalized, we can ﬁrst partition the inter-

val (0,1) to ((0..P0(r1)),(P0(r1)..P 0(r1) + P0(r2)),··· ,(1 −

P0(rm)..1)). We then use the random number generator to

obtain a number in (0,1), and output jif this number is in

the j-th partition. In the case of the cut-oﬀ distribution the

same process can be followed. As mentioned earlier we did

not renormalize this distribution and thus the last partition

does not end in 1, however renormalization can be easily per-

formed in the case of Monte Carlo by simply rejecting any

sample larger than the upper bound of the last partition.

Repeating the process above Btimes and averaging the

results gives an estimate of the expectation of measure M.

For most purposes B= 1000 (which will usually be much

less than nm) should be suﬃcient; we explore the errors in

estimates due to Bin Section 6.

5. DUPLICATE DOCUMENTS

Our measures to this point make a strong assumption:

that retrieval systems return unique documents for each

query reformulation. Under this assumption, the concate-

nated ranked list of documents ~

cl which corresponds to a

certain browsing path ωresembles a typical ranked list of

document in the standard evaluation paradigm. We cer-

tainly do not expect this assumption to hold in real systems;

it is likely that documents retrieved for a second query will

overlap with those retrieved for the ﬁrst. When relaxing this

assumption,we need to consider how these duplicate docu-

ments should be treated from the perspective of the evalua-

tion measure.

The ﬁrst question raised is whether documents repeated in

ranked lists for subsequent reformulations have any value for

a user. J¨

arvelin et al. [6] noticed that in an empirical inter-

active search study conducted by Price et al. [12] searchers

overlooked documents in early queries but recognized them

in later reformulations. Due to this, the proposed sDCG

measure does not give any special treatment to duplicate

relevant documents; it considers them relevant regardless of

the number of times they have been seen before.

But measures with a recall component (such as average

precision or recall at k) cannot count duplicates in a sound

way. Since there are multiple possible paths through the

results, and these paths could have duplicate relevant doc-

uments, it is possible that more than Rrelevant documents

could be observed along any given path. The computed

measure may exceed the desired maximum value of 1.

We can instead consider duplicate documents nonrelevant

in all cases. This has certain implications as well. For one,

penalizing a retrieval system for returning duplicate doc-

uments may lead to systems that are less transparent to

their users. Imagine a user that reformulates a query and

expects to see previously observed relevant documents at

higher ranks than before. If they are not there, the user may

question whether the system can ever be useful to them.

Furthermore, by deﬁnition the expected measures reward

a retrieval system that respond in an optimal way to a pop-

ulation of users. These diﬀerent users may very well follow

diﬀerent browsing paths. In an actual retrieval setup, a sys-

tem can infer whether a document has been observed by a

user (e.g. by observing the users’ clicks). In the case of a

batch experiment, however, a certain document may be a

duplicate for one user but not for another, depending on the

browsing path each one of them has followed. This informa-

tion is hidden in the parameters of the particular evaluation

measure (which simulates the population of users). Tak-

ing these parameters into account a system could respond

optimally by removing the expected duplicate documents.

However, the need of such an inference process and a rank-

ing function that accounts for the average browsing path

is just an artifact of the batch nature of experiments. A

retrieval system running in the real world will not need to

employ such an algorithm.

Yang and Lad [16] take an approach in between the two ex-

tremes. Although they did not explicitly consider the prob-

lem of exact duplicates, the proposed measure is a measure

of information novelty over multi-query sessions and thus it

takes the typical approach other novelty measures take [4]

by deﬁning information nuggets and discounting documents5

that contain the same relevant information nugget.

In this work we consider an alternative treatment of du-

plicates inspired by Sakai’s compressed ranked lists [14] and

Yilmaz & Aslam’s inducedAP [17]. When considering a path

ωout of the population of all paths, we construct the con-

catenated list ~

cl that corresponds to this path. We then

walk down the concatenated list and simply remove dupli-

cate documents, eﬀectively pushing subsequent documents

5The measure by Yang and Lad [16] was proposed for the

task of information distillation and thus it operates over

passages instead of documents; the same approach however

could be used for the case of documents.

one rank up. This way, we neither penalize systems for re-

peating information possibly useful to the user, nor do we

push unnecessary complexity to the retrieval system side.

Further, the measures still reward the retrieval of new rele-

vant documents. Note here that such an approach assumes

that a system ranks documents independent of each other

in a ranked list (probabilistic ranking principle [13]). If this

is not true, i.e. if ranking a document depends on previ-

ously ranked documents and the retrieval system is agnostic

to our removal policy then this may also lead to unsound

evaluation.

6. EXPERIMENTS

In this section we demonstrate the behavior of the pro-

posed measures. There is currently no “gold standard” for

session evaluation measures; our goal in this section is to

evaluate whether our measures provide information about

system eﬀectiveness over a session and whether they cap-

ture diﬀerent attributes of performance in a similar way to

traditional precision, recall, and average precision. We will

compare our measures to the session nDCG measure pro-

posed by J¨

arvelin et al., though we consider none of these,

nor any other measure, to be the one true session measure.

We use two collections towards these goals: (a) the TREC

2010 Session track collection [7], and (b) the TREC-8 (1999)

Query track collection [2]. Though the latter is not a collec-

tion of multi-query sessions, we will ﬁnd it useful to explore

properties of our measures. Both of these collections are

described in more detail below.

The instantiation of session nDCG@k we use is calculated

as follows: we start by concatenating the top kresults from

each ranked list of results in the session. For each rank iin

the concatenated list, we compute the discounted gain as

DG@i=2rel(i)−1

logb(i+ (b−1))

where bis a log base typically chosen to be 2. These are the

summands of DCG as implemented by Burges et al. [3] and

used by many others. We then apply an additional discount

to documents retrieved for later reformulations. For rank i

between 1 and k, there is no discount. For rank ibetween

k+ 1 and 2k, the discount is 1/logbq (2 + (bq −1)), where bq

is the log base. In general, if the document at rank icame

from the jth reformulation, then

sDG@i=1

logbq (j+ (bq −1)) DG@i

Session DCG is then the sum over sDG@i

sDCG@k=

mk

X

i=1

2rel(i)−1

logbq (j+ (bq −1)) logb(i+ (b−1))

with j=b(i−1)/kc, and mthe length of the session. We

use bq = 4. This implementation resolves a problem present

in the original deﬁnition by J¨

arvelin et al. [6] by which docu-

ments in top positions of an earlier ranked list are penalized

more than documents in later ranked lists.

As with the standard deﬁnition of DCG, we can also com-

pute an “ideal” score based on an optimal ranking of docu-

ments in decreasing order of relevance to the query and then

normalize sDCG by that ideal score to obtain nsDCG@k.

nsDCG@k essentially assumes a speciﬁc browsing path:

ranks 1 through kin each subsequent ranked list, thereby

giving a path of length mk. Therefore, we set the cut-oﬀs

of our expected session measures to mk (with the excep-

tion of AP). For the computation of the expected session

measures the parameter pdown is set to 0.8 following the

recommendation by Zhang et al. [18]; in expectation, users

stop at rank 5. The parameter preform is set arbitrarily to

0.5. With m= 2 reformulations, the probability of a user

stopping at the ﬁrst reformulation is then 67% and mov-

ing to the next reformulation 33%, which is not far oﬀ the

percentage of users reformulating their initial queries in the

Excite logs [15].

6.1 Session track collection

The Session track collection consists of a set of 150 two-

query sessions (initial query followed by one reformulation).

Out of the 150 sessions, 136 were judged. The judged 136

topics include 47 for which the reformulation involves greater

speciﬁcation of the information need, 42 for which the re-

formulation involves more generalization, and 47 for which

the information need “drifts” slightly. In the case of spec-

iﬁcation and generalization, both the initial query and its

reformulation represent the same information need, while in

the case of drifting the two queries in the session represent

two diﬀerent (but related) information needs. Given that

some of the proposed session measures make the assump-

tion of a single information need per session—these are the

recall-based measures such as AP and recall at cutoﬀ k—we

drop the 47 drifting sessions from our experiments.

Each participating group submitted runs consisting of three

ranked lists of documents:

RL1 ranked results for the initial query;

RL2 ranked results for the query reformulation indepen-

dently of the initial query;

RL3 ranked results for the query reformulation when the

initial query was taken into consideration.

Thus, each submission consists of two pairs of runs over

the two-query session: RL1→RL2, and RL1→RL3. The

document corpus was the ClueWeb09 collection. Judging

was based on a shallow depth-10 pool from all submitted

ranked lists. Kanoulas et al. detail the collection further [7].

Figure 3 shows example precision-recall surfaces for two

submissions, CengageS10R3 [9] and RMITBase [8]. In both

cases there is a moderate improvement in performance from

the ﬁrst query to the second. The decrease in precision is

rapid in both, but slightly less so in RMITBase. As a result,

though CengageS10R3 starts out with higher precisions at

lower recalls, the model-free mean sAP s are close: 0.240

and 0.225 respectively. In general, these surfaces, like tradi-

tional precision-recall curves, provide a good sense of relative

eﬀectiveness diﬀerences between systems and where in the

ranking they occur.

We use the submitted RL1 and RL2 runs (27 submissions

in total) to compare the proposed model-based measures

with normalized session DCG. nsDCG is computed at cut-

oﬀ 10. We compute all measures in Section 2 with cut-oﬀ

2·10 = 20 (to ensure the same number of documents are

used). Scatter plots of nsDCG@10 versus expected session

nDCG@20 (esnDCG), PC@20 (esPC), RC@20 (esRC), and

AP (esAP) are shown in Figure 4. Each point in the plot

corresponds to a participant’s RL1→RL2 submission; mea-

sures are averaged over 89 topics. The strongest correlation

recall

0.0 0.2 0.4 0.6 0.8

1.0

reformulation

1.0

1.2

1.4

1.6

1.8

2.0

precision

0.0

0.2

0.4

0.6

0.8

1.0

recall

0.0 0.2 0.4 0.6 0.8

1.0

reformulation

1.0

1.2

1.4

1.6

1.8

2.0

precision

0.0

0.2

0.4

0.6

0.8

1.0

Figure 3: Precision-recall surfaces for CengageS10R3 and RMITBase.

0.10 0.15 0.20

0.10 0.15 0.20

nsDCG vs. esNDCG

nsDCG

esNDCG

Kendall''s tau : 0.7972

0.10 0.15 0.20

0.10 0.15 0.20 0.25

nsDCG vs. esPC

nsDCG

esPC

Kendall''s tau : 0.7334

0.10 0.15 0.20

0.04 0.06 0.08 0.10

nsDCG vs. esRC

nsDCG

esRC

Kendall''s tau : 0.6987

0.10 0.15 0.20

0.04 0.06 0.08

nsDCG vs. esAP

nsDCG

esAP

Kendall''s tau : 0.5247

Figure 4: Scatter plots of the mean performance of systems in the Session track for the session RL1→RL2.

The x-axis is sNDCG@10; the y-axis is expected session nDCG@20 (esnDCG), PC20 (esPC), RC@20 (esRC),

and AP (esAP) for the four plots from left to right and top to bottom.

is between esnDCG and snDCG (as expected). Interestingly,

esAP strongly disagrees with snDCG; this demonstrates that

esAP measures diﬀerent aspects of system eﬀectiveness over

sessions than snDCG. Table 4 shows τcorrelations between

all the expected measures as well as the model-free sAP;

overall the correlations are within the range expected for

these measures with a relatively small number of input sys-

tems. They are high relative to random orderings, but low

enough that it is clear that all the measures are capturing

diﬀerent aspects of eﬀectiveness. esAP and sAP do not cor-

relate very well, but recall that sAP is based on maximums

while esAP is based on averages.

Kendall’s tau correlation

esRC esAP esnDCG sAP

esPC 0.88 0.79 0.93 0.78

esRC 0.80 0.84 0.87

esAP 0.72 0.78

esnDCG 0.74

Table 4: Kendall’s tau correlation across the four

expected session measures and sAP.

6.2 Query track collection

The Query track [2] was an eﬀort to understand the system-

query-topic interaction often observed in IR experiments,

where certain systems perform well for certain queries but

under-perform for others. A set of 50 topics (topics 51-100

from the TREC-1 collection) was provided to participants;

they responded with a set of queries for each topic. The con-

structed queries were in one of the following forms, (a) very

short: 2-4 words based on the topic and few relevance judg-

ments; (b) sentences: 1-2 sentences using the topic and few

relevant documents; (c) sentences+feedback: 1-2 sentences

using only the relevant documents; and (d) weighted terms.

Overall 23 query sets were produced, each consisting of 50

queries corresponding to the 50 topics of TREC-1. Partici-

pants ran their retrieval systems over each of the query sets

and submitted results for each system for each query set.

We use subsets of the 23 query sets and the submitted runs

to simulate query sessions.

The goal of our ﬁrst experiment is to empirically verify

that the proposed measures reward early retrieval of relevant

documents in the session. In this experiment we simulate

four sets of session retrieval algorithms over 50 two-query

sessions. The ﬁrst simulated algorithm (“good”–“good”) per-

forms well on both the initial query and its reformulation

in a hypothetical query session; the second (“good”–“bad”)

performs well on the initial query but not on its reformu-

lation, the third (“bad”–“good”) does not perform well on

the initial query but does on its reformulation, and the last

(“bad”–“bad”) does not perform well on either queries.

To simulate these algorithms, we note that the systems

participating in the Query track performed particularly well

on short queries (with an average MAP of 0.227), while they

did not perform well on the sentence-feedback queries (av-

erage MAP of 0.146) [2]. Taking this into consideration we

simulate the best algorithm using runs over two short formu-

lations of the same query. In particular, the nine runs over

the query set named INQ1a with simulated reformulation

results from the nine runs over query set INQ1b (with aver-

age MAP equal to 0.110 and 0.127 respectively, as computed

by trec_eval) simulated a set of systems performing well

both over the initial query of a hypothetical session and its

reformulation. We simulate the worst system by runs over

two sentence-feedback formulations. In particular the runs

over query set INQ3a with reformulations being the runs

over query set INQ3b (with average MAP 0.078 and 0.072

respectively) simulated a set of systems that performed well

neither over the initial nor over the reformulated query. The

other two combinations—INQ1a to INQ3b and INQ3a to

INQ1b—simulated medium performance systems.6

What we expect to observe in this simulation is that ses-

6The query sets INQ1a, INQ1b, INQ3a, INQ3b were man-

ually constructed by students at UMass Amherst [2].

Average mean session measures

esMPC@20 esMRC@20 esMAP

“good”–“good” 0.378 0.036 0.122

“good”–“bad” 0.363 0.034 0.112

“bad”–“good” 0.271 0.023 0.083

“bad”–“bad” 0.254 0.022 0.073

Table 5: esPC@20, esRC@20, and esAP, each aver-

aged over nine Query track systems simulating four

session retrieval systems.

sion measures reward the “good”–“good” simulated systems

the most, followed by the “good”–“bad”, the “bad”–“good”

and ﬁnally the “bad”–“bad” systems. Table 5 shows the av-

erage mean session scores produced by three of the expected

session measures over the four simulated sets of retrieval al-

gorithms. The results verify the behavior of the proposed

measures. This is the exact same behavior we observed for

the sAP measure in Table 3.

We conclude by testing the accuracy of the Monte Carlo

method in computing the expected session measures. We

simulated a two-query session by randomly selecting two

sets of query formulations and their corresponding runs and

forming a session from those. We run our O(nm) exact al-

gorithm along with the Monte Carlo method. For the latter

we use B= 10, B= 100 and B= 1000 trials. The results

are illustrated in the left-hand plot in Figure 5, showing the

high accuracy of the Monte Carlo simulation even with as

few as 10 repetitions. Obviously, the accuracy of the method

depends on the length of the session together with the set

parameters pdown and preform. We did not test for diﬀerent

parameters, however we did repeat the same process over 3-

query sessions. The results are illustrated at the right-hand

side plot in Figure 5. There is somewhat more variance but

even with B= 10 the results are very accurate.

7. CONCLUSIONS

In this work we considered the problem of evaluating re-

trieval systems over static query sessions. In our eﬀort to

make as few assumptions about user behavior over a ses-

sion as possible, we proposed two families of measures: a

model-free family inspired by interpolated precision, and a

model-based family with a simple model of user interaction

described by paths through ranked results. With a novel ses-

sion test collection and a session collection simulated from

existing data, we showed that the measure behaviors cor-

respond to our intuitions about their traditional counter-

parts as well as our intuitions about systems that are able

to ﬁnd more relevant documents for queries earlier in the

session. The measures capture diﬀerent aspects of system

performance, and also capture diﬀerent aspects than what

is captured by the primary alternative, snDCG.

They are surely other ways to generalize traditional eval-

uation measures and paradigms to session evaluation than

those we have presented. Our goal with this work is to de-

ﬁne measures as intuitively as possible while keeping models

of user behavior simple. In the future we will want to con-

sider explicit models of when and why users reformulate: is

it based on the relevance of the documents they see? Does it

depend on how many times they have already reformulated?

Are they willing to go deeper down the ranking for a later

reformulation than an earlier one?

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

0.25 0.30 0.35 0.40 0.45 0.50

0.25 0.30 0.35 0.40 0.45 0.50

2-query sessions

expected session AP (esAP) -- exact method

expected session AP (esAP) -- Monte Carlo

+

x

o

B=10 -- tau=0.957

B=100 -- tau=0.981

B=1000 -- tau=0.983

+

x

o+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

+

x

o

0.35 0.40 0.45

0.30 0.35 0.40 0.45

3-query sessions

expected session AP (esAP) -- exact method

expected session AP (esAP) -- Monte Carlo

+

x

o

B=10 -- tau=0.896

B=100 -- tau=0.947

B=1000 -- tau=0.97

Figure 5: Exact method versus Monte Carlo in the computation of expected session AP (esAP) for B= 10,

B= 100 and B= 1000, for 2 and 3 query sessions.

There are many cases we have not explicitly considered.

“Drifting” information needs, which were part of the TREC

2010 Session track, may require special treatment for eval-

uation since the set of relevant documents can change as

the need drifts. Furthermore, there are many examples of

sessions in which a query and its results serve to guide the

user in selecting future queries rather than immediately pro-

vide relevant documents; while we can apply our measures

to these types of sessions, they are clearly not designed to

measure a system’s eﬀectiveness at completing them.

Deeper understanding of session measures and their rela-

tionship to the user experience will come from future work

on session test collections, application to “real” sessions in

query log data, and extensive experimentation and analysis.

We plan to continue all these lines of research in the future.

8. ACKNOWLEDGMENTS

We gratefully acknowledge the support provided by the

European Commission grant FP7-PEOPLE-2009-IIF-254562

and FP7/2007-2013-270082 and by the University of Delaware

Research Foundation (UDRF).

9. REFERENCES

[1] M. J. Bates. The design of browsing and berrypicking

techniques for the online search interface. Online review,

13(5):407–431, 1989.

[2] C. Buckley and J. A. Walz. The trec-8 query track. In The

Eighth Text REtrieval Conference Proceedings (TREC

1999), 1999.

[3] C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank

with nonsmooth cost functions. In B. Sch¨

olkopf, J. C. Platt,

T. Hoﬀman, B. Sch¨

olkopf, J. C. Platt, and T. Hoﬀman,

editors, NIPS, pages 193–200. MIT Press, 2006.

[4] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova,

A. Ashkan, S. B¨

uttcher, and I. MacKinnon. Novelty and

diversity in information retrieval evaluation. In SIGIR ’08:

Proceedings of the 31st annual international ACM SIGIR

conference on Research and development in information

retrieval, pages 659–666, New York, NY, USA, 2008. ACM.

[5] W. S. Cooper. Expected search length: a single measure of

retrieval eﬀectiveness based on the weak ordering action of

retrieval systems. American Documentation, 19:30–41,

1968.

[6] K. J¨

arvelin, S. L. Price, L. M. L. Delcambre, and M. L.

Nielsen. Discounted cumulated gain based evaluation of

multiple-query ir sessions. In ECIR, pages 4–15, 2008.

[7] E. Kanoulas, B. Carterette, P. Clough, and M. Sanderson.

Session track overview. In The Nineteenth Text REtrieval

Conference Notebook Proceedings (TREC 2010), December

2010.

[8] S. Kharazmi, F. Scholer, and M. Wu. RMIT University at

TREC 2010: Session track. In Proceedings of TREC, 2010.

[9] B. King and I. Provalov. Cengage learning at the TREC

2010 Session track. In Proceedings of TREC, 2010.

[10] A. Moﬀat and J. Zobel. Rank-biased precision for

measurement of retrieval eﬀectiveness. ACM Trans. Inf.

Syst., 27(1):1–27, 2008.

[11] P. Over. Trec-7 interactive track report. In The Seventh

Text REtrieval Conference Proceedings (TREC 1998),

pages 33–39, 1998.

[12] S. Price, M. L. Nielsen, L. M. L. Delcambre, and

P. Vedsted. Semantic components enhance retrieval of

domain-speciﬁc documents. In M. J. Silva, A. H. F.

Laender, R. A. Baeza-Yates, D. L. McGuinness, B. Olstad,

Ø. H. Olsen, and A. O. Falc˜ao, editors, CIKM, pages

429–438. ACM, 2007.

[13] S. E. Robertson. The probability ranking principle in IR.

Journal of Documentation, 33(4):294–304, 1977.

[14] T. Sakai. Alternatives to bpref. In SIGIR ’07: Proceedings

of the 30th annual international ACM SIGIR conference

on Research and development in information retrieval,

pages 71–78, New York, NY, USA, 2007. ACM.

[15] D. Wolfram, A. Spink, B. J. Jansen, and T. Saracevic. Vox

populi: The public searching of the web. JASIST,

52(12):1073–1074, 2001.

[16] Y. Yang and A. Lad. Modeling expected utility of

multi-session information distillation. In L. Azzopardi,

G. Kazai, S. E. Robertson, S. M. R¨

uger, M. Shokouhi,

D. Song, and E. Yilmaz, editors, ICTIR, pages 164–175,

2009.

[17] E. Yilmaz and J. A. Aslam. Estimating average precision

with incomplete and imperfect judgments. In P. S. Yu,

V. Tsotras, E. Fox, and B. Liu, editors, Proceedings of the

Fifteenth ACM International Conference on Information

and Knowledge Management, pages 102–111. ACM Press,

November 2006.

[18] Y. Zhang, L. A. Park, and A. Moﬀat. Click-based evidence

for decaying weight distributions in search eﬀectiveness

metrics. Inf. Retr., 13:46–69, February 2010.