Conference PaperPDF Available

Evaluating multi-query sessions



The standard system-based evaluation paradigm has focused on assessing the performance of retrieval systems in serving the best results for a single query. Real users, however, often begin an interaction with a search engine with a sufficiently under-specified query that they will need to reformulate before they find what they are looking for. In this work we consider the problem of evaluating retrieval systems over test collections of multi-query sessions. We propose two families of measures: a model-free family that makes no assumption about the user's behavior over a session, and a model-based family with a simple model of user interactions over the session. In both cases we generalize traditional evaluation metrics such as average precision to multi-query session evaluation. We demonstrate the behavior of the proposed metrics by using the new TREC 2010 Session track collection and simulations over the TREC-9 Query track collection.
Evaluating Multi-Query Sessions
Evangelos Kanoulas, Ben Carterette, Paul D. Clough, Mark Sanderson,,,
Information School, University of Sheffield, Sheffield, UK
Department of Computer & Information Sciences, University of Delaware, Newark, DE, USA
School of Computer Science & Information Technology, RMIT, Melbourne, Australia
The standard system-based evaluation paradigm has focused
on assessing the performance of retrieval systems in serving
the best results for a single query. Real users, however, often
begin an interaction with a search engine with a sufficiently
under-specified query that they will need to reformulate be-
fore they find what they are looking for. In this work we con-
sider the problem of evaluating retrieval systems over test
collections of multi-query sessions. We propose two families
of measures: a model-free family that makes no assumption
about the user’s behavior over a session, and a model-based
family with a simple model of user interactions over the ses-
sion. In both cases we generalize traditional evaluation met-
rics such as average precision to multi-query session evalua-
tion. We demonstrate the behavior of the proposed metrics
by using the new TREC 2010 Session track collection and
simulations over the TREC-9 Query track collection.
Categories and Subject Descriptors: H.3.4 [Informa-
tion Storage and Retrieval] Performance Evaluation
General Terms: Experimentation, Measurement
Keywords: information retrieval, test collections, evalua-
tion, sessions
Evaluation measures play a critical role in the develop-
ment of retrieval systems, both as measures in compara-
tive evaluation experiments and as objective functions for
optimizing system effectiveness. The standard evaluation
paradigm has focused on assessing the performance of re-
trieval systems in serving the best results for a single query,
for varying definitions of “best”: for ad hoc tasks, the most
relevant results; for diversity tasks, the results that do the
best job of covering a space of information needs; for known-
item tasks, the single document the user is looking for. There
are many test collections for repeatable experiments on these
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’11, July 24–28, 2011, Beijing, China.
Copyright 2011 ACM 978-1-4503-0757-4/11/07 ...$5.00.
tasks, and dozens of evaluation measures assessing different
aspects of task effectiveness.
Real users, however, often begin an interaction with a
search engine with a query that they will need to reformulate
one or more times before they find what they are looking for.
Early studies on web search query logs showed that half of
all Web users reformulated their initial query: 52% of the
users in 1997 Excite data set, 45% of the users in the 2001
Excite dataset [15].
The standard evaluation paradigm of single-query test col-
lections seems unable to assess the effectiveness of retrieval
systems over sequences of query reformulations. Interactive
evaluation has been employed as an alternative. In inter-
active evaluation the user is part of the evaluation cycle
and freely interacts with the results of a retrieval system.
Measures such as instance recall [11] and session discounted
cumulative gain [6] have been proposed to capture the ef-
fectiveness of systems in these settings. Even though an
interactive evaluation paradigm can better capture the ac-
tual user experience, it is both noisy due to the high degrees
of freedom of user interactions and expensive due to its low
reusability and need for many test subjects. Furthermore,
conducting interactive comparative evaluation experiments
is by no means an easy task.
The TREC 2010 Session track [7] proposed an experiment
for the evaluation of retrieval systems over multi-query ses-
sions. We defined a session as a sequence of reformulations
in the service of satisfying a general information need, and
constructed a test collection of two query reformulations (an
initial and a follow-up query) for each of 150 information
needs. This collection makes compromises for simplicity and
tractability, but it provides a starting point for investigation
of questions about test collection-based session evaluation.
In addition to a test collection, new evaluation measures
are necessary as well. Traditional evaluation measures only
capture per-query effectiveness; they are not necessarily ap-
propriate for evaluating the effectiveness of a retrieval sys-
tem over a multi-query session. While one could evaluate
results for each query in the session in isolation, it may not
be the case that the system is serving results for each query
independently. Doing so would lose potentially valuable in-
formation about the ability of the system to provide results
for the session as a unit, and thus reduce our ability to op-
timize system performance across sessions.
Due to the lack of appropriate measures J¨
arvelin et al. [6]
extended the normalized discounted cumulative gain (nDCG)
measure to a measure that considers multi-query sessions.
The measure—called normalized session discounted cumula-
tive gain (nsDCG)—discounts documents that appear lower
in the ranked list for a given query as well as documents
that appear after more query reformulations. In a sense the
new model incorporates a cost for reformulating a query as
well as scanning down a ranked list.
The nsDCG measure is computed as follows: for each
query in a series of reformulations, DCG is computed in
isolation of all other queries in the series. Each DCG is
then discounted by a function of the position qof the query
in the series. The measure can evaluate the effectiveness
of retrieval systems over multiple queries in an interactive
retrieval scenario, in which a user moves down a ranked
list of documents and at some rank reformulates the query.
Since the reformulation points are known (from observing
the users), DCG is computed at those points for each query
and at the stopping point for the last reformulation. In
a test collection of static sessions, however, reformulation
points are unknown. Using nsDCG requires the selection of
a fixed reformulation cut-off, which clearly does not reflect
the fact that different retrieval results may trigger different
user behavior. Further, the measure does not model early
abandonment of a query session; our TREC session collec-
tion comes with a fixed number of reformulations, but a
user may choose to abandon the session before reaching the
last reformulation (either due to satisfaction or due to frus-
tration). A multi-query session measure should be able to
model such behavior.
Yang and Lad [16] overcame the need to define a fixed
reformulation point by defining a session measure as an ex-
pectation over a set of possible browsing paths. Based on
this they proposed a measure of expected utility for a multi-
query information distillation task. Given a series of mre-
formulations the proposed measure accounts for all possible
browsing paths that end in the kth reformulation. Each
path has a certain probability to be realized by a user. To
define the probability of a user following a certain path Yang
and Lad [16] follow the rank biased precision (RBP) frame-
work [10], replacing RBP’s stopping condition with a refor-
mulation condition. The utility of each path is a function of
the relevance and novelty of the returned documents being
considered. The system effectiveness is then defined as the
expected utility calculated over the aforementioned proba-
bilistic space of browsing paths. Though the expected utility
solves the problem of variable reformulation points, it still
does not allow early abandonment of the query session.
In this work we consider the problem of evaluating re-
trieval systems over test collections of static multi-query ses-
sions. We propose two families of measures: one that makes
no assumptions about the user’s behavior over a session,
and another with a simple model of user interactions over
the session. In the latter case we provide a general frame-
work to accommodate different models of user interactions
in the course of a session, avoiding predefined reformulation
cut-offs and allowing early abandonment. In both cases we
generalize traditional evaluation measures such as average
precision to multi-query session evaluation.
We define a session test collection as one with a set of top-
ics, each of which consist of a description of an information
need and a static sequence of mtitle queries (an initial query
query q1,1query q1,2query q1,3
ranking 1 ranking 2 ranking 3
d10 Nd0
10 Nd00
10 R
... ... ... ... ... ...
Table 1: Example rankings (document IDs and rel-
evance judgments) for three queries q1,1, q1,2, q1,3in
a session for topic number 1. Here we assume all
documents are unique.
and m1 reformulations), and judgments of the relevance
of documents to the topics. For simplicity, we assume that
all reformulations are directed towards a single information
need, so there is a global set of relevant documents of size
R.1This definition is similar to those used by J¨
arvelin et
al. [6] and Yang & Lad [16], and it is essentially the defi-
nition used for the TREC 2010 Session track in the case of
specification and generalization reformulations [7].
Following Cooper in his proposal for the expected search
length measure [5], we assume a user stepping down the
ranked list of results until some decision point. To this we
add an additional possible action: the decision point can be
either a stopping point for the search, or a point at which
the user reformulates their query. Thus, a user experiences
results by either moving down a ranking (i.e. moving from
rank kto rank k+ 1 in ranking ~ri) or to the top of the
next ranking by reformulating ( i.e. moving from rank kin
ranking ~rito rank 1 in ranking ~ri+1.).
Consider the example in Table 1. A user with a certain
information need formulates a query and submits it to a re-
trieval system. The retrieval system returns a ranked list
of documents, ~r1= (d1, d2, ..., d10 , ...). Suppose the user’s
first decision point occurs at document d5. After seeing
that document, the user decides to reformulate the original
query and resubmit it to the retrieval system. The retrieval
system responds with a second ranked list of documents,
~r2= (d0
1, d0
1, ..., d0
10, ...). The user reads the documents once
again from top to bottom and abandons the session. If we
only consider the documents the user has examined over the
session of reformulations then a final composite ranked list,
cl, can be composed: ~
cl = (d1, d2,...,d5, d0
1, d0
2, d0
3, . . . ).
Given the relevance of the documents in the composite
ranked list ~
cl, any traditional evaluation measure can be
calculated in the usual manner. We may require assump-
tions about the relevance of duplicates, e.g. if d1and d0
the same relevant document, how they should count towards
the evaluation measure; we will consider these in Section 5.
1In practice users’ information needs may change during a
search session and over time [1]; assuming it is fixed is a
modeling assumption we make for tractability. Some of our
measures require this assumption, but it can be relaxed for
other measures.
2.1 Evaluation over paths
The composite list ~
cl is the outcome of a series of decisions.
We define a path ωthrough the results as a series of decisions
to either move down a ranking, reformulate and start at
the top of the next ranking, or abandon the search. We
assume that at least one document—the first one—is viewed
in each ranking. A path of length kis a path that results
in kdocuments viewed. We denote the set of unique paths
of length kas Ωk, and the set of all unique paths as Ω.
A path can be represented as a series of actions, e.g. ω=
{down,down, ..., reformulate,down, ..., abandon}; as a series
of document IDs viewed, e.g. ω=~
cl above; or as a series
of ranks at which reformulations or abandonment occurred,
e.g. ω={5, ...}. The three are equivalent in the sense of
providing complete information about the series of decisions;
the last, being most compact, is the one we will use.
Different paths result in different documents being seen,
and in many cases different numbers of relevant documents.
Precision after kdocuments viewed may result in very differ-
ent values depending on the path chosen: a user that views
10 documents in ~r1(the relevance of which is shown in Ta-
ble 1), experiences 0 precision, while one that reformulates
immediately after the first document in ~r1and steps down ~r2
until rank 9 experiences precision of 5/10. In an interactive
evaluation scenario where real users interact with the ranked
list of documents returned by a retrieval system, the point
at which a user decides either to reformulate their previous
query or to abandon the search can be explicitly recorded
by observation, or implicitly inferred by looking at (for in-
stance) the last clicked document. In batch laboratory ex-
periments with static sessions, however, the reformulation
and stopping points are undefined—there is no user from
which to record them. This presents a challenge for defining
evaluation measures.
In this work we propose evaluating static sessions by sum-
marizing evaluation results over all paths through the re-
sults. We will consider two directions: one a “model-free”
approach inspired by interpolated precision and recall, the
other a model-based approach that explicitly defines prob-
abilities for certain actions, then averages over paths. In
both approaches we would like to make as few assumptions
as possible about the reasons a user reformulates.
2.2 Counting paths
The number of possible paths grows fairly fast. Consider
a path of length kending in reformulation number j. For
example, the paths of length 4 ending at reformulation 2 are
{d1, d2, d3, d0
1},{d1, d2, d0
1, d0
2}, and {d1, d0
1, d0
2, d0
3}. For any
given kand j, we can count the number of possible paths
as follows: imagine a ranking as a list of kdocuments, then
place j1 reformulation points between any two documents
in that list. The number of different places we can insert
them is `k1
j1´, and this is therefore the number of paths of
length kthat end at reformulation j.
The total number of paths of length kis:
j=1 k1
This is the definition of elements in the Bernoulli triangle.
Its rate of growth is unknown, but it is O(k2) for m= 2 and
approaches 2k1as mincreases to k. The total number of
paths of any length is ||=Pn
k=1 |k|=O(2n).
j= 1 j= 2 j= 3 ···
k= 1 {0}– –
k= 2 {0} {1}
k= 3 {0} {1,2} {2}
k= 4 {0} {1,2,3} {2,3,3}
k= 5 {0} {1,2,3,4} {2,3,3,4,4,4}
Table 2: Relevant document counts for different
paths of length kending at ranking jfrom the ex-
ample in Table 1.
On the other hand, if we only consider paths that end at
reformulation jbut continue down ranked list ~rjindefinitely,
the number is more manageable. We can enumerate these
by simply iterating over stopping points k1= 1...|~r1|, and
for each of those over stopping points k2= 1...|~r2|, and so
on. Within the (j1)st loop, ω={k1, ..., kk1}is the path
to that point. This takes |~r1| × |~r2| × · · · × |~rm1|=O(nm)
time, which, while not exactly fast, is at least manageable.
Our first goal is to develop a method for evaluating the
effectiveness of a system over a set of reformulations mak-
ing no assumptions about when or why users reformulate.
The approach is inspired by interpolated precision: there is
no formal user model behind interpolated precision, but it
reduces the full evaluation data (precision at every rank) to
a manageable set while still providing useful intuition about
system performance, particularly when plotted against re-
call values. Likewise, there is no formal user model behind
these measures, but they give some intuition while greatly
reducing the amount of evaluation data, which as we saw
above grows exponentially.
Consider all paths of length kthat end at reformulation
j. On each of those paths the user will see a certain number
of relevant documents. Let us define a set of relevant counts
rR@j, k as the set of counts of relevant documents seen on
all such paths.2In the example in Table 1, there is only one
possible way for a user to see 4 documents without reformu-
lating, and none of those documents are relevant; therefore
rR@1,4 = {0}. There are three ways for a user to see 4
documents over two queries: {d1, d2, d3, d0
1};{d1, d2, d0
1, d0
{d1, d0
1, d0
2, d0
3}. These paths have 1, 2, and 3 relevant docu-
ments respectively. Therefore rR@2,4 = {1,2,3}. All sets
rR for j= 1..3 and k= 1..5 are shown in Table 2; the size
of a set is `k1
j1´as described above.
We can then define session versions of precision and recall
by dividing the relevant counts rR@j, k by k(for precision)
or R(for recall). We will call these rPC@j, k and rRC@j, k.
This gives the session generalization to precision and recall:
precision and recall for each possible path through the re-
sults. In traditional systems-based evaluation there is only
one possible path of length k, and precision/recall for that
path is identical to precision/recall at rank k.
Precision is sometimes interpolated to a particular recall
point rby finding the first rank at which recall ris achieved,
then taking the maximum of all precisions at that rank or
deeper. Let us consider an analogous process for sessions by
defining precision at a particular recall point in a particular
2We use boldface to denote sets and italics to denote scalars.
Figure 1: Reformulation precision-recall surface for
the example in Table 1.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
ranking 1
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
ranking 2
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
ranking 3
Figure 2: Reformulation precision-recall cross-
sections of Figure 1 for the example in Table 1.
Note that these are not precision-recall curves for
the three rankings independently.
reformulation. At recall 1/R in ~r2, there is a set of possible
precision values {1/2,1/3,1/4, ...}, each of which is achieved
by a user looking at k= 1,2,3, ... documents in ranking
1, then reformulating and looking at the first document in
ranking 2. At recall 2/R in ~r2, the set is {2/3,2/4,2/5, ...}.
Now we will define sPC @r, j as the maximum value of the
set of possible precisions at the first rank in ~rjat which
recall ris achieved. This reduces the amount of data to m·R
precision values (with mbeing the number of queries in the
static session) that reflect the best possible effectiveness a
user could experience. Note that this is not interpolation
in the traditional sense. sP C@1/R, 2 may still be less than
sP C@2/R, 2, which is not possible in the usual definition.
Once we have computed sP C @r, j for each rand j, we can
plot a“precision-recall surface”over reformulations. Figure 1
shows the surface for the example in Table 1 with R= 20
under the assumption that all relevant documents in rank-
ing 3 are unique (meaning there are five additional relevant
documents that were not retrieved). We can see that pre-
cision increases with both recall and reformulation number,
suggesting that the system is doing a better job with the
later queries. (It may be easier to read cross-sections of the
surface. They are shown in Figure 2.)
Finally, just as average precision is computed as the area
under the precision-recall curve, we can define a model-free
“session average precision” (sAP ) as the the volume under
reformulation order sAP
~r1, ~r2, ~r30.261
~r1, ~r3, ~r20.335
~r2, ~r1, ~r30.344
~r2, ~r3, ~r10.519
~r3, ~r1, ~r20.502
~r3, ~r2, ~r10.602
Table 3: Session average precisions for different per-
mutations of the three ranked lists in Table 1. A
system is rewarded for finding more relevant docu-
ments in earlier queries.
the precision-recall surface. An expression for sAP is:
sAP =1
sP C@r /R, j
where sP C@r, j is the max of the set of all possible preci-
sions at the first point in ~rjat which recall ris achieved.
Computing sP C@r, j can be done with the O(nm) approach
described in Section 2.2:3within the jth loop, calculate pre-
cision and recall rover the documents on the path to that
point; if precision is greater than the current known maxi-
mum for sP C@r, j, update sP C@r, j to that precision.
In this example the volume under the surface is 0.261.
To test whether sAP follows our intuition that it should be
greater when more relevant documents are found for earlier
queries, we calculate it for each permutation of the three
rankings in our example. The results are shown in Table 3.
In the previous section we extended three traditional eval-
uation measures to the case of multi-query session collections
in a model-free fashion. The session-based system measures
capture the optimal contribution of a system to answer an
information need over an entire session. In this section we
look at the problem from a user perspective.
Note that our definition of sP C in the previous section
takes the maximum value of a set. We could instead take
the expected value; this has the advantage of using all of
the data as well as not assuming that a user will have the
optimal experience with the system. However, taking such
an expectation requires a probability distribution over paths;
formulating such a distribution requires a user model4.
To simplify the space of all possible browsing paths we
follow the user model described in Section 2: a user steps
down a ranked list of documents until some decision point.
It is important that any realization of the distribution over
possible paths allows for paths that end before the last re-
formulation in the static collection. Then, if Ω is the set
of all possible browsing paths that follow the simple user
model described earlier, P(ω) the probability of a certain
path ωΩ, and Mωa measure over the path ω, then we
define a session based measure as the expectation
esM =X
3There is also an O(nm) dynamic programming approach,
but we have not included it in this paper for reasons of space.
4A uniform distribution would not work, since most of the
paths are very long and therefore have approximately zero
4.1 Probability of browsing paths
As noted in Section 2, we can express a path ωas a set
of reformulation points.Let us therefore formulate P(ω=
{k1, k2, ..., ki}) as a joint probability distribution of a user
abandoning the session at reformulation i, while reformu-
lating at positions
ref = {k1, k2, ..., ki1}. Note that we do
not include ki, the abandonment cut-off at ranking i, in the
probability. For the sake of generalizing traditional mea-
sures, we will assume that once the user arrives at the ith
reformulation, they continue down that ranking as far as
necessary to compute the measure.
We express the probability of a path ωas,
P(ω) = P(ri,
ref) = P(ri)·P(
Here we introduce a simplifying assumption: the reformu-
lation position is independent across the ranked lists 1..ri.
Then P(
ref|ri) can be expressed as,
ref|ri) = P(k1|ri)P(k2|ri)···P(ki1|ri)
In general, we could make each reformulation point depen-
dent on the reformulation number and possibly even on the
relevance of documents in the ranking; in this work we have
elected to keep them independent for simplicity.
For the realization of the probability distribution over dif-
ferent browsing paths we follow Moffat and Zobel [10] in
their definition of ranked biased precision and use two geo-
metric distributions. The first gives the probability that the
ith reformulation is the last; it has an adjustable parameter
preform representing the probability that the user reformu-
lates again from their current query. They will only arrive
at reformulation iif they reformulate i1 times, so:
P(ri) = pi1
reform(1 preform )
Similarly, the second distribution gives the probability that
the kth rank is a stopping or reformulation point with ad-
justable parameter pdown. A user will arrive at rank konly
after deciding to progress down k1 times, so:
P(k) = pk1
down(1 pdown)
The probability of a path is then
ref) = pi1
reform(1 preform )
down(1 pdown )
Our definition of P(ri) may give non-zero probability to
a path that is not valid for a particular collection of static
sessions, e.g. one that ends at some point past the last (mth)
reformulation in the collection. To address this, we will trun-
cate the distribution P(ri) and renormalize it to ensure that
the probabilities to stop at different reformulations sum to
1. To do this we simply renormalize the probabilities from
1 to mby P r{rri}= 1 pm
reform. That is,
P0(ri) = pi1
reform(1 preform )
We could similarly truncate and renormalize P(kj). How-
ever, paths that extend beyond the last retrieved document
for a reformulation (typically ranks beyond 1000) will have
very low probability and thus they will not contribute in any
significant way in the calculation of a measure.
4.2 Expectations over paths
Given a measure Mωto be computed over the documents
viewed on a particular path ωalong with the probability dis-
tribution over all paths ωΩ, we can define a session mea-
sure as the expectation of Mωover the probabilistic space
of paths.
esM =E[M] = X
Let us consider a path-averaging generalization to preci-
sion at cut-off k. First define PC @k(ω) as the precision of
the first kdocuments experienced on path ω. Then:
esP C@k=X
P(ω)P C@k(ω)
P C@k(ω)P0(ri)
P C@k(ω) is the total proportion of relevant documents at
ranks 1..k1in ~r1, 1..k2in ~r2, and so on.
P C@k(ω) = 1
k k1
relij !
where ki, the abandonment cut-off at ranking i, is equal to
k(k1+k2+···+ki1). Plugging that into the expression
for E[P C@k] completes the formula.
Similarly, the expectations of recall after kdocuments can
be computed as
RC@k(ω) = 1
R k1
relij !
and kiis defined as above.
A path-averaging generalization to average precision is:
esAP =X
AP (ω)P0(ri)
where AP (ω) is the average precision of the concatenated
list of documents on path ω.
We can continue to define any measure this way. We will
conclude with a path-averaging generalization to nDCG:
where nDCG@k(ω) is the nDCG@k of the concatenated list.
All of the above formulations involve summing over paths
ω. In general, summing a function f(ω) over all paths can
be expressed in a brute-force way as:
f(ω) =
f({k1, k2, ..., ki1})
Note that computing it involves on the order of |r1| × |r2| ×
· · · × |rm1|=O(nm) steps.
4.3 Monte Carlo estimation
A running time of O(nm) is manageable, but it is not
fast, especially as mgrows. Since our model for these mea-
sures is fully probabilistic, a faster alternative approach to
estimating them uses simulation. A Monte Carlo simulation
method allows the estimation of a measure via repeated ran-
dom sampling. Running a Monte Carlo experiment requires
defining a domain of possible inputs, generating inputs ran-
domly from the domain using specific probability distribu-
tions, performing a deterministic computation using the in-
puts, and finally aggregating the results of the individual
computations into the final result.
In the case of the user-model based measures proposed
above, the input space is the ranking riat which the user
abandons the query and the reformulation cut-offs at all
previous queries {k1, ..., ki1}.
Each of the above path-averaging measures can be thought
as the expected outcome of the following random experi-
1. Sample the last reformulation rifrom P0(rj).
2. Sample (k1, k2, ..., ki1) i.i.d. from P(kj) to form a
path ω.
3. Create a ranked list of documents by concatenating
ranks 1...k1from r1, 1...k2from r2, ..., 1...ki1from
ri1, and 1...n from ri. These are the documents seen
along path ω.
4. Output measure Mover that ranked list.
This random experiment defines one round of the Monte
Carlo experiment. Executing the first two steps requires
sampling from a geometric distribution. This can be easily
performed assuming access to an algorithm that generates
pseudo-random numbers uniformly distributed in the inter-
val (0,1). Regarding the distribution of the last reformula-
tion, since it is renormalized, we can first partition the inter-
val (0,1) to ((0..P0(r1)),(P0(r1)..P 0(r1) + P0(r2)),··· ,(1
P0(rm)..1)). We then use the random number generator to
obtain a number in (0,1), and output jif this number is in
the j-th partition. In the case of the cut-off distribution the
same process can be followed. As mentioned earlier we did
not renormalize this distribution and thus the last partition
does not end in 1, however renormalization can be easily per-
formed in the case of Monte Carlo by simply rejecting any
sample larger than the upper bound of the last partition.
Repeating the process above Btimes and averaging the
results gives an estimate of the expectation of measure M.
For most purposes B= 1000 (which will usually be much
less than nm) should be sufficient; we explore the errors in
estimates due to Bin Section 6.
Our measures to this point make a strong assumption:
that retrieval systems return unique documents for each
query reformulation. Under this assumption, the concate-
nated ranked list of documents ~
cl which corresponds to a
certain browsing path ωresembles a typical ranked list of
document in the standard evaluation paradigm. We cer-
tainly do not expect this assumption to hold in real systems;
it is likely that documents retrieved for a second query will
overlap with those retrieved for the first. When relaxing this
assumption,we need to consider how these duplicate docu-
ments should be treated from the perspective of the evalua-
tion measure.
The first question raised is whether documents repeated in
ranked lists for subsequent reformulations have any value for
a user. J¨
arvelin et al. [6] noticed that in an empirical inter-
active search study conducted by Price et al. [12] searchers
overlooked documents in early queries but recognized them
in later reformulations. Due to this, the proposed sDCG
measure does not give any special treatment to duplicate
relevant documents; it considers them relevant regardless of
the number of times they have been seen before.
But measures with a recall component (such as average
precision or recall at k) cannot count duplicates in a sound
way. Since there are multiple possible paths through the
results, and these paths could have duplicate relevant doc-
uments, it is possible that more than Rrelevant documents
could be observed along any given path. The computed
measure may exceed the desired maximum value of 1.
We can instead consider duplicate documents nonrelevant
in all cases. This has certain implications as well. For one,
penalizing a retrieval system for returning duplicate doc-
uments may lead to systems that are less transparent to
their users. Imagine a user that reformulates a query and
expects to see previously observed relevant documents at
higher ranks than before. If they are not there, the user may
question whether the system can ever be useful to them.
Furthermore, by definition the expected measures reward
a retrieval system that respond in an optimal way to a pop-
ulation of users. These different users may very well follow
different browsing paths. In an actual retrieval setup, a sys-
tem can infer whether a document has been observed by a
user (e.g. by observing the users’ clicks). In the case of a
batch experiment, however, a certain document may be a
duplicate for one user but not for another, depending on the
browsing path each one of them has followed. This informa-
tion is hidden in the parameters of the particular evaluation
measure (which simulates the population of users). Tak-
ing these parameters into account a system could respond
optimally by removing the expected duplicate documents.
However, the need of such an inference process and a rank-
ing function that accounts for the average browsing path
is just an artifact of the batch nature of experiments. A
retrieval system running in the real world will not need to
employ such an algorithm.
Yang and Lad [16] take an approach in between the two ex-
tremes. Although they did not explicitly consider the prob-
lem of exact duplicates, the proposed measure is a measure
of information novelty over multi-query sessions and thus it
takes the typical approach other novelty measures take [4]
by defining information nuggets and discounting documents5
that contain the same relevant information nugget.
In this work we consider an alternative treatment of du-
plicates inspired by Sakai’s compressed ranked lists [14] and
Yilmaz & Aslam’s inducedAP [17]. When considering a path
ωout of the population of all paths, we construct the con-
catenated list ~
cl that corresponds to this path. We then
walk down the concatenated list and simply remove dupli-
cate documents, effectively pushing subsequent documents
5The measure by Yang and Lad [16] was proposed for the
task of information distillation and thus it operates over
passages instead of documents; the same approach however
could be used for the case of documents.
one rank up. This way, we neither penalize systems for re-
peating information possibly useful to the user, nor do we
push unnecessary complexity to the retrieval system side.
Further, the measures still reward the retrieval of new rele-
vant documents. Note here that such an approach assumes
that a system ranks documents independent of each other
in a ranked list (probabilistic ranking principle [13]). If this
is not true, i.e. if ranking a document depends on previ-
ously ranked documents and the retrieval system is agnostic
to our removal policy then this may also lead to unsound
In this section we demonstrate the behavior of the pro-
posed measures. There is currently no “gold standard” for
session evaluation measures; our goal in this section is to
evaluate whether our measures provide information about
system effectiveness over a session and whether they cap-
ture different attributes of performance in a similar way to
traditional precision, recall, and average precision. We will
compare our measures to the session nDCG measure pro-
posed by J¨
arvelin et al., though we consider none of these,
nor any other measure, to be the one true session measure.
We use two collections towards these goals: (a) the TREC
2010 Session track collection [7], and (b) the TREC-8 (1999)
Query track collection [2]. Though the latter is not a collec-
tion of multi-query sessions, we will find it useful to explore
properties of our measures. Both of these collections are
described in more detail below.
The instantiation of session nDCG@k we use is calculated
as follows: we start by concatenating the top kresults from
each ranked list of results in the session. For each rank iin
the concatenated list, we compute the discounted gain as
logb(i+ (b1))
where bis a log base typically chosen to be 2. These are the
summands of DCG as implemented by Burges et al. [3] and
used by many others. We then apply an additional discount
to documents retrieved for later reformulations. For rank i
between 1 and k, there is no discount. For rank ibetween
k+ 1 and 2k, the discount is 1/logbq (2 + (bq 1)), where bq
is the log base. In general, if the document at rank icame
from the jth reformulation, then
logbq (j+ (bq 1)) DG@i
Session DCG is then the sum over sDG@i
logbq (j+ (bq 1)) logb(i+ (b1))
with j=b(i1)/kc, and mthe length of the session. We
use bq = 4. This implementation resolves a problem present
in the original definition by J¨
arvelin et al. [6] by which docu-
ments in top positions of an earlier ranked list are penalized
more than documents in later ranked lists.
As with the standard definition of DCG, we can also com-
pute an “ideal” score based on an optimal ranking of docu-
ments in decreasing order of relevance to the query and then
normalize sDCG by that ideal score to obtain nsDCG@k.
nsDCG@k essentially assumes a specific browsing path:
ranks 1 through kin each subsequent ranked list, thereby
giving a path of length mk. Therefore, we set the cut-offs
of our expected session measures to mk (with the excep-
tion of AP). For the computation of the expected session
measures the parameter pdown is set to 0.8 following the
recommendation by Zhang et al. [18]; in expectation, users
stop at rank 5. The parameter preform is set arbitrarily to
0.5. With m= 2 reformulations, the probability of a user
stopping at the first reformulation is then 67% and mov-
ing to the next reformulation 33%, which is not far off the
percentage of users reformulating their initial queries in the
Excite logs [15].
6.1 Session track collection
The Session track collection consists of a set of 150 two-
query sessions (initial query followed by one reformulation).
Out of the 150 sessions, 136 were judged. The judged 136
topics include 47 for which the reformulation involves greater
specification of the information need, 42 for which the re-
formulation involves more generalization, and 47 for which
the information need “drifts” slightly. In the case of spec-
ification and generalization, both the initial query and its
reformulation represent the same information need, while in
the case of drifting the two queries in the session represent
two different (but related) information needs. Given that
some of the proposed session measures make the assump-
tion of a single information need per session—these are the
recall-based measures such as AP and recall at cutoff k—we
drop the 47 drifting sessions from our experiments.
Each participating group submitted runs consisting of three
ranked lists of documents:
RL1 ranked results for the initial query;
RL2 ranked results for the query reformulation indepen-
dently of the initial query;
RL3 ranked results for the query reformulation when the
initial query was taken into consideration.
Thus, each submission consists of two pairs of runs over
the two-query session: RL1RL2, and RL1RL3. The
document corpus was the ClueWeb09 collection. Judging
was based on a shallow depth-10 pool from all submitted
ranked lists. Kanoulas et al. detail the collection further [7].
Figure 3 shows example precision-recall surfaces for two
submissions, CengageS10R3 [9] and RMITBase [8]. In both
cases there is a moderate improvement in performance from
the first query to the second. The decrease in precision is
rapid in both, but slightly less so in RMITBase. As a result,
though CengageS10R3 starts out with higher precisions at
lower recalls, the model-free mean sAP s are close: 0.240
and 0.225 respectively. In general, these surfaces, like tradi-
tional precision-recall curves, provide a good sense of relative
effectiveness differences between systems and where in the
ranking they occur.
We use the submitted RL1 and RL2 runs (27 submissions
in total) to compare the proposed model-based measures
with normalized session DCG. nsDCG is computed at cut-
off 10. We compute all measures in Section 2 with cut-off
2·10 = 20 (to ensure the same number of documents are
used). Scatter plots of nsDCG@10 versus expected session
nDCG@20 (esnDCG), PC@20 (esPC), RC@20 (esRC), and
AP (esAP) are shown in Figure 4. Each point in the plot
corresponds to a participant’s RL1RL2 submission; mea-
sures are averaged over 89 topics. The strongest correlation
0.0 0.2 0.4 0.6 0.8
0.0 0.2 0.4 0.6 0.8
Figure 3: Precision-recall surfaces for CengageS10R3 and RMITBase.
0.10 0.15 0.20
0.10 0.15 0.20
nsDCG vs. esNDCG
Kendall''s tau : 0.7972
0.10 0.15 0.20
0.10 0.15 0.20 0.25
nsDCG vs. esPC
Kendall''s tau : 0.7334
0.10 0.15 0.20
0.04 0.06 0.08 0.10
nsDCG vs. esRC
Kendall''s tau : 0.6987
0.10 0.15 0.20
0.04 0.06 0.08
nsDCG vs. esAP
Kendall''s tau : 0.5247
Figure 4: Scatter plots of the mean performance of systems in the Session track for the session RL1RL2.
The x-axis is sNDCG@10; the y-axis is expected session nDCG@20 (esnDCG), PC20 (esPC), RC@20 (esRC),
and AP (esAP) for the four plots from left to right and top to bottom.
is between esnDCG and snDCG (as expected). Interestingly,
esAP strongly disagrees with snDCG; this demonstrates that
esAP measures different aspects of system effectiveness over
sessions than snDCG. Table 4 shows τcorrelations between
all the expected measures as well as the model-free sAP;
overall the correlations are within the range expected for
these measures with a relatively small number of input sys-
tems. They are high relative to random orderings, but low
enough that it is clear that all the measures are capturing
different aspects of effectiveness. esAP and sAP do not cor-
relate very well, but recall that sAP is based on maximums
while esAP is based on averages.
Kendall’s tau correlation
esRC esAP esnDCG sAP
esPC 0.88 0.79 0.93 0.78
esRC 0.80 0.84 0.87
esAP 0.72 0.78
esnDCG 0.74
Table 4: Kendall’s tau correlation across the four
expected session measures and sAP.
6.2 Query track collection
The Query track [2] was an effort to understand the system-
query-topic interaction often observed in IR experiments,
where certain systems perform well for certain queries but
under-perform for others. A set of 50 topics (topics 51-100
from the TREC-1 collection) was provided to participants;
they responded with a set of queries for each topic. The con-
structed queries were in one of the following forms, (a) very
short: 2-4 words based on the topic and few relevance judg-
ments; (b) sentences: 1-2 sentences using the topic and few
relevant documents; (c) sentences+feedback: 1-2 sentences
using only the relevant documents; and (d) weighted terms.
Overall 23 query sets were produced, each consisting of 50
queries corresponding to the 50 topics of TREC-1. Partici-
pants ran their retrieval systems over each of the query sets
and submitted results for each system for each query set.
We use subsets of the 23 query sets and the submitted runs
to simulate query sessions.
The goal of our first experiment is to empirically verify
that the proposed measures reward early retrieval of relevant
documents in the session. In this experiment we simulate
four sets of session retrieval algorithms over 50 two-query
sessions. The first simulated algorithm (“good”–“good”) per-
forms well on both the initial query and its reformulation
in a hypothetical query session; the second (“good”–“bad”)
performs well on the initial query but not on its reformu-
lation, the third (“bad”“good”) does not perform well on
the initial query but does on its reformulation, and the last
(“bad”“bad”) does not perform well on either queries.
To simulate these algorithms, we note that the systems
participating in the Query track performed particularly well
on short queries (with an average MAP of 0.227), while they
did not perform well on the sentence-feedback queries (av-
erage MAP of 0.146) [2]. Taking this into consideration we
simulate the best algorithm using runs over two short formu-
lations of the same query. In particular, the nine runs over
the query set named INQ1a with simulated reformulation
results from the nine runs over query set INQ1b (with aver-
age MAP equal to 0.110 and 0.127 respectively, as computed
by trec_eval) simulated a set of systems performing well
both over the initial query of a hypothetical session and its
reformulation. We simulate the worst system by runs over
two sentence-feedback formulations. In particular the runs
over query set INQ3a with reformulations being the runs
over query set INQ3b (with average MAP 0.078 and 0.072
respectively) simulated a set of systems that performed well
neither over the initial nor over the reformulated query. The
other two combinations—INQ1a to INQ3b and INQ3a to
INQ1b—simulated medium performance systems.6
What we expect to observe in this simulation is that ses-
6The query sets INQ1a, INQ1b, INQ3a, INQ3b were man-
ually constructed by students at UMass Amherst [2].
Average mean session measures
esMPC@20 esMRC@20 esMAP
“good”–“good” 0.378 0.036 0.122
“good”–“bad” 0.363 0.034 0.112
“bad”“good” 0.271 0.023 0.083
“bad”“bad” 0.254 0.022 0.073
Table 5: esPC@20, esRC@20, and esAP, each aver-
aged over nine Query track systems simulating four
session retrieval systems.
sion measures reward the “good”–“good” simulated systems
the most, followed by the “good”“bad”, the “bad”“good”
and finally the “bad”“bad” systems. Table 5 shows the av-
erage mean session scores produced by three of the expected
session measures over the four simulated sets of retrieval al-
gorithms. The results verify the behavior of the proposed
measures. This is the exact same behavior we observed for
the sAP measure in Table 3.
We conclude by testing the accuracy of the Monte Carlo
method in computing the expected session measures. We
simulated a two-query session by randomly selecting two
sets of query formulations and their corresponding runs and
forming a session from those. We run our O(nm) exact al-
gorithm along with the Monte Carlo method. For the latter
we use B= 10, B= 100 and B= 1000 trials. The results
are illustrated in the left-hand plot in Figure 5, showing the
high accuracy of the Monte Carlo simulation even with as
few as 10 repetitions. Obviously, the accuracy of the method
depends on the length of the session together with the set
parameters pdown and preform. We did not test for different
parameters, however we did repeat the same process over 3-
query sessions. The results are illustrated at the right-hand
side plot in Figure 5. There is somewhat more variance but
even with B= 10 the results are very accurate.
In this work we considered the problem of evaluating re-
trieval systems over static query sessions. In our effort to
make as few assumptions about user behavior over a ses-
sion as possible, we proposed two families of measures: a
model-free family inspired by interpolated precision, and a
model-based family with a simple model of user interaction
described by paths through ranked results. With a novel ses-
sion test collection and a session collection simulated from
existing data, we showed that the measure behaviors cor-
respond to our intuitions about their traditional counter-
parts as well as our intuitions about systems that are able
to find more relevant documents for queries earlier in the
session. The measures capture different aspects of system
performance, and also capture different aspects than what
is captured by the primary alternative, snDCG.
They are surely other ways to generalize traditional eval-
uation measures and paradigms to session evaluation than
those we have presented. Our goal with this work is to de-
fine measures as intuitively as possible while keeping models
of user behavior simple. In the future we will want to con-
sider explicit models of when and why users reformulate: is
it based on the relevance of the documents they see? Does it
depend on how many times they have already reformulated?
Are they willing to go deeper down the ranking for a later
reformulation than an earlier one?
0.25 0.30 0.35 0.40 0.45 0.50
0.25 0.30 0.35 0.40 0.45 0.50
2-query sessions
expected session AP (esAP) -- exact method
expected session AP (esAP) -- Monte Carlo
B=10 -- tau=0.957
B=100 -- tau=0.981
B=1000 -- tau=0.983
0.35 0.40 0.45
0.30 0.35 0.40 0.45
3-query sessions
expected session AP (esAP) -- exact method
expected session AP (esAP) -- Monte Carlo
B=10 -- tau=0.896
B=100 -- tau=0.947
B=1000 -- tau=0.97
Figure 5: Exact method versus Monte Carlo in the computation of expected session AP (esAP) for B= 10,
B= 100 and B= 1000, for 2 and 3 query sessions.
There are many cases we have not explicitly considered.
“Drifting” information needs, which were part of the TREC
2010 Session track, may require special treatment for eval-
uation since the set of relevant documents can change as
the need drifts. Furthermore, there are many examples of
sessions in which a query and its results serve to guide the
user in selecting future queries rather than immediately pro-
vide relevant documents; while we can apply our measures
to these types of sessions, they are clearly not designed to
measure a system’s effectiveness at completing them.
Deeper understanding of session measures and their rela-
tionship to the user experience will come from future work
on session test collections, application to “real” sessions in
query log data, and extensive experimentation and analysis.
We plan to continue all these lines of research in the future.
We gratefully acknowledge the support provided by the
European Commission grant FP7-PEOPLE-2009-IIF-254562
and FP7/2007-2013-270082 and by the University of Delaware
Research Foundation (UDRF).
[1] M. J. Bates. The design of browsing and berrypicking
techniques for the online search interface. Online review,
13(5):407–431, 1989.
[2] C. Buckley and J. A. Walz. The trec-8 query track. In The
Eighth Text REtrieval Conference Proceedings (TREC
1999), 1999.
[3] C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank
with nonsmooth cost functions. In B. Sch¨
olkopf, J. C. Platt,
T. Hoffman, B. Sch¨
olkopf, J. C. Platt, and T. Hoffman,
editors, NIPS, pages 193–200. MIT Press, 2006.
[4] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova,
A. Ashkan, S. B¨
uttcher, and I. MacKinnon. Novelty and
diversity in information retrieval evaluation. In SIGIR ’08:
Proceedings of the 31st annual international ACM SIGIR
conference on Research and development in information
retrieval, pages 659–666, New York, NY, USA, 2008. ACM.
[5] W. S. Cooper. Expected search length: a single measure of
retrieval effectiveness based on the weak ordering action of
retrieval systems. American Documentation, 19:30–41,
[6] K. J¨
arvelin, S. L. Price, L. M. L. Delcambre, and M. L.
Nielsen. Discounted cumulated gain based evaluation of
multiple-query ir sessions. In ECIR, pages 4–15, 2008.
[7] E. Kanoulas, B. Carterette, P. Clough, and M. Sanderson.
Session track overview. In The Nineteenth Text REtrieval
Conference Notebook Proceedings (TREC 2010), December
[8] S. Kharazmi, F. Scholer, and M. Wu. RMIT University at
TREC 2010: Session track. In Proceedings of TREC, 2010.
[9] B. King and I. Provalov. Cengage learning at the TREC
2010 Session track. In Proceedings of TREC, 2010.
[10] A. Moffat and J. Zobel. Rank-biased precision for
measurement of retrieval effectiveness. ACM Trans. Inf.
Syst., 27(1):1–27, 2008.
[11] P. Over. Trec-7 interactive track report. In The Seventh
Text REtrieval Conference Proceedings (TREC 1998),
pages 33–39, 1998.
[12] S. Price, M. L. Nielsen, L. M. L. Delcambre, and
P. Vedsted. Semantic components enhance retrieval of
domain-specific documents. In M. J. Silva, A. H. F.
Laender, R. A. Baeza-Yates, D. L. McGuinness, B. Olstad,
Ø. H. Olsen, and A. O. Falc˜ao, editors, CIKM, pages
429–438. ACM, 2007.
[13] S. E. Robertson. The probability ranking principle in IR.
Journal of Documentation, 33(4):294–304, 1977.
[14] T. Sakai. Alternatives to bpref. In SIGIR ’07: Proceedings
of the 30th annual international ACM SIGIR conference
on Research and development in information retrieval,
pages 71–78, New York, NY, USA, 2007. ACM.
[15] D. Wolfram, A. Spink, B. J. Jansen, and T. Saracevic. Vox
populi: The public searching of the web. JASIST,
52(12):1073–1074, 2001.
[16] Y. Yang and A. Lad. Modeling expected utility of
multi-session information distillation. In L. Azzopardi,
G. Kazai, S. E. Robertson, S. M. R¨
uger, M. Shokouhi,
D. Song, and E. Yilmaz, editors, ICTIR, pages 164–175,
[17] E. Yilmaz and J. A. Aslam. Estimating average precision
with incomplete and imperfect judgments. In P. S. Yu,
V. Tsotras, E. Fox, and B. Liu, editors, Proceedings of the
Fifteenth ACM International Conference on Information
and Knowledge Management, pages 102–111. ACM Press,
November 2006.
[18] Y. Zhang, L. A. Park, and A. Moffat. Click-based evidence
for decaying weight distributions in search effectiveness
metrics. Inf. Retr., 13:46–69, February 2010.
... With regards to evaluation metrics for user search performance, we considered the factors of incomplete relevance judgments in the test collection [5,48], models of user behavior [6,26,44], multiplequery sessions [27,29,52] and alignment between system and user performance [6,20]. Since we were concerned about user performance in a search session, we chose the bpref of the best performing query (bpref_bq) [9] and the R-Precision of the best performing query (Rprec_bq). ...
... Since the search performance measures of nDCG_mean and sDCG/q are used to summarize the overall performance with multiquery sessions by applying a query discount [e.g., 26,27,29], it's not surprising that either more queries or typed queries are correlated with session-based metrics. The low minimum of nDCG in some sessions reveals some unsuccessful searches, which may reflect the fact that users were struggling during the search process, or additional support was needed to make the best use of the suggested keywords. ...
Conference Paper
The objective of this controlled information retrieval (IR) user experiment is to gain an understanding of domain experts’ interactions with novel search interfaces within the context of biomedical information search, with a goal of better search interface design. In this paper, we examine the relationships among user perception, gaze and search behaviour and user search performance. An eye-tracking study of biomedical domain experts’ interactions with novel search interfaces was conducted. A total of thirty-two users participated and searched for documents answering eight complex exploratory search tasks, using four different search interfaces. The findings suggest that gaze behaviour in terms of fixation durations based measures of areas of interest (AOI), i.e., visual attention to the elements of title, author, abstract and MeSH (Medical Subject Headings) terms in document surrogates is correlated with search performance. Users are more likely to achieve better search performance by precision-based measures when 1) search tasks are perceived as difficult; 2) users attend to the element of abstract; and 3) users can recall using the per-query suggestions during the search processes. More importantly, our findings suggest that a user search interface design that displays contextual information between the suggested keywords and the document may better support users reformulating their queries for complex search tasks in the biomedical domain. We discuss implications for the design of search user interfaces for biomedical searching.
... Os métodos que utilizam alguma forma de interação com o usuário, por exemplo, via feedback do usuário para indicar se uma informação é relevante ou não, são chamados de sistemas interativos. Cada ciclo de interação do usuário com o resultado de uma busca define uma iteração (Järvelin et al ., 2008); (Kanoulas et al ., 2011). Para mensurar a eficácia, comparar sistemas, construir análises gráficas e realizar testes estatísticos para este tipo de sistema é necessária a utilização de um conjunto diverso de ferramentas, de modo trabalhoso e suscetível a erros e incompatibilidades. ...
Atualmente, há várias ferramentas de captura e armazenamento de informações que são utilizadas diariamente por grande parte da população.
... sDCG basically sums up the nDCG values of all the queries in the session and gives higher weight to the earlier queries. Kanoulas et al. (2011) later introduced a normalized variation of sDCG, called nsDCG. Jiang and Allan (2016) conducted a user study to measure the correlation between these metrics and user's opinion. ...
Conversational information seeking (CIS) is concerned with a sequence of interactions between one or more users and an information system. Interactions in CIS are primarily based on natural language dialogue, while they may include other types of interactions, such as click, touch, and body gestures. This monograph provides a thorough overview of CIS definitions, applications, interactions, interfaces, design, implementation, and evaluation. This monograph views CIS applications as including conversational search, conversational question answering, and conversational recommendation. Our aim is to provide an overview of past research related to CIS, introduce the current state-of-the-art in CIS, highlight the challenges still being faced in the community. and suggest future directions.
... They suggest calculating the Expected Global Utility over a set of possible interaction patterns. Kanoulas et al. (2011a) introduced a suite of model-based and model-free session evaluation measures. Tang and Yang (2017) investigated the challenge of calculating an upper bound for a number of session search metrics and suggested a new normalised measure. ...
A user model is a fundamental component in user-centred information retrieval systems. It enables personalization of a user's search experience. The development of such a model involves three phases: collecting information about each user, representing such information, and integrating the model into a retrieval application. Progress in this area is typically met with privacy and scalability challenges that hinder the ability to synthesize collective knowledge from each user's search behaviour. In this thesis, I propose a framework that addresses each of these three phases. The proposed framework is based on social role theory from the social science literature and at the centre of this theory is the concept of a social position. A social position is a label for a group of users with similar behavioural patterns. Examples of such positions are traveller, patient, movie fan, and computer scientist. In this thesis, a social position acts as a label for users who are expected to have similar interests. The proposed framework does not require real users' data; rather it uses the web as a resource to model users. The proposed framework offers a data-driven and modular design for each of the three phases of building a user model. First, I present an approach to identify social positions from natural language sentences. I formulate this task as a binary classification task and develop a method to enumerate candidate social positions. The proposed classifier achieves an accuracy score of 85.8%, which indicates that social positions can be identified with good accuracy. Through an inter-annotator agreement study, I further show a reasonable level of agreement between users when identifying social positions. Second, I introduce a novel topic modelling-based approach to represent each social position as a multinomial distribution over words. This approach estimates a topic from a document collection for each position. To construct such a collection for a particular position, I propose a seeding algorithm that extracts a set of terms relevant to the social position. Coherence-based evaluation shows that the proposed approach learns significantly more coherent representations when compared with a relevance modelling baseline. Third, I present a diversification approach based on the proposed framework. Diversification algorithms aim to return a result list for a search query that would potentially satisfy users with diverse information needs. I propose to identify social positions that are relevant to a search query. These positions act as an implicit representation of the many possible interpretations of the search query. Then, relevant positions are provided to a diversification technique that proportionally diversifies results based on each social position's importance. I evaluate my approach using four test collections provided by the diversity task of the Text REtrieval Conference (TREC) web tracks for 2009, 2010, 2011, and 2012. Results demonstrate that my proposed diversification approach is effective and provides statistically significant improvements over various implicit diversification approaches. Fourth, I introduce a session-based search system under the framework of learning to rank. Such a system aims to improve the retrieval performance for a search query using previous user interactions during the search session. I present a method to match a search session to its most relevant social positions based on the session's interaction data. I then suggest identifying related sessions from query logs that are likely to be issued by users with similar information needs. Novel learning features are then estimated from the session's social positions, related sessions, and interaction data. I evaluate the proposed system using four test collections from the TREC session track. This approach achieves state-of-the-art results compared with effective session-based search systems. I demonstrate that such a strong performance is mainly attributed to features that are derived from social positions' data.
... Sloan and Wang [67] used these metrics to evaluate multi-page search systems. The session Discounted Cumulative Gain (sDCG) [38], that models graded relevance, and user effort is used to evaluate dynamic search methods in multi-query search sessions. Recently, the TREC Dynamic Domain (DD) track [79] adopted the Cube Test (CT) [47], and the Average Cube Test (ACT) [79] metrics, which model relevance, diversity, and effort, to evaluate dynamic search systems tackling multi-aspect information needs. ...
Full-text available
In many search scenarios, such as exploratory, comparative, or survey-oriented search, users interact with dynamic search systems to satisfy multi-aspect information needs. These systems utilize different dynamic approaches that exploit various user feedback granularity types. Although studies have provided insights about the role of many components of these systems, they used black-box and isolated experimental setups. Therefore, the effects of these components or their interactions are still not well understood. We address this by following a methodology based on ANalysis Of VAriance (ANOVA). We built a Grid Of Points that consists of systems based on different ways to instantiate three components: initial rankers, dynamic rerankers, and user feedback granularity. Using evaluation scores based on the TREC Dynamic Domain collections, we built several ANOVA models to estimate the effects. We found that: (i) although all components significantly affect search effectiveness, the initial ranker has the largest effective size; (ii) the effect sizes of these components vary based on the length of the search session and the used effectiveness metric, and (iii) initial rankers and dynamic rerankers have more prominent effects than user feedback granularity. To improve effectiveness, we recommend improving the quality of initial rankers and dynamic rerankers. This does not require eliciting detailed user feedback, which might be expensive or invasive.
... The main difference between our evaluation measure [51], i.e. Equation 1, and existing evaluation metrics, such as session-ndcg [22], is that instead of using an overall relevance judgement for an entire session, we evaluate the retrieval effectiveness of the anticipatory list of each query separately. ...
Full-text available
Reducing user effort in finding relevant information is one of the key objectives of search systems. Existing approaches have been shown to effectively exploit the current search session context of users for automatically suggesting queries to reduce their search efforts. However, these approaches do not accomplish the end-goal of a search system-that of retrieving a set of potentially relevant documents for the evolving information need during a search session. This paper takes the problem of query prediction one step further by investigating the problem of contextual recommendation within a search session. More specifically, given the partial context information of a session in the form of a small number of queries, we investigate how a search system can effectively predict the documents that a user would have been presented with, had they continued the search session by submitting subsequent queries. To address the problem, we propose a model of contextual recommendation that seeks to capture the underlying semantics of information need transitions of a current user's search context. This model leverages information from a number of past interactions of other users with similar interactions from an existing search log. To identify similar interactions, as a novel contribution, we propose an embedding approach that jointly learns representations of both individual query terms and also those of queries (in their entirety) from a search log data by leveraging session-level containment relationships. Our experiments conducted on a large query log, namely the AOL, demonstrate that using a joint embedding of queries and their terms within our proposed framework of document retrieval outperforms a number of text-only and sequence modeling based baselines.
Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability : the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity : the ability to agree with ultimate user preference; and (3) intuitiveness : the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.
Recommender systems can support everyday digital tasks by retrieving and recommending useful information contextually. This is becoming increasingly relevant in services and operating systems. Previous research often focuses on specific recommendation tasks with data captured from interactions with an individual application. The quality of recommendations is also often evaluated addressing only computational measures of accuracy, without investigating the usefulness of recommendations in realistic tasks. The aim of this work is to synthesize the research in this area through a novel approach by (1) demonstrating comprehensive digital activity monitoring, (2) introducing entity-based computing and interaction, and (3) investigating the previously overlooked usefulness of entity recommendations and their actual impact on user behavior in real tasks. The methodology exploits context from screen frames recorded every 2 seconds to recommend information entities related to the current task. We embodied this methodology in an interactive system and investigated the relevance and influence of the recommended entities in a study with participants resuming their real-world tasks after a 14-day monitoring phase. Results show that the recommendations allowed participants to find more relevant entities than in a control without the system. In addition, the recommended entities were also used in the actual tasks. In the discussion, we reflect on a research agenda for entity recommendation in context, revisiting comprehensive monitoring to include the physical world, considering entities as actionable recommendations, capturing drifting intent and routines, and considering explainability and transparency of recommendations, ethics, and ownership of data.
The Third Workshop on Evaluation of Personalisation in Information Retrieval (WEPIR 2021) was held in conjunction with the ACM SIGIR Conference on Human Information Interaction & Retrieval (CHIIR 2021) in Canberra, Australia, as a virtual event. WEPIR 2021 followed on from the first and second WEPIRs held at CHIIR 2018 and 2019. The purpose of the workshop was again to bring together researchers from different backgrounds, interested in advancing the evaluation of personalisation in information retrieval. The workshop focused on further development of a common understanding of the challenges, requirements and practical limitations of personalisation in information retrieval and its evaluation.
Full-text available
The principle that, for optimal retrieval, documents should be ranked in order of the probability of relevance or usefulness has been brought into question by Cooper. It is shown that the principle can be justified under certain assumptions, but that in cases where these assumptions do not hold, the principle is not valid. The major problem appears to lie in the way the principle considers each document independently of the rest. The nature of the information on the basis of which the system decides whether or not to retrieve the documents determines whether the document-by-document approach is valid.
Conference Paper
Full-text available
The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions.
Evaluation measures act as objective functions to be optimized by information retrieval systems. Such objective functions must accurately reflect user requirements, particularly when tuning IR systems and learning ranking functions. Ambiguity in queries and redundancy in retrieved documents are poorly reflected by current evaluation measures. In this paper, we present a framework for evaluation that systematically rewards novelty and diversity. We develop this framework into a specific evaluation measure, based on cumulative gain. We demonstrate the feasibility of our approach using a test collection based on the TREC question answering track.
A measure of document retrieval system performance called the “expected search length reduction factor” is defined and compared with indicators, such as precision and recall, that have been suggested by other workers. The new measure is based on calculations of the expected number of irrelevant documents in the collection which would have to be searched through before the desired number of relevant documents could be found. Its advantages are: (1) it provides a single index for the property it attempts to measure; (2) it allows for gradations of retrieval status, through the mathematical concept of a “weak ordering”; (3) it evaluates retrieval performance relative to random searching; and (4) it takes into account the amount of relevant material desired by the requester.
First, a new model of searching in online and other information systems, called 'berrypicking', is discussed. This model, it is argued, is much closer to the real behavior of information searchers than the traditional model of information retrieval is, and, consequently, will guide our thinking better in the design of effective interfaces. Second, the research literature of manual information seeking behavior is drawn on for suggestions of capabilities that users might like to have in online systems. Third, based on the new model and the research on information seeking, suggestions are made for how new search capabilities could be incorporated into the design of search interfaces. Particular attention is given to the nature and types of browsing that can be facilitated.
Conference Paper
We seek to leverage knowledge about information organization in a domain to effectively and efficiently meet targeted information needs of expert users. The semantic components model represents document content in a manner that is complementary to full text and keyword indexing. Semantic component instances are segments of text about a particular aspect of the main topic of the document and may not correspond to structural elements in the document. This paper describes the semantic components model and presents experimental evidence from a large interactive searching study showing that semantic components, used to supplement full text and keyword indexing and to extend the query language, enhanced the retrieval of domain-specific documents in response to realistic queries posed by real users.
Conference Paper
IR research has a strong tradition of laboratory evaluation of systems. Such research is based on test collections, pre-defined test topics, and standard evaluation metrics. While recent research has emphasized the user viewpoint by proposing user-based metrics and non-binary relevance assessments, the methods are insufficient for truly user-based evaluation. The common assumption of a single query per topic and session poorly represents real life. On the other hand, one well-known metric for multiple queries per session, instance recall, does not capture early (within session) retrieval of (highly) relevant documents. We propose an extension to the Discounted Cumulated Gain (DCG) metric, the Session-based DCG (sDCG) metric for evaluation scenarios involving multiple query sessions, graded relevance assessments, and open-ended user effort including decisions to stop searching. The sDCG metric discounts relevant results from later queries within a session. We exemplify the sDCG metric with data from an interactive experiment, we discuss how the metric might be applied, and we present research questions for which the metric is helpful.
Conference Paper
Recently, a number of TREC tracks have adopted a retrieval effectiveness metric called bpref which has been designed for evaluation environments with incomplete relevance data. A graded-relevance version of this metric called rpref has also been proposed. However, we show that the application of Q-measure, normalised Discounted Cumulative Gain (nDCG) or Average Precision (AveP)to condensed lists, obtained by ?ltering out all unjudged documents from the original ranked lists, is actually a better solution to the incompleteness problem than bpref. Furthermore, we show that the use of graded relevance boosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the best choices. To this end, we use four graded-relevance test collections from NTCIR to compare ten different IR metrics in terms of system ranking stability and pairwise discriminative power.