ArticlePDF Available

Abstract and Figures

As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of interaction for search. Conversational search shares some features with traditional search, but differs in some important respects: conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions, and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these differences, traditional methods for search evaluation (such as the Cranfield paradigm) do not translate easily to conversational search. In this work, we propose a framework for offline evaluation of conversational search, which includes a methodology for creating test collections with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction data to train the model. The framework is based on the idea of "subtopics", often used to model novelty and diversity in search and recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we know, this is the first work to combine these ideas into a comprehensive framework for offline evaluation of conversational search.
Content may be subject to copyright.
How Am I Doing?: Evaluating Conversational Search Systems Oline
ALDO LIPANI, University College London, UK
BEN CARTERETTE, Spotify, USA
EMINE YILMAZ, University College London & Amazon, UK
As conversational agents like Siri and Alexa gain in popularity and use, conversation is becoming a more and more important mode of
interaction for search. Conversational search shares some features with traditional search, but diers in some important respects:
conversational search systems are less likely to return ranked lists of results (a SERP), more likely to involve iterated interactions,
and more likely to feature longer, well-formed user queries in the form of natural language questions. Because of these dierences,
traditional methods for search evaluation (such as the Craneld paradigm) do not translate easily to conversational search. In this work,
we propose a framework for oine evaluation of conversational search, which includes a methodology for creating test collections
with relevance judgments, an evaluation measure based on a user interaction model, and an approach to collecting user interaction
data to train the model. The framework is based on the idea of “subtopics”, often used to model novelty and diversity in search and
recommendation, and the user model is similar to the geometric browsing model introduced by RBP and used in ERR. As far as we
know, this is the rst work to combine these ideas into a comprehensive framework for oine evaluation of conversational search.
Additional Key Words and Phrases: information retrieval, conversational search, evaluation, test collections
1 INTRODUCTION
Conversation is increasingly becoming an important mode of interaction with search systems. As use of handheld and
in-car mobile devices and in-home “smart speakers” grows, people are utilizing voice as a mode of interaction more and
more. And since search remains one of the most common ways people nd and access information, search via voice
interfaces is more important than ever.
Thus search engines increasingly need to be built for “dialogues” with users. A full ranked list of results—a SERP—is
not likely to be useful in such interactions; systems should instead provide a single high-precision answer. Since a user
may only get that one answer, it is likely that their next query will be dependent on what that answer is, whether it is
to directly follow up with another query, to clarify what they wanted, to move to a dierent aspect of their need, or
to stop the search altogether. In other words, the dialogue the user has with the system is heavily inuenced by the
system itself.
This is dicult to model in evaluation, particularly in oine evaluations that are meant to be reproducible. In typical
oine evaluations, each question in a dialogue is evaluated independently. This, however, does not capture the ability
the system has to inuence the direction of the conversation. It may fail to identify when the user is following up on a
system response. It leaves open whether the system could have better helped the user by providing dierent information
earlier in the conversation. In short, it makes it dicult to optimize or evaluate the system over complete dialogues.
These problems have been raised in the information retrieval literature before. In particular, the TREC Session Track
attempted to build test collections for session evaluation and optimization by including short logs of user interactions
with search systems [
10
]. But in the Session track test collections, the previous session is xed, and only the nal query
is given as a candidate for retrieval. Thus it is more a relevance feedback task than a session task—it does not solve the
problems listed in the previous paragraph. More recently, the TREC Conversational Assistance Track (CAsT) provided
Authors’ addresses: Aldo Lipani, aldo.lipani@acm.org, University College London, London, UK; Ben Carterette, Spotify, New York, New York, USA,
carteret@acm.org; Emine Yilmaz, emine.yilmaz@ucl.ac.uk, University College London & Amazon, London, UK.
1
2 Lipani et al.
the user side of recorded “dialogues” with a conversational search system [
18
]. A candidate system would retrieve
answers for each of the xed user inputs provided. While the previous session is not xed, the user inputs cannot adapt
to varying system responses—the user dialogue remains static no matter what the system does.
In this paper we introduce a framework for reproducible oine evaluation of conversational search. The framework
includes a novel approach to building test collections as well as a new evaluation metric based on a user model. Our
approach is based on the idea of subtopics, typically used to evaluate search for novelty and diversity to determine
how much redundant information a search system is returning and the breadth of information the user is exposed
to. We essentially abstract queries and answers to a subtopic representation, then model the progression of the user
and system through the dialogue by modeling subtopic-to-subtopic transitions. We show empirically that our oine
framework correlates with online user satisfaction.
The rest of this paper is structured as follows: in Section 2we discuss related work on similar evaluation problems
and conversational search. Section 3provides a detailed overview of our framework for reproducible evaluation. In
Section 4, we describe a specic user model and metric for evaluating conversational search, and in Section 5we
describe the test collection we have assembled. In Section 6we analyze the results, the test collection, and the metric.
We conclude in Section 7.
2 RELATED WORK
A signicant amount of research has been devoted to the development of conversational systems [
9
,
15
,
33
]. Most
research on conversational systems has focused on devising user interfaces for conversational systems [
4
,
11
], using
knowledge graphs for question answering in a conversational dialogue [
23
], building neural models for developing
conversational systems [42], incorporating context in response generation [13] and asking clarifying questions [3].
Though a lot of progress has been made regarding the development of conversational systems, until recently, relatively
little work had been done in evaluating the quality of conversational systems. Hence, most researchers have been using
automatic evaluation metrics such as the BLEU score from the machine translation domain, or the ROUGE from text
summarization domain [
31
,
35
]. While these metrics have the advantage to not require explicit human annotations,
they were shown to not correlate with actual user satisfaction [26].
Until a few years ago, there was a lack of high quality conversational datasets [
37
], which was a major challenge
towards the development of evaluation metrics for measuring the quality of conversational systems. During the last
few years more eort has been devoted to creating such datasets and making them publicly available. Availability
of datasets such as MS Marco [
30
] and Alexa Prize [
2
] were a signicant step forward in the development and
evaluation of conversational systems. These datasets were further followed by some other domain specic conversational
datasets [
32
,
44
]. Trippas et al
. [40]
dened a method for creating conversational datasets (which could serve as a guide
in constructing such datasets) and used that for building one such dataset.
More research has been devoted to the design of new methodologies for evaluating conversational systems during
the last few years. There has been some work on devising benchmark datasets and evaluation metrics for evaluating
the natural language understanding component of conversational systems [
8
,
28
]. Focusing on non-goal oriented
dialogues, such as the setup for Alexa Prize, Venkatesh et al
. [41]
proposed several metrics to evaluate user satisfaction
in context of a conversational system. Choi et al
. [14]
proposed a new metric that can be used to predict user satisfaction
using implicit feedback signals from the users. Guo et al
. [22]
evaluate the quality of a conversation based on topical
coverage and depth of the conversation. The Conversational Assistance Track in TREC 2019 [
18
] focused on devising
test collections and evaluation methods for evaluating the quality of conversational systems.
How Am I Doing?: Evaluating Conversational Search Systems Oine 3
Some dimensions of conversational search evaluation have been identied as similar to those of evaluation of
interactive information retrieval systems [
6
]. There has been signicant more work in evaluation of interactive
information retrieval systems (which have more recently been studied in context of dynamic search) compared to the
evaluation of conversational search systems [
39
]. Dynamic Domain Track aimed devising evaluation methodologies for
evaluating dynamic search [43].
Most currently used conversational evaluation metrics including the ones used by the Conversational Assistance Track
and the Dynamic Domain Track are based on computing traditional information retrieval metrics on the conversation
session. Below we provide a summary of evaluation metrics for a traditional search scenario.
While there has been relatively little work in evaluating the quality of conversational search systems, signicant
amount of work has been devoted to devising evaluation metrics for evaluation in context of traditional search and
recommender systems. However, the analysis of some commonly used oine evaluation metrics used for this purpose
show little correlation with actual user satisfaction in context of recommender systems [
7
,
19
] and moderate to negligible
correlation in context of search [1].
Oine evaluation metrics based on actual user models have the potential to be more correlated with actual user
satisfaction as they are aiming at directly modeling the actual users, where parameters of these models can be learned
from user logs [
12
,
29
]. Moat et al. [
27
] also argued for having metrics that are attuned to the length of the ranked list,
to better align with users who may abandon search early.
One commonly used metric that is based on an explicit user model is Rank-Biased Precision (RBP) [
29
], which models
users’ persistence in examining the next retrieved document in a search result. The assumptions made by this metric
are that, users examine documents from the top and in order, and that the examination of each document is solely
dependent on the willingness of users in doing it, their persistence.
RBP(𝑞) = (1 𝑝)
𝑁
X
𝑛=1
𝑝𝑛1𝑗(𝑟𝑛, 𝑞),
where
𝑞
is a query,
𝑟
is the list of documents returned by the search system when querying it with
𝑞
,
𝑟𝑛
is the document
retrieved at position
𝑛
,
𝑁
is the number of retrieved results,
𝑗
is the relevance function, which returns 1 if
𝑟𝑛
is relevant
to 𝑞and 0 otherwise, and 𝑝[0,1] is the persistence parameter.
Another commonly used metric based on modeling user behavior is ERR [
12
]. ERR is based on a similar user model to
RBP, but assumes that the probability of user stopping at each rank depends on the relevance of the document observed.
Queries in search could be ambiguous; even the same query could mean dierent things to dierent users. Hence,
evaluation measures that capture the diverse interests from dierent users are needed, if the goal is to evaluate the
satisfaction of a random user using the search engine. Various evaluation metrics for diversity and novelty based
information retrieval have been developed [
5
,
16
]. Some previous work [
5
,
36
] did an analysis of several diversity
metrics and proposed new diversity evaluation metrics, which are based on an adaptation of the RBP user model to
diversity evaluation.
Most current evaluation metrics used for conversational search are based on session based evaluation metrics, which
have been investigated in context of the Session Track [
10
]. Session based metrics have been widely studied in the
literature [
10
,
38
]. Kanoulas et al
. [24]
proposed two families of measures: a model-free family that makes no assumption
about the user behavior over a session, and a model-based family with a simple model of user interactions over the
session. Most such session based metrics are adaptions of traditional information retrieval metrics to search sessions.
4 Lipani et al.
Metrics used by the Conversational Assistance Track and the Dynamic Domain Track are also based on variants of
session based evaluation metrics.
One of the most commonly used session based metrics for evaluating the quality of conversational systems is the
fraction of correct answers in the session (i.e., precision of responses in the session) [
41
]. Lipani et al
. [25]
extended the
RBP user model towards modeling user behavior over the entire search session and proposed the session RBP (sRBP)
metric, which could be used for evaluating the quality of conversational search systems. In addition to modeling the
probability of a user persisting in (i.e., not abandoning) the search task, the sRBP metric also models the trade-o
between re-querying the system and examining a new document down the search result via a new parameter named
balancer.
sRBP(𝑠) = (1 𝑝)
𝑀
X
𝑚=1 𝑝𝑏𝑝
1𝑏𝑝 𝑚1𝑁
X
𝑛=1
(𝑏𝑝)𝑛1𝑗(𝑟𝑚,𝑛, 𝑠𝑚),
where
𝑠
= [
𝑞1, . . . , 𝑞𝑚
]is a session, i.e., a series of queries,
𝑟𝑚,.
is the search result returned when querying with
𝑞𝑚
,
𝑟𝑚,𝑛
is the document retrieved at position
𝑛
for
𝑞𝑚
.
𝑀
is the length of the session, that is number of queries submitted
in 𝑠, and 𝑏[0,1] is the balancer parameter.
One of the primary problems associated with using all the aforementioned metrics, including the session based
metrics for evaluating conversational search systems is that all these metrics require that user sessions are known in
advance. However, the availability of such data requires that a system has already been deployed and is in use. This
is not always feasible. A new system or an academic research system may have no signicant user base, and even if
it did, one would rst want to have an oine evaluation to ensure that the system is of reasonable quality to deploy.
Furthermore, within a session, queries issued by a user may depend on the relevance of previous responses by the
system; hence, the session itself would be system dependent, and a dierent system shown to the same user could
lead to a completely dierent session. This means that these metrics cannot be reliably used to compare the quality of
multiple systems.
There has been some recent work on simulating dialogues and computing metrics on top of these simulated
dialogues [
20
,
21
]. However, this work would require using an agent to simulate the conversations and then using a
metric on top of these simulated conversations – a process where metric computation is completely separated from
dialogue generation. Furthermore, simulations used in this work are not based on a realistic user model, which is of
critical importance when the goal is to devise an evaluation metric correlated with user satisfaction.
Hence, evaluation of conversational search systems without the need for having access to actual search sessions
is still an open problem, which was also discussed in a recent Dagstuhl seminar [
6
], where some of complexities of
devising such metrics have been identied.
In this paper, we propose a novel oine evaluation framework, which does not rely on having access to real user
sessions. Our evaluation metric is based on a user model, which has been validated in context of search [
25
]. Hence, our
experimental results suggest that our oine proposed evaluation metric is highly aligned with actual user satisfaction,
in contrast to most information retrieval evaluation metrics that are not based on realistic user models [1,7,19].
3 MODELING CONVERSATIONAL SEARCH
In this section we describe an abstract framework for modeling conversational search for reproducible oine evalua-
tion. We contrast against previous approaches, which are based on assessing the relevance of answers to questions
independently of one another, as described above.
How Am I Doing?: Evaluating Conversational Search Systems Oine 5
(a) A dialogue between a user and a candidate conversational search system on the topic of cold flowering plants.
(b) A dialogue between a user and a dierent candidate conversational search system on the same topic.
Fig. 1. Two dialogues on the same topic with dierent candidate conversation search systems. The first proceeds in a natural way,
with the user’s questions answered by the system, which in turn informs the user’s next question. The second is less natural, with the
second question not clearly following from the answer to the first, and the third question coming despite having just been answered.
Both candidate systems are retrieving equally relevant answers.
A fully interactive, dialogue-based conversational search system is dicult to evaluate oine in a holistic way. Each
turn in the dialogue may be inuenced by previous interactions between user and system. Results generated by candidate
systems that have not been tested with real users will not necessarily be able to capture these inuences. An oine
evaluation using xed, recorded dialogues and independent relevance judgments will almost certainly mis-represent
the system’s eectiveness in a dialogue for this reason.
Consider the example conversational search dialogues shown in Figure 1. The sequence of user questions is the same
in each, but the system responses are very dierent. Both dialogues start out with the same question: the user asks
about owering plants robust to cold climates. The rst system (Fig. 1a) responds with a sentence about pansies. This
leads the user to ask how much cold pansies can tolerate; they receive a relevant answer which also motivates the third
question, about whether pansies can survive frost. The dialogue makes sense to us: the answers to the questions are
relevant; user questions proceed in a natural sequence, motivated by the systems responses to the previous question.
The second system (Fig. 1b) responds to the same rst question with a sentence about yarrows, which is relevant
to the question. But because the dialogue is static, there is no follow-up question. The user responds by asking about
pansies, which is not motivated by anything the system has done. This time, the answer is specically about the ability
of pansies to survive frost—and then the follow-up question is about whether pansies can survive frost. This dialogue
makes much less sense: while the answers to the questions are relevant, the questions themselves seem to proceed
without much logic given the system responses.
In other words, the two systems shown in this example would perform equally well if judged on the relevance of
their answers, but to our eyes, one fares much better than the other. Note that we are not claiming that the second
dialogue would never happen. We claim only that it produces an evaluation result that is less satisfying to us than the
rst: the dialogue being evaluated does not appear to be representative of real dialogues, and therefore we suspect that
the outcome of the evaluation is biased in some way.
One possible solution to this problem is to evaluate over many dierent dialogues from users with the same
information need. But where do these dialogues come from? Unless we are able to deploy a variety of candidate systems
to a large user base, it is unlikely we will be able to obtain them. Can we instead simulate these dialogues? Can we use
system responses to generate follow-up questions, and thus do a better job of testing candidate systems over responsive
6 Lipani et al.
(a) A dialogue identical to Fig. 1a, but including the subtopic representation of each question and answer. The subtopics follow an
intuitive progression.
(b) A dialogue identical to Fig. 1b, but including the subtopic representation of each question and answer, clearly showing that the
second question does not follow from the first answer, and the third question is on a subtopic that has already been answered.
(c) A new dialogue with the same candidate system as that in Fig. 2b, in which user questions proceed more naturally from the system
responses, and the candidate system’s ability to provide relevant results is less clear.
Fig. 2. Three dialogues on the same topic with dierent candidate conversation search systems, augmented with abstract subtopic
representations of questions and answers. The third example suggests that the second candidate system is not as successful at
answering questions when they proceed more naturally from each interaction. In our simulation, all three sequences of interactions
could occur, but the second would be significantly less likely than the other two.
dialogues? This sounds like a very dicult problem, potentially involving natural language understanding of system
responses and natural language generation of user queries.
We propose a simpler approach: we abstract queries and answers to a higher-level representation in a restricted
domain to make the simulation manageable. Our representation is based on the idea of subtopics, similar to those used
in diversity and novelty evaluation [
17
]. Given a sample user information need, we dene a set of subtopics that cover
the space of possible things users may learn by interacting with the system. Any given user question is mapped to one
of these pre-dened subtopics. Similarly, each system response is mapped to a subtopic. Our evaluation is based on
modeling the transitions from the subtopic represented by a system’s response to the subtopic represented by possible
follow-up questions, and the relevance of the system’s responses to those questions.
Figure 2demonstrates this using the same dialogues as in Figure 1, as well as one new dialogue. Question and answer
subtopics are shown above/below respectively. Here we can identify the transitions in the second example (Fig. 2b) that
may be less likely: a user following up an answer about the subtopic yarrow to a question about the subtopic pansy; a
user following up an answer about the subtopic frost tolerance with a question about frost tolerance. Again, this is not to
say that the dialogue is “wrong”; only that this dialogue is less likely to occur than one in which the user follows up an
answer about yarrow with a question about yarrow, and does not ask something that has just been answered, and that
the framework we propose in this work can capture these dierences in likelihood and therefore provide a more robust
How Am I Doing?: Evaluating Conversational Search Systems Oine 7
evaluation. Fig. 2c demonstrates this with a sequence of questions that are more likely, and that also retrieve responses
of lower quality. The answer to the second question is somewhat relevant, but hardly satisfying. It makes sense that a
user would follow up by asking if yarrow can survive frost. And the answer to that question is good if the user can
trust that the system’s “they” has the same referent as the “it” in the question. This example shows that when tested
using a more natural sequence of questions, the candidate system reveals its performance to be less satisfactory.
Since we do not want to generate natural language questions, our simulation will require that we have a set of
possible user questions to sample from, with each question associated with a subtopic. This requirement seems to
bring us back to the problem of obtaining many dierent dialogues for the same information need. But in fact it is
signicantly lighter than that: since subtopics are treated independently of one another, we can manually develop or
crowdsource questions; they do not need to occur in a dialogue in order to be useful in the evaluation. We also need to
have a model of transitions from one subtopic to another. This is a heavier requirement, but still lighter than using
natural language techniques. We can design tools specically to obtain these transitions through crowdsourcing.
Given the ideas outlined above, a full dataset for oine evaluation of conversational search would consist of the
following:
(1)
a sample of user information needs or topics, high-level descriptions of things users want to accomplish using a
conversational search system;
(2)
for each topic, a pre-dened set of subtopics that cover aspects of the need and that inuence “turns” in the
conversation;
(3)
transition probabilities between subtopics, capturing the likelihood that a user goes from one subtopic to another
during the course of their interaction;
(4) user queries that model the subtopics;
(5) a corpus of items (documents, passages, answers, etc.) that can be retrieved;
(6) relevance judgments between these items and the subtopics.
An oine evaluation would use the subtopics, queries, transition probabilities, and judgments to simulate a user
and system together progressing through a dialogue. It would proceed as follows: an “evaluation agent” is set up to
interface with a conversational search system to be evaluated. This evaluation agent may submit any query provided
in the test collection and receive an answer from the system being tested. Based on the relevance of this answer, it
uses the transition probabilities to sample the next subtopic for which to pose a query. It continues in this way, using
an abandonment model to determine when done, at which point the relevance of the answers received along the
way are combined into one possible evaluation result. The process is repeated for the same topic over many trials,
to simulate many dierent possible dialogues, and the resulting evaluation scores indicate the eectiveness of the
conversational search system for that topic. Over the sample of topics in the collection, we can understand the variation
in its eectiveness.
We can now state our research questions. They are:
RQ1 Does a simulation-based oine evaluation accurately capture user satisfaction?
RQ2
Is our metric based on simulations a better t to user behavior than other metrics for conversational search
evaluation?
RQ3 Can our framework detect dierences in eectiveness in conversational search systems?
8 Lipani et al.
Start Conversation
Query the System
Is answer relevant?
Ask more?Ask more?
End Conversation
q
j
j
Select Subtopic
sn
Fig. 3. Flow-chart of the proposed user model.
Before we address those, the next two sections present implementations of the ideas above. First, in Section 4we
describe in detail the user model and metric that we would like to use to evaluate conversational search systems. Then
Section 5describes a specic dataset and a user study performed to gather the data.
4 USER MODEL AND METRICS
In the following we will dene two components of a user model for conversational search interaction: a component
modeling the user persistence in performing the task, and a component modeling the gathering of information that
the user is trying to achieve through the dialogue. The former component is inspired by metrics like RBP for search
evaluation, as described in Section 2. The latter is the formalization of the subtopic transition model mentioned above.
The two combine into a metric for expected conversational satisfaction.
Given a topic with a set of subtopics
S
a user wants to learn about, we dene a conversation
𝑐∈ C
as a list of system
user interactions with the system. Each interaction consists of a query 𝑞and an answer 𝑎:
𝑐= [(𝑞1, 𝑎1), . . . , (𝑞𝑚, 𝑎𝑚)],
where each 𝑞can be abstracted to a subtopic 𝑠∈ S (𝑞∈ Q𝑠) and each pair (𝑞𝑖, 𝑎𝑖)has a relevance judgment 𝑗 J .
4.1 User Persistence
Figure 3depicts the user model in a ow chart. When users have an information need, they begin a “dialogue” with
a conversational system by issuing a query. Based on the relevance of the result, they decide whether to continue
querying or not; moreover, each turn in the dialogue is modeled as the user trying to nd information about a particular
subtopic. Thus their persistence in querying the system is dependent on what is observed by the previous query. We
How Am I Doing?: Evaluating Conversational Search Systems Oine 9
s3
sn
sn+1
s0
s2
·(a0,1)·
·(a0,2)·
·(a0,3)·
·(a0,n)·
·(a1,n+1)·
·(a2,n+1)·
·(a3,n+1)·
·(an,n+1)·
·(a1,1)·
·(a2,2)·
·(an,n)·
·(a3,n)·
·(an,3)·
·(a1,2)·
·(a2,1)·
·(a1,3)·
·(a3,1)·
·(a2,n)·
·(an,2)·
·(an,1)·
·(a1,n)·
·(a3,3)·
·(a2,3)·
·(a3,2)·
Fig. 4. Graphical model of subtopic transitions during information gathering. Note that
𝑠0
is a “dummy” start state, and
𝑠𝑛+1
is an
end.
model this using the following recursive denition:
𝑝(𝑄1=𝑞)=1
𝑝(𝑄𝑚) = 𝑝(𝐿𝑚1|𝑄𝑚1, 𝐽𝑚1)𝑝(𝑄𝑚1).
This recursive denition uses three random variables,
𝑄
,
𝐿
, and
𝐽
, where
𝑄
=
{𝑞, 𝑞}
indicates the act of querying
or not,
𝐿
=
{ℓ, ℓ }
the act of leaving or not, and
𝐽
=
{𝑗, 𝑗 }
indicates the relevance or not of the system reply. The rst
equation models the probability of starting a conversation when the user has an information need. The second equation
models the probability of continuing the conversation with the system, which is modeled as dependent to the previous
interaction with the system. Note that the second equation does not specify the outcomes of the random variables in
order to consider all possible combinations of the outcomes thereof.
We assume that the probability of continuing the conversation given that the user has not previously queried the
system is equal to 0:
𝑝(𝐿𝑚1=|𝑄𝑚1=𝑞)=0.
For the sake of clarity, we introduce parameters
𝛼+
and
𝛼
to substitute with the probability of continuing the
conversation given that the user has previously queried the system and the previous returned result was relevant:
𝑝(𝐿𝑚1=|𝑄𝑚1=𝑞, 𝐽𝑚1=𝑗) = 𝛼+,
and when the previous returned result was not relevant:
𝑝(𝐿𝑚1=|𝑄𝑚1=𝑞, 𝐽𝑚1=𝑗) = 𝛼.
Both probabilities will be estimated from user logs; Section 5.2 provides more detail. Moreover, when these two
𝛼
s are
equal, the model is equivalent to that of RBP and sRBP, with 𝛼functioning as the persistence parameter 𝑝.
4.2 User Information Gathering
The user’s task is to gather information about a topic by interacting with the conversational system. To describe the
user interaction we use the probabilistic graphical model shown in Figure 4. Given a topic, we dene the set of subtopics
the user wants to satisfy as follows:
𝑆={𝑠1, . . . , 𝑠𝑛}.
10 Lipani et al.
We model this by treating each subtopic as a state. To these states we add a start state and an end state, indicating the
initial (
𝑠0
=
start
) and nal states (
𝑠𝑛+1
=
end
) of the conversation. We dene the probability of transitioning to any
state given that we have started the conversation as:
𝑝(𝑆𝑚=𝑠1|𝑆𝑚1=𝑠0) = 𝑎0,1, 𝑝(𝑆𝑚=𝑠2|𝑆𝑚1=𝑠0) = 𝑎0,2, . . .
. . . , 𝑝 (𝑆𝑚=𝑠𝑛|𝑆𝑚1=𝑠0) = 𝑎0,𝑛, 𝑝(𝑆𝑚=𝑠𝑛+1|𝑆𝑚1=𝑠0)=0.
The nal probability guarantees that at least one search interaction needs to be performed.
To dene the probability of going from subtopic
𝑖
to any other subtopic, including the end state, we use two
approaches. The rst, called relevance independent (RI), assumes that these probabilities are independent of the relevance
of the system’s answers:
𝑝(𝑆𝑚=𝑠1|𝑆𝑚1=𝑠𝑖) = 𝑎𝑖,1, . . . , 𝑝(𝑆𝑚=𝑠𝑛|𝑆𝑚1=𝑠𝑖) = 𝑎𝑖,𝑛 , 𝑝(𝑆𝑚=𝑠𝑛+1 |𝑆𝑚1=𝑠𝑖) = 𝑎𝑖,𝑛+1,
where 𝑖∈ {1, . . . , 𝑛 }.
Our more advanced relevance dependent (RD) representation assumes that these probabilities depend on the relevance
of the system’s answers and estimates these probabilities conditioned also on relevance:
𝑝(𝑆𝑚=𝑠𝑖1|𝑆𝑚1=𝑠𝑖2, 𝐽𝑚1=𝑗), 𝑝(𝑆𝑚=𝑠𝑖1|𝑆𝑚1=𝑠𝑖2, 𝐽𝑚1=𝑗),
where 𝑖1∈ {1, . . . , 𝑛 + 1}and 𝑖2∈ {0, . . . , 𝑛 }.
We indicate the act of sampling a state (subtopic) using these estimations as
𝑠∼ S𝑗 ,𝑠
, where
𝑗
represents the
relevance of the previously retrieved document and
𝑠
is the previous subtopic to which the previously submitted
query belongs to. This last relationship is formalized by the set
Q𝑠
, which indicates the set of queries associated to the
subtopic
𝑠
. The act of sampling a query is indicated as
𝑞∼ Q𝑠
. The relevance of a subtopic to an answer is obtained
using a qrels le and it is indicated as
𝑟← J
𝑠,𝑎
. This modeling gives us the opportunity to also capture a more noisy
concept of relevance, where user factors like agreement, disturbance, and distractions are modeled by sampling using
the probability that 𝑎is relevant to 𝑠.
4.3 Evaluating a Conversation
Based on the user model dened above, we now dene our proposed evaluation metric Expected Conversation Satisfac-
tion (ECS), which is an expectation over many dialogues simulated oine using the two models above. Algorithm 1
shows how we estimate Conversation Satisfaction with a single (simulated) dialogue, given a system and a series of
estimates. Over many trials of this algorithm, we obtain ECS. Later in the paper we will describe how to estimate the
probabilities needed to compute this metric.
Finally, to ensure that the metric is in the range of 0 to 1, we normalize the metric by dividing it by the metric value
for an ideal conversation in which every reply is correct. We call this nECS, and dene this metric as:
nECS(𝑐) = ECS(𝑐)
IECS(𝑐).
This is similar to the way many information retrieval metrics such as RBP and ERR are normalized.
How Am I Doing?: Evaluating Conversational Search Systems Oine 11
Algorithm 1: Computation of ECS
Input: 𝛼+,𝛼,S,Q,J,system()
Output: score
1score 0
2𝑝(Q=𝑞)1
3relevant false
4subtopic start
5subtopic ∼ Srelevant,subtopic
6while subtopic ̸=end do
7query ∼ Qsubtopic
8answer system(query)
9relevant ← J
subtopic,answer
10 if relevant then
11 score score +𝑝(Q=𝑞)
12 𝑝(Q=𝑞)𝛼+𝑝(Q=𝑞)
13 else
14 𝑝(Q=𝑞)𝛼𝑝(Q=𝑞)
15 end
16 subtopic ∼ Srelevant,subtopic
17 end
4.4 ECS in Practice
In order to compute ECS in practice, one would need to identify (annotate) the possible subtopics given a topic and
compute the transition probabilities between the dierent subtopics.
If usage logs of a real conversational system are available, the metric can be computed with respect to one xed
conversation from the log, and the expected value of the metric could be obtained by averaging across all such
conversations that fall under the same topic. Given a real (non-simulated) conversation, it can be shown that ECS can
be computed as:
ECS(𝑐) =
|𝑐|
X
𝑚=1
𝑗(𝑐𝑚)
𝑚1
Y
𝑚=1
(𝛼+𝑗(𝑐𝑚) + 𝛼(1 𝑗(𝑐𝑚))),
where 𝑗returns the relevance of the system answer to the user query provided at step 𝑚.
Note that this is not generally an option for oine evaluation of new systems, as we expect them to retrieve answers
that have not previously been seen in user dialogues, in which case the metric with the simulated dialogues (as described
in Algorithm 1) needs to be used. Similarly, the simulated dialogues also need to be used when there is no access to the
usage logs of a conversational system.
However, such logs are not necessarily always available in practice. In such cases, test collections need to be
constructed in order to estimate the parameters of the model and compute the value of the metric. In the next section we
describe a procedure that can be used to construct such a test collection and show how the parameters of Algorithm 1
can be estimated using such a test collection.
12 Lipani et al.
5 DATA COLLECTION
In this section we describe the work we did to operationalize the framework presented in Section 3and collect data to
t the user model presented in Section 4. We created a dataset based on SQuAD [
34
] for question answering. SQuAD
consists of topics dened from manually-chosen Wikipedia pages. Each selected page is broken down by paragraph,
and for each paragraph there are several associated questions that can be answered by that paragraph.
SQuAD is designed for evaluating one-o question answering, not conversations or dialogues. To simplify the use of
this data, we decided to focus on systems that respond to questions with full paragraphs. This is because the paragraphs
are straightforward to use as a unit of retrieval; we elaborate in Section 5.1 below. For future work we will look at
subdividing paragraphs into smaller units for assessing and retrieval within this framework.
An example paragraph from the SQuAD topic “Harvard University” is as follows:
Established originally by the Massachusetts legislature and soon thereafter named for John Harvard (its rst
benefactor), Harvard is the United States’ oldest institution of higher learning, and the Harvard Corporation
(formally, the President and Fellows of Harvard College) is its rst chartered corporation. Although never
formally aliated with any denomination, the early College primarily trained Congregationalist and
Unitarian clergy. Its curriculum and student body were gradually secularized during the 18th century,
and by the 19th century Harvard had emerged as the central cultural establishment among Boston elites.
Following the American Civil War, President Charles W. Eliot’s long tenure (1869–1909) transformed the
college and aliated professional schools into a modern research university; Harvard was a founding
member of the Association of American Universities in 1900. James Bryant Conant led the university
through the Great Depression and World War II and began to reform the curriculum and liberalize
admissions after the war. The undergraduate college became coeducational after its 1977 merger with
Radclie College.
Each of the following questions is provided as part of SQuAD and can be answered by the paragraph above:
(1) What individual is the school named after?
(2) When did the undergraduate program become coeducational?
(3) What was the name of the leader through the Great Depression and World War II?
(4) What organization did Harvard found in 1900?
(5) What president of the university transformed it into a modern research university?
SQuAD also provides questions that cannot be answered by the text. We chose to discard these for this study.
In order to use this data in our framework, we rst need to dene subtopics, then associate questions and paragraphs
with our subtopics. We selected 11 SQuAD topics to use in our study. They are listed in Table 1. For each topic, we
dened a set of subtopics by manually examining the SQuAD questions and the original Wikipedia page. We attempted
to develop a set of subtopics that were largely mutually exclusive with one another and that covered most of the
desired information reected by the provided questions. Subtopics are represented as short keyphrases with a longer
text description explaining exactly what should and should not be considered relevant to the subtopic. Each topic has
between four and nine subtopics. Table 2shows some example subtopics and relevant questions from the SQuAD data.
We then manually judged each question in the SQuAD dataset for that topic relevant to one of the dened subtopics.
Each question could be relevant to at most one subtopic. Questions that were judged not relevant to any subtopic were
marked nonrelevant and excluded from the study. On average, the ratio of subtopic-relevant questions to topic-relevant
questions is 0.11, that is, each subtopic represents about 11% of the topic’s questions.
How Am I Doing?: Evaluating Conversational Search Systems Oine 13
topic subtopics questions paragraphs
harvard university 5 259 29
black death 5 108 23
intergovernmental panel on climate change 6 99 24
private school 12 113 26
geology 6 116 25
economic inequality 6 291 44
immune system 6 214 49
oxygen 9 239 43
normans 4 95 40
amazon rainforest 5 181 21
european union law 11 231 40
Table 1. Eleven topics selected from the SAD data, with the number of questions and paragraphs contained in SAD as well as
the number of subtopics we manually developed for each topic.
topic subtopic example question
harvard university harvard facts How many individual libraries make up the main school library?
harvard alumni What famous conductor went to Harvard?
harvard nances How much more land does the school own in Allston than Cambridge?
harvard academics How many academic units make up the school?
harvard history In what year did Harvard President Joseph Willard die?
economic inequality historical inequality During what time period did income inequality decrease in the United States?
economic theory What does the marginal value added by an economic actor determine?
economists What organization is John Schmitt and Ben Zipperer members of?
current state of inequality In U.S. states, what happens to the life expectancy in less economically equal ones?
causes of inequality Why are there more poor people in the United States and Europe than China?
solutions to the problem Who works to get workers higher compensation?
Table 2. For two of the topics we selected, our subtopics and, for each one, an example question from the SAD data. Each of these
questions has an accepted answer that can be extracted from a paragraph in the SAD data for the topic.
Based on these question-subtopic relevance judgments, paragraph relevance could then be automatically assessed by
mapping the relevance of the questions associated with that paragraph. A paragraph could be relevant to zero or more
subtopics, depending on the questions that it answered.
After completing this process, we have ve of the six components required by the framework:
(1) topics: subject-based Wikipedia articles that have been used in the SQuAD dataset;
(2) subtopics: manually developed based on the SQuAD questions and Wikipedia pages;
(3) user queries: questions from the SQuAD dataset that have been judged relevant to the selected subtopics;
(4) retrievable items: paragraphs from the topical Wikipedia pages;
(5) relevance judgments: obtained from the question-subtopic relevance judgments.
This is enough to evaluate the relevance of paragraphs retrieved in response to the provided user questions. However,
this does not give much more than a standard question-answering system. To evaluate a conversation, we need more:
we need a model of how a user progresses through the conversation based on the responses they are provided with.
The work in Section 4describes such a model; we now turn to collecting data to t it.
14 Lipani et al.
5.1 Crowdsourcing Study
In this section we describe a crowdsourcing study to gather user queries and data to t the user model. We designed
a prototype search system for the SQuAD-derived dataset described above. Users (Mechanical Turk workers), upon
accepting the work, were shown instructions to ask questions on a topic provided to them. They were given an interface
to input questions. The system responded to questions with a paragraph. The user was asked whether the paragraph
is relevant to their question, and to what subtopic their question related. They could end their search at any time, at
which point they were asked to indicate their satisfaction with the session.
Note that this is not meant to reect a “real” search scenario. Users in a real search setting would not be asked to
select a subtopic to represent their question. They would likely not be asked about the relevance of each response.
Furthermore, we imposed a strong restriction on the set of candidates that could be retrieved for each question: the
system would only select paragraphs from a small, manually-selected set relevant to the subtopic the user specied.
Clearly this information would not be available to a real search engine. The reason for these decisions is that our goal is
not to evaluate our system with users, but to collect user data for our models described in Section 4.
Here we would like to discuss two important decisions regarding the retrieval of paragraphs in response to user
questions. First, we recognize that full paragraphs (like the one exemplied above) would not typically be a retrieval
unit in a real conversational search system. The reason we chose to use full paragraphs regardless is that paragraphs
often touch on several dierent subtopics or aspects of the topic. Our subtopic-based framework thus makes it possible
to extract more insight into how system responses aect user questions than if we had used shorter units of retrieval
such as sentences or passages that only answer the question posed. It is straightforward to use our framework to
evaluate a system that is retrieving shorter passages than a full paragraph; it is primarily a matter of assessing the
relevance of retrieval units to the dened subtopics.
Second, our system only retrieves paragraphs from a small, manually-curated set that are known to be relevant to
the query (by the question-subtopic judgments and their mapping to paragraphs). One reason for this is that many
of the paragraphs are dicult to grasp without the context of other paragraphs in the full document—for example, a
paragraph entirely focused on Radclie College has little meaning to a user that does not already know about the 1977
merger of Harvard University and Radclie College. Thus we specically selected paragraphs that start new sections or
that can be easily understood without additional context. Furthermore, the paragraphs are selected to be “exemplary”
for the subtopic, so that if the paragraph is relevant to the subtopic, it is easy to understand why and to see why other
paragraphs (that require more context to understand) would be relevant to the same subtopic. We chose the example
paragraph above as being a straightforward introduction to the topic, being relevant to the subtopics “harvard history”
and “harvard academics”, and being exemplary for the former. One hypothesis raised by this paragraph is that users
may next query about academics, since the paragraph alludes to curriculum.
Since there are other candidate paragraphs relevant to one or the other of these example subtopics (“harvard history”
and “harvard academics”), we need a way for the system to select one to respond when given a user query. We use
a simple language-modeling approach, where paragraphs are modeled as a multinomial distribution of terms in the
vocabulary. Here the vocabulary is restricted to the terms used in the topic itself (not the full corpus). As a form of
smoothing, each paragraph is expanded using the terms in the subtopic label, so for example the paragraph above
would be expanded with repetitions of the terms “harvard”, “history”, and “academics”. Additional smoothing is done
using Dirichlet priors based on the prevalence of terms in all of the paragraphs relevant to the subtopic, plus all terms
How Am I Doing?: Evaluating Conversational Search Systems Oine 15
in the topic. When a user enters a query and its subtopic, the relevant paragraphs are scored with this language model,
and the top-scoring paragraph is selected for retrieval.
Our users were Amazon MTurk workers. The HIT they had to complete involved completing a dialogue for one topic.
They were free to do as many HITs as they wished. In all, we collected 207 dialogues from 220 unique workers. The
average length of a dialogue was 5.43 turns. Users marked 816 out of 1123 responses relevant, and indicated satisfaction
with 72% percent of their sessions.
Since we specically restricted retrieval to relevant paragraphs, why were nearly 30% marked nonrelevant? The
explanation boils down to disagreements between users and ourselves about the meaning of the subtopic labels as well
as the relevance of paragraphs to those labels. We trust that the users know best, so for the remainder of the study we
use the user-assessed judgments rather than the subtopic-question judgments.
5.2 Subtopic Transitions
The nal component of the framework is the transition probabilities between subtopics in a topic. We compute these
from the user data collected as described above. We use a simple Bayesian prior and updating approach. Since a
relevance-independent (RI) transition probability
𝑝
(
𝑆𝑚
=
𝑠𝑗|𝑆𝑚1
=
𝑠𝑖
)is multinomial, we initially assume a Dirichlet
prior with equal-valued parameters
𝑎0,𝑖, . . . , 𝑎𝑗 ,𝑖, . . . , 𝑎𝑛+1,𝑖
, resulting in a uniform posterior. Each time we observe a
transition from
𝑠𝑖
to
𝑠𝑗
in a user dialogue, we simply increment the prior parameter
𝑎𝑗,𝑖
by one, causing the posterior
probability to increase.
For the relevance-dependent (RD) model, we use a very similar approach. The only dierence is that we additionally
condition each multinomial distribution on the user-assessed relevance of the answer received at turn 𝑚.
6 RESULTS AND ANALYSIS
In this section we use the experiment data from above to analyze our ability to perform reliable, reproducible oine
evaluation of conversational search. As we wrote in Section 3, our research questions are:
RQ1 Does a simulation-based oine evaluation accurately capture user satisfaction?
RQ2 Is our ECS metric a better t to user behavior than other metrics for conversational search evaluation?
RQ3 Can our framework detect dierences in eectiveness in conversational search systems?
The data we use to investigate these questions is as follows:
(1) Users’ self-reported satisfaction in their search, as described in Section 5.1.
(2) Three evaluation metrics computed for the recorded user dialogues (not simulated):
(a) Precision (P), the proportion of correct answers in the dialogue;
(b) Rank-biased precision (RBP), a geometric-weighted version of precision;
(c) Expected conversation satisfaction (ECS), the measure we propose in Section 4.
(3) The same three evaluation metrics averaged over 𝑁simulated dialogues generated using Alg. 1.
(4) A candidate conversational search system based on language modeling.
We will answer the research questions by showing that:
RQ1 The metrics computed over simulated dialogues (item 3) all correlate well with user satisfaction (item 1).
RQ2
ECS correlates better with user satisfaction and ts user querying behavior better than the other metrics, with
both non-simulated (item 2) and simulated (item 3) data.
RQ3 As system (item 4) eectiveness decreases in a controlled way, ECS decreases in an expected pattern.
16 Lipani et al.
Sim. Parameters ln(TSE) ln(TAE) KLD
PRI -2.1201 0.5277 2.1544
RD -2.0980 0.5572 2.3984
RBP RI 𝛼= 0.79 -5.2268 -1.3926 0.0734
RD 𝛼= 0.79 -5.2268 -1.3926 0.0734
ECS RI 𝛼+= 0.82,𝛼= 0.70 -5.2617 -1.4062 0.0718
RD 𝛼+= 0.85,𝛼= 0.64 -5.2774 -1.4174 0.0706
Table 3. Model parameters and errors.
Sim. Parameters 𝜏 𝜌 𝑟
PRI 0.3963 0.4184 0.4659
RD 0.66060.82000.8577
RBP RI 𝛼= 0.79 0.3963 0.4184 0.4660
RD 𝛼= 0.79 0.66060.82000.8389
ECS RI 𝛼+= 0.82,𝛼= 0.70 0.3963 0.4184 0.4515
RD 𝛼+= 0.85,𝛼= 0.64 0.69720.83830.8492
Table 4. Correlations (Kendall’s
𝜏
, Spearman’s
𝜌
, and Pearson’s
𝑟
) between self-reported user satisfaction and metric values (with
two dierent dialogue simulation models over 100,000 trials per topic).
6.1 Results
We rst investigate correlation between conversational evaluation metrics and self-reported user satisfaction. The
primary results are based on simulating dialogues using Algorithm 1with subtopics sampled according to one of two
approaches described in Section 4:relevance independent (RI) transitions between subtopics, and relevance dependent
(RD) transitions. For each topic, we simulate 100,000 dialogues with the candidate system and compute the three metrics
for each, averaging over dialogues to obtain an expected value for the topic. We estimate expected satisfaction with a
topic by averaging the binary user-reported satisfaction responses. We then compute correlations between one of the
metrics and the expected satisfaction values.
In order to compute RBP and ECS, we rst need to optimize the parameters of the metrics. To set the parameters we
performed a simple grid search to minimize total square error (TSE) between
𝑝
(
𝑄𝑚
)as estimated by their user models
and the actual probability of a user reaching the
𝑚
th turn in a dialogue. In Table 3we show the model hyperparameters
obtained by optimizing the t of the model. In the table we also report the best achieved TSE (as well as corresponding
total absolute error (TAE) and KL-divergence) between these two quantities for simulations based on all three metrics,
as well as the two types of transitions (RI and RD). As it can be seen, the two variants of ECS tend to achieve lower
modeling error compared to the other metrics.
Once the model parameters are computed, we can now compute the correlations between the dierent metrics and
user satisfaction labels obtained from our participants. Table 4shows results for three dierent correlation coecients:
Kendall’s
𝜏
rank correlation, based on ranking topics by expected satisfaction across all our participants and counting
the number of pairwise swaps needed to reach the ranking by the metric; Spearman’s
𝜌
rank correlation, a linear
correlation on rank positions; and Pearson’s
𝑟
linear correlation between the numeric values themselves. All three
correlation coecients range from
1to 1, with 0indicating a random relationship between the two rankings. A metric
How Am I Doing?: Evaluating Conversational Search Systems Oine 17
Parameters ln(TSE) ln(TAE) KLD
P -2.7129 -0.1005 0.5317
RBP 𝛼= 0.80 -5.3199 -1.5106 0.0496
ECS 𝛼+= 1.00,𝛼= 0.53 -7.5696 -2.6574 0.0054
Table 5. Model parameters and errors for the non-simulated case.
Parameters 𝜏 𝜌 𝑟
P 0.69720.83830.9178
RBP 𝑝= 0.80 0.73400.86110.8952
ECS 𝛼+= 1.00,𝛼= 0.53 0.73400.87020.9088
Table 6. Correlations between self-reported user satisfaction and metrics computed on real sessions.
that consistently scores higher on topics for which users self-report higher satisfaction, and vice-versa, will have a
higher correlation.
The maximum reported correlations in this table are very strong (and statistically signicant) correlations, showing
that indeed our simulation-based framework can accurately capture user satisfaction, supporting a positive answer
to
RQ1
. Note that the simulation using relevance-dependent transition probabilities correlates far better than the
simulation using relevance-independent transition probabilities. This suggests that it is the case that users adjust their
questions in response to system answers, and an evaluation that fails to model this fails to model user satisfaction well.
Table 3and Table 4together demonstrate the strength of ECS as a metric: it achieves lower modeling error and
better correlations than the other two metrics (apart from the higher linear correlation that precision achieves). This
supports a positive answer to
RQ2
, that ECS is a better t to user behavior than other metrics. The dierences are
small but consistent, suggesting a real eect, though we must note that sample sizes are too small to detect statistical
signicances in the dierences of the closest results. We leave tests with larger datasets for future work.
To investigate correlations and model errors in more depth, consider Tables 5and 6. These tables are similar to
Tables 3and 4, but take as input a single real user dialogue—they are based on no simulation. They are thus not reective
of oine evaluation scenarios for which real dialogues do not exist. But they could be thought of as a sort of ceiling (or
oor) for the correlation (or model error, respectively). Our RD simulation achieves correlations very close to those
reported in Table 6, suggesting that not only is it a good model of user satisfaction, it is approaching the best possible
performance for any model based on the same assumptions. Furthermore, the errors are substantially lower than with
simulated data. These tables reinforce
RQ1
, supporting the idea that simulation is an acceptable substitute for real user
dialogues, as well as RQ2, in that ECS ts user behavior better than either P or RBP on real user dialogues.
Finally, to answer
RQ3
, we simulated “real” retrieval systems by degrading the one used by our users. These systems
were progressively more likely to return irrelevant answers, by adding additional paragraphs to the candidate sets
for each subtopic. Recall that the system users used would only retrieve responses relevant to the subtopic. The same
system degraded to by 10% could potentially retrieve an additional 10% of the full corpus of paragraphs, introducing
much more possibility for error in retrieval results. These additional answers would be both irrelevant to users, and also
redirect the dialogue in unexpected ways, thus testing both the ability of the metric to measure degraded performance
as well as the ability of the simulation to respond to such degradation. Figure 5shows the result. As noise (irrelevant
responses) increases, eectiveness drops precipitously. This supports a positive answer to RQ3.
18 Lipani et al.
Fig. 5. ECS computed across all topics (over 10,000 trials per topic) varying the system noise, from 0%, where only the relevant
answers to each subtopic can be retrieved, to 100% where any answer in the dataset can be retrieved.
Fig. 6. Comparison of estimated
𝑃
(
𝑄𝑚
) =
𝑚
in relevant-dependent simulations (over 100,000 trials per topic) vs. observed user
behavior.
𝑝(of querying the same subtopic at step 𝑚|the answer was relevant at step 𝑚1) 0.121
𝑝(of querying the same subtopic at step 𝑚|the answer was nonrelevant at step 𝑚1) 0.340
𝑝(of querying the another subtopic at step 𝑚|the answer was relevant at step 𝑚1)) 0.879
𝑝(of querying the another subtopic at step 𝑚|the answer was nonrelevant at step 𝑚1) 0.660
Table 7. Marginal probabilities computed on the observed conversations.
How Am I Doing?: Evaluating Conversational Search Systems Oine 19
Fig. 7. Transition probability tables for the ‘Harvard University’ topic. This topic has 5 sub-topics.
𝑠0
is the initial state and
𝑠6
is
the end state. On the le we have the RI case, on the center and the right we have the RD case: the first is when the answer is not
relevant the second when the answer is relevant.
6.2 Additional Analysis
6.2.1 Modeling errors. Table 3above summarized model errors. Figure 6illustrates the errors in a more granular way,
showing how well the user model ts the observed user data, specically in terms of the probability of a user reaching
turn 𝑚in a dialogue. Note that RBP and ECS are about equal; which is also reected in Table 3.
6.2.2 Conditional transition probabilities. Table 7reports some marginal conditional probabilities from our user
experiment. In particular, users are more likely to switch to a dierent subtopic if they have just seen a response relevant
to their current subtopic than if they have not. As well, users are more likely to ask about the same subtopic if the
answer is not relevant than if it is. This shows that the system’s answers do inuence user behavior, as we argued in
Section 3.
Figure 7shows examples of empirical subtopic transition probabilities for the “Harvard University” topic with ve
subtopics, plus the start and end states
𝑠0
and
𝑠6
. Each row contains the probability of transitioning from the state
indicated by the row label to the state indicated by the column label. The rst matrix is used in the RI case, while
the second and third matrices are used in the RD case; the second is conditional on non-relevance while the third is
conditional on relevance.
We make some observations based on these gures. The rst is fairly uniform: when transitions are not based on
relevance of responses, the simulation produces no strong tendency to move in any particular way through the subtopic
graph. The second is quite sparse: when users are provided with answers that they do not nd relevant (recall that we
asked users to indicate relevance as well, and despite the system retrieving from a subset of relevant paragraphs, users
could disagree) there are typically only a few options they take. In some cases (
𝑠2
,
𝑠3
, and
𝑠5
), they are likely to issue
another query on the same subtopic. In one case (
𝑠1
) they give up immediately. The chance of switching subtopics for
this topic is relatively low, as we saw in aggregate in Table 7. The third is interesting in that the diagonal is all zeros:
users never follow up a relevant answer on one subtopic with another question on the same subtopic. This demonstrates
part of our original motivation for the work, that users questions are dependent on system responses.
7 CONCLUSION
We have introduced a novel approach for oine reproducible evaluation of conversational search systems. Our approach
models user queries and system responses with subtopics, an idea from novelty and diversity search evaluation. We
20 Lipani et al.
propose a metric based on simulating users transitioning between subtopics in the course of a dialogue. Our simulation-
based methodology correlates strongly with user satisfaction, and our metric correlates better than others.
Our approach has limitations. The label “conversational search” could be applied to a wide variety of problems and
search scenarios, and like any class of search problems, it is unlikely there is any one-size-ts-all solution to evaluation
for all possible settings. Ours is ideal for settings with relatively complex information needs that cannot be answered
in a single turn, but that do have factual answers; for which the desired information can be represented by a nite
set of discrete subtopics; for which information returned for one query may inuence future queries; and when the
information returned is relatively long-form (sentence or paragraph length).
The proposed approach considers only conversational systems where the main initiative is provided by the user. In
fact, the notion of persistence is only modeled from the user perspective, that is the only one who decides when the
interaction should stop. In mixed-initiative conversational systems, where the initiative is also taken by the system,
the system could also decide when to stop the interaction. Hence, a possible extension of this approach could be the
introduction of a system’s persistence similar to the concept of user’s persistence. This would be in line with the notion
of pro-activity of conversational search systems as suggested by Trippas et al. [40].
We cannot infer from the collected data if a user found a paragraph relevant to the rest of the subtopics. This is
because in the crowdsourcing experiment we only asked users to indicate if a document is relevant to the submitted
query to which a subtopic is associated. Therefore we can only be certain about the relevance of the retrieved paragraph
to the queried subtopic. This limitation, although makes the crowdsourcing task more realistic, does not allow us to
make stronger assumptions about to which subtopics the user has already been exposed in previous interactions with
the system.
Nevertheless, an oine evaluation framework that accurately captures user satisfaction, that can address the problem
of dialogues with new systems taking turns that are not seen in online systems, that is fully reproducible, and that is
relatively straightforward to implement will be an invaluable tool for developers of conversational search systems. As
our immediate next step, we intend to implement our framework to train and test real conversational search systems.
ACKNOWLEDGMENTS
This project was funded by the EPSRC Fellowship titled “Task Based Information Retrieval”, grant reference number
EP/P024289/1.
REFERENCES
[1]
Azzah Al-Maskari, Mark Sanderson, and Paul Clough. 2007. The Relationship between IR Eectiveness Measures and User Satisfaction. In Proceedings
of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Amsterdam, The Netherlands) (SIGIR
’07). Association for Computing Machinery, New York, NY, USA, 773–774. https://doi.org/10.1145/1277741.1277902
[2] AlexaPrize. 2020. The Alexa Prize the socialbot challenge. https://developer.amazon.com/alexaprize/
[3]
Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2019. Asking Clarifying Questions in Open-Domain Information-
Seeking Conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris,
France) (SIGIR’19). Association for Computing Machinery, New York, NY, USA, 475–484. https://doi.org/10.1145/3331184.3331265
[4]
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori
Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference
on Human Factors in Computing Systems (Glasgow, Scotland UK) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13.
https://doi.org/10.1145/3290605.3300233
[5]
Enrique Amigó, Damiano Spina, and Jorge Carrillo-de Albornoz. 2018. An Axiomatic Analysis of Diversity Evaluation Metrics: Introducing the
Rank-Biased Utility Metric. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (Ann Arbor, MI,
USA) (SIGIR ’18). Association for Computing Machinery, New York, NY, USA, 625–634. https://doi.org/10.1145/3209978.3210024
How Am I Doing?: Evaluating Conversational Search Systems Oine 21
[6]
Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno Stein. 2020. Conversational Search (Dagstuhl Seminar 19461). Dagstuhl
Reports 9, 11 (2020), 34–83. https://doi.org/10.4230/DagRep.9.11.34
[7]
Joeran Beel, Marcel Genzmehr, Stefan Langer, Andreas Nürnberger, and Bela Gipp. 2013. A Comparative Analysis of Oine and Online Evaluations
and Discussion of Research Paper Recommender System Evaluation. In Proceedings of the International Workshop on Reproducibility and Replication
in Recommender Systems Evaluation (Hong Kong, China) (RepSys ’13). Association for Computing Machinery, New York, NY, USA, 7–14. https:
//doi.org/10.1145/2532508.2532511
[8]
Daniel Braun, Adrian Hernandez Mendez, Florian Matthes, and Manfred Langen. 2017. Evaluating Natural Language Understanding Services
for Conversational Question Answering Systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Association for
Computational Linguistics, Saarbrücken, Germany, 174–185. https://doi.org/10.18653/v1/W17-5522
[9]
H.C. Bunt. 1981. Conversational principles in question-answer dialogues. In Zur Theorie der Frage: Kolloquium, 1978, Bad Homburg: Vortraege / hrsg.
von D. Krallman und G. Stickel (Forschungsberichte des Instituts fuer Deutsche Sprache Mannheim). Tübingen, 119–142.
[10] Ben Carterette, Evangelos Kanoulas, Mark M. Hall, and Paul D. Clough. 2014. Overview of the TREC 2014 Session Track. In TREC.
[11]
Justine Cassell. 2001. Embodied Conversational Agents: Representation and Intelligence in User Interfaces. AI Magazine 22, 4 (Dec. 2001), 67.
https://doi.org/10.1609/aimag.v22i4.1593
[12]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proceedings of the 18th
ACM Conference on Information and Knowledge Management (Hong Kong, China) (CIKM ’09). Association for Computing Machinery, New York, NY,
USA, 621–630. https://doi.org/10.1145/1645953.1646033
[13]
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. QuAC: Question Answering in
Context. In "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing". Association for Computational Linguistics,
Brussels, Belgium, 2174–2184. https://doi.org/10.18653/v1/D18-1241
[14]
Jason Ingyu Choi, Ali Ahmadvand, and Eugene Agichtein. 2019. Oine and Online Satisfaction Prediction in Open-Domain Conversational Systems.
In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (Beijing, China) (CIKM ’19). Association for
Computing Machinery, New York, NY, USA, 1281–1290. https://doi.org/10.1145/3357384.3358047
[15]
Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards Conversational Recommender Systems. In Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for
Computing Machinery, New York, NY, USA, 815–824. https://doi.org/10.1145/2939672.2939746
[16]
Charles L.A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. 2008. Novelty and
Diversity in Information Retrieval Evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval (Singapore, Singapore) (SIGIR ’08). Association for Computing Machinery, New York, NY, USA, 659–666. https://doi.org/10.
1145/1390334.1390446
[17]
Charles L. A. Clarke, Nick Craswell, Ian Soboro, and Ellen M. Voorhees. 2011. Overview of the TREC 2011 Web Track. In Proceedings of The
Twentieth Text REtrieval Conference, TREC 2011, Gaithersburg, Maryland, USA, November 15-18, 2011.http://trec.nist.gov/pubs/trec20/papers/WEB.
OVERVIEW.pdf
[18] Jerey Dalton, Chenyan Xiong, and Jamie Callan. 2020. TREC CAsT 2019: The Conversational Assistance Track Overview. In TREC.
[19]
GG Gebremeskel and AP de Vries. 2016. Recommender Systems Evaluations: Oine, Online, Time and A/A Test. In CLEF 2016: Working Notes of
CLEF 2016-Conference and Labs of the Evaluation forum Évora, Portugal, 5-8 September, 2016. [Sl]: CEUR, 642–656.
[20]
David Griol, Javier Carbó, and José M. Molina. 2013. An Automatic Dialog Simulation Technique To Develop and Evaluate Interactive Conversational
Agents. Appl. Artif. Intell. 27, 9 (Oct. 2013), 759–780.
[21]
David Griol, Javier Carbó, and José M. Molina. 2013. A Statistical Simulation Technique to Develop and Evaluate Conversational Agents. 26, 4
(2013), 355–371.
[22]
Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Anirudh Raju, Anu Venkatesh, and Ashwin Ram. 2017. Topic-based Evaluation for Conversational
Bots. NIPS Conversational AI Workshop.
[23] Ben Hixon, Peter Clark, and Hannaneh Hajishirzi. 2015. Learning Knowledge Graphs for Question Answering through Conversational Dialog. In
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Association for Computational Linguistics, Denver, Colorado, 851–861. https://doi.org/10.3115/v1/N15-1086
[24]
Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating Multi-Query Sessions. In Proceedings of the 34th
International ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China) (SIGIR ’11). Association for Computing
Machinery, New York, NY, USA, 1053–1062. https://doi.org/10.1145/2009916.2010056
[25]
Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2019. From a User Model for Query Sessions to Session Rank Biased Precision (sRBP). In Proceedings
of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (Santa Clara, CA, USA) (ICTIR ’19). Association for Computing
Machinery, New York, NY, USA, 109–116. https://doi.org/10.1145/3341981.3344216
[26]
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System:
An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2122–2132. https://doi.org/10.18653/v1/D16-1230
[27]
Fei Liu, Alistair Moat, Timothy Baldwin, and Xiuzhen Zhang. 2016. Quit While Ahead: Evaluating Truncated Rankings (SIGIR ’16). Association for
Computing Machinery, New York, NY, USA, 953–956. https://doi.org/10.1145/2911451.2914737
22 Lipani et al.
[28]
Xingkun Liu, Arash Eshghi, Pawel Swietojanski, and Verena Rieser. 2019. Benchmarking Natural Language Understanding Services for building
Conversational Agents. In 10th International Workshop on Spoken Dialogue Systems Technology 2019 (IWSDS ’19).https://iwsds2019.unikore.it/
[29]
Alistair Moat and Justin Zobel. 2008. Rank-Biased Precision for Measurement of Retrieval Eectiveness. ACM Trans. Inf. Syst. 27, 1, Article 2 (Dec.
2008), 27 pages. https://doi.org/10.1145/1416950.1416952
[30]
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated
MAchine Reading COmprehension Dataset. In CoCo@NIPS, Vol. abs/1611.09268. http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
[31]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-JingZhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings
of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania,
USA, 311–318. https://doi.org/10.3115/1073083.1073135
[32]
Filip Radlinski, Krisztian Balog, Bill Byrne, and Karthik Krishnamoorthi. 2019. Coached Conversational Preference Elicitation: A Case Study in
Understanding Movie Preferences. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational
Linguistics, Stockholm, Sweden, 353–360. https://doi.org/10.18653/v1/W19-5941
[33]
Filip Radlinski and Nick Craswell. 2017. A Theoretical Framework for Conversational Search. In Proceedings of the 2017 Conference on Conference
Human Information Interaction and Retrieval (Oslo, Norway) (CHIIR ’17). Association for Computing Machinery, New York, NY, USA, 117–126.
https://doi.org/10.1145/3020165.3020183
[34]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas,
2383–2392. https://doi.org/10.18653/v1/D16-1264
[35]
Ehud Reiter. 2018. A Structured Review of the Validity of Bleu. Comput. Linguist. 44, 3 (Sept. 2018), 393–401. https://doi.org/10.1162/coli_a_00322
[36]
Tetsuya Sakai and Zhaohao Zeng. 2019. Which Diversity Evaluation Measures Are "Good"?. In Proceedings of the 42nd International ACM SIGIR
Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR’19). Association for Computing Machinery, New York, NY,
USA, 595–604. https://doi.org/10.1145/3331184.3331215
[37]
Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. 2018. A Survey of Available Corpora For Building Data-Driven
Dialogue Systems: The Journal Version. Dialogue & Discourse 9, 1 (2018), 1–49.
[38]
Zhiwen Tang and Grace Hui Yang. 2017. Investigating per Topic Upper Bound for Session Search Evaluation. In Proceedings of the ACM SIGIR
International Conference on Theory of Information Retrieval (Amsterdam, The Netherlands) (ICTIR ’17). Association for Computing Machinery, New
York, NY, USA, 185–192. https://doi.org/10.1145/3121050.3121069
[39] Zhiwen Tang and Grace Hui Yang. 2019. Dynamic Search–Optimizing the Game of Information Seeking. arXiv preprint arXiv:1909.12425 (2019).
[40]
Johanne R. Trippas, Damiano Spina, Paul Thomas, Mark Sanderson, Hideo Joho, and Lawrence Cavedon. 2020. Towards a model for spoken
conversational search. Information Processing & Management 57, 2 (2020), 102162. https://doi.org/10.1016/j.ipm.2019.102162
[41]
Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel, Ashish Nagar, Rohit Prasad, Ming Cheng, Behnam Hedayatnia, Angeliki
Metallinou, Rahul Goel, Shaohua Yang, and Anirudh Raju. 2017. On Evaluating and Comparing Conversational Agents. In NIPS Conversational AI
Workshop.
[42] Oriol Vinyals and Quoc V. Le. 2015. A Neural Conversational Model. In ICML Deep Learning Workshop.http://arxiv.org/pdf/1506.05869v3.pdf
[43] Grace Hui Yang and Ian Soboro. 2016. TREC 2016 Dynamic Domain Track Overview. In TREC.
[44]
Kangyan Zhou, Shrimai Prabhumoye, and Alan W Black. 2018. A Dataset for Document Grounded Conversations. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 708–713.
... However, the offline evaluation of conversational search systems (CSSs) is more challenging [28]. Although we can decompose a conversation into turns and evaluate each turn independently, this assumption disregards the fact that a turn's relevance may be dependent on what happened in the previous turns. ...
... Understanding how much these assumptions hold will allow the extension of the Cranfield paradigm to CSSs. This is important also in order to inform the development of better user models [27,28] and user simulators [21], therefore better evaluation measures and training procedures. ...
... In this paper, we perform a crowdsourcing study to compare the online and offline evaluation of a CSS. For the online evaluation, we reuse the conversation logs collected by Lipani et al. [28]. For the offline evaluation, we perform the crowdsourcing study and collect new assessments. ...
Conference Paper
Full-text available
Due to the sequential and interactive nature of conversations, the application of traditional Information Retrieval (IR) methods like the Cranfield paradigm require stronger assumptions. When building a test collection for Ad Hoc search, it is fair to assume that the relevance judgments provided by an annotator correlate well with the relevance judgments perceived by an actual user of the search engine. However, when building a test collection for conversational search, we do not know if it is fair to assume the same. In this paper, we perform a crowdsourcing study to evaluate the applicability of the Cranfield paradigm to conversational search systems. Our main aim is to understand what is the agreement in terms of user satisfaction between the users performing a search task in a conversational search system (i.e., directly assessing the system) and the users observing the search task being performed (i.e., indirectly assessing the system). The results of this study are paramount because they underpin and guide 1) the development of more realistic user models and simulators, and 2) the design of more reliable and robust evaluation measures for conversational search systems. Our results show that there is a fair agreement between direct and indirect assessments in terms of user satisfaction and that these two kinds of assessments share similar conversational patterns. Indeed, by collecting relevance assessments for each system utterance, we tested several conversational patterns that show a promising ability to predict user satisfaction.
... To tackle these limitations, user simulators, mimicking users' behavior, are needed in order to train and evaluate dialogue systems [1,15]. For training, a user simulator can be used to generate a vast amount of synthetic dialogues, which then can be used for teaching dialogue strategies to the agent. ...
... For example, in reinforcement learning, a simulator can act as the environment that affects the agent with its own rewards [5,11]. For evaluation, a user simulator can be used to interact with dialogue systems while tracking the generated dialogues with predicted satisfaction scores [15]. ...
Conference Paper
Full-text available
A human-like user simulator that anticipates users' satisfaction scores, actions, and utterances can help goal-oriented dialogue systems in evaluating the conversation and refining their dialogue strategies. However, little work has experimented with user simulators which can generate users' utterances. In this paper, we propose a deep learning-based user simulator that predicts users' satisfaction scores and actions while also jointly generating users' utterances in a multi-task manner. In particular, we show that 1) the proposed deep text-to-text multi-task neural model achieves state-of-the-art performance in the users' satisfaction scores and actions prediction tasks, and 2) in an ablation analysis, user satisfaction score prediction, action prediction, and utterance generation tasks can boost the performance with each other via positive transfers across the tasks. The source code and model checkpoints used for the experiments run in this paper are available at the following weblink: https://github.com/kimdanny/user-simulation-t5.
... Recently, the Information Retrieval (IR) and Natural Language Processing (NLP) communities have been enjoying major performance enhancements in many core tasks including ad hoc retrieval, question answering, and conversational search [12,19,23,26,31]. Despite this recent progresses, even the best-performing systems are not able to perform smoothly and coherently on all requests [5,27,29]. For example, broad queries can have multiple interpretations and aspects [13,14,36,41], and satisfying such ambiguous ...
Preprint
Despite recent progress on conversational systems, they still do not perform smoothly and coherently when faced with ambiguous requests. When questions are unclear, conversational systems should have the ability to ask clarifying questions, rather than assuming a particular interpretation or simply responding that they do not understand. Previous studies have shown that users are more satisfied when asked a clarifying question, rather than receiving an unrelated response. While the research community has paid substantial attention to the problem of predicting query ambiguity in traditional search contexts, researchers have paid relatively little attention to predicting when this ambiguity is sufficient to warrant clarification in the context of conversational systems. In this paper, we propose an unsupervised method for predicting the need for clarification. This method is based on the measured coherency of results from an initial answer retrieval step, under the assumption that a less ambiguous query is more likely to retrieve more coherent results when compared to an ambiguous query. We build a graph from retrieved items based on their context similarity, treating measures of graph connectivity as indicators of ambiguity. We evaluate our approach on two recently released open-domain conversational question answering datasets, ClariQ and AmbigNQ, comparing it with neural and non-neural baselines. Our unsupervised approach performs as well as supervised approaches while providing better generalization.
... User simulation has been widely leveraged in the past for training the dialogue state tracking component of conversational agents using reinforcement learning algorithms, either via agenda-based or model-based simulation [19]. The highly interactive nature of conversational information access systems has also sparked renewed interest in evaluation using user simulation within the IR community [4,5,23,36,38,53]. Recently, Zhang and Balog [53] proposed a general framework for evaluating conversational recommender systems using user simulation. ...
Conference Paper
Full-text available
User simulation has been a cost-effective technique for evaluating conversational recommender systems. However, building a human-like simulator is still an open challenge. In this work, we focus on how users reformulate their utterances when a conversational agent fails to understand them. First, we perform a user study, involving five conversational agents across different domains, to identify common reformulation types and their transition relationships. A common pattern that emerges is that persistent users would first try to rephrase, then simplify, before giving up. Next, to incorporate the observed reformulation behavior in a user simulator, we introduce the task of reformulation sequence generation: to generate a sequence of reformulated utterances with a given intent (rephrase or simplify). We develop methods by extending transformer models guided by the reformulation type and perform further filtering based on estimated reading difficulty. We demonstrate the effectiveness of our approach using both automatic and human evaluation. CCS CONCEPTS • Information systems → Users and interactive retrieval.
... User simulation has been widely leveraged in the past for training the dialogue state tracking component of conversational agents using reinforcement learning algorithms, either via agenda-based or model-based simulation [19]. The highly interactive nature of conversational information access systems has also sparked renewed interest in evaluation using user simulation within the IR community [4,5,23,36,38,53]. Recently, Zhang and Balog [53] proposed a general framework for evaluating conversational recommender systems using user simulation. ...
Preprint
User simulation has been a cost-effective technique for evaluating conversational recommender systems. However, building a human-like simulator is still an open challenge. In this work, we focus on how users reformulate their utterances when a conversational agent fails to understand them. First, we perform a user study, involving five conversational agents across different domains, to identify common reformulation types and their transition relationships. A common pattern that emerges is that persistent users would first try to rephrase, then simplify, before giving up. Next, to incorporate the observed reformulation behavior in a user simulator, we introduce the task of reformulation sequence generation: to generate a sequence of reformulated utterances with a given intent (rephrase or simplify). We develop methods by extending transformer models guided by the reformulation type and perform further filtering based on estimated reading difficulty. We demonstrate the effectiveness of our approach using both automatic and human evaluation.
... For this reason, researchers adopt human-in-the-loop techniques to mimic human-computer interactions, and further perform human annotation to evaluate the whole system's performance (in response to human). Recent work of Lipani et al. [30] propose a metric for offline evaluation of conversational search systems based on user interaction model. ...
Preprint
Clarifying the underlying user information need by asking clarifying questions is an important feature of modern conversational search system. However, evaluation of such systems through answering prompted clarifying questions requires significant human effort, which can be time-consuming and expensive. In this paper, we propose a conversational User Simulator, called USi, for automatic evaluation of such conversational search systems. Given a description of an information need, USi is capable of automatically answering clarifying questions about the topic throughout the search session. Through a set of experiments, including automated natural language generation metrics and crowdsourcing studies, we show that responses generated by USi are both inline with the underlying information need and comparable to human-generated answers. Moreover, we make the first steps towards multi-turn interactions, where conversational search systems asks multiple questions to the (simulated) user with a goal of clarifying the user need. To this end, we expand on currently available datasets for studying clarifying questions, i.e., Qulac and ClariQ, by performing a crowdsourcing-based multi-turn data acquisition. We show that our generative, GPT2-based model, is capable of providing accurate and natural answers to unseen clarifying questions in the single-turn setting and discuss capabilities of our model in the multi-turn setting. We provide the code, data, and the pre-trained model to be used for further research on the topic.
Article
Online shopping platforms, such as Amazon and AliExpress, are increasingly prevalent in the society, helping customers purchase products conveniently. With recent progress on natural language processing, researchers and practitioners shift their focus from traditional product search to conversational product search. Conversational product search enables user-machine conversations and through them collects explicit user feedback that allows to actively clarify the users’ product preferences. Therefore, prospective research on an intelligent shopping assistant via conversations is indispensable. Existing publications on conversational product search either model conversations independently from users, queries, and products or lead to a vocabulary mismatch. In this work, we propose a new conversational product search model, ConvPS, to assist users to locate desirable items. The model is first trained to jointly learn the semantic representations of user, query, item, and conversation via a unified generative framework. After learning these representations, they are integrated to retrieve the target items in the latent semantic space. Meanwhile, we propose a set of greedy and explore-exploit strategies to learn to ask the user a sequence of high-performance questions for conversations. Our proposed ConvPS model can naturally integrate the representation learning of the user, query, item, and conversation into a unified generative framework, which provides a promising avenue for constructing accurate and robust conversational product search systems that are flexible and adaptive. Experimental results demonstrate that our ConvPS model significantly outperforms state-of-the-art baselines.
Article
Re-using research resources is essential for advancing knowledge and developing repeatable, empirically solid experiments in scientific fields, including interactive information retrieval (IIR). Despite recent efforts on standardizing research re-use and documentation, how to quantitatively measure the reusability of IIR resources still remains an open challenge. Inspired by the reusability evaluations on Cranfield experiments, our work proactively explores the problem of measuring IIR test collection reusability and makes threefold contributions: (1) constructing a novel usefulness-oriented framework with specific analytical methods for evaluating the reusability of IIR test collections consisting of query sets, document/page sets, and sets of task-document usefulness (tuse); (2) explaining the potential impacts of varying IIR-specific factors (e.g. search tasks, sessions, user characteristics) on test collection reusability; (3) proposing actionable methods for building reusable test collections in IIR and thereby amortizing the true cost of user-oriented evaluations. The Cranfield-inspired reusability assessment framework serves as an initial step towards accurately evaluating the reusability of IIR research resources and measuring the reproducibility of IIR evaluation results. It also demonstrates an innovative approach to integrating the insights from individual heterogeneous user studies with the evaluation techniques developed in standardized ad hoc retrieval experiments, which will facilitate the maturation of IIR fields and eventually benefits both sides of research.
Article
The advent of recent Natural Language Processing technology has led human and machine interactions more towards conversation. In Conversational Search Systems (CSS) like chatbots and Virtual Personal Assistants (VPA) such as Apple’s Siri, Amazon Alexa, Microsoft’s Cortana, and Google Assistant both user and device have limited platform to communicate through chatting or voice. In the information seeking process, often users do not know how to properly describe their information need in a machine understandable language. Consequently, it is hard for the assistant agent to predict the user’s intent and yield relevant results by only relying on the original query. Study has shown many unsatisfactory results can be enhanced with the benefit of CSS. Conversational search systems can dig deeper into the user’s query to reveal the real need. This survey intends to provide a comprehensive and comparative overview of ambiguous query clarification task in the context of conversational search technology. We investigate different approaches, their evaluation methods, and future work. We also address the importance of understanding a query for retrieving the most relevant document(s) and satisfying user’s need by predicting their potential request. This work provides a divine overview of characteristics of ambiguous queries and contributes to better understanding of the existing technologies and challenges in CSS focus on disambiguation of unclear queries from various dimensions.
Article
The use of clarifying questions (CQs) is a fairly new and useful technique to aid systems in recognizing the intent, context, and preferences behind user queries. Yet, understanding the extent of the effect of CQs on user behavior and the ability to identify relevant information remains relatively unexplored. In this work, we conduct a large user study to understand the interaction of users with CQs in various quality categories, and the effect of CQ quality on user search performance in terms of finding relevant information, search behavior, and user satisfaction. Analysis of implicit interaction data and explicit user feedback demonstrates that high-quality CQs improve user performance and satisfaction. By contrast, low- and mid-quality CQs are harmful, and thus allowing the users to complete their tasks without CQ support may be preferred in this case. We also observe that user engagement, and therefore the need for CQ support, is affected by several factors, such as search result quality or perceived task difficulty. The findings of this study can help researchers and system designers realize why, when, and how users interact with CQs, leading to a better understanding and design of search clarification systems.
Conference Paper
Full-text available
To satisfy their information needs, users usually carry out searches on retrieval systems by continuously trading off between the examination of search results retrieved by under-specified queries and the refinement of these queries through reformulation. In Information Retrieval (IR), a series of query reformulations is known as a query-session. Research in IR evaluation has traditionally been focused on the development of measures for the ad hoc task, for which a retrieval system aims to retrieve the best documents for a single query. Thus, most IR evaluation measures, with a few exceptions , are not suitable to evaluate retrieval scenarios that call for multiple refinements over a query-session. In this paper, by formally modeling a user's expected behaviour over query-sessions, we derive a session-based evaluation measure, which results in a generalization of the evaluation measure Rank Biased Precision (RBP). We demonstrate the quality of this new session-based evaluation measure, named Session RBP (sRBP), by evaluating its user model against the observed user behaviour over the query-sessions of the 2014 TREC Session track.
Conference Paper
Full-text available
Users often fail to formulate their complex information needs in a single query. As a consequence, they may need to scan multiple result pages or reformulate their queries, which may be a frustrating experience. Alternatively, systems can improve user satisfaction by proactively asking questions of the users to clarify their information needs. Asking clarifying questions is especially important in conversational systems since they can only return a limited number of (often only one) result(s). In this paper, we formulate the task of asking clarifying questions in open-domain information-seeking conversational systems. To this end, we propose an offline evaluation methodology for the task and collect a dataset, called Qulac, through crowdsourcing. Our dataset is built on top of the TREC Web Track 2009-2012 data and consists of over 10K question-answer pairs for 198 TREC topics with 762 facets. Our experiments on an oracle model demonstrate that asking only one good question leads to over 170% retrieval performance improvement in terms of [email protected], which clearly demonstrates the potential impact of the task. We further propose a retrieval framework consisting of three components: question retrieval, question selection, and document retrieval. In particular, our question selection model takes into account the original query and previous question-answer interactions while selecting the next question. Our model significantly outperforms competitive baselines. To foster research in this area, we have made Qulac publicly available.
Conference Paper
Full-text available
Advances in artificial intelligence (AI) frame opportunities and challenges for user interface design. Principles for human-AI interaction have been discussed in the human-computer interaction community for over two decades, but more study and innovation are needed in light of advances in AI and the growing uses of AI technologies in human-facing applications. We propose 18 generally applicable design guidelines for human-AI interaction. These guidelines are validated through multiple rounds of evaluation including a user study with 49 design practitioners who tested the guidelines against 20 popular AI-infused products. The results verify the relevance of the guidelines over a spectrum of interaction scenarios and reveal gaps in our knowledge, highlighting opportunities for further research. Based on the evaluations, we believe the set of design guidelines can serve as a resource to practitioners working on the design of applications and features that harness AI technologies, and to researchers interested in the further development of human-AI interaction design principles.
Article
Full-text available
The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.
Conference Paper
Predicting user satisfaction in conversational systems has become critical, as spoken conversational assistants operate in increasingly complex domains. Online satisfaction prediction (i.e., predicting satisfaction of the user with the system after each turn) could be used as a new proxy for implicit user feedback, and offers promising opportunities to create more responsive and effective conversational agents, which adapt to the user's engagement with the agent. To accomplish this goal, we propose a conversational satisfaction prediction model specifically designed for open-domain spoken conversational agents, called ConvSAT. To operate robustly across domains, ConvSAT aggregates multiple representations of the conversation, namely the conversation history, utterance and response content, and system- and user-oriented behavioral signals. We first calibrate ConvSAT performance against state of the art methods on a standard dataset (Dialogue Breakdown Detection Challenge) in an online regime, and then evaluate ConvSAT on a large dataset of conversations with real users, collected as part of the Alexa Prize competition. Our experimental results show that ConvSAT significantly improves satisfaction prediction for both offline and online setting on both datasets, compared to the previously reported state-of-the-art approaches. The insights from our study can enable more intelligent conversational systems, which could adapt in real-time to the inferred user satisfaction and engagement.
Conference Paper
This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question "Which SERP is more relevant?'' and the other based on a diversity question "Which SERP is likely to satisfy a higher number of users?'' To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the ♯-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best ♯-measures when compared to the gold diversity preferences.