Content uploaded by Chang Gao
Author content
All content in this area was uploaded by Chang Gao on Oct 05, 2022
Content may be subject to copyright.
Search Clarification Selection via
Query-Intent-Clarification Graph Attention
Chang Gao and Wai Lam
The Chinese University of Hong Kong, Shatin, Hong Kong
{gaochang,wlam}@se.cuhk.edu.hk
Abstract. Proactively asking clarifications in response to search queries
is a useful technique for revealing the intent of the query. Search clarifi-
cation is important for both web and conversational search. This paper
focuses on the clarification selection task. Inspired by the fact that a
good clarification should clarify the query’s different intents, we propose
a graph attention-based clarification selection model that can exploit
the relations among a given query, its intents, and its clarifications via
constructing a query-intent-clarification attention graph. The compar-
ison with competitive baselines on large-scale search clarification data
demonstrates the effectiveness of our model.
Keywords: Search Clarification, Clarification Selection, Conversational
Search
1 Introduction
Search queries are often short, and users’ information needs are complex. This
makes it challenging for search engines to predict potential user intents and
give satisfactory retrieval results. As a consequence, users may need to browse
multiple result pages or reformulate their queries. Alternatively, search engines
can proactively ask clarifications to the user instead of just giving “ten blue
links” [2,4]. Fig. 1 shows an example of clarifications in the Bing search engine.
Each clarification consists of a clarifying question and a set of candidate answers
in response to the query. Users can click on one of the answers to indicate their
intents. Zamani et al. [25] show that users enjoy seeing clarifications due to their
functional and emotional benefits. Aliannejadi et al. [2] show that asking only
one good question can improve the retrieval performance significantly. Moreover,
search clarification has been recognized as a critical component of conversational
search systems [3, 7, 15].
Although there is significant progress in exploring the search clarification
[2, 8, 13, 19, 25, 27, 28], selecting clarifications is underexplored. In this pap er, we
focus on the clarification selection task, which is a fundamental task in search
The work described in this paper is substantially supported by a grant from the Re-
search Grant Council of the Hong Kong Special Administrative Region, China (Project
Code: 14200620).
hair coloring
Select one to refine your search
for men
Query
Clarification 1
a1a2
Select one to refine your search
blonde
Clarification 2
a3grey
a4
What color do you like?
blonde
Clarification 3
a3red
a5a7
a6white
for women
silver
Engagement Level
0
2
4
Fig. 1: A query and its three clarifications. Each clarification is associated with
an engagement level which is an integer between 0 and 10 based on click-through
rates. Higher click-through rates correspond to higher engagement levels.
clarification because: (1) search engines can generate multiple clarifications and
then select one of them [2], and (2) search engines can generate new clarifica-
tions based on existing clarifications through some operations such as adding or
deleting answers, and judge whether the new clarifications are better. The aim
of clarification selection is to select the clarification with the highest engagement
level for each query.
A good clarification should clarify different intents of the query [27]. There
are two challenges for clarification selection: (1) how to estimate the query’s
intents; (2) how to utilize the query’s intents. To address the first challenge,
we observe that the candidate answers in the query’s clarifications can capture
some reasonable intents. For example, as shown in Fig. 1, the answers a1to a7
can reflect the intents of ”hair coloring” to a certain extent. To overcome the
second challenge, we propose a Graph Attention-based Clarification Selection
(GACS) model, which constructs a query-intent-clarification (QIC) attention
graph to exploit the relations among a given query, its intents, and its clarifi-
cations. Afterwards, it transforms the graph into a sequence, inputs it to the
Transformer [22], and outputs a score to measure how the clarification reflects
the query’s intents.
We design several different graphs and evaluate the model on two search
clarification datasets. Experimental results show that by properly designing the
graph structure, GACS can outperform competitive baselines, and the intent
can improve the model’s performance, especially in scenarios where there are
negative clarifications.
2 Related Work
2.1 Conversational Search
The conversational search paradigm aims to satisfy information needs within
a conversational format [3]. A key property of conversational search systems
that is different from traditional search systems is the mixed-initiative interac-
tion, which has the potential to increase user engagement and user satisfaction.
Radlinski and Craswell [15] propose a theoretical framework for conversational
search. They define a conversational search system as a system for retrieving
information that permits a mixed-initiative back and forth between a user and
agent, where the agent’s actions are chosen in response to a model of current
user needs within the current conversation, using both short-term and long-term
knowledge of the user. Qu et al. [14] introduce the MSDialog dataset and analyze
user intent distribution, co-occurrence, and flow patterns in information-seeking
conversations. In the Dagstuhl seminar report [3], a conversational search sys-
tem is defined as either an interactive information retrieval system with speech
and language processing capabilities, a retrieval-based chatbot with user task
modeling, or an information-seeking dialogue system with information retrieval
capabilities. Rosset et al. [17] study conversational question suggestion, which
aims to proactively engage the user by suggesting interesting, informative, and
useful follow-up questions. Ren et al. [16] develop a pipeline for conversation
search consisting of six sub-tasks: intent detection, keyphrase extraction, action
prediction, query selection, passage selection, and response generation.
In this paper, we focus on search clarification selection, which differs from
most existing work in conversational search. Therefore, we review the related
work on asking clarifications in the next section.
2.2 Asking Clarifications
Asking clarifications is important in conversational systems since they can only
return a limited number of results [2]. Kiesel et al. [9] focus on ambiguous voice
queries and conduct a user study for a better understanding of voice query
clarifications and their impact on user satisfaction. Aliannejadi et al. [2] propose a
workflow for asking clarifying questions in an open-domain conversational search
system. Moreover, they build a dataset called Qulac based on the TREC Web
Track 2009-2012 collections and develop an offline evaluation protocol. Hashemi
et al. [8] propose a Guided Transformer model that can use multiple information
sources for document retrieval and next clarifying question selection. Krasakis et
al. [10] investigate how different aspects of clarifying questions and user answers
affect the quality of document ranking. Previous work ignores the possibility
that conversational search systems may generate off-topic clarifying questions
that may reduce user satisfaction. Therefore, Wang and Ai [23] propose a risk-
aware model to balance the risk of answering user queries and asking clarifying
questions. Their system has a risk decision module which can decide whether
the system should ask the question or show the documents directly. Tavakoli
et al. [21] investigate the characteristics of useful clarifying questions and show
that useful clarifying questions are answered by the asker, have an informative
answer, and are valuable for the post and the accepted answer. Sekuli´c et al. [18]
propose a facet-driven approach for generating clarifying questions.
For the clarification in web search, Zamani et al. [25] develop a taxonomy of
clarification based on large-scale query reformulation data sampled from Bing
hair coloring
Query
Select one to refine your search blonde greyClarification
for men blonde grey red whitefor women silverIntents for menfor men for women blonde grey red
Embedding Layer Attention Layer
Mask-Transformer Encoder
Mean Pooling
Two-Layer FNN
Score
Embeddings Attention Mask Matrix
Query
Intent Clarifi
cation
QIC
Attention
Graph
Fig. 2: Overall architecture of GACS.
search logs. Then they propose a rule-based model, a weakly supervised model
trained using maximum likelihood, and a reinforcement learning model which
maximizes clarification utility for generating clarifications. Later, Zamani et
al. [27] analyze user engagements based on different attributes of the clarifi-
cation and propose a model called RLC for clarification selection. RLC encodes
each query-answer-intent triple separately and uses Transformer [22] to capture
the interaction between these representations. Compared with RLC, our GACS
model exploit the relations among the query, intent, and clarification via con-
structing a QIC attention graph and is more efficient. To promote research in
search clarification, Zamani et al. [26] further introduce a large-scale search clar-
ification dataset called MIMICS, which can be used for training and evaluating
various tasks such as generating clarifications, user engagement prediction for
clarifications, and clarification selection. In a follow-up study [19], ELBERT is
proposed for user engagement prediction. It treats the user engagement predic-
tion task as supervised regression and jointly encodes the query, the clarification,
and the SERP elements for predicting the engagement levels. By comparison,
we use the intent and its interaction with the query and clarification and focus
on clarification selection.
3 Our Proposed Framework
3.1 Model Architecture
Fig. 2 depicts the overall structure of GACS. There are four main compo-
nents, namely, QIC attention graph, embedding layer, attention layer, and Mask-
Transformer. The graph is fed into the embedding layer and the attention layer to
hair coloring Select one to refine your search blonde grey
for men for women blonde grey red silverwhite
Fig. 3: Illustration of QIC attention graph G1. In the QIC attention graph, a
directed edge from A to B means that A can attend to B when updating its
representation. All elements can attend to themselves.
obtain its embeddings and attention mask matrix, respectively. The aim of these
two layers is to transform the graph into a sequence while retaining its structural
information. The Mask-Transformer takes the embeddings and attention mask
matrix as input and uses the mean of output embeddings as the graph’s repre-
sentation, which will be fed into a two-layer feedforward neural network (FNN)
to compute the score.
QIC Attention Graph There are three types of nodes in the QIC attention
graph, i.e., query, intent, and clarification. We consider the following four differ-
ent graphs:
G1: When updating the representation, the query can attend to itself and its
intents, but not the clarification. Thus, the query can focus on its own represen-
tation that is independent of a specific clarification. The clarification only needs
to attend to itself and the query because the query can absorb its intents’ infor-
mation. For each intent, it can attend to itself, the query, and the clarification,
but not other intents. In this way, it can associate itself with the query and de-
termine whether it is reflected in the clarification. Fig. 3 provides an illustration
of G1.
G2: Based on G1,G2considers the fact that each clarification usually only
covers some intents. Therefore, G2adds edges from the clarification to multiple
intents, which allows GACS to model the fact explicitly.
G3:G3is a fully connected graph. It does not impose any special restrictions
on the relation among the query, intent, and clarification and has the strongest
expressive power.
G4: Unlike the previous three graphs, G4does not contain the intent, i.e., only
the query and clarification are available in this graph. The query and clarification
can attend to each other. G4is a special case of the QIC attention graph, where
intents are masked. The purpose of using this graph is to explore whether the
intent is useful or not.
Generally, the more complex the graph is, the stronger its expressiveness
is, but the model is more difficult to train. In addition, by designing the graph
structure, we can introduce task-related prior knowledge to the model. Therefore,
different graphs may be suitable for different situations.
hair coloring men for women blonde grey red white
1 2 3 4 3 4 3 3 3 3
hair
0
[CLS] for
1 2 3 4 5 6 7 8 9 100
0 0 1 1 1 1 1 1 1 10
hair Select to refine your search blonde grey [SEP]
3 4 5 6 7 8 9 10 11 12
[SEP]
3
silver one
12 13 14 15 16 17 18 19 20 2111
2 2 2 2 2 2 2 2 2 21
Token
Embedding
Soft-position
Embedding
Type
Embedding
Hard-position
Embedding
Token
Embedding
Soft-position
Embedding
Type
Embedding
Hard-position
Embedding
Fig. 4: Illustration of embedding layer of GACS. (1) The tokens in the QIC
attention graph are flattened into a sequence by their hard-position indices. (2)
The soft-position embedding is used as position embedding. (3) For the type
embedding, type ”0”, ”1”, and ”2” represent the query, intent, and clarification,
respectively.
Embedding Layer The function of the embedding layer is to convert the QIC
attention graph into embeddings that can be fed into the Mask-Transformer.
As shown in Fig. 4, the input embedding is the sum of the token embedding,
soft-position embedding, and type embedding. The token embedding is consis-
tent with Google BERT [6]. Inspired by [11], we use soft-position embeddings
instead of hard-position embeddings because the soft position can reflect the
graph structure. As in [20], we use type embeddings to indicate the three types
of nodes in the graph. Precisely, type ”0”, ”1”, and ”2” represent the query,
intent, and clarification, respectively.
Attention Layer The attention layer preserves the structure information of
the QIC attention graph via constructing an attention mask matrix. Different
graphs have different attention mask matrices. The attention mask matrix of G1
is shown in Fig. 5. Given the embeddings of the tokens X∈Rn×d, where nis
the number of tokens and dis the dimension of embeddings, the attention mask
matrix M∈Rn×nis defined as
Mij =(0xixj
−∞ Otherwise (1)
where xixjmeans that xiand xjare in the same graph node or there is a
directed edge from the node where xiis to the node where xjis. iand jare
hard-position indices.
Mask-Transformer The Mask-Transformer follows the Transformer [22] en-
coder structure. Given the token embeddings Xand the attention mask matrix
0
1 2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
20
21
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Fig. 5: Attention mask matrix of G1. The cell at row i, column jis pink means
that token ican attend to token j.
M, the Mask-Transformer uses masked self-attention as shown below:
Q,K,V=XW Q,XW K,X W V(2)
Attn(Q,K,V) = Softmax(QK>
√dk
+M)V(3)
where WQ,WK,WV∈Rd×dkare learnable parameters. The Mask-Transformer
can control the information flow according to the attention mask matrix.
3.2 Loss Function
We use a loss function similar to Attention Rank [1]. For each query qand its
kclarifications c1to ck, we first compute the best attention allocation aeand
compute the attention allocation aswith the ranking score s:
ae= Softmax(e),as= Softmax(s) (4)
where eand sare k-dimensional vectors. eiis the engagement level of ciand si
is the score of ci. Then we use the cross entropy between aeand asas the loss:
L=−
k
X
i=1
(ae
ilog(as
i) + (1 −ae
i) log(1 −as
i)) (5)
The loss function is a listwise function. It does not predict the engagement level
of each clarification but focuses on the relative importance of each element in
the ranked list.
4 Experiments
4.1 Dataset
We evaluate the model on MIMICS [26], a large-scale dataset collected from the
Bing search engine for search clarification. In MIMICS, the engagement level is
an integer between 0 and 10 based on click-through rates. The engagement level
0 means that there is no click on the clarification.
MIMICS consists of three subdatasets: MIMICS-Click, MIMICS-ClickExplore
and MIMICS-Manual. In this work, we mainly focus on MIMICS-ClickExplore
because each query has multiple clarifications in this dataset, but not in the
other two datasets. Thus, it can be directly used for evaluating clarification
selection models. However, some queries in MIMICS-ClickExplore have several
identical clarifications, but their engagement levels are different. We delete such
inconsistent data as f ollows:
–For each query qand its clarifications c1to cn, 1 ≤i, j ≤n, i 6=j, if ciand
cjare identical but their engagement levels are different, we delete both of
them;
–Afterwards, if the number of clarifications of qis less than 2 or the engage-
ment levels of all its clarifications are 0, we delete the query q.
Finally, 2.44% of queries in MIMICS-ClickExplore are deleted. We call the
processed dataset D1. There are 62446 queries in D1, and each query is associated
with 2.57 clarifications on average. We divide D1into training, development, and
test sets with a ratio of 8:1:1 based on the query.
The average number of clarifications per query in D1is small, we construct a
dataset D2based on D1. For each query in D1, we randomly sample 10 negative
clarifications (i.e., clarifications from other queries) from the set to which it
belongs. This ensures that the training, validation, and test sets of D2have no
intersection. Note that there are negative clarifications when testing the model
on D2, which is different from testing the model on D1. We set the engagement
levels of all negative clarifications to -1. Intuitively, it does not make sense to set
the click-based engagement level to -1. However, our purpose is to distinguish
between positive and negative clarifications. Because the engagement level of a
positive clarification may be 0, it is inappropriate to set the engagement level of
a negative clarification to 0.
4.2 Baselines
We use the following four baselines:
– Random. For each query, it selects one from the query’s clarifications ran-
domly.
– RankNet [13]. We use the BERT embedding of the query and clarification,
the number of characters in the query and question, the number of answers,
and the average number of characters in the answers as features and minimize
the loss function in Eq. 5.
Table 1: Experimental results on D1. Best results are in bold. Superscripts 1–4
indicate statistically significant improvements over Random, RankNet, RLC and
ELBERT, respectively.
Hits@1 MRR nDCG@1 nDCG@2
Random 0.423 0.683 0.494 0.721
RankNet 0.46010.70410.52310.7371
RLC 0.45310.70110.52710.7421
ELBERT 0.480123 0.716123 0.547123 0.751123
GACS-G10.4991234 0.7261234 0.5631234 0.7591234
GACS-G20.485123 0.719123 0.555123 0.7571234
GACS-G30.485123 0.719123 0.552123 0.753123
GACS-G40.4931234 0.7241234 0.5601234 0.7591234
– RLC [27]. RLC is composed of an intent coverage encoder and an answer
consistency encoder. Because the answer consistency encoder requires the
answer entity type data which is not available in MIMICS, we implement
the intent coverage encoder of RLC as one baseline. In the original paper,
they use the query reformulation data and the click data to estimate the
intent. However, the two kinds of data are also not available in MIMICS.
Therefore, we use the candidate answers to estimate the query’s intents as
in GACS.
– ELBERT [19]. ELBERT shows state-of-the-art performance on the user
engagement prediction task. Since we can rank the clarifications according
to their predicted engagement levels and select the best one, we implement
ELBERT as one baseline. For a fair comparison, we use BERT to encode
the query and clarification.
4.3 Evaluation Metrics
For the query q, we denote its clarification with the highest engagement level as
cbest. We use the following three evaluation metrics:
– Hits@1: the percentage of test queries with cbest ranking first.
– MRR: the mean reciprocal rank of cbest of all the test queries.
– nDCG@p: the normalized discounted cumulative gain, which is computed
as:
nDCG@p=Pp
i=1
ei
log(i+1)
P|E|
i=1
ei
log(i+1)
(6)
where eiis the engagement level of the i-th clarification in the ranked list and
Eis the clarification list ordered by their engagement levels up to position
p.
Higher Hits@1, MRR, or nDCG@p indicates better performance.
Table 2: Experimental results on D2. Best results are in bold. Superscripts 1–4
indicate statistically significant improvements over Random, RankNet, RLC and
ELBERT, respectively.
Hits@1 MRR nDCG@1 nDCG@2
Random 0.075 0.245 -0.119 -0.177
RankNet 0.44110.67310.48810.6541
RLC 0.44510.69412 0.51612 0.73112
ELBERT 0.465123 0.703123 0.51812 0.71712
GACS-G10.4851234 0.7181234 0.5531234 0.7521234
GACS-G20.4891234 0.7221234 0.5561234 0.7581234
GACS-G30.4891234 0.7211234 0.5561234 0.7541234
GACS-G40.466123 0.703123 0.52012 0.71412
Table 3: Cohen’s dand its 95% confidence interval (CI) which indicate the stan-
dardized difference between the performance of GACS-G1and ELBERT on D1
and D2.
Hits@1 MRR nDCG@1 nDCG@2
D1
Cohen’s d0.04 0.04 0.03 0.03
95% CI [0.01, 0.07] [0.01, 0.06] [0.01, 0.06] [0.01, 0.05]
D2
Cohen’s d0.04 0.05 0.07 0.11
95% CI [0.01, 0.07] [0.03, 0.08] [0.04, 0.10] [0.08, 0.13]
4.4 Implementation Details
The Transformer encoders of GACS and baselines are initialized with BERTBASE
[6]. We use the implementation of HuggingFace’s Transformer [24]. For training
these models, we use the AdamW [12] optimizer with an initial learning rate of
10−5and a linear learning rate decay scheduler. We fine-tune for 5 epochs and
choose the best hyperparameters according to MRR on the development set.
4.5 Experimental Results
Table 1 and Table 2 report the experimental results of our GACS using different
graphs and the baselines on D1and D2, respectively. Statistical significance
is tested using the paired student’s t-test with p < 0.05. Due to the multiple
comparisons problem [5], we use the Benjamini–Hochberg procedure to adjust p
to control the false discovery rate. Moreover, we report the effect size Cohen’s d
and its 95% confidence interval to compare the performance of GACS-G1and the
best baseline ELBERT, as shown in Table 3. We have the following observations:
–All other models perform significantly better than the Random model, which
shows that the engagement level is a reasonable proxy for the usefulness of
the clarification.
Table 4: Performance of GACS-G3with different engagement levels of negative
clarifications on D2.
0 -1 -2 -3 -4
Hits@1 0.477 0.489 0.489 0.483 0.483
MRR 0.715 0.721 0.720 0.717 0.718
Table 5: Experimental results of GACS trained on D2and evaluated on D1.
Hits@1 MRR nDCG@1 nDCG@2
GACS-G10.486 0.720 0.555 0.755
GACS-G20.489 0.722 0.556 0.760
GACS-G30.489 0.721 0.556 0.756
GACS-G40.485 0.720 0.552 0.755
–Our framework generally outperforms all the baselines, which is statistically
significant and demonstrates the effectiveness of our framework.
–As shown in Table 1, when trained with only positive clarifications, GACS
using G2or G3performs worse than GACS using G1or G4. Although G2
and G3are reasonable and have stronger expressive power, they increase the
model’s complexity and are more difficult to train, especially considering
that there are only 2.57 clarifications per query in D1. Moreover, GACS-G1
performs better than GACS-G4, showing that the intent can improve the
model’s performance but requires that the graph structure is reasonable and
suitable for the dataset.
–As shown in Table 2, when introducing negative clarifications, GACS using
G2or G3can outperform GACS using G1or G4, which is different from
the model’s performance on D1. This is because negative clarifications can
bring some new information and help train the model. In addition, GACS
models that use the intent perform obviously better than those that do not,
indicating that the intent can help the model distinguish between positive
and negative clarifications.
–Combining the observation from Table 1 and Table 2, we can see that differ-
ent graphs are suitable for different scenarios. We expect that as the average
number of clarifications per query increases, the benefits of the intent will
be more obvious.
Effect of Negative Clarifications According to Eq. 4, the smaller the engage-
ment level of negative clarifications, the lower the importance of them. Table 4
reports the performance of GACS-G3with different engagement levels of neg-
ative clarifications on D2. We can see that when the engagement level is -1 or
-2, the model performs the best. Moreover, setting the engagement level to 0
performs worse than setting it to a negative value.
Table 6: Experimental results of GACS on D1.
Hits@1 MRR nDCG@1 nDCG@2
GACS-G10.499 0.726 0.563 0.759
w/o st 0.486 0.719 0.553 0.756
GACS-G20.485 0.719 0.555 0.757
w/o st 0.481 0.717 0.546 0.752
GACS-G30.485 0.719 0.552 0.753
w/o st 0.484 0.718 0.549 0.752
GACS-G40.493 0.724 0.560 0.759
w/o st 0.486 0.719 0.554 0.756
To further investigate the effect of negative clarifications, we evaluate the
GACS models trained on D2on D1. The results are shown in Table 5. First, we
can see that models trained on D2have very similar performance on D1and D2,
indicating that it is much easier to distinguish between positive and negative
clarifications than to rank the positive clarifications. Thus, although negative
clarifications can help train the model, the benefits they can provide are limited
compared with positive clarifications. Second, after adding negative clarifications
for training, the performance of GACS using G1and G4does not get better, but
slightly worse. This is because the graphs they use are relatively simple and
negative clarifications may also bring some incorrect information because they
are treated equally.
Effect of Soft-Position Embedding and Type Embedding In Table 6,
”w/o st” refers to using the hard-position embedding instead of the soft-position
embedding and no type embedding to distinguish between the three types of
nodes in the QIC attention graph. Experimental results show that removing
them will reduce the performance of GACS with different graphs. This indicates
that the soft-position embedding and the type embedding are important for
preserving the structure information in the QIC attention graph, which is crucial,
especially for relatively simple graphs.
5 Conclusion
This paper proposes a graph attention-based model GACS for clarification se-
lection. It can effectively exploit the relations among the query, intent, and clar-
ification by constructing the QIC attention graph and outperform competitive
baselines. The graph structure information is critical to the model’s performance.
Moreover, we show that negative clarifications can help train GACS using com-
plex graphs but their benefits are limited compared with positive clarifications.
A better estimation of the intent may further improve the model’s performance.
In the future, we will explore how to better estimate the intent.
References
1. Ai, Q., Bi, K., Guo, J., Croft, W. B.: Learning a deep listwise context model for
ranking refinement. In: The 41st International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 135-144 (2018)
2. Aliannejadi, M., Zamani, H., Crestani, F., Croft, W.B.: Asking clarifying questions
in open-domain information-seeking conversations. In: Proceedings of the 42nd In-
ternational ACM SIGIR Conference on Research and Development in Information
Retrieval, pp. 475–484 (2019)
3. Anand, A., Cavedon, L., Joho, H., Sanderson, M., Stein, B.: Conversational search
(Dagstuhl Seminar 19461). In: Dagstuhl Reports. vol. 9. Schloss Dagstuhl-Leibniz-
Zentrum f¨ur Informatik (2020)
4. Braslavski, P., Savenkov, D., Agichtein, E., Dubatovka, A.: What do you mean
exactly? Analyzing clarification questions in CQA. In: Proceedings of the 2017
Conference on Human Information Interaction and Retrieval, pp. 345–348 (2017)
5. Carterette, Benjamin A.: Multiple testing in statistical analysis of systems-based
information retrieval experiments. ACM Trans. Inf. Syst. 30(1), 1-34 (2012)
6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
bidirectional transformers for language understanding. In: Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota
(2019).
7. Gao, J., Xiong, C., Bennett, P.: Recent Advances in Conversational Information
Retrieval. In: Proceedings of the 43rd International ACM SIGIR Conference on
Research and Development in Information Retrieval, pp. 2421-2424 (2020)
8. Hashemi, H., Zamani, H., Croft, W.B.: Guided transformer: Leveraging multiple
external sources for representation learning in conversational search. In: Proceed-
ings of the 43rd International ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval, pp. 1131–1140 (2020)
9. Kiesel, J., Bahrami, A., Stein, B., Anand, A., Hagen, M.: Toward voice query
clarification. In: The 41st International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 1257–1260 (2018)
10. Krasakis, A.M., Aliannejadi, M., Voskarides, N., Kanoulas, E.: Analysing the effect
of clarifying questions on document ranking in conversational search. In: Proceed-
ings of the 2020 ACM SIGIR on International Conference on Theory of Information
Retrieval, pp. 129–132 (2020)
11. Liu, W., Zhou, P., Zhao, Z., Wang, Z., Ju, Q., Deng, H., Wang, P.: K-bert: En-
abling language representation with knowledge graph. In: Proceedings of the AAAI
Conference on Artificial Intelligence, pp. 2901-2908 (2020)
12. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International
Conference on Learning Representations (2017)
13. Lotze, T., Klut, S., Aliannejadi, M., Kanoulas, E.: Ranking Clarifying Questions
Based on Predicted User Engagement. arXiv preprint arXiv:2103.06192. (2021)
14. Qu, C., Yang, L., Croft, W. B., Trippas, J. R., Zhang, Y., Qiu, M.: Analyzing and
characterizing user intent in information-seeking conversations. In: The 41st In-
ternational ACM SIGIR Conference on Research and Development in Information
Retrieval, pp. 989-992 (2018)
15. Radlinski, F., Craswell, N.: A theoretical framework for conversational search.
In: Proceedings of the 2017 Conference on Human Information Interaction and
Retrieval, pp. 117–126 (2017)
16. Ren, P., Liu, Z., Song, X., Tian, H., Chen, Z., Ren, Z., de Rijke, M.: Wizard of
Search Engine: Access to Information Through Conversations with Search Engines.
In: Proceedings of the 44th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 533–543 (2021)
17. Rosset, C., Xiong, C., Song, X., Campos, D., Craswell, N., Tiwary, S., Bennett,
P.: Leading conversational search by suggesting useful questions. In: Proceedings
of The Web Conference 2020, pp. 1160-1170 (2020)
18. Sekuli´c, I., Aliannejadi, M., Crestani, F.: Towards Facet-Driven Generation of Clar-
ifying Questions for Conversational Search. In: Proceedings of the 2021 ACM SI-
GIR International Conference on Theory of Information Retrieval, pp. 167-175
(2021)
19. Sekuli´c, I., Aliannejadi, M., Crestani, F.: User Engagement Prediction for Clarifica-
tion in Search. In: Advances in Information Retrieval - 43rd European Conference
on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceed-
ings, Part I, Springer, Lecture Notes in Computer Science, vol. 12656, pp. 619–633
(2021)
20. Sun, T., Shao, Y., Qiu, X., Guo, Q., Hu, Y., Huang, X., Zhang, Z.: Colake: Con-
textualized language and knowledge embedding. In: Proceedings of the 28th Inter-
national Conference on Computational Linguistics, pp. 3660-3670 (2020)
21. Tavakoli, L., Zamani, H., Scholer, F., Croft, W.B., Sanderson, M.: Analyzing clar-
ification in asynchronous information-seeking conversations. Journal of the Asso-
ciation for Information Science and Technology (2021)
22. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Polo-
sukhin, I.: Attention is all you need. In: Advances in neural information processing
systems, pp. 5998-6008 (2017)
23. Wang, Z., Ai, Q.: Controlling the Risk of Conversational Search via Reinforcement
Learning. In: Proceedings of the Web Conference 2021, pp. 1968-1977 (2021)
24. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P.,
Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of-
the-art natural language processing. In: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing: System Demonstrations, pp.
38-45 (2020)
25. Zamani, H., Dumais, S., Craswell, N., Bennett, P., Lueck, G.: Generating clarifying
questions for information retrieval. In: Proceedings of The Web Conference 2020,
pp. 418–428 (2020)
26. Zamani, H., Lueck, G., Chen, E., Quispe, R., Luu, F., Craswell, N.: Mimics: A large-
scale data collection for search clarification. In: Proceedings of the 29th ACM Inter-
national Conference on Information and Knowledge Management, pp. 3189–3196
(2020)
27. Zamani, H., Mitra, B., Chen, E., Lueck, G., Diaz, F., Bennett, P.N., Craswell, N.,
Dumais, S.T.: Analyzing and learning from user interactions for search clarification.
In: Proceedings of the 43rd International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 1181-1190 (2020)
28. Zou, J., Kanoulas, E., Liu, Y.: An Empirical Study on Clarifying Question-Based
Systems. In: Proceedings of the 29th ACM International Conference on Information
and Knowledge Management, pp. 2361-2364 (2020)