Conference PaperPDF Available

Identifying ambiguous queries in web search

Authors:

Abstract

It is widely believed that some queries submitted to search engines are by nature ambiguous (e.g., java, apple). However, few studies have investigated the questions of "how many queries are ambiguous?" and "how can we automatically identify an ambiguous query?" This paper deals with these issues. First, we construct the taxonomy of query ambiguity, and ask human annotators to manually classify queries based upon it. From manually labeled results, we find that query ambiguity is to some extent predictable. We then use a supervised learning approach to automatically classify queries as being ambiguous or not. Experimental results show that we can correctly identify 87% of labeled queries. Finally, we estimate that about 16% of queries in a real search log are ambiguous.
Identifying Ambiguous Queries in Web Search
Ruihua Song
1, 2
, Zhenxiao Luo
3
, Ji-Rong Wen
2
, Yong Yu
1
, and Hsiao-Wuen Hon
2
1
Shanghai Jiao Tong University, Shanghai China
2
Microsoft Research Asia, Beijing China
3
Fudan University, Shanghai China
Contact: rsong@microsoft.com
ABSTRACT
It is widely believed that some queries submitted to search
engines are by nature ambiguous (e.g., java, apple). However, few
studies have investigated the questions of “how many queries are
ambiguous?” and “how can we automatically identify an
ambiguous query?” This paper deals with these issues. First, we
construct the taxonomy of query ambiguity, and ask human
annotators to manually classify queries based upon it. From
manually labeled results, we find that query ambiguity is to some
extent predictable. We then use a supervised learning approach to
automatically classify queries as being ambiguous or not.
Experimental results show that we can correctly identify 87% of
labeled queries. Finally, we estimate that about 16% of queries in
a real search log are ambiguous.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search
and Retrieval – query formulation; H.5.2 [Information
Interfaces and Presentation]: User Interfaces – Natural
Language
General Terms
Experimentation, Languages, Human Factors
Keywords
Ambiguous query, query classification, broad topics, Web user
study
1. INTRODUCTION
Some technologies like personalized Web search and search
results clustering aim to improve users’ satisfaction towards
ambiguous queries from different perspectives. However, there is
no sufficient study on ambiguous queries identification. Questions
like “what percentage of queries are ambiguous?” and “can we
automatically determine whether a query is ambiguous?” are still
open. If we can estimate the percentage of ambiguous queries, we
would know how many queries will be influenced potentially by
the query ambiguity oriented technologies. If we can further
identify ambiguous queries automatically, it is possible to apply
such technologies for a particular kind of queries, instead of for
all. We will try to answer such questions in this paper.
Identifying ambiguous queries is challenging for three reasons.
First, there is no acknowledged definition and taxonomy of query
ambiguity. Many terms related to this concept, such as
“ambiguous query,” “semi-ambiguous query,” “clear query,”
“general term,” “broad topic,” and “diffuse topic.” These terms
are confusing in our investigation. Second, it is uncertain whether
most queries can be associated with a particular type in terms of
ambiguity quality. Cronen-Townsend et al. [1] proposed to use
the relative entropy between a query and the collection to
quantify query clarity, but the score is not easily aligned to
concepts in human’s mind. Third, even if ambiguous queries can
be recognized manually, it is not realistic to label thousand of
queries sampled from query logs. So how can we identify them in
an automatic way?
In this paper, we first construct taxonomy for query ambiguity
from the literature. We then assess human agreement on query
classification through a user study. Based on the findings, we take
a supervised learning approach to automatically identify
ambiguous queries. Experimental results show that our approach
achieves 85% precision and 81% recall in identifying ambiguous
queries. Finally, we estimate that about 16% of queries in the
sampled search log are ambiguous.
2. TAXONOMY OF QUERIES
By surveying the literature, we summarize the following three
types of queries from being ambiguous to specific.
Type A (Ambiguous Query): a query that has more than
one meaning;
e.g. “giant,” which may refer to “Giant Company Software
Inc.” (an internet security software developer), “Giant” (a film
produced in 1956), “Giant Bike” (a bicycle manufacturer), or
San Francisco Giants” (National League baseball team).
Type B (Broad Query): a query that covers a variety of
subtopics and a user might look for one of the subtopics by
issuing another query.
e.g. “songs,” which covers some subtopics such as “song
lyrics,” “love songs,” “party songs,” and “download songs.” In
practice, a user often issues such a query first, and then narrows
down to a subtopic.
Type C (Clear Query): a query that has a specific meaning
and covers a narrow topic.
e.g. “University of Chicago” and “Billie Holiday.” A clear
query usually means a successful search in which a user can find
several results with a high degree of quality in the first results
page.
3. USER STUDY
The purpose of user study is to answer whether it is ever possible
to associate a query with a certain type by looking at Web search
results. Since it is difficult to find different meanings of a query
by going through all the results, we use clustered search results
generated by Vivisimo [5] to facilitate understanding the query.
Copyright is held by the author/owner(s).
WWW 2007, May 8-12, 2007, Banff, Alberta, Canada.
ACM 978-1-59593-654-7/07/0005.
WWW 2007 / Poster Paper Topic: Search
1169
0
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
Library
’giant’
Work & Money
Computing
Library
0
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
0
0.2
0.4
0.6
0.8
Personal
’songs’
Entertainment
Library
Personal
0
0.2
0.4
0.6
0.8
0
0.2
0.4
0.6
0.8
0
0.2
Work & Money
’Billie Holiday’
Entertaiment
Library
Work & Money
(a) Type A: giant (b) Type B: songs (c) Type C: Billie Holiday
Figure 1. Projection of documents represented in categories for three example queries
Queries used in our user study are sampled from 12-day Live
Search [4] query logs in August 2006. We use a total of 60
queries and involve five human subjects. Each participant is asked
to judge whether a query is ambiguous (Type A) or not. If the
query is not ambiguous, the participant would answer an
additional question: “Is it necessary to add some words to the
query in order to let it be clearer?” The question aims to clarify
whether the query is broad (Type B) or clear (Type C).
The user study results indicate that participants are in general
agreement, i.e. 90%, in judging whether a query is ambiguous or
not. However, it is difficult to distinguish Type B from Type C as
the agreement is only 50%.
4. LEARNING A QUERY AMBIGUITY
MODEL
In this paper, we utilize a query q and a set of top n search
results
D
with respect to the query in modeling query ambiguity.
We formulate the problem of identifying ambiguous queries as a
classification problem:
(, ) ( | )
f
qD A Aa
Based on the findings in the user study, we aim to classify a query
as
A
(ambiguous queries) or
A
(broad or clear queries). Support
Vector Machines (SVM) developed by Vapnik [3] with RBF
kernel is used as our classifier.
A text classifier similar to that used in [2] is applied to classify
each Web document in
D into predefined categories in
KDDCUP 2005. We represent a document by a vector of
categories, in which each dimension corresponds to the
confidence that the document belongs to a category.
Our main idea of identifying an ambiguous query is that relevant
documents with different interpretations probably belong to
several different categories. To illustrate this assumption, we
project documents into a three-dimensional (3D) space and show
three example queries in Figure 1. The coordinates correspond to
three categories that a query most likely belongs to. “
Giant”, as
an ambiguous query, may refer to “
giant squid” in Library
category, “
Giant Company Inc.” in Computing category, and
Gaint Food supermarket” in Work&Money category. Figure 1(a)
shows scattered distribution among these three categories. “
Billie
Holiday
” is a clear query and Figure 1(c) shows almost all the
documents are gathered in the category of Entertainment. “
Songs
is a broad query. A pattern of documents between being scattered
and gathered is observed in Figure 1(b).
12 features are derived to quantify the distribution of
D , such as
the maximum Euclidean distance between a document vector and
the centroid document vector in
D .
5. EXPERIMENTS
We conduct the experiments of learning a query ambiguity model
on 253 labeled queries. Five-fold cross validation is performed.
The best classifier in our experiments achieves precision of 85.4%,
recall of 80.9%, and accuracy of 87.4%. Such performance
verifies that ambiguous queries can be identified automatically.
We try to estimate what percentage of queries is ambiguous in a
query set sampled from Live Search logs. The set consists of 989
queries. To achieve the goal, our newly learned query ambiguity
model is used to do prediction on the query set. When we increase
the size of query set for estimation from 1/10 to 10/10, the
percentage first vibrates between 15% and 18% and finally
stabilizes at around 16%. Therefore, we estimate that about 16%
of all the queries are ambiguous.
6. CONCLUSION
In this paper, we find people are in general agreement on whether
a query is ambiguous or not. Thus we propose a machine learning
model based on search results to identify ambiguous queries. The
best classifier achieves high accuracy as 87%. By applying the
classifier, we estimate that about 16% queries are ambiguous in
the sampled logs.
7. REFERENCES
[1] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting
query performance. In Proceedings of the 25
th
ACM
Conference on Research in Information Retrieval (SIGIR),
pages 299-306, 2002
[2] D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q.
Yang. Q2c@ust: our winning solution to query classification
in KDDCUP 2005. SIGKDD Explorations, 7(2):100–110,
2005
[3] V. Vapnik. Principles of risk minimization for learning
theory. In D. S. Lippman, J. E. Moody, and D. S. Touretzky,
editors, Advances in neural information processing systems 3,
pages 831-838. Morgan Kaufmann, 1992
[4] Live Search. http://www.live.com/
[5] Vivisimo search engine. http://www.vivisimo.com
WWW 2007 / Poster Paper Topic: Search
1170
... Search engines return web links related to an issued query. Usually, the user's query is short and ambiguous (Silverstein et al., 1999;Dou et al., 2007;Jansen et al., 2000;Song et al., 2007), leading to the user being dissatisfied with the search results (Shum, 2011;Niccolai, 2009). For example, when the user's query is watches, she may not only want to know information about watch brands, but also watch materials or watch colors. ...
Article
Full-text available
Search results from the search engine may be not enough to satisfy users’ search intent when the issued query is broad or ambiguous. In such cases, presenting to the user query facets, which include common query reformulations, may help disambiguate the current query, save the effort of query reformulation, and improve the user’s search experience. Existing approaches for mining query facets are mainly based on rule-based statistical features, but ignore the deep semantic information which can measure the relationship between items in facets more precisely and find more potential facet items. In this paper, we introduce a deep learning model with contrastive learning for query facets mining—DeepQFM. We first extract items from search result documents, form lists containing items having a parallel structure, and weight these lists based on their importance. Then, we cluster the weighted lists based on their semantic distance. Finally, we train an item encoder with contrastive sampling and rank the facets and the facet items based on their semantic representation. Experimental results show that our deep query facets mining model outperforms the state-of-the-art approach QDMiner in almost all evaluation metrics, especially for the recall and rp-nDCG, suggesting that DeepQFM can effectively mine more facet items from search result documents.
... When to Show FAQ Results? Query intent is inherently ambiguous (Krovetz and Croft, 1992;Song et al., 2007;Sanderson, 2008). Figure 1 illustrates an example where the users can use the same query "apple tv bluetooth" to retrieve products, or find information about Bluetooth connectivity. ...
... When to Show FAQ Results? Query intent is inherently ambiguous (Krovetz and Croft, 1992;Song et al., 2007;Sanderson, 2008). Figure 1 illustrates an example where the users can use the same query "apple tv bluetooth" to retrieve products, or find information about Bluetooth connectivity. ...
Preprint
Customers interacting with product search engines are increasingly formulating information-seeking queries. Frequently Asked Question (FAQ) retrieval aims to retrieve common question-answer pairs for a user query with question intent. Integrating FAQ retrieval in product search can not only empower users to make more informed purchase decisions, but also enhance user retention through efficient post-purchase support. Determining when an FAQ entry can satisfy a user's information need within product search, without disrupting their shopping experience, represents an important challenge. We propose an intent-aware FAQ retrieval system consisting of (1) an intent classifier that predicts when a user's information need can be answered by an FAQ; (2) a reformulation model that rewrites a query into a natural question. Offline evaluation demonstrates that our approach improves Hit@1 by 13% on retrieving ground-truth FAQs, while reducing latency by 95% compared to baseline systems. These improvements are further validated by real user feedback, where 71% of displayed FAQs on top of product search results received explicit positive user feedback. Overall, our findings show promising directions for integrating FAQ retrieval into product search at scale.
... In other words, the function Φ reorders the passages in by pushing the passages agreed by both and forward and moves the ones only in backward. The motivation behind this design is that a stand-alone query is often ambiguous for search engines [33]; This ambiguity is even more critical in the context of ConvSearch (See Figure 1). As a result, relying solely on to synthesize pseudo relevance for model training may cause re-rankers to establish unfaithful relations between passages and conversational context. ...
Preprint
This paper presents ConvRerank, a conversational passage re-ranker that employs a newly developed pseudo-labeling approach. Our proposed view-ensemble method enhances the quality of pseudo-labeled data, thus improving the fine-tuning of ConvRerank. Our experimental evaluation on benchmark datasets shows that combining ConvRerank with a conversational dense retriever in a cascaded manner achieves a good balance between effectiveness and efficiency. Compared to baseline methods, our cascaded pipeline demonstrates lower latency and higher top-ranking effectiveness. Furthermore, the in-depth analysis confirms the potential of our approach to improving the effectiveness of conversational search.
... Despite this recent progresses, even the best-performing systems are not able to perform smoothly and coherently on all requests [5,27,29]. For example, broad queries can have multiple interpretations and aspects [13,14,36,41], and satisfying such ambiguous ...
Preprint
Despite recent progress on conversational systems, they still do not perform smoothly and coherently when faced with ambiguous requests. When questions are unclear, conversational systems should have the ability to ask clarifying questions, rather than assuming a particular interpretation or simply responding that they do not understand. Previous studies have shown that users are more satisfied when asked a clarifying question, rather than receiving an unrelated response. While the research community has paid substantial attention to the problem of predicting query ambiguity in traditional search contexts, researchers have paid relatively little attention to predicting when this ambiguity is sufficient to warrant clarification in the context of conversational systems. In this paper, we propose an unsupervised method for predicting the need for clarification. This method is based on the measured coherency of results from an initial answer retrieval step, under the assumption that a less ambiguous query is more likely to retrieve more coherent results when compared to an ambiguous query. We build a graph from retrieved items based on their context similarity, treating measures of graph connectivity as indicators of ambiguity. We evaluate our approach on two recently released open-domain conversational question answering datasets, ClariQ and AmbigNQ, comparing it with neural and non-neural baselines. Our unsupervised approach performs as well as supervised approaches while providing better generalization.
Article
Search result diversification plays a crucial role in improving users’ search experience by providing users with documents covering more subtopics. Previous studies have made great progress in leveraging inter-document interactions to measure the similarity among documents. However, different parts of the document may embody different subtopics and existing models ignore the subtle similarities and differences of content within each document. In this paper, we propose a hierarchical attention framework to combine intra-document interactions with inter-document interactions in a complementary manner in order to conduct multi-grained document modeling. Specifically, we separate the document into passages to model the document content from multi-grained perspectives. Then, we design stacked interaction blocks to conduct inter-document and intra-document interactions. Moreover, to measure the subtopic coverage of each document more accurately, we propose a passage-aware document-subtopic interaction to perform fine-grained document-subtopic interaction. Experimental results demonstrate that our model achieves state-of-the-art performance compared with existing methods.
Article
In this paper, we describe our ensemble-search based approach, Q2C@UST (http://webprojectl.cs.ust.hk/q2c/), for the query classification task for the KDDCUP 2005. There are two aspects to the key difficulties of this problem: one is that the meaning of the queries and the semantics of the predefined categories are hard to determine. The other is that there are no training data for this classification problem. We apply a two-phase framework to tackle the above difficulties. Phase I corresponds to the training phase of machine learning research and phase II corresponds to testing phase. In phase I, two kinds of classifiers are developed as the base classifiers. One is synonym-based and the other is statistics based. Phase II consists of two stages. In the first stage, the queries are enriched such that for each query, its related Web pages together with their category information are collected through the use of search engines. In the second stage, the enriched queries are classified through the base classifiers trained in phase I. Based on the classification results obtained by the base classifiers, two ensemble classifiers based on two different strategies are proposed. The experimental results on the validation dataset help confirm our conjectures on the performance of the Q2C@UST system. In addition, the evaluation results given by the KDDCUP 2005 organizer confirm the effectiveness of our proposed approaches. The best F1 value of our two solutions is 9.6% higher than the best of all other participants' solutions. The average F1 value of our two submitted solutions is 94.4% higher than the average F1 value from all other submitted solutions.
Article
The application of clustering to Web search engine technology is a novel approach that offers structure to the information deluge often faced by Web searchers. Clustering methods have been well studied in research labs; however, real user searching with clustering systems in operational Web environments is not well understood. This article reports on results from a transaction log analysis of Vivisimo.com, which is a Web meta-search engine that dynamically clusters users' search results. A transaction log analysis was conducted on 2-week's worth of data collected from March 28 to April 4 and April 25 to May 2, 2004, representing 100% of site traffic during these periods and 2,029,734 queries overall. The results show that the highest percentage of queries contained two terms. The highest percentage of search sessions contained one query and was less than 1 minute in duration. Almost half of user interactions with clusters consisted of displaying a cluster's result set, and a small percentage of interactions showed cluster tree expansion. Findings show that 11.1% of search sessions were multitasking searches, and there are a broad variety of search topics in multitasking search sessions. Other searching interactions and statistics on repeat users of the search engine are reported. These results provide insights into search characteristics with a cluster-based Web search engine and extend research into Web searching trends. © 2006 Wiley Periodicals, Inc.
Article
We develop a method for predicting query performance by computing the relative entropy between a query language model and the corresponding collection language model. The resulting clarity score measures the coherence of the language usage in documents whose models are likely to generate the query. We suggest that clarity scores measure the ambiguity of a query with respect to a collection of documents and show that they correlate positively with average precision in a variety of TREC test sets. Thus, the clarity score may be used to identify ine#ective queries, on average, without relevance information. We develop an algorithm for automatically setting the clarity score threshold between predicted poorly-performing queries and acceptable queries and validate it using TREC data. In particular, we compare the automatic thresholds to optimum thresholds and also check how frequently results as good are achieved in sampling experiments that randomly assign queries to the two classes.