ArticlePDF Available

Abstract and Figures

Collection sizes, query rates, and the number of users of Web search engines are increasing. Therefore, there is continued demand for innovation in providing search services that meet user information needs. In this article, we propose new techniques to add additional terms to documents with the goal of providing more accurate searches. Our techniques are based on query association, where queries are stored with documents that are highly similar statistically. We show that adding query associations to documents improves the accuracy of Web topic finding searches by up to 7%, and provides an excellent complement to existing supplement techniques for site finding. We conclude that using document surrogates derived from query association is a valuable new technique for accurate Web searching. © 2005 Wiley Periodicals, Inc.
Content may be subject to copyright.
Query Association Surrogates for Web Search
Falk Sch o ler Hugh E. Willi a m s
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V
Melbourne, Australia, 3001.
{fscholer,hugh}@cs.rmit.edu.au
Andrew Turpin
Department of Computer Science
Curtin University of Technology
Perth, Australia, 6001.
andrew@cs.curtin.edu.au
Abstract
Collection sizes, query rates, and the number of users of web search engines are increasing.
Therefore, there is continued demand for innovation in providing search services that meet
user information needs. In this paper, we propose new techniques to add additional terms
to documents with the goal of providing more accurate search. Our techniques are based
on query association, where queries are stored with documents that are highly statistically
similar. We show that adding query associations to documents improves the accuracy of
web topic finding searches by up to 7% and provides an excellent complement to existing
supplement techniques for site finding. We conclude that using document surrogates derived
from query association is a valuable new technique for accurate web searching.
Keywords: Query Association, Past Queries, Web Search, Document Surrogates.
1Introduction
Web search is the first step in m a n y information di s covery tasks. Amongst other activities, users
frequently search for web services to arrange travel, purchase products, obtain health advice, and
find documents. Indeed, the accessibility and usability of web search engines has been responsible
for significant changes in the way the Web is used: for example, most users no longer maintain
personal lists of frequently-accessedsites but instead use a search engine to re-discover the resource.
Eciency and eectiveness are key to a successful search engine. Popular search engines
process hundreds of millions of queries each day that require answers from collections of over three
billion documents. The collections, number of queries, and number of users continue to grow at
staggering rates. Despite these loads, users are dicult to satisfy: each query must be resolved in
afewtenthsofasecondandthersttenresponsesmust meet the information need. Therefore,
there is continual demand for innovative, accurate, and fast solutions to the search problem.
Over 80% of queries posed to search engines are ranked or bag of words queries. A recent
study [32] showed that these queries are typically very short, with a mean length of 2.6 terms.
To resol ve them, t he q uery te r m s a r e match e d a gainst te r m s e x tracted f r o m the collect i o n b e i ng
searched and inverted index structures [39] used to evaluate the query. The results of the query
process are typically summaries of documents thatcontainthequeryterms,orderedbydecreasing
statistical similarity to the query.
It is unclear whether matching query terms with document terms is the optimal strategy for
eective retrieval. Indeed, the problem of vocabulary mismatch — where authors and searchers
1
Scholer et al., Query Association Surrogates for Web Search 2
choose dierent subsets of words when describing an information need — is one of the major
causes of poor retrieval eectiveness [9]. This problem is closely related to the consideration of
how the “aboutness” of a document should be represented, a key question in information retrieval
on which no consensus has yet been formed [2, 16].
Recently, it has been shown that document surrogates created from the text associated with
web hypertext links are an eective document representation for some web search tasks [12].
For example, a document describing a researcher’s publications may never include a succinct
description of its topic. However, a hypertext link to the document marked-up with the text
“Information retrieval research papers” is an excellent summary. Such surrogates can be used
instead of the document text for both fast and accurate retrieval or, alternatively, the surrogate
terms can be added to the document to improve the search process. Such alternative strategies
for representing the content of documents may therefore lead to improvements in the accuracy of
the retrieval process.
In this paper, we explore new surrogate techniques for eective retrieval. We consider two
approaches: first, document supplements, where the textual content of a document is supple-
mented by additional terms; and, second, document replacements,wheretheoriginalcontentof
the document is replaced by (usually) a smaller set of terms that summarise the document. In our
experiments, we consider two techniques for creating such document surrogates: the well-known
technique of anchor text extractiondescribedabove,andtheuseofquery association.
In preliminary work, we proposed query association as a novel method of generating summaries
of web documents [27]. With query association, each query that is posed to a search engine is
stored and associated with the set of documents that are ranked highly in response to the query.
In our previous work, the associated queries are then presented to the user as answer summaries.
In a user study, we found that these summaries are comparable to conventional query-biased
summaries [34] for judging the relevance of each answer to a user’s information need.
In this paper, we examine the use of associated queries as document surrogates, and propose
heuristics for determining how a query should be associated with documents. We explore the
optimal parameters for the maximum associations that should be stored for each document, the
number of associations that should be made per query, and techniques for dynamically choosing
these values. In addition, we explore metrics — such as the recently-proposed query clarity
technique [6] — to determine a priori whether a query should be included in the association
process. We investigate these techniques using the well-known WT10g collection [1] and around
1millionqueriesfromtwoquerylogsoftheExcitesearchengine[33].
Our results show that query association leads to more accurate web searching. For finding
documents on a topic, adding query association terms to documents improves “precision-at-10”
accuracy from 26.8% to 28.7% in web collection experiments; this is a statistically significant
improvement of 7%. Moreover,weshowthataccuracymayfurtherimproveifmorequeries
are available for the association process: after processing all 1 million queries, all three of our
eectiveness measures appear not to have reachedamaximum. Theseresults are particularly
significant, since anchor text is ineective for topic finding [12] and even a small improvement in
web search accuracy has the potential to aect millions of users.
Our results also show that anchor text and query association are complementary. Anchor text
improves the eectiveness of site finding, while query associations works well for topic finding, and
acombinationofthetwotechniquesaordsaccuratesearchingforbothsearchtasks. Weconclude
that query association is a valuable technique for accurate web searching.
This paper is structured as follows. In Section 2 we present a background of query and
surrogate techniques that have been used to improve the retrieval process. Section 3 outlines
our novel techniques for creating query association surrogates. Section 4 presents our results and
adetaileddiscussion,whileSection5discussesalternativeheuristicsforthequeryassociation
process. Our conclusions and possible future work are discussed in Section 6.
Scholer et al., Query Association Surrogates for Web Search 3
2 Background
In this section, we describe previous approaches to using queries and document surrogates to
improve the retrieval process. In particular, we discuss approaches that overcome the vocabulary
mismatch problem including: document expansion and document surrogate creation; the use of
past queries; and, the estimation of query performance.
2.1 Document Surrogates
Document surrogates — representations of a document that dier from the full content — have
been examined for many purposes including summarisation [19, 34], graphical display of re-
sults [35], document browsing [11], and video and text retrieval [5]. In this section, we describe
their use in improving text searching.
Based on the assumption that the anchor text of a hypertext link describes its target, Craswell
et al. [5] have investigated constructing document replacements composed of the anchor text that
points at a page. Document replacements are surrogates where the original content of the docu-
ment is replaced by new content. In the approach of Craswell et al., the anchor text is treated
as though it occurs in the target document and is then used to replace its content. Experi-
ments demonstrated that the use of anchor text can significantly improve retrieval eectiveness
for named page finding tasks (where the subject of the search is a known resource), compared
to retrieval using the full text content of documents. However, anchor text is not as eective
for a topic finding task. It could therefore be expected that anchor text surrogates would be
beneficial for a query such as “world health organization” (where the target is the WHO web-
site, http://www.who.int/)butless soforworldhealthstatistics(whereseveralresourcesmay
contain relevant, topical information).
Adding content to documents, that is, creating document supplements has also been previously
investigated. Westerveld et al. [38] have studied adding information such as the number of in-
links (external hypertext links that point to the document) and anchor texts to a document
using a language model approach. Their results showed, for example, that taking the depth of a
document in the server document tree into account can significantly improve the rank at which
the first correct answer is retrieved. We show similar results in Section 4.
Document supplements have also been used in speech retrieval [29]. The motivation for this
work was to reduce the performance gap between retrieval from automatic speech transcriptions
and perfect text. The Rocchio method for pseudo-relevance feedback [24] was used to identify
frequently occurring words from documents that are related to the document under consideration.
These words were then added to the current document. Experiments showed that this technique
results in significantly more robust retrieval from automated speech transcripts.
Query expansion using pseudo-relevance feedback is also related to our work. In query ex-
pansion, the original query is supplemented with terms from highly-ranked documents after an
initial processing of the query. However, in contrast to the techniques discussed previously, query
expansion deals with the modification of the query space rather than the document space [25]. Re-
cent work by Lam-Adesina and Jones [18] considered the use of document summaries for choosing
query expansion terms. Both query-biased summaries — where document fragments are weighted
toward those that contain query terms — and context-independent summaries were investigated,
and results showed that expanding queries with terms from these sources can improve retrieval
eectiveness.
2.2 Using Queries to Improve Eectiveness
Past queries have been shown to be a useful tool to improve retrieval performance [7, 8, 21].
Furna s e t a l . e xamined th e v o c a b ulary mi s m a t ch proble m by carryin g o u t s everal mo t iva t ional
studies in which the communication process between systems and users was considered [10]. Using
psychological tests, they demonstrated that the chance of two people using the same main content
Scholer et al., Query Association Surrogates for Web Search 4
word to describe a sub ject is as low as 10%–20%. The problem arises because of diversity in the
use of language and imprecision in its application.
In later work, Furnas [8] addressed this problem by exploring an adaptive indexing scheme
to improve keyword access to objects in an information system. The approach collects word use
interactively: a typical user engaging in a keyword retrieval task might enter several words that
fail to retrieve the desired information. Then, when the user enters a correct word, the system asks
whether the previous failed attempts should be stored as synonyms for the successful keyword.
For examp l e , a u ser lo o king for a comm a n d t o s ave a fil e m i ght un s u c cessfull y t r y t h e keywo r d s
“store”, “write” and “record” before entering the successful “save”. The system then oers to
record the failures as synonyms for the “save” command. These synonyms aid future users in
their search process.
The work of Raghavan and Sever [21] also considered the use of past queries, with the aim of
increasing the eciency of retrieval by using the results of stored past optimal queries to satisfy
similar new queries. They demonstrated that the retrieval process can be accelerated by using the
past optimal query that is most similar to a new query as a starting point for a steepest descent
algorithm. Their work also showed that it is not appropriate to measure the similarity between
two queries directly, but rather that similarity should be measured by considering the relationship
between the sets of documents that they retrieve. The similarity between queries and documents
is also used in our query association technique discussed in the next section.
Recently, Huang et al. have applied past queries to the problem of suggesting relevant search
terms [14]. In their approach, past queries are clustered based on user sessions captured from a
web proxy server log. By making use of contextual information from past users’ query refinements,
their techniques are able to make eective term suggestions. Since our query association process
is based on all past queries, it does not need to consider the problem of how a user’s query session
should be segmented.
Pirkola and Jarvelin [20] examined automatic query improvement by considering methods to
identify which query term is the best discriminator of relevance in a collection. Identification
of keys is based on the distribution of terms within the collection and within documents. Their
results showed that the best key, once identified, can be used in a structured query to improve
performance without the use of relevance information. We consider similar measures in Section 5.
2.3 Query Association
In previous preliminary work, we have proposed query association as a technique to store the
relationship between past queries and documents [27]. The motivation of our approach was to
construct alternative document summaries that are shown to users after the query evaluation
process is complete.
Consider an example query “telecom australia” that has been posed to a search engine incor-
porating query association for summarisation. In response to this query, the list of ranked answers
includes both conventional query-biased summaries [34] and query associations for each document.
For examp l e , t he query asso ciati o n s for the top- r a nked answer are: “cable tv carriers”, “telstra
share oer australia”, “telstra australia”, “australia telstra”, and “optus products nokia 1620”.
Each associated query is both a summary and a hypertext link that can be clicked-on to execute
that query.
For documen t s ummarisati o n , we found t h a t e a ch quer y s h o u l d b e a s s o c i a t ed with t h e t h r ee
top-ranked documents and that five queries should be stored per document. With these settings,
summaries are composed of only queries that have high similarity to each document, while the
cognitive load for the user is kept low. In addition, queries are only associated with documents
when all terms in the query occur in the document; this was a result of the observation that users
are mislead by summaries that contain terms that are not in the document.
In a user study, we demonstrated that query association based summaries are competitive with
conventional document fragment based summaries for accurately assessing whether an answer
meets the user’s information need. In this paper, we use the association approach as the basis of
document modification techniques, as described in the next section.
Scholer et al., Query Association Surrogates for Web Search 5
3WebDocumentSurrogates
In this section, we propose techniques for creating document surrogates. In our approach, we
develop document surrogates by applying the technique of query association, and compare this
with anchor text extraction. We apply our techniques to create two classes of surrogate: first,
document supplements where additional terms are added to documents; and, second, document
replacements where the search terms for a document consist only of the surrogate terms.
3.1 Query Association Surrogates
We consider in this section how the query association process described in Section 2 can be adapted
to derive surrogates for eective web searching. Our aim is to define a process that will result in
relevant terms being stored in surrogates, permits dynamic change in surrogates as temporal shift
occurs in queries, and is likely to be computationally inexpensive.
Perhaps the simplest approach to surrogate creation using query association is to associate
each query with a fixed number of documents and to limit the number of associations stored per
document. In this approach, for each query that is posed to a search engine, the query is associated
with the top Ndocuments that are ranked in response to the query. In this work, the ranking
function is a variant of the Okapi BM25 similarity measure [23, 30, 31]; we describe our retrieval
process fully in Section 4.1. In addition to adding new associations, we impose an upper limit of
Massociations per document. The association process for fixed M=αand N=βis detailed in
Figure 1. Surprisingly, as we show later in Section 4, this is a highly eective approach, and its
benefits outweigh those of the more complex, dynamic approaches described in Section 5.
The limit of Massociations per document serves two purposes. First, it aords a predictable
space and time requirement for managing associations; this is an area we plan to further explore,
as we discuss in our conclusions. Second, once this maximum is reached, existing associations
can be replaced by new queries with greater statistical similarity to the document; this allows the
associations to evolve and adapt to any temporal shift in user queries, and it also allows dissimilar
queries to be unassociated.
Consider an example of creating document surrogate using query association. In this example,
there is a collection of fifty documents that have no stored associations, and the association
parameters are set to N=20andM=2. Supposethatausersubmitsthequeryq1=“information
retrieval” to the search system, and that twenty-seven documents in the collection are identified
by the ranking function as having some statistical similarity to the query. Since N=20,q1is
stored as an association for the first twenty ranked documents. Suppose that the user then enters
another query, q2=“search recall precision”, which returns ten documents, two of which are already
associated with q1.Afterq2is associated with the answer documents, two documents have two
query associations, and twenty-eight have one. Finally, the user enters a third query, q3=“search
engine resources”, which returns five answer documents, four with no existing associations and one
with two associations. Since the maximum number of associations per document is M=2,the
association between q3and the document with two associations is not automatic. Suppose that the
similarity scores between the queries and the document are such that q1<q
2<q
3.Inthiscase,
q1is unassociated with the document and replaced by q3. After the process is complete, eighteen
documents have no associations, thirty have one association, and two have two associations.
Of the two documents that have two associations, one has the associated queries q1andq2,
while the other has q2andq3. This illustrates that, although there can be some term overlap in
associations, this is not necessary. For example, there are no terms in common between q1andq2.
The queries have become associated with the samedocumentbasedonthedocumentscontent.
In this way, topical relationships between queries can be established even when no common terms
are present.
From th e p r e v ious examp l e , i t c an b e s e en that the q u a l i ty of a do cu m e nt surroga t e is dependent
on the quality and number of past queries that are available for the association process. As we
show later in Section 4, we have found that settings of M=19andN=39workwellinpractice. If
less than M=19querieshavebeenassociatedwith thedocument, thenthe surrogateis aected
Scholer et al., Query Association Surrogates for Web Search 6
1. Let Nbe the number of associations per query and Mthe maximum associations
per document. Direpresents the ith document in the collection, and Aiis the
set of queries associated with document Di.LetSi,k be the similarity score
between the kth association in Aiand document Di.Notethatassociationsin
Aiare sorted by Si,k.
2. For each query qin the query collection Q
(a) Calculate the similarity S(f, q, Di)ofthequeryqto each document Diin
the collection with the similarity function f
(b) Create a max-heap of documents Husing S(f, q, Di)asthekeyforDi
(c) Repeat Ntimes
i. Let Djbe the result of removing the root of the heap H
ii. If |Aj|=Mand S(f, q, Dj)>S
j,1
A. Remove the first item from the list Aj.
iii. If |Aj|<M then Djdoes not have Massociations
A. Choose the largest k<M so that S(f, q, Dj)Sj,k
B. Insert qjust before association kin Aj
C. Set the new Sj,k S(f, q, Dj)
Figure 1: The query association process. In this scheme, we use static association parameters: M
is the maximum number of associations per document, and Nis the number of associations that
are created per query.
in two ways: first, the number of queries and (usually) the number of distinct search terms is
lower; and, second, the quality of the surrogate is lower, since the dynamic replacement process
that occurs after the threshold M= 19 is reached has not been used to remove lower-quality
associations. However, the latter eect is also dependent on the number Nof associations created
per query; with lower values of Nit is possible that associations may be high-quality but that M
may never be exceeded.
Consider an example document from our collection, with M=19queryassociationsthat
concerns hypnosis:
hypnotic regression, how to hypnotize people, information on phobias, phobias that
are not common, shock tv, how many people have phobias, hypnotic sensations, stage
hypnotists, hypnotherapist, how do i find a hypnotherapist, hypnotized, hypnosis,
depression physical causes, hypnosis during birth, addictions and hypnosis, how to
hypnotize hypnosis, hypnosis hypnotize, catalepsy hypnosis, deep intelligent hypnosis
The associated queries relate to dierent aspects of hypnosis, hypnotherapy, and related conditions.
Addition of the terms allows the document to be eectively retrieved in response to queries about
this topic.
For the c r e a t i o n of document r e p lacement s , i t i s i mp o r tant th a t t he surrogate in formation
accurately reflects the original content of the document. In preliminary experiments not reported
here, we observed that in some cases queries can become associated with a document when a
single term strongly dominates the similarity function. If the other query terms do not occur
in the document, then the replacement surrogate can subsequently be retrieved based on these
spurious terms that are not a reflection of the original content, causing performance to deteriorate.
We therefo r e i m p o s e d a n a d ditional constra i n t on the asso c i a tion process, namely t h a t a l l terms
in a query that is a candidate for association must also occur in the document content. We call
this Ranked and Boolean association.
For document r e p l a cement s u r r o g ates, thes e a s s o c i a ted querie s would be used in place of the
document text. For indexing purposes, the terms in the associations would be stored, and subse-
Scholer et al., Query Association Surrogates for Web Search 7
quently made available for retrieval. For document supplements, the association terms would be
added to the full text of the document; both the full text and association terms would be indexed
and made available for retrieval. With the Ranked and Boolean condition, associations do not
add new terms to document supplements. Instead, they change the distribution of terms in the
collection, adding weight to those terms that have been considered important by past users.
In the next section, we consider how anchor text can be used to create alternative surrogates.
We furth e r d e t a i l the parameter s e t tings in query as s o ci a t i o n and the e e c t o f q u e r y l o g s i z e i n
Section 4. We discuss other, less successful techniques for creating query association surrogates in
Section 5.
3.2 Surrogates using Anchor Text
As discussed in Section 2, in-link anchor text is well-known as an alternative method of creating
document surrogates. In particular, anchor text surrogates work well for named page finding tasks
such as, for example, locating home pages. For home page finding, additional techniques are also
useful such as using the length of the web resource URL, the depth of a resource within a web
domain, and the document structure [5].
For our e x p e r i m e nts, we h a ve ext r a c t e d in-lin k a n chor text f o r o u r collectio n . We use d t wo
anchor text surrogate collections: first, we extracted the in-link anchor text from within the
collection itself; and, second, we processed a larger 100 gigabytecollectionandextractedin-link
anchors that reference documents within the smaller test collection. Details of our experiments
are presented in Section 4. In addition, we incorporated a URL length-based adjustment to the
similarity score for home page finding tasks, where the final similarity score is a product of the
score determined by the ranking function and the inverse of the URL length; similar approaches
have been shown to work well for this collection [12].
4Results
In this section, we describe our experimental framework and present the results of document sur-
rogate experiments. We show that query association surrogates are an eective tool for topic
finding tasks, a task for which anchor text is ineective. Anchor text provides a complementary
performance boost in site finding tasks, a task for which query associations are less suited. Over-
all, we show that the addition of both anchors and associations for all querying tasks improves
performance over a document-only baseline.
4.1 Experimental Environment
The TREC WT10g collection is used for our experiments. This collection of 1.69 million documents
was created from a snapshot of the Web in 1997, and was designed to have a high degree of inter-
server connectivity to support experiments that exploit link data [1, 13]. Collection statistics are
reported in Table 1. The collection was primarily used as a testbed at the TREC-9 and TREC
2001 conferences [36, 37].
In addition to the test collection, the TREC environment also established a set of topic finding
queries, fifty each from TREC-9 and TREC 2001. The queries consist of four major components,
marked up as SGML fields: a number, a title, a description, and a narrative. The title is a short
string of text taken from an actual search engine query log, while the description and narrative
fields are engineered from these to expand on the perceived information need of a hypothetical
user. Only the title strings are used in our experiments, as these are representative of a web search
task.
The TREC queries have corresponding relevance judgements,thatis,listsofdocumentsthat
meet the user information needs. Due to the large size of collections such at WT10g, it is not
possible to judge the relevance of each document in the collection. Instead, relevance assessments
are created by pooling,wherebyparticipantsintheTRECconferencesubmittheirrankedresults
Scholer et al., Query Association Surrogates for Web Search 8
Number of documents 1.69 million
Median terms per document 207
Minimum terms per document 0
Mean terms per document 597
Maximum terms per document 649,998
Standard deviation of document size 2907
Minimum relevant documents per topic 1
Mean relevant documents per topic 60
Maximum relevant documents per topic 519
Table 1 : Statistics for TREC WT10g collection and related TREC topics 451–550. Our retrieval
system indexes content terms only; all information within HTML markup is ignored. Relevance
statistics are based on the TREC relevance assessments for topics 451–550.
for each query [37]. Human assessors select a number of results, usually 100, from each submitted
run, to create a topic pool. Documents in the pool are assessed and judged as being relevant to
the query or not. Unjudged documents are assumed to be irrelevant. Statistics for the number of
relevant documents for TREC topics 451 to 550 are given in Table 1.
An experimental environment for a dierent type of web search activity, site finding, was
additionally introduced in TREC 2001. Also known as home page finding, this is a known-item
task, where the aim is to return the main entry-page to a known site [5]. A total of 145 queries
were devised for this task, and corresponding relevance judgements based on the WT10g collection
have been made available.
For our retr i e val r u n s , we use a va r i a nt of the Oka p i B M25 similarit y measure [22 ] :
BM25(q, d)=!
tq
log "
Nft+0.5
ft+0.5#
×(k1+1)fd,t
K+fd,t
where:
qis a query, containing terms t;
dis a document;
Nis the number of documents in the collection;
ftis the number of documents containing term t;
Kis k1((1 b)+b×Ld/AL);
k1and bare parameters, set to 1.2 and 0.75;
fd,t is the number of occurrences of tin d;and
Ldand AL are the document length and average document length respectively.
The first term in the similarity measure reduces the impact of query terms that occur often
throughout the collection, while the second favours documents in which query terms occur fre-
quently. Sparck Jones et al. [30, 31] present a detailed explanation of the Okapi BM25 formulation.
For our experiment s, we hav e o mitted some add i t i onal parameter s t h a t a r e not used in this con-
text; for example, we have assumed that query terms are not repeated. We do not use relevance
feedback in our retrieval runs, but rank documents directly on their similarity score. In our ex-
periments, 1000 ranked documents are returned per query. We also do not use stemming (the
removal of common prefixes and suxes), but use a stoplist of frequently occurring, closed-class,
and function words1.
Query associations for documents in WT10g were constructed from two Excite search engine
query logs: the first contained approximately 1 million queries from a single day in 1997; and,
the second contained approximately 1.7 million queries from a single day in 1999 [33]. Both logs
1The stoplist is available as part of Lucy, our public-domain search engine which can be downloaded at
http: //sourcefor ge.net/pro jects/lucy-search/
Scholer et al., Query Association Surrogates for Web Search 9
Statistic Associations Anchors Anchors (VLC)
Mean Terms per Association/Anchor 2.7 3.7 3.7
Mean Associations/Anchors per Document 5.4 3.9 4.4
Mean Surrogate Terms per Document 14.5 14.3 16.2
Maximum Number of Associations/Anchors 19 56,135 72,315
Documents with Zero Associations/Anchors 416,358 1,048,893 619,961
Tot a l A ssoci atio n s /Anc hors 9, 0 99, 3 43 6, 5 88,5 1 6 7 ,484 , 664
Tot a l S urro g ate Ter m s 24 , 533, 0 12 24 , 279, 0 11 27 , 483, 9 11
Table 2 : Distribution of surrogate information (associated queries and anchors) and terms for the
WT10g collection. The table shows: the mean number of terms per associated query/anchor text
string; the mean number of associated queries/anchor text strings stored per document; the mean
number of terms in the associated queries/anchor text strings stored per document; the maximum
number of associated queries/anchor text stored for a document; the number of documents which
have no associated queries/anchor text strings; the total number of associated queries/anchor text
string for the whole collection; and the total number of surrogate terms for the whole collection.
were preprocessed to remove oensive terms — using the same technique used to prepare WT10g
—andduplicateswereeliminated,leaving917,455uniquequeries. Afterprocessingthequeries,
with association parameter settings of M=19andN=39,themeannumberofassociations
per document was 5.4. Around 194,000 documents had the maximum of M= 19 associations,
and just over 416,000 documents had no associations. Detailed statistics for query association
surrogates are given in Table 2.
Anchor text was extracted from twosources: first,theWT10gcollection itself; and, second, the
larger TREC VLC2 100 gigabyte collection (VLC). We extracted in-link anchor text, that is, text
from documents that pointed to a given document. Using our techniques, the WT10g extraction
resulted in a mean of 3.9anchortextfragmentsperdocumentandover643,000 documents having
no anchor text. Detailed statistics are provided in Table 2.
For th e t o p i c finding task , a c c u r a c y or eectiveness is evaluated using the standard information
retrieval metrics of precision (the fraction of retrieved documents that are relevant to the query)
and recall (the proportion of relevant documents that have been retrieved) [37, 39]. First, we
report the mean average precision (MAP). For a single query, the average precision is the mean
of the precision obtained after each relevant document is retrieved; MAP is the average of these
values over each quer y in th e query set. It there fore provides a single measure of performance for
a retrieval run, where documents that are retrieved earlier are weighted more heavily. Second, we
report the macro averaged precision-at-10 answers (P@10). Last, we report the macro averaged
R-precision (RP), that is, the precision after inspecting the number of answers that is equal to the
number of relevant do cuments for the query.
For site finding, we use the three measures used at TREC 2001 [5]: first, mean reciprocal rank
(MRR), that is, the average of the inverse of the rank at which the first correct answer appears;
second, the percentage of queries where the correct answer is in the first ten answers (%top10);
and, last, the percentage of queries where the correct answer is not in the first one hundred answers
(%fail).
4.2 Significance Testing
In comparing the performance of retrieval algorithms, it is necessary to make a judgement about
the significance of dierences in performance, that is, to determine if the observed trends may have
arisen by chance or indicate a reliable dierence in performance. As demonstrated by Zobel [40],
it is not safe to assume that two retrieval runs are not significantly dierent because they give
similar average performance, nor is it safe to assume that retrieval runs whose performance diers
by more than a few percent do dier significantly. Statistical testing can be used to give a measure
of confidence that an observed dierence in performance is due to an actual improvement, rather
Scholer et al., Query Association Surrogates for Web Search 10
Document Type MAP pP@10 pRP p
Full t ext an d A ssoci atio ns 0. 1 754 0.015 0.2866 0.011 0.2084 0.019
Full t e xt, As s ocia tion s a n d Anch ors ( V LC) 0 .173 8 0.044 0.2876 0.014 0.2074 0.017
Full t e xt, As s ocia tion s a n d Anch ors 0. 1 737 0.050 0.2876 0.014 0.2074 0.019
Full t ext 0. 1 681 – 0 . 268 0 – 0 .19 9 3 –
Full t e xt and A n cho r s (VLC ) 0. 1 659 0.011 0.2629 0.234 0.1919 0.043
Full t e xt and A s soci a tio n s ( n o RnB) 0. 164 4 0.107 0.2649 0.378 0.1930 0.429
Anchors and Associations 0.1221 <0.001 0.2281 0.044 0.1644 0.004
Associations 0.1212 <0.001 0.2479 0.187 0.1613 0.001
Associations (no RnB)
Anchors (VLC) 0.0222 <0.001 0.0781 <0.001 0.0440 <0.001
Anchors 0.0183 <0.001 0.0750 <0.001 0.0383 <0.001
Table 3 : Accuracy of surrogate techniques for the topic finding task, ordered by decreasing mean
average precision (MAP). Accuracy is also shown as precision at 10 (P@10) and R-Precision (RP).
Associations are created with M=19and N=39,andanchortextisextractedfromWT10g.
p-values are reported for comparisons between runs and the baseline (full text) using a one-tailed
Wilcoxon signed rank test.
than random variation.
Avarietyoftestshavebeenproposedfortheevaluationofinformationretrievalexperiments,
including the t-test, the sign test, the Wilcoxon signed rank test and ANOVA (see, for example,
Hull [15]). We test the significance of our results using the Wilcoxon signed rank test, a nonpara-
metric alternative to the t-test [28]. This is preferred to the t-test and ANOVA because it does
not require an assumption that the underlying data is normally distributed, it is more powerful
than the sign test, and it had been shown to be eective for the evaluation of information retrieval
experiments [40].
4.3 Topic Finding
Table 3 s h ows th e r e sults of ou r topic find i n g experimen ts and an im p o r t a nt out c o m e: que r y
association improves web topic retrieval. Overall, query association document supplements —
where query associations are added to the full text of the document — improve mean average
precision by 4.3% over the baseline of full text alone and, importantly for a web search task, they
improve mean average precision-at-10 by 7%. Both results are significant at the 0.05 level using
the Wilcoxon signed rank test. Moreover, as shown later in this section, we would expect that
query association performance would improve if a larger query log were available for association.
In contrast, for topic finding, anchor text does not improve accuracy: addition of anchor text to
the full text reduces mean average precision by 1.3%.
Figure 2 shows the performance of the dierent document supplement schemes for individual
queries. Each line represents the dierence between the MAP of a query under a selected document
supplement scheme compared to baseline full text retrieval. The queries have been sorted by the
dierence in performance. A scheme is better thanthebaseline,ontheaverage,iftheareabetween
the region of the curve where performance is positive (toward the right hand side of the graph) and
the x-axis exceeds the area between the region of the curve where performance is negative (toward
the left hand side) and the x-axis; from the graph it can be seen that for the associations-based
supplement approaches the positive area exceeds the negative area. Figure 2 also shows that the
MAP of less than five of the queries is negatively aected by more than 2%, while 18 or more
queries are improved by more than 2% for the association-based document supplement approaches.
Moreover, 3 to 4 queries are boosted by more than 10%, while none are aected negatively by
more than 8%.
The graph also shows an important detail about the addition of anchor text for a topic-finding
task: the performance of the scheme that combines full text, associations, and anchors and the
Scholer et al., Query Association Surrogates for Web Search 11
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Query
-0.10
-0.05
-0.00
0.05
0.10
0.15
Difference in Mean Average Precision
Full text and Associations
Full text, Associations and Anchors(VLC)
Full text, Associations and Anchors
Full text and Anchors (VLC)
Figure 2: Per-query performance of document supplements. The graph shows the dierence in
eectiveness, measured by mean average precision, between the the document surrogate and the
full text baseline. Queries have been sorted by the dierence in eectiveness.
scheme that combines only full text and associations, is nearly the same for the positive parts of
the curve. In contrast, for the negative parts of the curve, the former scheme always lies below
the latter. While the addition of anchors into the mix of full text and associations has almost no
eect on those queries that already do well, for queries that already do worse than the baseline
case, anchors have a compounding eect and reduce performance further. Finally, we observe that
supplements using full text and anchors alone performs worse than any other document supplement
scheme for the positive parts of the curve.
Associations and anchors are most eective when combined with full text. However, query
association surrogates are also reasonably eective as document replacements: under the average
precision-at-10 measure, associations alone are only 6% less eective than full text alone, and the
result is not significant at the 0.05 level based on the Wilcoxon signed rank test. This suggests that
associations may be an eective tool for discovering the most-similar answers to a query and, with
careful consideration of how associations are stored and maintained, may be a useful component
of an ecient query evaluation scheme.
Our aim in these experiments is not to set new benchmarks for performance but, rather, to
illustrate the relative performance of dierent surrogate approaches. Ranking parameters can be
adjusted to improve particular combinations; for example, to favour short association replace-
ments. The Okapi similarity function incorporates an adjustment for document length, so that
long documents are not favoured because they contain more terms. The normalisation is governed
by a parameter, b, that is set empirically and Chowdhury et al. recently demonstrated that dier-
ent collections require dierent bsettings [4]. Indeed, in preliminary experiments, we found that
adjusting from the recommended value of b=0.75 to b=0.1 improves all query association docu-
ment replacement accuracy measures by 4%–8%. Whenfinelytuned,therefore,queryassociations
may be a competitive replacement for full text for many topic finding tasks.
Scholer et al., Query Association Surrogates for Web Search 12
Document Type MRR p%top10 %fail
Full t e xt, As socia tion s, a nd Anch ors ( V LC) 0. 6 63 0.417 79.3 11.7
Full t e xt and A n cho r s (VLC ) 0. 6 62 0.288 80.7 11.7
Full t ext an d A n ch o rs 0. 6 52 0.326 80.0 12.4
Full t e xt, As s ocia tion s , a nd A nch ors 0.6 52 0.371 78.6 12.4
Full t ext 0. 6 42 81 . 4 11. 0
Full t ext an d A ssoci atio ns 0.6 3 8 0.033 80.0 11.7
Anchors and Associations 0.350 <0.001 45.1 46.5
Associations 0.287 <0.001 37.6 51.1
Anchors (VLC) 0.210 <0.001 25.0 72.2
Anchors 0.172 <0.001 21.5 75.7
Table 4 : Accuracy of surrogate techniques for the site finding task, ordered by decreasing mean
reciprocal rank. Accuracy is also shown as a percentage of queries where the answer is in the top
ten documents (%top10) and as a percentage of queries that do not have the correct answer in
the top 100 (%fail). Associations are created with M=19and N=39,andexceptwherenoted
anchor text is extracted from WT10g.
4.4 Site Finding
Table 4 s hows th e r e sults of our si t e fi nding experi m e nts. As e x p e cted, anch o r text supp l e ment s
improve the results for this task: MRR is improved by 1.6% by adding WT10g anchors to the
full text, and by 3.1% by adding anchor text extracted from the larger 100 Gb VLC collection.
However, perhaps unexpectedly, these results are not significant under the Wilcoxon signed rank
test because only a few of the 145 queries actually find dierent answers with and without anchor
text. Improvements from using anchor text have been shown to be significant under the sign
test [5] but, to our knowledge, statistical significance has not been demonstrated by others for
improvements using the TREC query sets that we have used.
On a per-query basis, the four document supplement schemes are very similar in performance:
the MRR of around 120 of the 145 queries stays unchanged from the baseline full text case for
these schemes. For the best scheme — full text,associations,andanchors—ninequeriesare
improved by more than 10% compared to the baseline, while the performance of only five queries
fell by more than 10%. Both of the other two schemes that incorporate anchors have seven queries
that are improved by more than 10%, while three to five are reduced by more than this level. The
Full text a nd Asso ciations sc h eme was le s s e e c t ive for th e n a m e d p a g e finding tas k : h ere, three
queries were improved by more than 10%, while the performance of the same number fell by over
10%. Therefore, while associations on their own are not beneficial for topic finding, when they are
added to anchor text they do not harm performance.
Document replacements are overall less eective for named page finding: Anchors and Asso-
ciations — the best document replacement scheme — led to an MRR of 0.35 compared to the
baseline of 0.64. Replacements constructed only from query associations performed slightly worse,
and anchors on their own showed the poorest performance for this task.
To investig a t e w hy quer y asso ci a t i ons are not hel p f ul f or n amed page find i n g, we e xamined
the query log more closely. On inspection of a sample, it was found that the majority of queries
from which the associations are built were posed for topic finding. They are therefore unlikely
to contain terms that are useful for a page-finding task, since this often relies on the presence of
proper nouns (for example, the names of institutions). If these do not occur frequently in the
queries that are used for building associations, then the document replacements will not contain
the terms that are required for the eective retrieval of a specifically named page. We would
expect that this eect would be mitigated substantially in a production system, where a much
larger number of queries, including a greater range of named page finding queries, is available.
Scholer et al., Query Association Surrogates for Web Search 13
4.5 Parameter Settings
In this section, we consider the eectiveness of dierent values of Mand Nfor deriving surrogates
for web search tasks. In a recent analysis of user logs from the Excite search engine, Spink et
al. show that over 50% of users only look at the first page of results, usually corresponding to
10 answers, from their web search. We therefore investigated the role of the association param-
eters based on precision-at-10 as a performance metric, as this measure is most representative
of web searching behaviour. Specifically, we identified document surrogates through the associ-
ation process for all combinations of M=1toM=25andN=1toN=45,andcreated
document replacements that are searched instead of the original text. We then identified the
combination of parameter settings that achieved the highest precision-at-10, and used this setting
in our experiments described in the previous section.
Since precision-at-10 is a less stable metric than mean average precision and R-precision [3],
we used all 100 queries for the WT10g collection to determine our parameter settings. Figure 3
shows the results of this experiment in varying Mand Nfor a web search topic finding task. The
precision-at-10 achieved for each combination of Mand Nis shown as a shade of gray, where black
and white are the lowest and highest accuracy observed over all settings respectively. The figure
demonstrates that very low values of Mand Nlead to poor performance, while higher values lead
to improved eectiveness. As shown in Table 3, the best precision-at-10 for this task is 0.2479
with M=19andN=39.
We va lidated our resu l t s by conducting a simil a r a n a l y s i s based on mean av e r a ge preci s i o n .
Based on this more stable precision metric, a region of good performance was identified. Best
performance is given in the area bounded by M=17toM=24andN=36toN=44. As the
optimum parameter settings based on precision-at-10 lie within this region, we used M=19and
N=39forourexperiments.
Our precision-at-10 results are based on post-hoc tuning of the association parameters for
one type of surrogate (document replacements), one type of web search task (topic finding), and
one performance metric (precision-at-10). We expect that dierent parameter settings would be
optimal for other tasks, such as named page finding, and other surrogates, such as document
supplements. However, from Figure 3 it can be concluded that, in general, large settings for M
and Nare to be preferred to small settings for static query association.
4.6 Query Log Size
To investig a t e how the nu mber of querie s t h a t a re availab l e i mpacts on the perfo r m a nce of surro -
gates, we conducted experiments to investigate the eect of the size of the query log from which
associations are derived. In all, we carried out six experiments, in which we randomly extracted
queries from the query log in multiples of 150,000. For example, in our second experiment we ex-
tracted 300,000 queries from the log, constructed query association document replacements, and
then searched these with our topic finding queries.
Figure 4 shows mean average precision, precision-at-10, and R-precision as the number of
queries used for the creation of query association document replacements increases. With zero
queries, the replacements are empty, so precision is zero. Performance then improves as more
queries are used. Interestingly, none of the three accuracy measures appears to have reached a
maximum. This is encouraging, as it suggests that with the availability of even more queries,
the performance of retrieval using association based surrogates would continue to improve. This
observation is significant to a production web search engine that might typically process tens or
hundreds of millions of queries each day.
5AlternativeApproaches
The query association process proposed in Section3andusedinSection4usesstaticparameters
for the maximum Massociations per document and the number Nof possible associations per
query. In this section, we consider alternative techniques for dynamically setting the association
Scholer et al., Query Association Surrogates for Web Search 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
M
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
N
Figure 3: The eect on precision-at-10 of varying Mand Nin creating query association document
replacement surrogates; Mis the maximum associations per document, and Nis the number of
associations created per query. The lowest P@10 is shown shaded black (11.1%) and the highest
shaded white (24.8%).
parameters. Our motivation is to investigate whether retrieval accuracy can be improved by
allowing more associations for documents with diverse topic coverage and less for those that are
narrow, and by allowing more associations to be created when a query is appropriate to a collection
and less when the query has few matching answers.
5.1 Dynamic N
AstaticvalueofthenumberNof associations created per query allows each candidate query to be
be associated with up to the same number of documents in the collection. Less than Nassociations
are only created if the maximum Massociations per document has been reached and the current
query is dissimilar to the target document. However, a static value of Nrelies on an assumption of
equal query quality: in practice, for some queries, there will be many relevant answer documents
in a collection, while for other queries there will be only a few (or perhaps none). It is therefore
intuitive that some queries should instead be associated with many documents, and others with
few or none. To investigate this eect, we have experimented with dynamically choosing Nfor
each query.
Scholer et al., Query Association Surrogates for Web Search 15
1.5 3.0 4.5 6.0 7.5 9.0
Number of queries (x 100,000)
0
0.1
0.2
0.3
Precision
Precision at 10
Average Precision
R-Precision
Figure 4: The eect of varying the number of queries used to build associations on surrogate
quality. Performance is shown for precision-at-10, mean average precision and R-precision; these
measures, the collections, and queries are discussed in Section 4. Associations are built with the
optimal static parameter settings of M=19and N=39.
An obvious method for choosing Nis to make use of the similarity between the query and
answer documents, as determined by a score from the ranking function. Using parameters that
we found to work well for the static case as a guide, we evaluated the similarity between 10,000
queries that were randomly chosen from our query log and the collection, and calculated the mean
similarity score at the optimal static rank of 39. We then used this value as a threshold so that all
documents that are more similar to a query than this average value are associated with the query,
and those with lower values are not. In practice, the number of documents that are associated
with a query then varies between 0 and 10,000. We call this scheme threshold.
An alternative method to dynamically determine Nfor each query is to estimate the quality of
the query itself prior to its execution. Documentfrequency,thatis,thenumberofdocumentsin
the collection that contain a term, is a fundamental metric used in computing similarity scores [26].
Intuitively, therefore, we might expect that it is also a useful technique for determining the likely
benefit of associating a given query. For example, we might expect that the query “algorithms”
—whichhasahighinversedocumentfrequency(IDF)—isabetterassociationthan“software
—whichhasalowIDF—foradocumentdescribingdatastructuresandalgorithms.
In preliminary experiments that we do not describe in detail here, we found that the average
IDF of the query terms is weakly correlated with the performance of the query; this is consistent
with observations made elsewhere [6]. However,wefoundastrongcorrelationbetweenquery
performance and the IDF of the query term that has the maximum IDF; this was significant
at the 0.05 level using the Spearman rank correlation test over 100 queries for three precision
measures. We therefore experimented further with this approach in creating surrogates using
query association.
Similarly to the threshold scheme, we again used the optimal static value of N=39toguide
our development of a dynamic approach. However, for this scheme, we used the maximum IDF of
the query to determine an Nvalue. By co nsiderin g the mean max imum IDF of terms from 10,000
randomly chosen queries in the query log, we found that rank 39 can be achieved by multiplying
the maximum IDF by 12.15. In practice, this results in dynamic values of Nbetween 0 and 110.
We call this scheme maxidf.
Scholer et al., Query Association Surrogates for Web Search 16
Surrogate Type Total Mean Max Zero
Static M(19), Static N(39) 9,099,343 5.4 19 416,358
Static M(19), Dynamic N(maxidf)8,143,9104.819463,536
Static M(19), Dynamic N(threshold)11,011,5626.519337,350
Dynamic M(doc), Static N(39) 12,225,417 7.2 425 416,683
Dynamic M(doc), Dynamic N(threshold)14,567,0118.6 425337,350
Anchors 6,588,516 3.9 56,135 643,437
Anchors (VLC) 7,484,664 4.4 72,315 619,961
Table 5 : Distributions of query associations for static and dynamic association schemes and an-
chor text. To tal is the number of associated queries or pieces of anchor text in the surrogate
col lection; Mean indicates the average number of associated queries or pieces of anchor text stored
per document; Max shows the maximum number of associated queries or pieces of anchor text that
are stored for a single document; Zero indicates how many documents in the original collection
have zero associated queries or pieces of anchor text.
AcomparisonofthedistributionsofassociationsforbothdynamicNschemes and a static N
approach is shown in the first three rows of Table 5. For comparison, the distribution of anchor
text extracted from WT10g and the WT100g (VLC) collections is also shown in the final two
rows. The maxidf scheme creates around 11% less associations than the static N=39approach,
while threshold creates around 21% more. Similar trends occur with the number of documents
that have no associations.
We also consider e d a n alternativ e t o maxidf that made use of query clarity.Cronen-Townsend
et al. [6] proposed query clarity as an alternative technique to estimate query performance. In
query clarity, the ambiguity of a query with respect to a collection is estimated using the divergence
between the language model of the query, and the background language model of the collection. In
experiments on collections with controlled vocabularies, query clarity was shown to have a strong
correlation with query performance and to be a more reliable performance predictor than one based
on the number of documents containing each query term (average inverse document frequency).
Surprisingly, we found only a weak correlation between query clarity and query performance for
our experiments with WT10g; the Spearman rank correlation test was significant at the 0.1level
for only R-precision. However, despite this, we investigated the use of clarity in our large-scale
query association task.
We preco m p u t e d the clarity o f the Excite quer i es with respect to our collection, and then
investigated association where queries with low clarity were not associated. In all, around 7.5% of
our queries have a clarity score of zero but omitting only these queries has no eect on accuracy.
Indeed, omitting those with the lowest 10% query clarity still had no eect and omitting more
was detrimental. This finding may be due to the large, less-controlled vocabulary of the WT10g
collection, and may also be influenced by the choice of parameters in query clarity. We chose not
to experiment with query clarity further.
Surrogate Type MAP P@10 RP
Dynamic M(doc), Dynamic N(threshold)0.13210.23440.1738
Dynamic M(doc), Static N(39) 0.1239 0.2531 0.1707
Static M(19), Static N(39) 0.1212 0.2479 0.1613
Static M(19), Dynamic N(maxidf)0.11480.22290.1636
Static M(19), Dynamic N(threshold)0.11470.22920.1488
Table 6 : Accuracy of dynamic surrogate schemes. Schemes are ordered by decreasing mean average
precision (MAP). Accuracy is also shown as precision at 10 (P@10) and R-Precision (RP).
Scholer et al., Query Association Surrogates for Web Search 17
5.2 Dynamic M
The association parameter Mlimits the number of queries that can become associated with a
single document. When set statically, the same value is applied for all documents in the collection;
that is, each document, regardless of its size, has the opportunity to store the same number of
associations. However, it seems intuitive that alargedocumentmightrequireagreaternumber
of associations to adequately describe its content, while a small document would need only a few.
One method in which Mcan be determined dynamically is to choose a value proportional to
the size of the full text document. Similarly to our experiments with N,weexaminedthesizeof
10,000 randomly chosen documents in the collection, and calculated a multiplier so that the mean
document size resulted in a value of M=19. Inpractice,therangeofvaluesofMwas from 0
to 425. We call this scheme doc and selected statistics are shown in the fourth and fifth rows of
Table 5 .
5.3 Results
The results of our experiments using dynamic association are shown in Table 6. The combination
of doc and threshold oers a 9% improvement in mean average precision over the static only
approach, but is 5% worse under the precision-at-10 measure. In all other cases, the improvements
are small or the dynamic scheme is less accurate than the static approach. Indeed, for the doc and
static Nscheme, the cost of the small improvement is substantially more storage: an additional 3.1
million or 34% more associations are stored compared to the static approach. Moreover, when the
value of Mis allowed to vary, it becomes more dicult to construct an ecient implementation
of the querying engine, an issue that we intend to explore further in future work. We therefore
conclude that static association is preferred over dynamic association.
6Conclusions
Document surrogates are an eective technique for improving retrieval eectiveness. For web home
page or site finding tasks, the addition of in-link hypertext — or anchor text — is a well-known
method for improving accuracy. Similarly, using additional information, such as document struc-
ture and URL text, can further improve accuracy forsuchsearchtasks. However,toourknowledge,
eective document surrogates for topic finding tasks have not previously been discovered.
We hav e prop o s ed a new method for supplemen t ing do c u ments b a s e d o n q u e r y a s so c i a tion.
Query associations are a link between user queries — the text that users use to describe information
needs — and documents in a collection. As such, query associations are an excellent tool for
describing the content of documents that are highly statistically similar to a query.
In this paper, we have proposed and investigated techniques for associating queries with doc-
uments to provide accurate web search for topic finding. We have shown that query association
supplements improve the accuracy of topic finding searches by 4%–7%. We conclude that query
association is a valuable technique for improving topic finding searches and, moreover, is comple-
mentary rather than competitive with existing anchor text approaches for site finding.
We will f u r t h er inve s t i g a te query as s o ciati o n surrogates. At p r e s e nt, terms fr o m d o c u m e nts and
query associations are treated equally during query evaluation. We expect that higher weightings
for query association terms and additional experimentation with ranking parameters would further
improve our results in a production environment; for example, static parameters work well but
may need to be chosen based on the collection and queries. A larger study with a recently obtained
query log of over 10 million queries and a larger collection is planned.
We also plan to next investigate e c i e n c y issues. An association-based document replacement
collection is only a small fractionofthesizeofafull-textcollection;indeed,theinvertedindex
structures are only 3% of the size of those used for full-text retrieval. The size dierence permits
very fast searching. However, the drawback is that associations must be created and updated
during the query evaluation process and, for large collections, may need to be maintained in an
on-disk structure. In addition, web pages change [17] and associated queries need to be updated
Scholer et al., Query Association Surrogates for Web Search 18
as these changes occur. We plan to investigate these trade-os, and determine how associations
can be used as part of a fast and accurate web search strategy.
Acknowledgments
This research was supported by the Australian Research Council.
References
[1] P. Bailey, N. Craswell, and D. Hawking. Engineering a multi-purpose test collection for web re-
trieval experiments. Information Processing and Management,2001. Inrevision.Availablefrom
www.ted.cmis.csiro.au/dave/cwc.ps.gz.
[2] P. D. Bruza, D. W. Song, and K. F. Wong. Aboutness from a commonsense perspective. Journal of
the American Society for Information Science,51(12):10901105, 2000.
[3] C. Buckley and E. Voorhees. Evaluating evaluation measure stability. In E. Yannakoudakis, N. J.
Belkin, M.-K. Leong, and P. Ingwersen, editors, Proceedings of the ACM SIGIR International Con-
ference on Research and Development in Information Retrieval,pages3340,Athens,Greece,2000.
[4] A. Chowdhury, M. C. McCabe, D. Grossman, and O. Frieder. Document normalization revisited. In
M. Beaulieu, R. Baeza-Yates, S. H. Myaeng, and K. J¨arvelin, editors, Proceedings of the ACM SIGIR
International Conference on Research and Development in Information Retrieval,pages381382,
Tam p e r e, Fin l a nd, 20 0 2.
[5] N. Craswell, D. Hawking, and S. Robertson. Eective site finding using link anchor information.
In D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel, editors, Proceedings of the ACM SIGIR
International Conference on Research and Development in Information Retrieval,pages250257,
New Orleans, LA, 2001.
[6] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In M. Beaulieu,
R. Baeza-Yates, S. H. Myaeng, and K. J¨arvelin, editors, Proceedings of the ACM SIGIR International
Conference on Research and Development in Information Retrieval,pages299306,Tampere,Finland,
2002.
[7] L. Fitzpatrick and M. Dent. Automatic feedback using past queries: Social searching? In N. J.
Belkin, A. D. Narasimhalu, P. Willett, W. Hersh, F. Can, and E. Voorhees, editors, Proceedings of
the ACM SIGIR International Conference on Research and Development in Information Retrieval,
pages 306–313, Philadelphia, PA, 1997.
[8] G. W. Furnas. Experience with an adaptive indexing scheme. In L. Borman and R. Smith, editors,
Proceedings of the ACM CHI Conference on Human Factors in Computing Systems,pages131135,
San Francisco, CA, 1985.
[9] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter, and
K. E. Lochbaum. Information retrieval using a singular value decomposition model of latent semantic
structure. In Y. Chiaramella, editor, Proceedings of the ACM SIGIR International Conference on
Research and Development in I nf orma ti on Retrieval,pages465480,Grenoble,France,1988.
[10] G. W. Furnas, L. M. Gomez, T. K. Landauer, and S. T. Dumais. Statistical semantics: How can a
computer use what people name things to guess what things people mean when they name things.
In J. A. Nichols and M. L. Schneider, editors, Proceedings of the First Major Conference on Human
Fac t ors i n C omp u ter s Syst ems,pages251253,Gaithersburg,MD,1982.
[11] D. J. Harper, S. Coulthard, and S. Yixing. A language modelling approach to relevance profiling for
document browsing. In W. Hersh and G. Marchionini, editors, Proceedings of the second ACM/IEEE-
CS joint conference on Digital libraries,pages7683,Portland,OR,2002.
[12] D. Hawking and N. Craswell. Overview of the TREC 2001 web track. In E. M. Voorhees and D. K.
Harman, editors, The Tenth Text REtrieval Conference (TREC 2001),pages6167,Gaithersburg,
MD, 2001. National Institute of Standards and Technology Special Publication 500-250.
[13] D. Hawking, N. Craswell, and P. Thistlewaite. Overview of TREC-7 very large collection track. In
E. M. Voorhees and D. K. Harman, editors, The Seventh Text REtrieval Conference (TREC-7),pages
91–104, Gaithersburg, MD, 1998. National Institute of Standards and Technology Special Publication
500-242.
Scholer et al., Query Association Surrogates for Web Search 19
[14] C.-K. Huang, L.-F. Chien, and Y.-J. Oyang. Relevant term suggestion in interactive web search based
on contextual information in query session logs. Journal of the American Society for Information
Science and Technology,54(7):638649,2003.
[15] D. Hull. Using statistical testing in the evaluation of retrieval experiments. In R. Korfhage, E. Ras-
mussen, and P. Willett, editors, Proceedings of the ACM SIGIR International Conference on Research
and Development in Information Retrieval,pages329338,Pittsburgh,PA,1993.
[16] W. J. Hutchins. The concept of ’aboutness’ in subject indexing. In K. Sparck Jones and P. Willett,
editors, Readings in Information Retri eval,pages9397.MorganKaufmannPublishersInc.,1997.
[17] W. Koehler. Web page change and persistence - a four-year longitudinal study. Journal of the
American Society for Information Science and Technology,53(2):162171,2002.
[18] A. M. Lam-Adesina and G. J. F. Jones. Applying summarization techniques for term selection in
relevance feedback. In D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel, editors, Proceedings of
the ACM SIGIR International Conference on Research and Development in Information Retrieval,
pages 1–9, New Orleans, LA, 2001.
[19] C. Paice. Constructing literature abstracts by computer: techniques and prospects. Information
Processing & Management,26(1):171186,1990.
[20] A. Pirkola and K. Jarvelin. Employing the resolution power of search keys. Journal of the American
Society for Information Science and Technology,52(7):575583,2001.
[21] V. V. Raghavan and H. Sever. On the reuse of past optimal queries. In E. A. Fox, P. Ingwersen,
and R. Fidel, editors, Proceedings of the ACM SIGIR International Conference on Research and
Development in Information Retrieval,pages344350,Seattle,WA,1995.
[22] S. E. Robertson and S. Walker. Okapi/Keenbow at TREC-8. In E. M. Voorhees and D. K. Harman,
editors, The Eighth Text REtrieval Conference (TREC-8),pages151162,Gaithersburg,MD,1999.
National Institute of Standards and Technology Special Publication 500-246.
[23] S. E. Robertson, S. Walker, and M. M. Hancock-Beaulieu. Large test collection experiments on an op-
erational, interactive system: Okapi at TREC. Information Processing and Management,31(3):345
360, 1995.
[24] J. J. Rocchio. Relevance feedback in information retrieval. In E. Ide and G. Salton, editors, The
Smart Retrieval System — Experiments in Automatic Document Processing,pages313323.Prentice
Hall, Englewood, Clis, New Jersey, 1971.
[25] G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the
American Society for Information Science,41(4):288297,1990.
[26] G. Salton and M. McGill. Introduction to Modern Information Retrieval.McGraw-Hill,NewYork,
1983.
[27] F. Scholer and H. E. Williams. Query association for eective retrieval. In C. Nicholas, D. Grossman,
K. Kalpakis, S. Qureshi, H. van Dissel, and L. Seligman, editors, Proceedings of the ACM CIKM
International Conference on Information and Knowledge Management,pages324331,McLean,VA,
2002.
[28] D. Sheskin. Handbook of parametric and nonparametric statistical proceedures.CRCPress,Boca
Raton, FL, 1997.
[29] A. Singhal and F. Pereira. Document expansion for speech retrieval. In M. Hearst, F. Gey, and
R. Tong, editors, Proceedings of the ACM SIGIR International Conference on Research and Devel-
opment in Information Retrieval,pages3441,Berkeley,CA,1999.
[30] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: devel-
opment and comparative experiments. Part 1. Information Processing and Management,36(6):779
808, 2000.
[31] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: devel-
opment and comparative experiments. Part 2. Information Processing and Management,36(6):809
840, 2000.
[32] A. Spink, B. J. Jansen, D. Wolfram, and T. Saracevic. From e-sex to e-commerce: Web search
changes. IEEE Computer,35(3):107109,2002.
Scholer et al., Query Association Surrogates for Web Search 20
[33] A. Spink, D. Wolfram, B. J. Jansen, and T. Saracevic. Searching the web: the public and their
queries. Journal of the American Society for Information Science and Technology,52(3):226234,
2001.
[34] A. Tombros and M. Sanderson. Advantages of query biased summaries in information retrieval. In
W. B. Croft, A. Moat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors, Proceedings of the
ACM SIGIR International Conference on Research and Development in Information Retrieval,pages
2–10, Melbourne, Australia, 1998.
[35] A. Veerasamy and R. Heikes. Eectiveness of a graphical display of retrieval results. In N. J. Belkin,
A. D. Narasimhalu, P. Willett, W. Hersh, F. Can, and E. Voorhees, editors, Proceedings of the
ACM SIGIR International Conference on Research and Development in Information Retrieval,pages
236–245, Philadelphia, PA, 1997.
[36] E. M. Voorhees and D. K. Harman. Overview of the Ninth Text REtrieval Conference (TREC-9). In
E. M. Voorhees and D. K. Harman, editors, The Ninth Text REtrieval Conference (TREC-9),pages
1–14, Gaithersburg, MD, 2000. National Institute of Standards and Technology Special Publication
500-249.
[37] E. M. Voorhees and D. K. Harman. Overview of TREC 2001. In E. M. Voorhees and D. K. Harman,
editors, The Tenth Text REtrieval Conference (TREC 2001),pages115,Gaithersburg,MD,2001.
National Institute of Standards and Technology Special Publication 500-250.
[38] T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, URLs and
anchors. In E. M. Voorhees and D. K. Harman, editors, The Tenth Text REtrieval Conference (TREC
2001),pages663672,Gaithersburg,MD,2001.NationalInstituteofStandardsandTechnology
Special Publication 500-250.
[39] I. Witten, A. Moat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and
Images.MorganKaufmannPublishers,LosAltos,CA94022,USA,2ndedition,1999.
[40] J. Zobel. How reliable are the results of large-scale information retrieval experiments? In W. B. Croft,
A. Moat, C. J. van Rijsbergen, R. Wilkinson, and J. Zobel, editors, Proceedings of the ACM SIGIR
International Conference on Research and Development in Information Retrieval,pages307314,
Melbourne, Australia, 1998.
... Entity ranking is commonly addressed -much like EL -by exploiting content and structure 2.3. Emerging Entities of knowledge bases, for example by including anchor texts and inter-entity links [230], category structure [12,226], entity types [135], or internal link structure [252]. ...
... Amitay et al. [6] study the effectiveness of query reformulations for document expansion by appending all queries in a reformulation session to the top-k returned documents for the last query. Scholer et al. [226] propose a method to either add additional terms from associated queries to documents or replace the original content with these associated queries, all with the goal of providing more accurate document representations. ...
... Knowledge base. Knowledge base entities have rich metadata that can be leveraged for improving retrieval [12,226,230]. We consider four types of metadata to construct the KB entity representations: (i) anchor text of inter-knowledge base hyperlinks, (ii) redirects, (iii) category titles, and (iv) titles of entities that are linked from and to each entity. ...
Preprint
In the era of big data, we continuously - and at times unknowingly - leave behind digital traces, by browsing, sharing, posting, liking, searching, watching, and listening to online content. When aggregated, these digital traces can provide powerful insights into the behavior, preferences, activities, and traits of people. While many have raised privacy concerns around the use of aggregated digital traces, it has undisputedly brought us many advances, from the search engines that learn from their users and enable our access to unforeseen amounts of data, knowledge, and information, to, e.g., the discovery of previously unknown adverse drug reactions from search engine logs. Whether in online services, journalism, digital forensics, law, or research, we increasingly set out to exploring large amounts of digital traces to discover new information. Consider for instance, the Enron scandal, Hillary Clinton's email controversy, or the Panama papers: cases that revolve around analyzing, searching, investigating, exploring, and turning upside down large amounts of digital traces to gain new insights, knowledge, and information. This discovery task is at its core about "finding evidence of activity in the real world." This dissertation revolves around discovery in digital traces, and sits at the intersection of Information Retrieval, Natural Language Processing, and applied Machine Learning. We propose computational methods that aim to support the exploration and sense-making process of large collections of digital traces. We focus on textual traces, e.g., emails and social media streams, and address two aspects that are central to discovery in digital traces.
... As with many NLP and information extraction tasks, research into entity ranking too has flourished in recent years, largely driven by benchmarking campaigns, such as the INEX Entity Ranking track [57][58][59], and the TREC Entity track [13,14]. Entity ranking is commonly addressed -much like EL -by exploiting content and structure of knowledge bases, for example by including anchor texts and inter-entity links [230], category structure [12,226], entity types [135], or internal link structure [252]. ...
... Amitay et al. [6] study the effectiveness of query reformulations for document expansion by appending all queries in a reformulation session to the top-k returned documents for the last query. Scholer et al. [226] propose a method to either add additional terms from associated queries to documents or replace the original content with these associated queries, all with the goal of providing more accurate document representations. ...
... Knowledge base. Knowledge base entities have rich metadata that can be leveraged for improving retrieval [12,226,230]. We consider four types of metadata to construct the KB entity representations: (i) anchor text of inter-knowledge base hyperlinks, (ii) redirects, (iii) category titles, and (iv) titles of entities that are linked from and to each entity. ...
Thesis
Full-text available
In the era of big data, we continuously — and at times unknowingly — leave behind digital traces, by browsing, sharing, posting, liking, searching, watching, and listening to online content. Aggregated, these digital traces can provide powerful insights into the behavior, preferences, activities, and traits of people. While many have raised privacy concerns around the use of aggregated digital traces, it has undisputedly brought us many advances, from the search engines that enable our access to unforeseen amounts of data, knowledge, and information, to, e.g., the discovery of previously unknown adverse drug reactions from search engine logs. Whether in online services, journalism, digital forensics, law, or research, we increasingly set out to exploring large amounts of digital traces to discover new information. Consider for instance, the Enron scandal, Hillary Clinton’s email controversy, or the Panama Papers: cases that revolve around analyzing, searching, investigating, exploring, and turning upside down large amounts of digital traces to gain new insights, knowledge, and information. This discovery task is at its core about “finding evidence of activity in the real world.” This dissertation revolves around discovery in digital traces. We propose computational methods that aim to support the exploration and sense-making process of large collections of textual digital traces, e.g., emails and social media streams. We address methods for analyzing the textual content of digital traces, and the contexts in which they are created, with the goal of predicting people’s future activity, by leveraging their historic digital traces.
... This has resulted in a large body of work on QPP, which is divided into two common approaches. Pre-retrieval predictors analyze query and corpus statistics prior to retrieval (Cronen-Townsend et al., 2002;Hauff et al., 2008;He & Ounis, 2004;Mothe & Tanguy, 2005;Scholer et al., 2004;Zhao et al., 2008) and post-retrieval predictors that also analyze the retrieval results (Aslam & Pavlu, 2007;Roitman, 2018b;Shtok et al., 2016;Zamani et al., 2018;Zhou & Croft, 2006;Carmel et al., 2006;Cummins, 2014;Diaz, 2007;Amati et al., 2004). Predictors are typically evaluated by measuring the correlation coefficient between the AP values attained with relevance judgments and the values assigned by the predictor. ...
... For the parameters of the post-retrieval predictors we used fixed settings that have been demonstrated to be effective for the Robust04 collection previously (Shtok et al., 2012(Shtok et al., , 2010Tao & Wu, 2014). We apply Average Precision (AP) to measure the effectiveness of the different retrieval pipelines, as our primary goal is to be consistent with Zhao et al. (2008) Measures similarity based on cf.idf to the corpus, summed over the query terms AvgSCQ by Zhao et al. (2008) SCQ normalized by the query length MaxSCQ by Zhao et al. (2008) The query term with maximal SCQ score SumVAR by Zhao et al. (2008) Measures the cf.idf variability of the query terms in the corpus AvgVAR by Zhao et al. (2008) Variability normalized with the query length MaxVAR by Zhao et al. (2008) The query term with maximal variability AvgIDF by Cronen-Townsend et al. (2004) The mean idf value of the query terms MaxIDF by Scholer et al. (2004) The query term with maximal idf value Post-retrieval Clarity by Cronen-Townsend et al. (2002) Measures the divergence between the Language Model (LM) constructed over top documents in the result list to the LM of the entire corpus NQC by Shtok et al. (2012) Measures the standard deviation of the top documents scores in the retrieval list WIG by Zhou and Croft (2007) Measures the difference between the mean retrieval score of the top retrieved documents and the score of the entire corpus SMV by Tao and Wu (2014) Scores the queries based on a combination of the scores standard deviation and magnitude UEF by Shtok et al. (2010) Prediction framework that is based on the similarity of the initial result list with the list re-ranked using a Relevance Model (RM), scaled by an estimator of the RM quality. In this work we scale the RM with the existing post-retrieval predictors: UEF(Clarity), UEF(NQC), UEF(WIG) and UEF (SMV) previous evaluation exercises, as AP was the most common effectiveness metric used in prior QPP work. ...
Article
Full-text available
Query performance prediction (QPP) has been studied extensively in the IR community over the last two decades. A by-product of this research is a methodology to evaluate the effectiveness of QPP techniques. In this paper, we re-examine the existing evaluation methodology commonly used for QPP, and propose a new approach. Our key idea is to model QPP performance as a distribution instead of relying on point estimates. To obtain such distribution, we exploit the scaled Absolute Ranking Error (sARE) measure, and its mean the scaled Mean Absolute Ranking Error (sMARE). Our work demonstrates important statistical implications, and overcomes key limitations imposed by the currently used correlation-based point-estimate evaluation approaches. We also explore the potential benefits of using multiple query formulations and ANalysis Of VAriance (ANOVA) modeling in order to measure interactions between multiple factors. The resulting statistical analysis combined with a novel evaluation framework demonstrates the merits of modeling QPP performance as distributions, and enables detailed statistical ANOVA models for comparative analyses to be created.
... Traditionally, Query Preformance Predictors are divided into two macro-categories, according to the information they exploit to formulate the prediction: Pre-retrieval predictors and Post-retireval Predictors. Pre-retrieval predictors analyze query and corpus statistics prior to retrieval [14,[23][24][25][26][27][28] and post-retrieval predictors that also analyze the retrieval results [15,[29][30][31][32][33][34][35][36]. Even though Preretrieval predictors have the advantage of being faster, since they do not need to retrieve the documents for a certain run, post-retrieval predictors typically perform better. ...
... pre-retrieval max-idf [27] It considers the maximum value of the idf (inverse document frequency) over the query terms mean-idf [37] It computes the mean value of the idf over the query terms std-idf [37] It uses the standard deviation of the idf over the query terms sum-scq [28] Measures similarity based on cf.idf to the corpus, summed over the query terms. ...
Article
Full-text available
Evidence-based healthcare integrates the best research evidence with clinical expertise in order to make decisions based on the best practices available. In this context, the task of collecting all the relevant information, a recall oriented task, in order to take the right decision within a reasonable time frame has become an important issue. In this paper, we investigate the problem of building effective Consumer Health Search (CHS) systems that use query variations to achieve high recall and fulfill the information needs of health consumers. In particular, we study an intent-aware gain metric used to estimate the amount of missing information and make a prediction about the achievable recall for each query reformulation during a search session. We evaluate and propose alternative formulations of this metric using standard test collections of the CLEF 2018 eHealth Evaluation Lab CHS.
... This has resulted in a large body of work on QPP, which is divided into two common approaches. Pre-retrieval predictors analyze query and corpus statistics prior to retrieval [12,23,24,27,36,48] and post-retrieval predictors that also analyze the retrieval results [1,2,9,14,16,31,38,46,49]. Predictors are typically evaluated by measuring the correlation coefficient between the AP values attained with relevance judgments and the values assigned by the predictor. Such evaluation methodologies are based on a point estimate and have been shown to be unreliable when comparing multiple systems, corpora and predictors [22,35]. ...
... AvgIDF [13] The mean idf value of the query terms. MaxIDF [36] The query term with maximal idf value. ...
Conference Paper
Full-text available
Query Performance Prediction (QPP) has been studied extensively in the IR community over the last two decades. A by-product of this research is a methodology to evaluate the effectiveness of QPP techniques. In this paper, we reexamine the existing evaluation methodology commonly used for QPP, and propose a new approach. Our key idea is to model QPP performance as a distribution instead of relying on point estimates. Our work demonstrates important statistical implications , and overcomes key limitations imposed by the currently used correlation-based point-estimate evaluation approaches. We also explore the potential benefits of using multiple query formulations and ANalysis Of VAriance (ANOVA) modeling in order to measure interactions between multiple factors. The resulting statistical analysis combined with a novel evaluation framework demonstrates the merits of modeling QPP performance as distributions, and enables detailed statistical ANOVA models for comparative analyses to be created.
... • Linguistic features extracted from the query only [21,[54][55][56]; • Other pre-retrieval features that use information on the document collection [57][58][59][60][61]; • Post retrieval features that consider the retrieved documents for that query [22,57,[62][63][64][65][66][67][68][69][70][71][72][73]. ...
Article
Full-text available
Information retrieval aims to retrieve the documents that answer users’ queries. A typical search process consists of different phases for which a variety of components have been defined in the literature; each one having a set of hyper-parameters to tune. Different studies focused on how and how much the components and their hyper-parameters affect the system performance in terms of effectiveness, others on the query factor. The aim of these studies is to better understand information retrieval system effectiveness. This paper reviews the literature of this domain. It depicts how data analytics has been used in IR for a better understanding system effectiveness. This review concludes lack of full understanding of system effectiveness according to the context although it has been possible to adapted the query processing to some contexts successfully. This review also concludes that, even if it is possible to distinguish effective from non effective system on average on a query set, nor the system component analysis nor the query features analysis were successful in explaining when and why a particular system fails on a particular query.
... Pre-retrieval predictors can be categorized as: (1) Specificity of the query terms in the collection: MaxIDF [20] and AvgIDF [8]; (2) Similarity between the query terms and the collection: SCQ, MaxSCQ and AvgSCQ [31]; and (3) Coherency with respect to the documents that contain the query terms: SumVAR, MaxVAR and AvgVAR [31]. Similarly, post-retrieval predictors can be categorized as: (1) Clarity based methods measure the coherence of the results with respect to the corpus. ...
... In particular, on one hand, since the natural-language-based code comments can re ect program semantics to strengthen the understanding of the programs [1], adopting them in code search can improve the matching process with natural language queries [6]. Accordingly, injecting code comments for code search is expected to enhance the search results [31]. On the other hand, the returned search results can be utilized as an indicator of the accuracy of the generated code comments to guide their optimization process. ...
Preprint
Code summarization and code search have been widely adopted in sofwaredevelopmentandmaintenance. However, fewstudieshave explored the efcacy of unifying them. In this paper, we propose TranS^3 , a transformer-based framework to integrate code summarization with code search. Specifcally, for code summarization,TranS^3 enables an actor-critic network, where in the actor network, we encode the collected code snippets via transformer- and tree-transformer-based encoder and decode the given code snippet to generate its comment. Meanwhile, we iteratively tune the actor network via the feedback from the critic network for enhancing the quality of the generated comments. Furthermore, we import the generated comments to code search for enhancing its accuracy. To evaluatetheefectivenessof TranS^3 , we conduct a set of experimental studies and case studies where the experimental results suggest that TranS^3 can signifcantly outperform multiple state-of-the-art approaches in both code summarization and code search and the study results further strengthen the efcacy of TranS^3 from the developers' points of view.
... There are two common approaches to this problem. Pre-retrieval predictors analyze the query and the corpus prior to retrieval time [14,22,23,29,38,39,48,59]; e.g., queries containing terms with high IDF (inverse document Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. ...
Conference Paper
Full-text available
The query performance prediction (QPP) task is to estimate the effectiveness of a search performed in response to a query with no relevance judgments. Existing QPP methods do not account for the effectiveness of a query in representing the underlying information need. We demonstrate the far-reaching implications of this reality using standard TREC-based evaluation of QPP methods: their relative prediction quality patterns vary with respect to the effectiveness of queries used to represent the information needs. Motivated by our findings, we revise the basic probabilistic formulation of the QPP task by accounting for the information need and its connection to the query. We further explore this connection by proposing a novel QPP approach that utilizes information about a set of queries representing the same information need. Predictors instantiated from our approach using a wide variety of existing QPP methods post prediction quality that substantially transcends that of applying these methods, as is standard, using a single query representing the information need. Additional in-depth empirical analysis of different aspects of our approach further attests to the crucial role of query effectiveness in QPP.
... As exemplified by [56], such an annotation can be taken as the representation of the corresponding code snippet, based on which the aforementioned lexical mismatch issue in a naive keyword-based search engine can be alleviated. A similar idea of improving retrieval by adding extra annotations to items is also explored in document retrieval [47] and image search [61]. However, most of them rely on humans to provide the annotations. ...
Preprint
To accelerate software development, much research has been performed to help people understand and reuse the huge amount of available code resources. Two important tasks have been widely studied: code retrieval, which aims to retrieve code snippets relevant to a given natural language query from a code base, and code annotation, where the goal is to annotate a code snippet with a natural language description. Despite their advancement in recent years, the two tasks are mostly explored separately. In this work, we investigate a novel perspective of Code annotation for Code retrieval (hence called `CoaCor'), where a code annotation model is trained to generate a natural language annotation that can represent the semantic meaning of a given code snippet and can be leveraged by a code retrieval model to better distinguish relevant code snippets from others. To this end, we propose an effective framework based on reinforcement learning, which explicitly encourages the code annotation model to generate annotations that can be used for the retrieval task. Through extensive experiments, we show that code annotations generated by our framework are much more detailed and more useful for code retrieval, and they can further improve the performance of existing code retrieval models significantly.
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Article
The TREC-2002 VVcb Track moved away from non-Web relevance ranking and towards Vcbspecifictasks on a 1.25 million page crawl ".GOV". The topic distillation task involved finding pageswhich were relevant, but also had characteristics which would make them desirable inclusions in adistilled list of key pages. The nmned page task is a variant of last year's homepage finding task.
Article
The common view of the ‘aboutness’ of documents is that the index entries (or classifications) assigned to documents represent or indicate in some way the total contents of documents; indexing and classifying are seen as processes involving the ‘summarization’ of the texts of documents. In this paper an alternative concept of ‘aboutness’ is proposed based on an analysis of the linguistic organization of texts, which is felt to be more appropriate in many indexing environments (particularly in non-specialized libraries and information services) and which has implications for the evaluation of the effectiveness of indexing systems.
Article
Information retrieval (IR) is driven by a process that decides whether a document is about a query. Recent attempts spawned from a logic-based information retrieval theory have formalized properties characterizing “aboutness,” but no consensus has yet been reached. The proposed properties are largely determined by the underlying framework within which aboutness is defined. In addition, some properties are only sound within the context of a given IR model, but are not sound from the perspective of the user. For example, a common form of aboutness, namely overlapping aboutness, implies precision degrading properties such as compositional monotonicity. Therefore, the motivating question for this article is: independent of any given IR model, and examined within an information-based, abstract framework, what are commonsense properties of aboutness (and its dual, nonaboutness)? We propose a set of properties characterizing aboutness and nonaboutness from a commonsense perspective. Special attention is paid to the rules prescribing conservative behavior of aboutness with respect to information composition. The interaction between aboutness and nonaboutness is modeled via normative rules. The completeness, soundness, and consistency of the aboutness proof systems are analyzed and discussed. A case study based on monotonicity shows that many current IR systems are either monotonic or nonmonotonic. An interesting class of IR models, namely those that are conservatively monotonic, is identified.