Conference PaperPDF Available

Query association for effective retrieval

Authors:

Abstract and Figures

We introduce a novel technique for document summarisation which we call query association. Query association is based on the notion that a query that is highly similar to a document is a good descriptor of that document. For example, the user query "richmond football club" is likely to be a good summary of the content of a document that is ranked highly in response to the query. We describe this process of defining, maintaining, and presenting the relationship between a user query and the documents that are retrieved in response to that query. We show that associated queries are an excellent technique for describing a document: for relevance judgement, associated queries are as effective as a simple online query-biased summarisation technique. As future work, we suggest additional uses for query association including relevance feedback and query expansion.
Content may be subject to copyright.
Query Association for Effective Retrieval
Falk Scholer Hugh E. Williams
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V
Melbourne, Australia, 3001.
{fscholer,hugh}@cs.rmit.edu.au
ABSTRACT
We introduce a novel technique for document summarisa-
tion which we call query association. Query association is
based on the notion that a query that is highly similar to a
document is a good descriptor of that document. For exam-
ple, the user query “richmond football club” is likely to be a
good summary of the content of a document that is ranked
highly in response to the query. We describe this process
of defining, maintaining, and presenting the relationship be-
tween a user query and the documents that are retrieved
in response to that query. We show that associated queries
are an excellent technique for describing a document: for
relevance judgement, associated queries are as effective as a
simple online query-biased summarisation technique. As fu-
ture work, we suggest additional uses for query association
including relevance feedback and query expansion.
Categories and Subject Descriptors
H.3.4 [Information Storage and Retrieval]: Information
Search and Retrieval Search Process; H.3.4 [Information
Storage and Retrieval]: Information Search and Retrieval
Retrieval Models
General Terms
Design, Experimentation
Keywords
Query Association, Summarisation, Past Queries, Relevance
Assessment, Web Search
1. INTRODUCTION
Web search is a common first step for information discov-
ery in universities, commercial institutions, the home, and
in other facets of daily life. Users typically pose natural
language queries — usually consisting of two or three words
— and expect accurate answers that meet their information
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’02, November 4–9, 2002, McLean, Virginia, USA.
Copyright 2002 ACM 1-58113-492-4/02/0011 ...$5.00.
needs. Answers are typically a ranked list of summaries and
hypertext links, ordered from most- to least-similar to the
query. When users find a useful answer to a query, they usu-
ally click on the hypertext link associated with the answer
and visit the relevant web resource.
Despite its familiarity, users face a difficult task when us-
ing a web search service. Indeed, 71% of surveyed users
have expressed “that they become frustrated whether they
are successful or not” [7]. They must develop their own
search strategy and, in doing so, learn how to use the in-
terface, decide on query terms, understand the structure of
the vast underlying collection, use advanced search facilities,
and interpret answers. The task can be simplified through
two paradigms: first, information can be presented more ef-
fectively; and, second, strategic help can be provided to the
user during the search process.
When answering queries, most popular web search en-
gines1provide users with both additional information and
strategic help. Such resources include additional search fea-
tures and query-biased summaries: additional search fea-
tures include relevance feedback mechanisms — usually im-
plemented through the “more like this” paradigm — and
query-biased summaries are short, automatically selected
document fragments that usually contain words from the
query. Recent work — such as query reformulation using
phrases derived from answer documents [7] and thumbnail
sketch summaries [25] — has further shown the advantages
of provision of more information and additional help.
In this paper, we propose a novel method that can be
used to provide additional information and strategic help to
users. We call this two-step technique query association and
it works as follows: first, for each query that is posed to our
search engine, we identify the Nmost statistically-similar
documents in the collection according to the ranking for-
mula; and, second, we attempt to associate the query with
each of the top Ndocuments (sub ject to constraints that
we discuss later). Each document has a maximum number
of associations and, when this limit is reached, we employ
rules so that associations can be dynamically replaced with
others.
Associated queries can then used to aid the querying pro-
cess in several ways:
As an alternative summarisation technique. As we
show later, queries that are closely associated with
a document are excellent content descriptors for that
document
1For example, Google at http://www.google.com/
324
For query expansion. Associated queries can be used
as a source of additional query terms, and we discuss
previous approaches to this in the next section
For interactive retrieval. Through a Web interface,
an initial query can be executed and then associated
queries used to further explore the collection
To improve retrieval efficiency and effectiveness. As
we discuss in the next section, answers to past high-
quality queries can be used as the answers to new
queries (or to seed the result sets of new queries)
An example of the application of query association to sum-
marisation and interactive retrieval is shown in Figure 1. In
this example, the user has posed the query “telecom aus-
tralia” to our search engine. The figure shows the first
answer returned in response to the query: a query-biased
document summary is shown at the top of the screen, while
five associated queries are shown as hypertext links at the
bottom. The associated queries are used to both assess the
content and relevance of each document, and as an inter-
active retrieval mechanism where a user can click on and
execute an associated query.
We present results of our initial experiments with query
association. We show that query association is an excellent
alternative to a conventional query-biased summarisation
technique: overall, users achieve the same accuracy when
using associated queries to judge the relevance of answers as
when using simple summaries. Indeed, we have found that
query association is more useful for assessing which answers
are irrelevant.
2. BACKGROUND
In this section, we present a short background of related
summarisation, past query, and query refinement techniques.
Query association also has other applications — such as
query expansion and interactive retrieval — but for com-
pactness we omit detailed discussion of other related back-
ground here.
Users express their information needs to a search engine or
text retrieval system in the form of a query. In response, the
system presents a list of matching documents or, more par-
ticularly, abstracts or summaries of those documents. Using
the summaries, the user judges the relevance of each answer
to their information need and, in the case of web search,
usually follows a hypertext link to relevant resources. In
the case where the summaries are uninformative or contain
insufficient information, the user may need to visit the web
resource in order to ascertain its relevance. A good sum-
marisation technique should provide an accurate summary
that allows the user to avoid visiting the related web re-
source, while also minimising the cognitive load in process-
ing the summary.
Text summarisation has traditionally followed one of three
paradigms: sentence extraction [5, 17], passage retrieval [4,
11, 15], and language generation [16]. While the latter two
techniques have shown promise, neither is used in practice
by web search engines. Instead, the focus of web summarisa-
tion has been towards sentence extraction and, in particular,
towards query-biased summaries [1, 21]. Query-biased sum-
maries are those where the extracted document fragments
usually contain one or more query terms.
Tombros and Sanderson [21] proposed and experimentally
evaluated two basic summarisation techniques: first, a static
summarisation technique containing document titles and the
opening sentences of each document; and, second, weighted
fragments that may contain frequently-occurring document
terms, query terms, or are located at the beginning of the
document. They found that query-biased summaries —
where summaries are dynamically generated — permitted
users to more accurately and quickly judge the relevance of
documents than static summaries.
Most popular web search engines exclusively use textual
summaries and, since the work of Tombros and Sanderson,
most are query-biased. An alternative approach is the use of
thumbnails [25] or other graphical methods [14] that graphi-
cally summarise the content of a web page. Thumbnails are
small images that highlight features of a page including im-
ages, title or large heading phrases, and colour. It is argued
that thumbnails provide additional information that is not
readily accessible in a textual summary including the type,
style, and layout of the page. Under selected metrics, a form
of enhanced thumbnail has been shown to offer accurate and
consistent performance as a summarisation technique [25].
Using past queries to improve retrieval performance has
previously been considered elsewhere [9, 10, 18]. The ap-
proach of Fitzpatrick and Dent [9] is perhaps the most sim-
ilar to our query association techniques. In their approach,
the answer documents to past queries are grouped together
to form an affini ty pool when the similarity between the past
queries exceeds a threshold. Then, when a new query ex-
ceeds a similarity threshold to an affinity pool, terms from
top-ranking documents in the pool are used to expand the
new query. The aim, therefore, is to improve retrieval per-
formance through query expansion. Results show this ap-
proach is successful: precision was substantially improved
compared to unexpanded queries, and was also better than
expansion using terms from top-ranked documents using the
approach of Buckley et al. [3].
Prior to the work of Fitzpatrick and Dent, Furnas [10] de-
scribed a frequency-based technique to gather information
concerning the association of keywords. In this approach, a
user searches with keywords using an initially untrained sys-
tem and the system accumulates information about keyword
associations. For example, a user might search unsuccess-
fully using the keywords “develop”, “script”, and “instruc-
tions”. Then, the user might find the information they re-
quire using a fourth keyword, “program”. The system then
asks if the user was satisfied with the result of the search us-
ing “program” and, if so, whether the initial three keywords
should be used as synonyms for the term “program”. Agree-
ment results in an initial association between the keywords
and, in later retrieval using the same synomyms, increment-
ing of a counter that reinforces the association. Frequent
associations between keywords are then used to provide fu-
ture users with their answers when synonyms are entered.
Our query association approach — which we discuss in the
next section — exploits relationships between documents
and queries, does not require user feedback, and measures
similarity not frequency.
Raghavan and Sever [18] have also considered the use of
past queries in retrieval with two somewhat different aims:
first, they show how the stored results of past optimal queries
can be used as answers to new queries; and, second, they
show how answers to a new query can be seeded using an ini-
tial result vector from a past optimal query. The significant
result that impacts our work is that query-to-query similar-
325
Figure 1: An example of query association. A user has posed the query “telecom australia” and the top
ranked answers have been displayed. Each answer includes a query-biased summary and the top five associated
queries. Associated queries provide both an alternative summary and an interactive retrieval mechanism.
ity cannot be successfully measured using conventional sim-
ilarity measures, but rather that query-to-document mea-
sures can be used to determine regions of similarity between
queries. We apply this idea in our query association tech-
niques.
Query clustering is another area with similarity to our
work. Wen, Nie and Zhang [22] describe a technique where
the selection of answer documents by users is used to provide
evidence of similarity between queries, and these relation-
ships are exploited to assist in the identification of frequently
asked questions or FAQs. In agreement with Raghavan and
Sever [18], they show that using documents to identify the
relationship between queries is more effective than identi-
fying the similarity between queries alone. Our approach
differs from query clustering in that our aim is not to find
similarity between queries; rather, we exploit the statistical
correspondence between queries and documents to provide
summaries, as we describe later.
The Hyperindex Browser (HiB) [6, 7, 8] provides a novel
method for query refinement that has synergies with sum-
marisation. After executing an initial search using an aux-
iliary Web-based search engine, phrases are extracted from
answer documents using a shallow natural language pars-
ing technique and shown to the user. For example, if a user
poses the initial query “footy”, the HiB mechanism may sug-
gest alternative queries such as “australian rules football” or
“australian football league”. The advantage of this approach
is that the user can reformulate their initial query or explore
the collection through a query-by-navigation paradigm [2]
that uses document phrases. The disadvantage is that the
phrases are document fragments that may not be successful
as queries; query association, which we propose in the next
section, offers the query refinement capability of HiB, while
also being oriented around user queries.
3. QUERY ASSOCIATION
In this section, we describe our novel query association
technique. Query association works as follows: for each
query that is posed to our search engine2we associate the
Ntop ranked documents with that query, subject to the
following conditions:
1. Each document has a maximum of Massociated queries
(this limit is imposed for efficiency in our implementa-
tion)
2. Queries that are identical to an already associated
query are ignored
3. When Mqueries have been associated with a docu-
ment, the similarity scores of each associated query
are recorded. If a new query exceeds the similarity
score of the least-similar associated query, that query
is disassociated and replaced by the new query
The process of association is attempted for each query that
is posed to the system.
The similarity scores of associated queries are normalised
by query length. Document length normalisation is com-
monly used in ranking functions to negate the effect of longer
documents — which contain more terms — appearing to be
more similar to a query than shorter ones [24]. We divide
the query association similarity score by log(1 + |Q|), where
|Q|is the number of terms in the query.
Consider an example of query association where a user
poses the query “richmond football club”. In this example,
2Our search engine — which we call SEG — is a pub-
lic domain testbed for information retrieval research. It
will soon be available under the GNU public license from
http://www.seg.rmit.edu.au/
326
more than one hundred thousand documents in a fictional
collection have some statistical similarity to the query. With
an arbitrary value of N= 10, we identify the top ten doc-
uments that match the query and attempt to associate the
query with each of the documents; for simplicity here, we
assume that each attempted association is successful. Con-
sider now a second query, “richardson richmond football”,
that also identifies several hundred thousand answers, but
includes three documents in the top ten that are the same
documents as identified by the first query. In the case of
these three documents, both queries are now associated.
To continue our example, consider a third query, “foot-
ball richmond” that a user poses to express their informa-
tion need for documents that discuss the misfortunes of the
Richmond Australian rules football club. In response to the
query, the system identifies as the top ranked answer one
of the documents with the two prior associations “richmond
football club” and “richardson richmond football”. The user
is shown both prior associations, and can use these asso-
ciations to assess the relevance of the document. In this
example, the user decides the document is relevant based
on the associations, and clicks on a hypertext link to visit
the original web resource. As part of this process, the user
query is again used to update the associations of the top
ten ranked documents. Indeed, with our association man-
agement techniques, associations are continually updated,
added, and deleted as queries are processed.
In our experiments that we describe in the next section,
around one million queries were used to develop query asso-
ciations. As we show, in many cases the associations form
good summaries of each answer document. For example, in
response to the query “chevrolet trucks”, a highly ranked
answer document in the collection has the query associa-
tions “buying used chevy trucks”, “antique chevy trucks”,
“dodge vs chevy”, “mazda trucks truck”, and “chevrolet
trucks truck chevy”.
We have experimented with different values of Mand N,
and different methods of restricting query associations. We
found in our initial user experiments — which, for compact-
ness, we do not discuss in detail here — that N=3works
well when sufficient queries are available to develop initial
associations; we also experimented with values of N=2,
N=5,andN= 10. Low values of Nrestrict associations
to only those documents that are highly similar to a query,
while higher values of Npermit more associations. We also
found that a value of M= 5 works well in practice, and
offers a reasonable trade-off between permitting sufficient
numbers of associations and a compact implementation; as
we discuss later, we plan to experiment further with different
methods for choosing M.
We also experimented with different methods for restrict-
ing query associations. Our initial experiment showed that
association based solely on our ranking measure — which is
a variant of that used in the Okapi basic search system [19]
— resulted in errors in relevance assessment that were due
to the presence of some terms in highly-weighted associated
queries that were not present in the corresponding docu-
ment. For example, the query “chevrolet trucks” may be
associated with a highly ranked document that contains the
term “chevrolet” but not the term “trucks”; this may con-
tribute to an incorrect relevance judgement if the user is
presented with the association and is asked to assess if the
document discusses Chevrolet trucks. In the experiments we
describe next, we address this problem by only permitting
associations where all terms in the query occur in the doc-
ument. We call this ranked and Boolean (RnB) association.
4. ANEXPERIMENTALCOMPARISONOF
SUMMARY TECHNIQUES
We have proposed query association as a method for im-
proving the effectiveness of search engines. In this section,
we consider how query association can be used as an alter-
native and supplementary summarisation technique and, in
particular, we consider its effect on the accuracy of relevance
judgements. In our initial work, we have not considered the
use of query association in interactive retrieval, query ex-
pansion, and other tasks.
Participants
Data was collected from thirty students and staff of RMIT
University. All participants were undergraduate computer
science students, graduate researchers, or staff. Twenty-two
people participated in our initial experiment to determine
the relative merits of our basic association technique; the re-
maining eight participated in the experiment to evaluate the
refined RnB version of association that is described in this
section. All participants were daily users of web browsers
and experienced in using the web as an information discov-
ery tool. While our participants are not typical web users,
our experimental setup is used to compare performance be-
tween different techniques, not to derive an absolute measure
of performance. We plan future experiments to examine the
use of query association by novice users.
Design
We used the TREC ten gigabyte Web track collection for
our experiments [13]. This collection contains around 1.69
million web documents from a 1997 web crawl and has been
used in the ongoing TREC collaborative evaluation of in-
formation retrieval techniques [12]. We chose this data for
several reasons: first, it is a realistic data set that has suc-
cessfully been used in retrieval experiments; second, queries
and relevance judgements are available as we discuss below;
and, last, our query association training data that we discuss
next was collected in 1997 and 1999.
We created initial query associations by attempting to
associate 917,457 unique queries drawn from two query logs
of the Excite search engine [20]. The first log was from a
single day in 1997 containing around 1 million queries, while
the second log drawn from a single day in 1999 containing 1.7
million queries. Both logs were pre-processed using the same
filtering script used to remove offensive documents from the
TREC web collections and duplicate queries were removed.
Figure 2 shows how the number of query associations per
document changes as more queries from the Excite logs are
processed; as discussed in the previous section, we use val-
ues of a maximum of N= 3 associations per query and a
maximum of M= 5 associations per document. The num-
ber of documents that have at least one association grows
rapidly when processing of the query log starts. However, as
processing continues, the rate at which new documents are
retrieved diminishes, and appears to approach a limit; this
is a similar trend to that of the occurrence of new words in
web documents [23]. When processing has completed, the
number of documents with no associations is 1.22 million,
327
0 200 400 600 800 1000
Number of queries (’000)
0.0
0.5
1.0
1.5
2.0
Documents (’000,000)
zero associations
1-5 associations
5 associations
Figure 2: The change in the number of associations per document after processing queries from the Excite
query log. In all, after the training phase, around one third of the documents in the collection have at least
one association.
and the number with at least one is nearly 470,000. Overall,
this suggests that only around one third of the documents
are ranked highly in response to the queries in the Excite
log. However, given that the queries are from both 1997 and
1999, and the data is from 1997, the number of documents
with associations in a production system may be somewhat
higher. We plan to investigate these phenomena further in
future work.
After training the system, we asked participants to judge
the relevance of answers to selected queries from the TREC-
9 web track topics 451–500; we discuss this experimental
procedure in the next section. An example web track query
topic is shown in Figure 3. The TREC web track queries
have the form of a <title>which is posed as a query to our
search engine, a <desc>or description of the information
need, and a <narr>or narrative that describes how to assess
the relevance of each document; we use the title as a query
as they are originally drawn from queries posed to the Excite
search engine. In all, we used 21 of the 50 queries available
through the web track; we chose the queries through our
initial query association experiment by selecting the first 21
queries that returned at least ten answers, and had at least
two query associations for each of the ten answers. Rele-
vance judgements have been performed for each query, that
is, the set of relevant documents for each information need
is known and provided as part of the TREC experiments.
We use query-biased summaries as a benchmark for evalu-
ating the effectiveness of query association as a summarisa-
tion technique. We generate summaries as follows: we parse
each answer document and, when a query word is encoun-
tered, we extract a sentence fragment of up to nine words
centred on the query word and add that to the summary for
that document. A maximum of five such sentence fragments
are obtained — a total of at most 45 words from each doc-
ument — after which the complete summary is displayed to
the user. While this summarisation technique is relatively
simple, it addresses two important criteria commonly iden-
tified in the summarisation literature [21]: first, sentence
fragments are centred around query terms, so that relevant
sections of a document are identified; and, second, sentences
near the beginning of a document are considered more im-
portant since the first-occurring query terms are extracted.
Further improvements to our query-biased summaries are
possible, however our current implementation is likely to be
more effective than simple static summaries that include the
title and first few sentences of each document [21].
Procedure
In our experiment, we aimed to test three hypotheses:
1. Query associations allow users to better judge the rel-
evance or irrelevance of a document than query-biased
summaries
2. A combination of query associations and query-biased
summaries allow users to better judge relevance or ir-
relevance of a document than query-biased summaries
3. A combination of query associations and query-biased
summaries allow users to better judge relevance or ir-
relevance of a document than query associations
We are therefore concerned with whether users are better
able to judge relevance using different types of summaries,
and not with the absolute performance of the underlying
search techniques.
To investigate our hypotheses, we provided each of the
participants with a folder containing twenty-one queries,
and twenty-one corresponding answer sets, each showing
the summaries of the top ten answer documents for that
query. Summaries were one of three types: query associa-
tions, query-biased summaries, and the composite of both
query associations and query-biased summaries. The fold-
ers were divided so that each contained seven answer sets
based on each of these three summary types. Thus each user
made relevance assessments for seven queries using each of
328
<num>Number: 453
<title>hunger
<desc>Description: Find documents that discuss organizations/groups that are aiding in
the eradication of the worldwide hunger problem.
<narr>Narrative: Relevant documents contain the name of any organization or group that
is attempting to relieve the hunger problem in the world. Documents that address the
problem only without providing names of organizations/groups that are working on hunger
are irrelevant.
Figure 3: An example TREC query used in our experiments.
Total Relevant Irrelevant
Correct Incorrect Correct Incorrect Correct Incorrect
Query Associations 75% 25% 42% 58% 83% 17%
Query Biased Summaries 76% 24% 60% 40% 79% 21%
Composite (both) 75% 25% 62% 38% 79% 21%
Table 1: Results of user experiment comparing summarisation techniques. Users were asked to assess the
likely relevance of answer documents to a specified information need using three types of summary informa-
tion: query associations, query-biased summaries or a combination of both.
the summary types, while across all folders, each query was
answered with each summary type.
Participants were asked to read the set of twenty-one
TREC queries, each specifying a particular information need.
For each query, they were then asked to consider the cor-
responding result set. Then, for each of the summarised
answers in the set, subjects were asked to judge whether the
document was likely to be relevant or irrelevant to the given
information need. User responses were collected through a
web interface. We then collated the results and compared
them to the TREC relevance judgements.
5. RESULTS
The results of our user experiment are shown in Table 1.
For each of the three types of summary information — query
associations, query-biased summaries, and a composite of
both — the proportion of users who were able to correctly
identify a document as being relevant or irrelevant to the
information need is shown.
The first column in Table 1 shows the overall performance
of users, averaged over their ability to correctly identify the
relevance or irrelevance of a document. For example, us-
ing query associations, users were able to correctly identify
a document as being relevant or irrelevant to an informa-
tion need at an average of 7.5 out of every 10 answers they
inspected. Overall, the three summarisation techniques we
tested are comparable. Indeed, statistical analysis [26] using
a signed t-test shows that there is no difference in perfor-
mance at the 0.05 or 0.1 significance levels; the Wilcoxon
signed rank test for paired difference experiments confirms
this result. The critical value for the signed t-test is 1.73
for the 0.1 significance level, and t=0.00 for the associa-
tions to summaries comparison, t=0.72 for associations to
composite, and t=0.82 for summaries to composite.
The second and third columns of Table 1 divide the to-
tal results into two categories: relevance and irrelevance as-
sessments. When judging summaries of documents that are
known to be relevant, users performed worst when presented
with query associations, and performed best when provided
with composite summary information. Both the t-test and
the Wilcoxon signed rank test show a difference in perfor-
mance significant at the 0.1 level between these two tech-
niques; the difference in using only query associations and
only query-biased summaries is not statistically significant.
With a 0.1 significance level, the critical value is t=1.78
for a t-test, and the three values were t=1.36 for associa-
tions to summaries, t=1.93 for associations to composite,
and t=0.51 for summaries to composite. It is possible that
the improved performance with the composite summaries
may be because there is more data from which to make a
judgement, however we have not investigated this in detail.
When judging summaries of documents that are known to
be irrelevant to an information need, query associations are
more effective than the other two summary types in assisting
users. However, the differences in performance between the
three summary types are not statistically significant at the
0.1 level when tested with the t-test or the Wilcoxon signed
rank test. With a 0.1 significance level, the critical value
is t=1.73 for a t-test, and the three values were t=0.96
for associations to summaries, t=0.34 for associations to
composite, and t=0.80 for summaries to composite.
We investigated the lack of statistically significant differ-
ence in performance between query association and query-
biased summaries. On a query-by-query basis, there were
several cases where query associations do poorly, while query-
biased summaries do well, and several cases where the op-
posite is true. For example, for a query to find informa-
tion on the topic of “antique appliance restoration”, associ-
ated queries were able to provide a broader overview of the
content of some documents than was obtained from query-
biased summaries, making it easier to find those answers
that were relevant to the information need. Conversely,
when asked to judge which documents satisfy an informa-
tion need regarding when Jackie Robinson appeared in his
first game, there was only an average of 2.5 associations
available per answer and these appear to be of low quality,
while query-biased summaries always consist of five sentence
fragments. We hope to obtain more queries for training to
increase the average number of associations per document
329
and the quality of associations overall.
For many queries, query associations and query-biased
summaries appear to be complementary: when one tech-
nique does well, the other does poorly. It might therefore
be expected that composite summary information would be
uniformly superior, but this is not the case: when either
of the individual summarisation techniques do badly, they
appear to provide misleading cues.
There were also several cases — for example, the query
concerning “chevrolet trucks” — where both query associ-
ations and query-biased summaries seemed to suggest that
a document was relevant, while the TREC relevance judge-
ments did not. On further examination, these documents
were often web pages containing long lists of hypertext links
that were generally not considered to be relevant answers
by the TREC assessors. While it might be argued that
link pages could still be a valuable answer, the fact that
all three summarisation techniques were judged on an equal
basis means that this should not distort the comparative
results.
6. CONCLUSIONS
We have proposed a novel technique that we call query
association, where queries that are highly similar to a docu-
ment are stored as descriptors for that document. We have
shown how relationships between documents and associated
queries can be created and maintained, and described sev-
eral uses of this technique including providing a summary
of document content, query expansion, and interactive re-
trieval for intuitive exploration of a collection.
Through a user experiment we have demonstrated that
query associations are as effective as a summarisation tech-
nique as simple query-biased summaries. We have also ob-
served that associated queries have advantages over sum-
maries: they are usually shorter than summary sentences,
and often completely embody a specific concept or informa-
tion need. Indeed, summary sentences can be more difficult
to interpret outside the context of the document in which
they occur, and are therefore likely to place a higher cogni-
tive burden on the user. We plan to quantify this difference
by examining the time taken for user relevance assessments.
An analysis of those queries for which query associations
were effective or ineffective has identified avenues for future
work. For example, one method to improve performance
would be by disambiguation of associations as, for some
queries, it was found that the occurrence of a term that
was highly weighted by the ranking function could dominate
all associations and prevent a broad content coverage. We
also intend to experiment further with the number of docu-
ments that become candidates for association to each query
and the maximum number of associations per document.
Associations also have potential system efficiency benefits:
query-biased summaries require document parsing, which is
a known performance bottleneck for some search engines;
in contrast, however, query associations require additional
storage space. Another application that we will investigate
fully is the use of query associations as a basis for query
expansion; preliminary experiments have shown promising
results.
Acknowledgements
We are grateful to the anonymous referees for their com-
ments.
7. REFERENCES
[1] N. Alexander, C. Brown, J. Jose, I. Ruthven, and
A. Tombros. Question answering, relevance feedback
and summarisation: TREC-9 interactive track report.
In E. Voorhees and D. Harman, editors, Proc. Text
Retrieval Conference (TREC), pages 523–550,
Washington, 2000. National Institute of Standards
and Technology Special Publication 500-249.
[2] P. Bruza and T.H. van der Weide. Stratified
information disclosure. Computer Journal,
35(3):208–220, 1992.
[3] C. Buckley, G. Salton, and J. Allan. The effect of
adding relevance information in a relevance feedback
environment. In W.B. Croft and C.J. van Rijsbergen,
editors, Proc. ACM-SIGIR International Conference
on Research and Development in Information
Retrieval, pages 292–300, Dublin, Ireland, 1994.
[4] J.P. Callan. Passage-level evidence in document
retrieval. In W.B. Croft and C.J. van Rijsbergen,
editors, Proc. ACM-SIGIR International Conference
on Research and Development in Information
Retrieval, pages 302–309, Dublin, Ireland, 1994.
[5] W.T. Chuang and J. Yang. Extracting sentence
segments for text summarization: A machine learning
approach. In N. Belkin, P. Ingwersen, and M-K.
Leong, editors, Proc. ACM-SIGIR International
Conference on Research and Development in
Information Retrieval, pages 152–159, Athens, Greece,
2000.
[6] S. Dennis and P. Bruza. Query re-formulation on the
Internet: Empirical data and the hyperindex search
engine. In Recherche D’Information Assistee par
Ordinateur sur Internet, pages 488–499, Montreal,
Quebec, 1997.
[7] S. Dennis, P. Bruza, and R. McArthur. Web searching:
A process-oriented experimental study of three
interactive search paradigms. Journal of the American
Society for Information Science and Technology,
52(2):120–133, 2002.
[8] S. Dennis, R. McArthur, and P. Bruza. Searching the
world wide web made easy? the cognitive load
imposed by query refinement mechanisms. In J. Kay
and M. Milosavljevic, editors, Proc. Australian
Document Computing Conference, pages 65–71,
Sydney, Australia, 1998. University of Sydney.
[9] L. Fitzpatrick and M. Dent. Automatic feedback using
past queries: Social searching? In N.J. Belkin, A.D.
Narasimhalu, and P. Willett, editors, Proc.
ACM-SIGIR International Conference on Research
and Development in Information Retrieval, pages
306–313, Philadelphia, PA, 1997.
[10] G.W. Furnas. Experience with an adaptive indexing
scheme.InL.BormanandW.Curtis,editors,ACM
Conference on Human Factors in Computing Systems,
pages 131–135, San Francisco, CA, 1985.
[11] H. Hardy, N. Shimizu, T. Strzalkowski, L. Ting, G.B.
Wise, and X. Zhang. Cross-document summarization
by concept classification. In M. Beaulieu,
R. Baeza-Yates, S.H. Myaeng, and K. J¨arvelin,
editors, Proc. ACM-SIGIR International Conference
on Research and Development in Information
Retrieval, pages 121–128, Tampere, Finland, 2002.
330
[12] D. Harman. Overview of the second text retrieval
conference (TREC-2). Information Processing &
Management, 31(3):271–289, 1995.
[13] D. Hawking, N. Creswell, and P. Thistlewaite.
Overview of TREC-7 very large collection track. In
E. Voorhees and D.K. Harman, editors, Proc. Text
Retrieval Conference (TREC), pages 91–104,
Washington, 1999. National Institute of Standards
and Technology Special Publication 500-242.
[14] M. Hearst. TileBars: Visualization of term
distribution information in full text information
access. In I.R. Katz, R.L. Mack, L. Marks, M.B.
Rosson, and J. Nielsen, editors, ACM Conference on
Human Factors in Computing Systems, pages 59–66,
Denver, CO, 1995.
[15] M. Kaszkiel and J. Zobel. Effective ranking with
arbitrary passages. Journal of the American Society
for Information Science and Technology,
54(4):344–364, 2001.
[16] K. McKeown and D.R. Radev. Generating summaries
of multiple news articles. In E.A. Fox, P. Ingwersen,
andR.Fidel,editors,Proc. ACM-SIGIR International
Conference on Research and Development in
Information Retrieval, pages 74–82, Seattle,
Washington, July 1995.
[17] C.D. Paice. Constructing literature abstracts by
computer: techniques and prospects. Information
Processing & Management, 26(1):171–186, 1990.
[18] V.V. Raghavan and H. Sever. On the reuse of past
optimalqueries.InE.A.Fox,P.Ingwersen,and
R. Fidel, editors, Proc. ACM-SIGIR International
Conference on Research and Development in
Information Retrieval, pages 344–350, Seattle, WA,
1995.
[19] S.E. Robertson and S. Walker. Okapi/keenbow at
TREC-8. In E.M Voorhees and D. Harman, editors,
Proc. Text Retrieval Conference (TREC), pages
151–162, Washington, 1999. National Institute of
Standards and Technology.
[20] A. Spink, D. Wolfram, B. J. Jansen, and T. Saracevic.
Searching the web: The public and their queries.
Journal of the American Society for Information
Science, 52(3):226–234, 2001.
[21] A. Tombros and M. Sanderson. Advantages of query
biased summaries in information retrieval. In
R. Wilkinson, B. Croft, K. van Rijsbergen, A. Moffat,
and J. Zobel, editors, Proc. ACM-SIGIR International
Conference on Research and Development in
Information Retrieval, pages 2–10, Melbourne,
Australia, July 1998.
[22] J-R. Wen, J-Y. Nie, and H-J. Zhang. Query clustering
using user logs. ACM Transactions on Information
Systems, 20(1):59–81, 2002.
[23] H.E. Williams and J. Zobel. Searchable words on the
web. International Journal of Digital Libraries.To
appear.
[24] I.H. Witten, A. Moffat, and T.C. Bell. Managing
Gigabytes: Compressing and Indexing Documents and
Images. Morgan Kaufmann Publishers, Los Altos, CA
94022, USA, second edition, 1999.
[25] A. Woodruff, R. Rosenholtz, J.B. Morrison,
A. Faulring, and P. Pirolli. A comparison of the use of
text summaries, plain thumbnails, and enhanced
thumbnails for web search tasks. Journal of the
American Society for Information Science and
Technology, 52(2):172–185, 2002.
[26] J. Zobel. How reliable are the results of large-scale
information retrieval experiments? In R. Wilkinson,
B. Croft, K. van Rijsbergen, A. Moffat, and J. Zobel,
editors, Proc. ACM-SIGIR International Conference
on Research and Development in Information
Retrieval, pages 307–314, Melbourne, Australia, July
1998.
331
... In Section 5, we further reduce the cost of query-based fusion and introduce a second new approach, this one based on re-ranking. The key idea is to amortize the cost of fusion and use a pre-processing phase that computes query centroids, as illustrated in Figure 1, where a group of related queries are found using manual or automatic methods [11], and then their runs fused together into a single list [66,77,87]. Those centroid rankings are then held in a searchable cache. ...
... Relevance feedback approaches use documents as a source to add terms to the original query. As an alternative, in experiments using the TREC WT10g collection, Billerbeck et al. [14] show that forming query associations using the method described by Scholer and Williams [77] is more effective than using documents. Scholer and Williams [77] had found that a group of related queries can be formed by finding the set of top scoring queries for a document. ...
... As an alternative, in experiments using the TREC WT10g collection, Billerbeck et al. [14] show that forming query associations using the method described by Scholer and Williams [77] is more effective than using documents. Scholer and Williams [77] had found that a group of related queries can be formed by finding the set of top scoring queries for a document. This type of association has parallels in the initial stage of pseudo-relevance feedback and assumes that the high-scoring queries for a document are related and hence can act as a surrogate for forming query associations when click-logs are not available. ...
Article
Rank fusion is a powerful technique that allows multiple sources of information to be combined into a single result set. Query variations covering the same information need represent one way in which different sources of information might arise. However, when implemented in the obvious manner, fusion over query variations is not cost-effective, at odds with the usual web-search requirement for strict per-query efficiency guarantees. In this work, we propose a novel solution to query fusion by splitting the computation into two parts: one phase that is carried out offline, to generate pre-computed centroid answers for queries addressing broadly similar information needs, and then a second online phase that uses the corresponding topic centroid to compute a result page for each query. To achieve this, we make use of score-based fusion algorithms whose costs can be amortized via the pre-processing step and that can then be efficiently combined during subsequent per-query re-ranking operations. Experimental results using the ClueWeb12B collection and the UQV100 query variations demonstrate that centroid-based approaches allow improved retrieval effectiveness at little or no loss in query throughput or latency and within reasonable pre-processing requirements. We additionally show that queries that do not match any of the pre-computed clusters can be accurately identified and efficiently processed in our proposed ranking pipeline.
... The idea to use queries for social indexing is a natural one, since queries are, by definition, an expression of a user's information needs, albeit a partial one that uses a particular vocabulary. It is therefore an obvious step to enhance a document representation using query terms [170,114,171,209,14]; or the terms could be used to create an alternative query-based representation [159]. ...
... The main problem with this idea is that it is not always straightforward to create a reliable association between documents and queries by using query logs alone. First-generation research on query-document association focused on two natural ideas: associate a query with each of the top-N retrieved documents [163,170], or with each of the selected documents on a SERP [114,197,209,159,82,151]. Both approaches were found to be satisfactory for such tasks as document labeling or clustering, but were not able to match the quality of anchor-based document expansion. ...
... The idea of using social data to generate page snippets was first suggested in [170]. The authors proposed to use past queries as a component of page snippets by arguing that they offered the best characterization of a document from an end-user perspective. ...
Book
Full-text available
Springer International Publishing AG, part of Springer Nature 2018. Today, most people find what they are looking for online by using search engines such as Google, Bing, or Baidu. Modern web search engines have evolved from their roots in information retrieval to developing new ways to cope with the unique nature of web search. In this chapter, we review recent research that aims to make search a more social activity by combining readily available social signals with various strategies for using these signals to influence or adapt more conventional search results. The chapter begins by framing the social search landscape in terms of the sources of data available and the ways in which this can be leveraged before, during, and after search. This includes a number of detailed case studies that serve to mark important milestones in the evolution of social search research and practice.
... The idea to use queries for social indexing is a natural one, since queries are, by definition, an expression of a user's information needs, albeit a partial one that uses a particular vocabulary. It is therefore an obvious step to enhance a document representation using query terms [170,114,171,209,14]; or the terms could be used to create an alternative query-based representation [159]. ...
... The main problem with this idea is that it is not always straightforward to create a reliable association between documents and queries by using query logs alone. First-generation research on query-document association focused on two natural ideas: associate a query with each of the top-N retrieved documents [163,170], or with each of the selected documents on a SERP [114,197,209,159,82,151]. Both approaches were found to be satisfactory for such tasks as document labeling or clustering, but were not able to match the quality of anchor-based document expansion. ...
... The idea of using social data to generate page snippets was first suggested in [170]. The authors proposed to use past queries as a component of page snippets by arguing that they offered the best characterization of a document from an end-user perspective. ...
Chapter
Full-text available
Today, most people find what they are looking for online by using search engines such as Google, Bing, or Baidu. Modern web search engines have evolved from their roots in information retrieval to developing new ways to cope with the unique nature of web search. In this chapter, we review recent research that aims to make search a more social activity by combining readily available social signals with various strategies for using these signals to influence or adapt more conventional search results. The chapter begins by framing the social search landscape in terms of the sources of data available and the ways in which this can be leveraged before, during, and after search. This includes a number of detailed case studies that serve to mark important milestones in the evolution of social search research and practice.
... Durante los últimos años se han propuesto muchas técnicas para generar consultas desde el contexto del usuario [BHB01,KCMK06]. Otros métodos realizan el proceso de expansión y refinamiento de consultas con la explícita intervención del usuario [SW02,BSWZ03]. Sin embargo poco se ha hecho en el campo de los métodos semisupervisados que saquen ventaja simultáneamente del contexto del usuario y de los resultados que obtienen del proceso de búsqueda. ...
... Además, estas técnicas no distinguen las nociones de descriptores y discriminadores de tópicos. Las técnicas para la elección de los términos de las consultas propuestas en este trabajo están inspiradas y motivadas sobre la misma base de otros métodos de expansión y refinamiento de consultas[SW02,BSWZ03]. Sin embargo, los sistemas que aplican estos métodos se diferencian de la plataforma propuesta en que el proceso se realiza a través de consultar o navegar en interfaces que necesitan la intervención explícita del usuario, en lugar de formular consultas automáticamente.En los sistemas de recuperación proactivos, el uso del contexto juega un rol vital a la hora de seleccionar y filtrar información. ...
Preprint
Full-text available
The Web has become a potentially infinite information resource, turning into an essential tool for many daily activities. This resulted in an increase in the amount of information available in users' contexts that is not taken into account by current information retrieval systems. This thesis proposes a semisupervised information retrieval technique that helps users to recover context relevant information. The objective of the proposed technique is to reduce the vocabulary gap existing between the knowledge a user has about a specific topic and the relevant documents available in the Web. This thesis presents a method for learning novel terms associated with a thematic context. This is achieved by identifying those terms that are good descriptors and good discriminators of the user's current thematic context. In order to evaluate the proposed method, a theoretical framework for the evaluation of search mechanisms was developed. This served as a guide for the implementation of an evaluation framework that allowed to compare the techniques proposed in this thesis with other techniques existing in the literature. The experimental evidence indicates that the methods proposed in this thesis present significant improvements over previously published techniques. In addition, the evaluation framework was equipped with novel evaluation metrics that favor the exploration of novel material and incorporates a semantic relationship metric between documents. The algorithms developed in this thesis evolve high quality queries, which have the capability of retrieving results that are relevant to the user context. These results have a positive impact on the way users interact with available resources.
... In particular, contextsensitive query auto-completion algorithms output the completions of the user's input that are most similar to the user context or search history [59,60,61]. Other methods support query expansion and refinement processes through a query or browsing interface requiring explicit user intervention [62,63]. In [64] a method is proposed for the automatic refinement of ambiguous queries based on a context list that stores potential context keywords for a large number of topic relevant query n-grams. ...
... Table 1 presents a classification of the methods reviewed in this section based on how context is mainly exploited (context as a source of queries, context for filtering and ranking results, context for query refinement and context as an indicator of user intents). Query expansion and refinement with explicit user intervention [62,63] Query generation, augmentation and/or refinement from context without user intervention [24,27,36,37,49,111,112] Context sensitive query autocompletion [59,60,61] Query generation and rank-biasing based on context [49,50,113] Query understanding or disambiguation based on context [58,64,114] Implicit feedback from cursor movement, vertical scrolling, interactions in the areas of interest and/or eyetracking [71,72,73] Touch interaction data on mobile devices [74,75,76] ...
Article
Full-text available
Contextual information extracted from the user task can help to better target retrieval to task-relevant content. In particular, topical context can be exploited to identify the subject of the information needs, contributing to reduce the information overload problem. A great number of methods exist to extract raw context data and contextual interaction patterns from the user task and to model this information using higher-level representations. Context can then be used as a source for automatic query generation, or as a means to refine or disambiguate user-generated queries. It can also be used to filter and rank results as well as to select domain-specific search engines with better capabilities to satisfy specific information requests. This article reviews methods that have been applied to deal with the problem of reflecting the current and long-term interests of a user in the search process. It discusses major difficulties encountered in the research area of context-based information retrieval and presents an overview of tools proposed since the mid-nineties to deal with the problem of context-based search.
... Specifically, we investigate whether query association can be used for query expansion (Scholer, 2004). Given a query log containing a large number of queries, it is straightforward to build a surrogate for each document in a collection, consisting of the queries that were a close match to that document. ...
... This technique was proposed for the creation of document summaries, to aid users in judging the relevance of answers returned by a search system. It has been successfully used to increase the weighting of terms that encapsulate the "aboutness" of a document (Scholer andWilliams, 2002, Scholer, 2004). ...
Thesis
Full-text available
Hundreds of millions of users each day search the web and other repositories to meet their information needs. However, queries can fail to find documents due to a mismatch in terminology. Query expansion seeks to address this problem by automatically adding terms from highly ranked documents to the query. While query expansion has been shown to be effective at improving query performance, the gain in effectiveness comes at a cost: expansion is slow and resource-intensive. Current techniques for query expansion use fixed values for key parameters, determined by tuning on test collections. We show that these parameters may not be generally applicable, and, more significantly, that the assumption that the same parameter settings can be used for all queries is invalid. Using detailed experiments, we demonstrate that new methods for choosing parameters must be found. In conventional approaches to query expansion, the additional terms are selected from highly ranked documents returned from an initial retrieval run. We demonstrate a new method of obtaining expansion terms, based on past user queries that are associated with documents in the collection. The most effective query expansion methods rely on costly retrieval and processing of feedback documents. We explore alternative methods for reducing query-evaluation costs, and propose a new method based on keeping a brief summary of each document in memory. This method allows query expansion to proceed three times faster than previously, while approximating the effectiveness of standard expansion. We investigate the use of document expansion, in which documents are augmented with related terms extracted from the corpus during indexing, as an alternative to query expansion. The overheads at query time are small. We propose and explore a range of corpus-based document expansion techniques and compare them to corpus-based query expansion on TREC data. These experiments show that document expansion delivers at best limited benefits, while query expansion - including standard techniques and efficient approaches described in recent work - usually delivers good gains. We conclude that document expansion is unpromising, but it is likely that the efficiency of query expansion can be further improved.
... Specifically, we investigate whether query association can be used for query expansion (Scholer, 2004). Given a query log containing a large number of queries, it is straightforward to build a surrogate for each document in a collection, consisting of the queries that were a close match to that document. ...
... This technique was proposed for the creation of document summaries, to aid users in judging the relevance of answers returned by a search system. It has been successfully used to increase the weighting of terms that encapsulate the "aboutness" of a document (Scholer andWilliams, 2002, Scholer, 2004). ...
Article
Full-text available
... In addition, to efficiently compute offline clusters, we introduce a cost-sensitive fusion technique that employs a single heap to simultaneously evaluate all query variations. Similar queries can be mined or generated [11], then aggregated together using a variety of well-known techniques [57,64,73]. For reproducibility, the experiments we report here employ the publicly available UQV100 test collection [5]. Figure 1 summarizes the proposed architecture. ...
Preprint
Rank fusion is a powerful technique that allows multiple sources of information to be combined into a single result set. However, to date fusion has not been regarded as being cost-effective in cases where strict per-query efficiency guarantees are required, such as in web search. In this work we propose a novel solution to rank fusion by splitting the computation into two parts -- one phase that is carried out offline to generate pre-computed centroid answers for queries with broadly similar information needs, and then a second online phase that uses the corresponding topic centroid to compute a result page for each query. We explore efficiency improvements to classic fusion algorithms whose costs can be amortized as a pre-processing step, and can then be combined with re-ranking approaches to dramatically improve effectiveness in multi-stage retrieval systems with little efficiency overhead at query time. Experimental results using the ClueWeb12B collection and the UQV100 query variations demonstrate that centroid-based approaches allow improved retrieval effectiveness at little or no loss in query throughput or latency, and with reasonable pre-processing requirements. We additionally show that queries that do not match any of the pre-computed clusters can be accurately identified and efficiently processed in our proposed ranking pipeline.
Chapter
Performance evaluation plays a crucial role in the development and improvement of search systems in general and context-based systems in particular. In order to evaluate search systems, test collections are needed. These test collections typically involve a corpus of documents, a set of queries and a series of relevance assessments. In traditional approaches users or hired evaluators provide manual assessments of relevance. However this is difficult and expensive, and does not scale with the complexity and heterogeneity of available digital information. This chapter proposes a semantic evaluation framework that takes advantages of topic ontologies and semantic similarity data derived from these ontologies. The structure and content of the Open Directory Project topic ontology is used to derive semantic relations among a massive number of topics and to implement classical and ad hoc retrieval performance evaluation metrics. In addition, this chapter describes an incremental method for context-based retrieval, which is based on the notions of topic descriptors and topic discriminators. The incremental context-based retrieval method is used to illustrate the application of the proposed semantic evaluation framework. Finally, the chapter discusses the advantages of applying the proposed framework.
Article
This paper addresses deficiencies in current information retrieval models by integrating the concept of relevance into the generation model using various topical aspects of the query. The models are adapted from the latent Dirichlet allocation model, but differ in the way that the notation of query-document relevance is introduced in the modeling framework. In the first method, query terms are added to relevant documents in the training of the latent Dirichlet allocation model. In the second method, the latent Dirichlet allocation model is expanded to deal with relevant query terms. The topic of each term within a given document may be sampled using either the normal document-specific mixture weights in LDA using query-specific mixture weights. We also developed an efficient method based on the Gibbs sampling technique for parameter estimation. Experiment results based on the Text REtrieval Conference Corpus (TREC) demonstrate the superiority of the proposed models.
Conference Paper
Full-text available
In this paper we describe a Cross Document Summarizer XDoX designed specifically to summarize large document sets (50-500 documents and more). Such sets of documents are typically obtained from routing or filtering systems run against a continuous stream of data, such as a newswire. XDoX works by identifying the most salient themes within the set (at the granularity level that is regulated by the user) and composing an extraction summary, which reflects these main themes. In the current version, XDoX is not optimized to produce a summary based on a few unrelated documents; indeed, such summaries are best obtained simply by concatenating summaries of individual documents. We show examples of summaries obtained in our tests as well as from our participation in the first Document Understanding Conference (DUC).
Conference Paper
Full-text available
www.dcs.gla.ac.uk/-tombrosa / www-ciir.cs.umass.edu/-sanderso/ Abstract This paper presents an investigation into the utility of document summarisation in the context of information retrieval, more specifically in the application of so called query biased (or user directed) summaries: summaries customised to reflect the information need expressed in a query. Employed in the retrieved document list displayed after a retrieval took place, the summaries ’ utility was evaluated in a task-based environment by measuring users ’ speed and accuracy in identifying relevant documents. This was compared to the performance achieved when users were presented with the more typical output of an IR system: a static predefined summary composed of the title and first few sentences of retrieved documents. The results from the evaluation indicate that the use of query biased summaries significantly improves both the accuracy and speed of user relevance judgements. 1
Article
Text retrieval systems store a great variety of documents, from abstracts, newspaper articles, and Web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in current approaches to ranking. Use of short fragments of documents, called passages, instead of whole documents can overcome these shortcomings: passage ranking provides convenient units of text to return to the user, can avoid the difficulties of comparing documents of different length, and enables identification of short blocks of relevant material among otherwise irrelevant text. In this article, we compare several kinds of passage in an extensive series of experiments. We introduce a new type of passage, overlapping fragments of either fixed or variable length. We show that ranking with these arbitrary passages gives substantial improvements in retrieval effectiveness over traditional document ranking schemes, particularly for queries on collections of long documents. Ranking with arbitrary passages shows consistent improvements compared to ranking with whole documents, and to ranking with previous passage types that depend on document structure or topic shifts in documents.
Article
Abstract Text retrieval systems store a great variety of documents, from abstracts, newspaper articles, and web pages to journal articles, books, court transcripts, and legislation. Collections of diverse types of documents expose shortcomings in current approaches to ranking. Use of short fragments of documents, called passages, instead of whole documents can overcome these shortcomings: passage ranking provides convenient units of text to return to the user, can avoid the diculties of comparing documents of dierent length, and enables identication of short blocks of relevant material amongst otherwise irrelevant text. In this paper, we compare several kinds of passage in an extensive series of experiments. We introduce a new type of passage, overlapping fragments of either xed or variable length. We show that ranking with these arbitrary passages gives substantial improvements in retrieval eectiveness over traditional document ranking schemes, particularly for queries on collections of long documents. Ranking with arbitrary passages shows consistent improvements compared to ranking with whole documents, and to ranking with previous passage types that depend on document structure or topic shifts in documents. Keywords: passage retrieval, document retrieval, eective ranking, similarity measures, pivoted
Article
In studying actual Web searching by the public at large, we analyzed over one million Web queries by users of the Excite search engine. We found that most people use few search terms, few modified queries, view few Web pages, and rarely use advanced search features. A small number of search terms are used with high frequency, and a great many terms are unique; the language of Web queries is distinctive. Queries about recreation and entertainment rank highest. Findings are compared to data from two other large studies of Web queries. This study provides an insight into the public practices and choices in Web searching.
Conference Paper
The effect of using paat queries to improve automatic query expansion was examined in the TREC environment. Automatic feedback of documents identified from similar past queries was compared with standard top-document feed- back and with no feedback. A new query similarity metric was used based on comparing result lists and using prob- ability of relevance. Our top-document feedback method showed small improvements over no feedback method consis- tent with past studies. On recall-precision and average pre- cision measures, past query feedback yielded performance superior to that of top-document feedback. The past query feedback method also lends itself to tunable thresholds such that better performance can be obtained by automatically deciding when, and when not, to apply the expansion. Auto- matic past-query feedback actually improved top document precision in this experiment.
Conference Paper
With the proliferation of the Internet and the huge amount of data it transfers, text summarization is becoming more important. We present an approach to the design of an automatic text summarizer that generates a summary by extracting sentence segments. First, sentences are broken into segments by special cue markers. Each segment is represented by a set of predefined features (e.g. location of the segment, average term frequencies of the words occurring in the segment, number of title words in the segment, and the like). Then a supervised learning algorithm is used to train the summarizer to extract important sentence segments, based on the feature vector. Results of experiments on U.S. patents indicate that the performance of the proposed approach compares very favorably with other approaches (including Microsoft Word summarizer) in terms of precision, recall, and classification accuracy.