Conference PaperPDF Available

Automatic Control of Multiple Pseudo-Relevance Feedback Pass



Automatic query expansion (AQE) based on Pseudo Relevance Feedback (PRF) is a useful technique for enhancing the effectiveness of information retrieval systems. In this article, we propose the study of the behavior of a set of features based on the query and the set of feedback documents in the aim to control automatically the convergence of the query expansion process to the pertinent set of documents. It is a new look for which is commonly known in IR community as prediction of query performance or query ambiguity. The results obtained show that is very interesting to use these features as good predictors for any PRF methods.
Automatic Control of Multiple Pseudo-
Relevance Feedback Pass
Chakir Mokhtari
Computer Science Department
University Mustapha Stambouli of
Mascara, Mascara, Algeria
Mohammed Debakla
Computer Science Department
University Mustapha Stambouli of
Mascara, Mascara, Algeria
Boudjelal Meftah
Computer Science Department
University Mustapha Stambouli of
Mascara, Mascara, Algeria
AbstractAutomatic query expansion (AQE) based on Pseudo
Relevance Feedback (PRF) is a useful technique for enhancing
the effectiveness of information retrieval (IR) systems which
are a set of methods techniques to provide relevant information
for any user information needs. In this article, we propose the
study of the behavior of a set of features based on the query
and the set of feedback documents in the aim to control
automatically the convergence of the query expansion process
to the pertinent set of documents. It is a new look for which is
commonly known in IR community as prediction of query
performance or query ambiguity. The results obtained show
that is very interesting to use these features as good predictors
for any PRF methods.
Index Terms Information retrieval, Automatic query
expansion, pseudo-relevance feedback, query performance.
Information retrieval (IR) system aims to find documents
that are relevant to a user's information need. In general, the
user’s information need is typically expressed as a query
consisting of a small number of words due to the difficult to
formulate it by the user. In fact, the user’s original query
consisting of just a few terms is too short to describe the
information need accurately leading to make answer really
difficult and a poor coverage of the relevant documents [1].
To overcome this limitation, the user’s original query is
augmented by new features with a similar meaning which is
commonly known in information retrieval community as
automatic query expansion (AQE). The main purpose of
AQE is to improve the retrieval effectiveness of document
ranking by understanding automatically the query and
produce related terms that are close to the original query [2].
AQE is based on Feedback documents that can come from
user relevance judgments (relevance feedback, RF), the top
retrieved documents for the initial query (pseudo-relevance
feedback, PRF), or even the whole collection of documents
or external resources such as thesaurus or ontology (global
feedback) [3]. One of the most popular AQE technique
inspired directly from relevance feedback is PRF method.
PRF [4, 5, 6, 7, 8, 9, 10] aims to learn the user’s
requirements during a search session and by implicit user
involvement adapt the behavior of the retrieval process. The
basic idea is to reformulate the query such that it gets closer
to the relevant documents by using the first documents
retrieved in response to the original query as a more detailed
description of the underlying query topic, from which to
extract the most important terms to be used as expansion
features. Since PRF is an important technique to improve
the effectiveness of IR systems, it is widely proposed and
studied for various retrieval models [4, 5, 11, 12, 13] and
still attract the attention of IR community [14, 15, 6, 16, 17,
8, 18].
The implementation of PRF methods always requires two
rounds of retrieval. In the first round, an initial set of
documents is retrieved with respect to an original query.
And in the second round, another set of documents is
retrieved based on an expanded query, which is generated by
expanding the original query with relevant terms selected
from the previously retrieved document set. This second
round could be repeated more than one time, but the
challenge is how to ensure the effectiveness of the IR
system at each time. In other word, we still perform the
second round of feedback while the IR system retrieves
more and more relevant documents. In the other hand, there
have been interests in applying machine learning and
clustering methods to select a subset of documents deemed
representative of the subject of the query from the feedback
documents [16, 19, 20, 21, 22]. Although it is well
observed that cluster-based PRF improves effectiveness,
however, it is interesting to reuse feedback documents more
than once in order to bring more relevant and novel
information to the user.
For this purpose, in this paper, we estimate the effectiveness
of any PRF methods by studying the behavior of a set of
features. This set of features can help us predict the
effectiveness of PRF rounds and to avoid query drift. It is a
new look for which is commonly known in IR community
as prediction of query performance or query ambiguity [23].
The experimental results show that the proposed features
have significant correlation with IR effectiveness in term of
mean average precision. One of the major issues in PRF is
selecting the most appropriate feedback documents to be
used in the expansion phase. In our work, we consider PRF
based on documents clustering.
The remainder of this paper has the following organization.
We first give a background on pseudo-relevance feedback
and query predicting effectiveness in Section II. Section III
focuses on the formulation of the problem. Section IV and V
present FPF classification algorithm and the set of the
proposed features. Section VI proposes our study on how to
predict the effectiveness of any PRF method. The
evaluations are presented in section VII and we conclude
this paper with a look at the future work in section VIII.
Pseudo relevance feedback learn from the top retrieved
documents set to expand terms to the original query by
assuming this set of documents as relevant documents when
there are no user relevance judgments available. It is a
significant technique in IR and is proved to improve the
average effectiveness of a number of retrieval models for the
past decades [14, 24, 25, 26, 27, 28]. The PRF Algorithm
works as the following steps:
Step 1. User submits his information needs as a query: Q.
Step 2. IR system examines the query and returns a ranked
list of the retrieval documents: L to the user.
Step 3. From L, IR system considers the top N documents as
feedback documents.
Step 4. Add terms from feedback documents to expand the
query Q.
Step 5. Repeat step 2-4 until the user is satisfied with the
But this mechanism can also fail, particularly for ambiguous
queries [29, 30]. And, it is observed in several works [19,
16] that the quality of selected expansion terms from top
ranked documents (which may contain irrelevant
information) have a strong effect on the effectiveness of
PRF methods. Thus, the major issues are selecting the most
appropriate documents from feedback documents.
Clustering is one of techniques adopting to improve PRF
and to overcome this problem. It is well studied in many
researches [21, 20, 31, 32]. The main idea of clustering
methods is to exploit the relationship between documents of
the top ranked list and to detect clusters with similar topics
to the query.
In the other hand, estimating the quality of the retrieved
documents set for a query is regarded as predicting the
query performance. It is an important and challenging task
in IR research [23, 33]. The central goal of this field of
research is to identify features that can be used for
predicting query performance. These features are
categorized into pre-retrieval methods and post-retrieval
methods [34].
Our work is between the field of PRF methods and the field
of query performance prediction. Through this study, we
want to identify a number of post-retrieval features to guide
the evolution of any PRF method in the case where it is
performed more than once which means if the first pass of
PRF retrieve more relevant document (according to features)
we perform another pass and so on. But, once no more
relevant documents retrieved we stop the process of PRF.
Let be an IR system R with any information model. For any
given query Q and a document collection C, R returns a
ranked list of documents L. Any PRF method assumes that
the set S containing the N top-ranked documents of L are
relevant and uses them to enhance the IR effectiveness by
adding to the original query terms derived from these first N
documents. Due to the fact that PRF has proven to be an
effective strategy for improving retrieval accuracy in all
retrieval models, why not don’t repeat this strategy more
than one time to retrieve more and more relevant documents.
In the below, the circle at the left represents the retrieved
documents, the other circle at the right the set of the relevant
documents for the given query Q. As described in figure 1,
after the first PRF pass we are faced with two situations. In
the situation A, our system R retrieves more relevant
documents, but in the situation B no more relevant
documents are retrieved. So, if this IR system detect that he
is in the situation A, he still performs more and more PRF
pass until to be in the situation B. In this case, the system
Fig 1: Cases of PRF pass.
Our goal is to estimate the contribution of any PRF method
by studying the quality of the N top-ranked documents of L
in relation to the query which will be rephrased it in order to
retrieve more relevant documents. Since we use PRF as
AQE method, we have in the top-ranked documents relevant
and irrelevant documents. So, in order to detect the most
useful subset from S for query expansion, we use clustering
Since in PRF, the main unit is the expansion terms, we
present each documents d in S by a vector of the term
weights in S. We rank unique terms in S according to their
Kullback-Leibler Divergence (KLD) weights. It has been
shown that KLD is an effective weighting mechanism in
many state-of-art AQE methods [35, 36]. Terms with higher
KLD score are the more informative ones and then they will
First PRF pass
Third PRF pass
be well to represent documents. Other terms with lower
score are ignored.
KLD measures divergence term’s distribution in the set of
feedback documents S from its distribution in the whole
collection. The KLD weight is given by:
 
 (1)
Where is the probability of presence of term t in the
set S of the N top-retrieved document, given by
c(t, S) is the frequency of the term t in S, and c(S) is the
count of terms in S. And  is the probability of
presence of the term t in the entire document collection C. It
is defined as follow:
c(t, C) is the frequency of the term t in collection C, and
c(C) is the count of terms in the whole collection C.
One of the major issues in PRF is selecting the most
appropriate feedback documents to be used in the expansion
phase. Document clustering is one way to select appropriate
documents for AQE. Many clustering techniques have been
developed and can be applied to clustering documents. The
document clustering used in this study has been carried out
using FPF (Furthest Point First) clustering algorithm which
is presented below [37]. This algorithm has a low
computational cost which is O(nk)(n: number of feedback
documents, k: number of iteration) and free from parameters
except the number of centers, so it is advantageous in many
applications such as documents clustering
FPF algorithm is a polynomial clustering algorithm
introduced by Gonzalez [37]. Unlike K-means algorithm,
FPF algorithm doesn’t choose randomly the k-centroids but
it builds them iteratively. At each iteration, it calculates the
cosine similarity defined above between documents and
centroid within their cluster. The furthest document from the
current set of centroids is chosen as new centroid and it will
be added to this set of centroids. We can summaries the FPF
algorithm in two steps, step of initialization and step of
iteration as follow:
- Pick an arbitrary document as centroid1,
centroid of cluster 1,
- Assign all documents of S to cluster 1,
Iteration I (i=2,…,K)
- Designate the furthest document from the
current set of centroids as centroidi,
- Assign to cluster i each document closer
to centroidi than to its current centroid.
The similarity used in this work is Cosine distance. The
Cosine distance between a document d and a center cte is
defined as:
  
 (2)
In order to predict the robustness of any PRF method and to
perform it more than one time, we propose some features
that capture intrinsic characteristics of the query and of the
feedback documents selected from the N top-ranked
documents via document clustering algorithm and their
relationship. The proposed list of features is inspired from
previous works [38, 39] and it is intuitive, simple to
implement and have a very low computational cost. These
features are mainly based on the query, the retrieved
documents and the relationship between them.
Query Clarity (QC). Query clarity is an intrinsic
feature of a query. It has be shown that it predict
query ambiguity well [39, 40]. So, we expect it to
predict the effectiveness of PRF method. In the
definition, the clarity of a query is the Kullback-
leibler divergence of the query model from the
collection model. To overcome the computational
cost of this definition, we follow the work of [40].
It is a simplified formula of clarity and doesn’t
need the computation of the relevance scores for
the query model which is defined as:
 
  (3)
In this definition 
 , where  is
the count of term t in the query Q, and  is the
total number of terms in Q.
Feedback Documents Clarity (FDC). Similar to the
query clarity used above, we define FDC as the
kullback-leibler divergence of the feedback
documents model from the collection model.
Intuitively, a large value of this feature possibly
means that the set of feedback documents on the
same topic and far away from noise. Therefore, we
hope that this feature give us more information to
perform positively another PRF pass. So, this
feature is defined as:
 
 
 (4)
Query Entropy (QE). as mentioned in [38], the
query is often very short, then, we compute query
entropy base on the set of feedback documents S. it
is defined as follows:
 
 (5)
Selected Documents Entropy (SDE). In our work,
the PFR process is increased by a clustering phase
in order to detect the most useful documents for
query expansion. At this stage, each cluster is
treated as a large document constructed by
concatenating all documents belonging to it. So
after forming clusters, the clusters are ranked by
the query-likelihood retrieval model [41]. This
retrieval model scores a document d with the
respect to a query Q as bellow:
 
 
 
and 
 , 
 is the count of the term t within the
document d and  is the total number of the
document d. is a smoothing parameter used to
estimate non-zero values for terms in the query
which are not in a document.
The documents belonging to the top ranked cluster
constitutes the set of the selected documents for
query expansion. We call this set SD. So, we get
the entropy of the selected documents by:
  
 (7)
Feedback Documents Broadness (FDB). To
compute this feature, we use the set of the N top
retrieved documents S. Intuitively, if the documents
in this set are more discriminating, it may mean
that they focus more on the same topic and away
from the noise. The broadness of the set S is
defined as the average distance between each
document and the centroid of the set S.
 
 
 
and 
The central problem in pseudo relevance feedback is how to
select the appropriate documents from the top-retrieved
documents since not all these documents are relevant and
there is a high chance that we use irrelevant documents for
expansion. To overcome this issue, we use clustering
techniques which are proved as an effective way. We
construct clusters using FPF algorithm [42].
So, after the selection of the most useful documents, the
query is expanded with new terms and another set of
documents are retrieved. At this stage, if we suppose that the
expanded query presents well the user's information need
then the novel set of retrieved documents must contain more
relevant documents. In this case, another PRF pass will be
beneficial. For this purpose we study the behavior of the set
of features described above (section 4). After each PRF pass
we compute the features QC, FDC, QE, SDE, DSD_S and
Our studying procedure follows the five steps below:
Step 1: For a given query Q retrieve a set S of the N
top-ranked documents according to their system
Step 2: Split the set S in K clusters using the FPF
clustering algorithm.
Step 3: Rank the resulting K clusters according to
their score using formula (6).
Step 4: Use the K’ top ranked clusters as the set of
documents to be used for feedback pass.
Step 5: compute the set of features QC, FDC, QE,
SDE and FDB.
These steps are performed for many times in order to
observe the features behavior according to IR system
performance. In our experiments, we have performed five
In this section, we report our experimental investigation of
the proposed scheme. The objective of our experiments is to
investigate how we can predict the positive contribution of
any PRF method in unsupervised manner by studying the set
of the features described above.
A. Experimental Setup
We utilize in our study the TREC ad-hoc test collection
(Disks 4, 5). The data set contains about 528,000 news
documents from various sources such as Federal Register,
Financial Times, etc…, which are considered as high-quality
text data with minimum noise. On average, each document
includes 557 terms. The topic set used constitutes the TREC
6 ad-hoc task topics. The topics are numbered from 301 to
350, each one consists of three fields. These are the title,
which includes the most related words (1-3 words); the
description, which gives a wider explanation about the
query; and a narrative, which contains specific conditions on
accepting or rejecting documents. The 50 ad-hoc queries
were taken from only title field of the corresponding query
topics because it is a realistic setting that reflects the concise
nature of real user’s queries [36]. The average query length
in TREC 6 is 2.64.
We use the full-featured text search engine library Lucene
an open source IR framework. For all queries and
documents in the test collections, each term is stemmed
using Porter’s English stemmer [43], and stop words are
To measure the correlation of each feature and its behavior
with the query performance, we use the Mean Average
Precision (MAP) as the focus measure representing the
queries performance. MAP is defined as
 
 (10)
Where |NQ| is the number of queries and APj is the average
precision for a given query Qj is defined as:
 
 (11)
nj is the number of relevant documents for Qj . P@k is the
precision at k, it is the fraction of the top k retrieved
documents that are relevant.
B. Results
The main aim of this work is to study the correlation
between the set of features described above and the IR
system effectiveness.
A well-known problem in a typical pseudo relevance
feedback algorithm is its instability with respect to the
variation of two major parameters, namely, the number of
feedback documents and the number of terms [1]. Since our
purpose is not to show the contribution of the classification
task in the retrieval phase, we kept the same number of
clusters for all experiments and we have varied the values of
number of documents and terms. Figures 2, 3 and 4 present
results when their number of feedback documents is fixed to
20, 50 and 100 respectively. At each figure, the two graphics
show the correlation between features and MAP when
number of terms is 5 and 10 respectively.
All experiments show that the features forms two classes,
one class is formed by FDC and QC and the other class is
formed by FDB, QE and SDE. When the MAP increases,
the features of the first class increases proportionally but
features of the second class decreases. We note that, FDB is
the most sensitive feature to the MAP variation. It is logical
since when the feedback documents forms a cluster of a
high quality IR system retrieves more relevant documents
and consequently the MAP increases in general.
This paper presented the results of an experimental study of
some features in the aim to control automatically any PRF
methods. We have Remarque that we can control this
scheme of PRF by just monitoring the behavior of a set of
features. We believe that the main reason for this result is
that when the PRF method is effective the set of feedback
documents contributes positively in the IR system
performance and has lower entropy and so the expanded
query entropy.
We plan to pursue several directions in our future work. We
will investigate the robustness of this proposition by doing
more extensive experiments in different ways. Moreover,
further works will focus on developing more efficient
features to predict the positive contribution of any PRF
method to enhance the IR system performance. We plan also
to develop function which incorporates all features and has
the same behavior of them.
(a): #terms=5
(b): #terms=10
Fig 2: behavior of FDB, SDE, FDC, QE and QC according to MAP, #doc=20.
(a): #terms=5
(b): #terms=10
Fig 3: behavior of FDB, SDE, FDC, QE and QC according to MAP, #doc=50.
(a): #terms=5
(b): #terms=10
Fig 4: behavior of FDB, SDE, FDC, QE and QC according to MAP, #doc=100.
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Refo sans
Iteration 1
Entropie Doc
Clarté2 Doc
Clarté 2
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
[1] Carpineto C., and Romano G., A Survey of Automatic Query
Expansion in Information Retrieval, ACM Computing Surveys,
44, 1 (2012), 1-50.
[2] Manning D., Christopher P., R., and Schütze H., An Introduction
to Information Retrieval, Cambridge University Press, England
[3] Baeza-Yates, R. A. and Ribeiro-Neto, B. : Modern Information
Retrieval. Pearson Education Ltd., 2nd edition (2011).
[4] Lavrenko V., and Croft B., Relevance-based language models, In
Proceedings of SIGIR’01, (Sept 2001), 120127.
[5] Rocchio J., Relevance feedback in information retrieval, In The
SMART Retrieval System: Experiments in Automatic Document
Processing, Prentice-Hall Inc (1971), 313323.
[6] Lv Y., Zhai C., and Chen W., A boosting approach to improving
pseudo-relevance feedback, In Proceedings of SIGIR’11 (2011),
[7] Miao J., Huang J., and Ye Z., Proximity-based rocchio’s model for
pseudo relevance. In Proceedings of SIGIR’12 (2012), 535-544.
[8] Singh J., and Sharan A., A new fuzzy logic-based query expansion
model for efficient information retrieval using relevance feedback
approach, Neural Computing and Applications (2016), 1-24.
[9] Aly A., Using a query expansion technique to improve document
retrieval, International Journal Information Technologies and
Knowledge, 2(2008), 4, 343-348.
[10] Vaidyanathan R., Das S., and Srivastava N., Query Expansion
Strategy based on Pseudo Relevance Feedback and Term Weight
Scheme for Monolingual Retrieval, arXiv preprint arXiv (2015)
[11] Salton G., and Buckley C,. Improving retrieval performance by
relevance feedback, Journal of the American Society for
Information Science, 4(1990), 44, 288297.
[12] Zhai C., and Lafferty J., Model-based feedback in the KL-
divergence retrieval model, In Tenth International Conference on
Information and Knowledge Management (CIKM 2001), (2001)
[13] Xu J., and Croft W., B., Improving the effectiveness of
information retrieval with local context analysis, ACM
Transactions on Information Systems, 18(2000), 79112.
[14] Lv Y., and Zhai C., Positional relevance model for pseudo-
relevance feedback. In Proceedings of SIGIR’10 (2010), 579-596.
[15] He B., and Ounis I., Finding good feedback documents. In CIKM
'09 (2009), 20112014.
[16] Huang J., X., Miao J., and He B., High performance query
expansion using adaptive co-training. Information Processing &
Management, 49(2013), 2, 441-453.
[17] Pal D., Mitra M., Bhattacharya S., Improving Pseudo Relevance
Feedback in the Divergence from Randomness
Model. ICTIR2015: Proceedings of the 2015 International
Conference on The Theory of Information Retrieval (2015),
Northampton, MA, USA, 325-328.
[18] Xu J., Croft W., B., Quary expansion using local and global
document analysis. In: ACM SIGIR Forum, 51(2017), 168175.
[19] Cao G., Nie J., Y., Gao J., and Robertson S., Selecting good
expansion terms for pseudo-relevance feedback, In Proceedings of
31st annual international ACM conference on research and
development in information retrieval, Singapore (2008), 243-250.
[20] Lee K., S., and Croft W., B., A deterministic resampling method
using overlapping document clusters for pseudo-relevance
feedback, Information Processing and Management, 49(2013),
[21] Gelfer K., I., and Kurland O,. Cluster-based query expansion, In
Proceedings of the 32nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR ’09
(2009), Boston, MA, USA, 646647.
[22] Levi O., Raiber F., Kurland O., and Guy I., Selective Cluster-
Based Document retrieval. CIKM 2016(2016), 1473-1482.
[23] Lang H., Wang B., Jones J., Li J., T., Ding F., and Liu Y., X.,
Query Performance Prediction for Information Retrieval Based on
Covering Topic Score. Journal Of Computer Science And
Technology, 4(2008), 23, 590-601.
[24] Tao T., and Zhai C., Regularized estimation of mixture models for
robust pseudo-relevance feedback, In SIGIR 2006, Seattle,
Washington, USA (2006), 162169.
[25] Buckley C., Salton G., Allan J.: Automatic retrieval with locality
information using smart. In: TREC (1992), 5972.
[26] Udupa R., Bhole A., and Bhattacharya P., A term is known by the
company it keeps: On selecting a good expansion set in pseudo-
relevance feedback, In Proceedings of ICTIR 2009, Advances in
IR Theory. Springer (2009), 104-115.
[27] Dang E., K., F., Luk R., W., P., Allan J., Fast forward index
methods for pseudo-relevance feedback retrieval, ACM Trans.,
Inf., Syst., 4 (2015), 33, 133.
[28] He B., Ounis I., Finding good feedback documents, in:
Proceedings of CIKM (2009), 20112014.
[29] Amati G., Carpineto C., Romano G., and Bordoni F., U., Query
difficulty, robustness and selective application of query expansion,
In Proc. of ECIR (2004), 127-137.
[30] Singh J., and Sharan A., Relevance Feedback Based Query
Expansion Model Using Borda Count and Semantic Similarity
Approach, Computational intelligence and neuroscience, 2015.
[31] Sakai T., Manabe T., and Koyama M., . Flexible pseudo-relevance
feedback via selective sampling, ACM Transactions on Asian
Language Information Processing (TALIP), 4(2005) , 2, 111135.
[32] Huang Q,. Song D., Ruger S., Robust query-specific pseudo
feedback document selection for queryexpansion, In Proceedings
of the IR Research, 30th European Conference on Advances in
Information Retrieval, ECIR’08, Milton Keynes, Glasgow,
UK(2008), 547554.
[33] Raiber F., Kurland O., Query-performance prediction: setting the
expectations straight,In Proceeding of SIGIR 2014 (2014), 13-22
[34] Hauff C., Hiemstra D., and de Jong F., : A survey of pre-retrieval
query performance predictors. In Proceedings of the 17th ACM
Conference on Information and Knowledge Management, CIKM
2008, New York, NY, USA(2008). ACM, 14191420.
[35] Parapar J., Barreiro A., Promoting Divergent Terms in the
Estimation of Relevance Models. In: Amati, G., Crestani, F. (eds.)
ICTIR 2011. LNCS, Springer, Heidelberg, 6931 (2011), 7788.
[36] Ye Z., Huang X., He B., and Lin H., York university at TREC
2009: relevance feedback track. In Proceedings of TREC 2009
[37] Gonzalez T., F., Clustering to minimize the maximum intercluster
distance. Theoretical Computation. Sci., 38(1985), 293-306.
[38] Lv Y.,and Zhai C.,. Adaptive relevance feedback in information
retrieval. In CIKM '09 (2009).
[39] Cronen-Townsend S., Zhou Y., and Croft W., B,. Predicting query
performance, In Proceedings of the 25th Annual International
ACM SIGIR Conference on Research and Development in
Information Retrieval, , Tampere, Finland (2002), 299-306.
[40] He, B,. Ounis, I., Query performance prediction. Inf.
Syst. 31(7): 585-594. 2006.
[41] Liu X., Croft W., B., Cluster-based retrieval using language
models. In Proc. 27th ACM, SIGIR conference on research and
development in information retrieval (2004), 186193.
[42] Filippo G., Marco P., Paolo P., and Fabrizio S., A scalable
algorithm for high-quality clustering of Web snippets, In
Proceedings of SAC-06, 21st ACM Symposium on Applied
Computing, Dijon, FR (2006), 10581062.
[43] Porter M., F., An algorithm for suffix stripping, In Readings in
Information Retrieval (1997), 313316.
... Pseudo-Relevance Feedback (PRF) is an automatic query expansion method that uses two search queries to find relevant documents [2,3,4,5,6]. The first query is used to find top relevant documents which are used to reformulate the initial query. ...
Conference Paper
Users have specific information needs which are expressed in short queries to information retrieval systems. The queries are unstructured, and they tend to be short and ambiguous in most cases. Using the shallow language statistics including probabilistic or language models such as BM25 or Indri respectively can enhance the retrieval system metrics like Mean Average Precision (MAP). However, such methods depend on query terms and their presence in the retrieved document to define relevance. Query expansion is a technique that can be used to overcome this problem by expanding the query with terms from an initial top few relevant documents. The question that we try to answer is whether the quality of the corpus used for expansion produce a significant improvement MAP and precision at top 30 retrieved documents. We show that the quality and the selection criteria of expansion documents are important factors in query expansion performance.
Full-text available
Efficient query expansion (QE) terms selection methods are really very important for improving the accuracy and efficiency of the system by removing the irrelevant and redundant terms from the top-retrieved feedback documents corpus with respect to a user query. Each individual QE term selection method has its weaknesses and strengths. To overcome the weaknesses and to utilize the strengths of the individual method, we used multiple terms selection methods together. In this paper, we present a new method for QE based on fuzzy logic considering the top-retrieved document as relevance feedback documents for mining additional QE terms. Different QE terms selection methods calculate the degrees of importance of all unique terms of top-retrieved documents collection for mining additional expansion terms. These methods give different relevance scores for each term. The proposed method combines different weights of each term by using fuzzy rules to infer the weights of the additional query terms. Then, the weights of the additional query terms and the weights of the original query terms are used to form the new query vector, and we use this new query vector to retrieve documents. All the experiments are performed on TREC and FIRE benchmark datasets. The proposed QE method increases the precision rates and the recall rates of information retrieval systems for dealing with document retrieval. It gets a significant higher average recall rate, average precision rate and F measure on both datasets.
Full-text available
Pseudo-Relevance Feedback (PRF) is a well-known method of query expansion for improving the performance of information retrieval systems. All the terms of PRF documents are not important for expanding the user query. Therefore selection of proper expansion term is very important for improving system performance. Individual query expansion terms selection methods have been widely investigated for improving its performance. Every individual expansion term selection method has its own weaknesses and strengths. To overcome the weaknesses and to utilize the strengths of the individual method, we used multiple terms selection methods together. In this paper, first the possibility of improving the overall performance using individual query expansion terms selection methods has been explored. Second, Borda count rank aggregation approach is used for combining multiple query expansion terms selection methods. Third, the semantic similarity approach is used to select semantically similar terms with the query after applying Borda count ranks combining approach. Our experimental results demonstrated that our proposed approaches achieved a significant improvement over individual terms selection method and related state-of-the-art methods.
Full-text available
Query expansion (QE) is a potentially useful technique to help searchers formulate improved query statements, and ultimately retrieve better search results. The objective of our query expansion technique is to find a suitable additional term. Two query expansion methods are applied in sequence to reformulate the query. Experiments on test collections show that the retrieval effectiveness is considerably higher when the query expansion technique is applied.
Conference Paper
In an earlier analysis of Pseudo Relevance Feedback (PRF) models by Clinchant and Gaussier (2013), five desirable properties that PRF models should satisfy were formalised. Also, modifications to two PRF models were proposed in order to improve compliance with the desirable properties. These resulted in improved retrieval effectiveness. In this study, we introduce a sixth property that we believe PRF models should satisfy. We also extend the earlier exercise to Bo1, a standard PRF model. Experimental results on the robust, wt10g and gov2 datasets show that the proposed modifications yield improvements in effectiveness.
Query Expansion using Pseudo Relevance Feedback is a useful and a popular technique for reformulating the query. In our proposed query expansion method, we assume that relevant information can be found within a document near the central idea. The document is normally divided into sections, paragraphs and lines. The proposed method tries to extract keywords that are closer to the central theme of the document. The expansion terms are obtained by equi-frequency partition of the documents obtained from pseudo relevance feedback and by using tf-idf scores. The idf factor is calculated for number of partitions in documents. The group of words for query expansion is selected using the following approaches: the highest score, average score and a group of words that has maximum number of keywords. As each query behaved differently for different methods, the effect of these methods in selecting the words for query expansion is investigated. From this initial study, we extend the experiment to develop a rule-based statistical model that automatically selects the best group of words incorporating the tf-idf scoring and the 3 approaches explained here, in the future. The experiments were performed on FIRE 2011 Adhoc Hindi and English test collections on 50 queries each, using Terrier as retrieval engine.
The query-performance prediction task has been described as estimating retrieval effectiveness in the absence of relevance judgments. The expectations throughout the years were that improved prediction techniques would translate to improved retrieval approaches. However, this has not yet happened. Herein we provide an in-depth analysis of why this is the case. To this end, we formalize the prediction task in the most general probabilistic terms. Using this formalism we draw novel connections between tasks --- and methods used to address these tasks --- in federated search, fusion-based retrieval, and query-performance prediction. Furthermore, using formal arguments we show that the ability to estimate the probability of effective retrieval with no relevance judgments (i.e., to predict performance) implies knowledge of how to perform effective retrieval. We also explain why the expectation that using previously proposed query-performance predictors would help to improve retrieval effectiveness was not realized. This is due to a misalignment with the actual goal for which these predictors were devised: ranking queries based on the presumed effectiveness of using them for retrieval over a corpus with a specific retrieval method. Focusing on this specific prediction task, namely query ranking by presumed effectiveness, we present a novel learning-to-rank-based approach that uses Markov Random Fields. The resultant prediction quality substantially transcends that of state-of-the-art predictors.
The quality of feedback documents is crucial to the effectiveness of query expansion (QE) in ad hoc retrieval. Recently, machine learning methods have been adopted to tackle this issue by training classifiers from feedback documents. However, the lack of proper training data has prevented these methods from selecting good feedback documents. In this paper, we propose a new method, called AdapCOT, which applies co-training in an adaptive manner to select feedback documents for boosting QE’s effectiveness. Co-training is an effective technique for classification over limited training data, which is particularly suitable for selecting feedback documents. The proposed AdapCOT method makes use of a small set of training documents, and labels the feedback documents according to their quality through an iterative process. Two exclusive sets of term-based features are selected to train the classifiers. Finally, QE is performed on the labeled positive documents. Our extensive experiments show that the proposed method improves QE’s effectiveness, and outperforms strong baselines on various standard TREC collections.
Typical pseudo-relevance feedback methods assume the top-retrieved documents are relevant and use these pseudo-relevant documents to expand terms. The initial retrieval set can, however, contain a great deal of noise. In this paper, we present a cluster-based resampling method to select novel pseudo-relevant documents based on Lavrenko’s relevance model approach. The main idea is to use overlapping clusters to find dominant documents for the initial retrieval set, and to repeatedly use these documents to emphasize the core topics of a query.The proposed resampling method can skip some documents in the initial high-ranked documents and deterministically construct overlapping clusters as sampling units. The hypothesis behind using overlapping clusters is that a good representative document for a query may have several nearest neighbors with high similarities, participating in several different clusters. Experimental results on large-scale web TREC collections show significant improvements over the baseline relevance model.To justify the proposed approach, we examine the relevance density and redundancy ratio of feedback documents. A higher relevance density will result in greater retrieval accuracy, ultimately approaching true relevance feedback. The resampling approach shows higher relevance density than the baseline relevance model on all collections, resulting in better retrieval accuracy in pseudo-relevance feedback.
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.