Content uploaded by Mokhtari Chakir
Author content
All content in this area was uploaded by Mokhtari Chakir on Feb 04, 2019
Content may be subject to copyright.
Automatic Control of Multiple Pseudo-
Relevance Feedback Pass
Chakir Mokhtari
Computer Science Department
University Mustapha Stambouli of
Mascara, Mascara, Algeria
mokhtari.c@univ-mascara.dz
Mohammed Debakla
Computer Science Department
University Mustapha Stambouli of
Mascara, Mascara, Algeria
debakla_med@yahoo.fr
Boudjelal Meftah
Computer Science Department
University Mustapha Stambouli of
Mascara, Mascara, Algeria
meftahb@yahoo.fr
Abstract—Automatic query expansion (AQE) based on Pseudo
Relevance Feedback (PRF) is a useful technique for enhancing
the effectiveness of information retrieval (IR) systems which
are a set of methods techniques to provide relevant information
for any user information needs. In this article, we propose the
study of the behavior of a set of features based on the query
and the set of feedback documents in the aim to control
automatically the convergence of the query expansion process
to the pertinent set of documents. It is a new look for which is
commonly known in IR community as prediction of query
performance or query ambiguity. The results obtained show
that is very interesting to use these features as good predictors
for any PRF methods.
Index Terms— Information retrieval, Automatic query
expansion, pseudo-relevance feedback, query performance.
I. INTRODUCTION
Information retrieval (IR) system aims to find documents
that are relevant to a user's information need. In general, the
user’s information need is typically expressed as a query
consisting of a small number of words due to the difficult to
formulate it by the user. In fact, the user’s original query
consisting of just a few terms is too short to describe the
information need accurately leading to make answer really
difficult and a poor coverage of the relevant documents [1].
To overcome this limitation, the user’s original query is
augmented by new features with a similar meaning which is
commonly known in information retrieval community as
automatic query expansion (AQE). The main purpose of
AQE is to improve the retrieval effectiveness of document
ranking by understanding automatically the query and
produce related terms that are close to the original query [2].
AQE is based on Feedback documents that can come from
user relevance judgments (relevance feedback, RF), the top
retrieved documents for the initial query (pseudo-relevance
feedback, PRF), or even the whole collection of documents
or external resources such as thesaurus or ontology (global
feedback) [3]. One of the most popular AQE technique
inspired directly from relevance feedback is PRF method.
PRF [4, 5, 6, 7, 8, 9, 10] aims to learn the user’s
requirements during a search session and by implicit user
involvement adapt the behavior of the retrieval process. The
basic idea is to reformulate the query such that it gets closer
to the relevant documents by using the first documents
retrieved in response to the original query as a more detailed
description of the underlying query topic, from which to
extract the most important terms to be used as expansion
features. Since PRF is an important technique to improve
the effectiveness of IR systems, it is widely proposed and
studied for various retrieval models [4, 5, 11, 12, 13] and
still attract the attention of IR community [14, 15, 6, 16, 17,
8, 18].
The implementation of PRF methods always requires two
rounds of retrieval. In the first round, an initial set of
documents is retrieved with respect to an original query.
And in the second round, another set of documents is
retrieved based on an expanded query, which is generated by
expanding the original query with relevant terms selected
from the previously retrieved document set. This second
round could be repeated more than one time, but the
challenge is how to ensure the effectiveness of the IR
system at each time. In other word, we still perform the
second round of feedback while the IR system retrieves
more and more relevant documents. In the other hand, there
have been interests in applying machine learning and
clustering methods to select a subset of documents deemed
representative of the subject of the query from the feedback
documents [16, 19, 20, 21, 22]. Although it is well
observed that cluster-based PRF improves effectiveness,
however, it is interesting to reuse feedback documents more
than once in order to bring more relevant and novel
information to the user.
For this purpose, in this paper, we estimate the effectiveness
of any PRF methods by studying the behavior of a set of
features. This set of features can help us predict the
effectiveness of PRF rounds and to avoid query drift. It is a
new look for which is commonly known in IR community
as prediction of query performance or query ambiguity [23].
The experimental results show that the proposed features
have significant correlation with IR effectiveness in term of
mean average precision. One of the major issues in PRF is
selecting the most appropriate feedback documents to be
used in the expansion phase. In our work, we consider PRF
based on documents clustering.
The remainder of this paper has the following organization.
We first give a background on pseudo-relevance feedback
and query predicting effectiveness in Section II. Section III
focuses on the formulation of the problem. Section IV and V
present FPF classification algorithm and the set of the
proposed features. Section VI proposes our study on how to
predict the effectiveness of any PRF method. The
evaluations are presented in section VII and we conclude
this paper with a look at the future work in section VIII.
II. BACKGROUND
Pseudo relevance feedback learn from the top retrieved
documents set to expand terms to the original query by
assuming this set of documents as relevant documents when
there are no user relevance judgments available. It is a
significant technique in IR and is proved to improve the
average effectiveness of a number of retrieval models for the
past decades [14, 24, 25, 26, 27, 28]. The PRF Algorithm
works as the following steps:
Step 1. User submits his information needs as a query: Q.
Step 2. IR system examines the query and returns a ranked
list of the retrieval documents: L to the user.
Step 3. From L, IR system considers the top N documents as
feedback documents.
Step 4. Add terms from feedback documents to expand the
query Q.
Step 5. Repeat step 2-4 until the user is satisfied with the
result.
But this mechanism can also fail, particularly for ambiguous
queries [29, 30]. And, it is observed in several works [19,
16] that the quality of selected expansion terms from top
ranked documents (which may contain irrelevant
information) have a strong effect on the effectiveness of
PRF methods. Thus, the major issues are selecting the most
appropriate documents from feedback documents.
Clustering is one of techniques adopting to improve PRF
and to overcome this problem. It is well studied in many
researches [21, 20, 31, 32]. The main idea of clustering
methods is to exploit the relationship between documents of
the top ranked list and to detect clusters with similar topics
to the query.
In the other hand, estimating the quality of the retrieved
documents set for a query is regarded as predicting the
query performance. It is an important and challenging task
in IR research [23, 33]. The central goal of this field of
research is to identify features that can be used for
predicting query performance. These features are
categorized into pre-retrieval methods and post-retrieval
methods [34].
Our work is between the field of PRF methods and the field
of query performance prediction. Through this study, we
want to identify a number of post-retrieval features to guide
the evolution of any PRF method in the case where it is
performed more than once which means if the first pass of
PRF retrieve more relevant document (according to features)
we perform another pass and so on. But, once no more
relevant documents retrieved we stop the process of PRF.
III. PROBLEM FORMULATION
Let be an IR system R with any information model. For any
given query Q and a document collection C, R returns a
ranked list of documents L. Any PRF method assumes that
the set S containing the N top-ranked documents of L are
relevant and uses them to enhance the IR effectiveness by
adding to the original query terms derived from these first N
documents. Due to the fact that PRF has proven to be an
effective strategy for improving retrieval accuracy in all
retrieval models, why not don’t repeat this strategy more
than one time to retrieve more and more relevant documents.
In the below, the circle at the left represents the retrieved
documents, the other circle at the right the set of the relevant
documents for the given query Q. As described in figure 1,
after the first PRF pass we are faced with two situations. In
the situation A, our system R retrieves more relevant
documents, but in the situation B no more relevant
documents are retrieved. So, if this IR system detect that he
is in the situation A, he still performs more and more PRF
pass until to be in the situation B. In this case, the system
stops.
Fig 1: Cases of PRF pass.
Our goal is to estimate the contribution of any PRF method
by studying the quality of the N top-ranked documents of L
in relation to the query which will be rephrased it in order to
retrieve more relevant documents. Since we use PRF as
AQE method, we have in the top-ranked documents relevant
and irrelevant documents. So, in order to detect the most
useful subset from S for query expansion, we use clustering
methods.
Since in PRF, the main unit is the expansion terms, we
present each documents d in S by a vector of the term
weights in S. We rank unique terms in S according to their
Kullback-Leibler Divergence (KLD) weights. It has been
shown that KLD is an effective weighting mechanism in
many state-of-art AQE methods [35, 36]. Terms with higher
KLD score are the more informative ones and then they will
Stop
First PRF pass
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
Second PRF pass
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
Third PRF pass
.
.
.
A
B
be well to represent documents. Other terms with lower
score are ignored.
KLD measures divergence term’s distribution in the set of
feedback documents S from its distribution in the whole
collection. The KLD weight is given by:
(1)
Where is the probability of presence of term t in the
set S of the N top-retrieved document, given by
c(t, S) is the frequency of the term t in S, and c(S) is the
count of terms in S. And is the probability of
presence of the term t in the entire document collection C. It
is defined as follow:
c(t, C) is the frequency of the term t in collection C, and
c(C) is the count of terms in the whole collection C.
One of the major issues in PRF is selecting the most
appropriate feedback documents to be used in the expansion
phase. Document clustering is one way to select appropriate
documents for AQE. Many clustering techniques have been
developed and can be applied to clustering documents. The
document clustering used in this study has been carried out
using FPF (Furthest Point First) clustering algorithm which
is presented below [37]. This algorithm has a low
computational cost which is O(nk)(n: number of feedback
documents, k: number of iteration) and free from parameters
except the number of centers, so it is advantageous in many
applications such as documents clustering
IV. FPF ALGORITHM
FPF algorithm is a polynomial clustering algorithm
introduced by Gonzalez [37]. Unlike K-means algorithm,
FPF algorithm doesn’t choose randomly the k-centroids but
it builds them iteratively. At each iteration, it calculates the
cosine similarity defined above between documents and
centroid within their cluster. The furthest document from the
current set of centroids is chosen as new centroid and it will
be added to this set of centroids. We can summaries the FPF
algorithm in two steps, step of initialization and step of
iteration as follow:
Initialization
- Pick an arbitrary document as centroid1,
centroid of cluster 1,
- Assign all documents of S to cluster 1,
Iteration I (i=2,…,K)
- Designate the furthest document from the
current set of centroids as centroidi,
- Assign to cluster i each document closer
to centroidi than to its current centroid.
The similarity used in this work is Cosine distance. The
Cosine distance between a document d and a center cte is
defined as:
(2)
V. FEATURES
In order to predict the robustness of any PRF method and to
perform it more than one time, we propose some features
that capture intrinsic characteristics of the query and of the
feedback documents selected from the N top-ranked
documents via document clustering algorithm and their
relationship. The proposed list of features is inspired from
previous works [38, 39] and it is intuitive, simple to
implement and have a very low computational cost. These
features are mainly based on the query, the retrieved
documents and the relationship between them.
Query Clarity (QC). Query clarity is an intrinsic
feature of a query. It has be shown that it predict
query ambiguity well [39, 40]. So, we expect it to
predict the effectiveness of PRF method. In the
definition, the clarity of a query is the Kullback-
leibler divergence of the query model from the
collection model. To overcome the computational
cost of this definition, we follow the work of [40].
It is a simplified formula of clarity and doesn’t
need the computation of the relevance scores for
the query model which is defined as:
(3)
In this definition
, where is
the count of term t in the query Q, and is the
total number of terms in Q.
Feedback Documents Clarity (FDC). Similar to the
query clarity used above, we define FDC as the
kullback-leibler divergence of the feedback
documents model from the collection model.
Intuitively, a large value of this feature possibly
means that the set of feedback documents on the
same topic and far away from noise. Therefore, we
hope that this feature give us more information to
perform positively another PRF pass. So, this
feature is defined as:
(4)
Query Entropy (QE). as mentioned in [38], the
query is often very short, then, we compute query
entropy base on the set of feedback documents S. it
is defined as follows:
(5)
Selected Documents Entropy (SDE). In our work,
the PFR process is increased by a clustering phase
in order to detect the most useful documents for
query expansion. At this stage, each cluster is
treated as a large document constructed by
concatenating all documents belonging to it. So
after forming clusters, the clusters are ranked by
the query-likelihood retrieval model [41]. This
retrieval model scores a document d with the
respect to a query Q as bellow:
where
and
,
is the count of the term t within the
document d and is the total number of the
document d. is a smoothing parameter used to
estimate non-zero values for terms in the query
which are not in a document.
The documents belonging to the top ranked cluster
constitutes the set of the selected documents for
query expansion. We call this set SD. So, we get
the entropy of the selected documents by:
(7)
Feedback Documents Broadness (FDB). To
compute this feature, we use the set of the N top
retrieved documents S. Intuitively, if the documents
in this set are more discriminating, it may mean
that they focus more on the same topic and away
from the noise. The broadness of the set S is
defined as the average distance between each
document and the centroid of the set S.
where
and
VI. STUDY OF THE DEPENDENCE BETWEEN THE
FEATURES AND PRF PASS
The central problem in pseudo relevance feedback is how to
select the appropriate documents from the top-retrieved
documents since not all these documents are relevant and
there is a high chance that we use irrelevant documents for
expansion. To overcome this issue, we use clustering
techniques which are proved as an effective way. We
construct clusters using FPF algorithm [42].
So, after the selection of the most useful documents, the
query is expanded with new terms and another set of
documents are retrieved. At this stage, if we suppose that the
expanded query presents well the user's information need
then the novel set of retrieved documents must contain more
relevant documents. In this case, another PRF pass will be
beneficial. For this purpose we study the behavior of the set
of features described above (section 4). After each PRF pass
we compute the features QC, FDC, QE, SDE, DSD_S and
FDB.
Our studying procedure follows the five steps below:
Step 1: For a given query Q retrieve a set S of the N
top-ranked documents according to their system
relevance.
Step 2: Split the set S in K clusters using the FPF
clustering algorithm.
Step 3: Rank the resulting K clusters according to
their score using formula (6).
Step 4: Use the K’ top ranked clusters as the set of
documents to be used for feedback pass.
Step 5: compute the set of features QC, FDC, QE,
SDE and FDB.
These steps are performed for many times in order to
observe the features behavior according to IR system
performance. In our experiments, we have performed five
pass.
VII. EVALUATION
In this section, we report our experimental investigation of
the proposed scheme. The objective of our experiments is to
investigate how we can predict the positive contribution of
any PRF method in unsupervised manner by studying the set
of the features described above.
A. Experimental Setup
We utilize in our study the TREC ad-hoc test collection
(Disks 4, 5). The data set contains about 528,000 news
documents from various sources such as Federal Register,
Financial Times, etc…, which are considered as high-quality
text data with minimum noise. On average, each document
includes 557 terms. The topic set used constitutes the TREC
6 ad-hoc task topics. The topics are numbered from 301 to
350, each one consists of three fields. These are the title,
which includes the most related words (1-3 words); the
description, which gives a wider explanation about the
query; and a narrative, which contains specific conditions on
accepting or rejecting documents. The 50 ad-hoc queries
were taken from only title field of the corresponding query
topics because it is a realistic setting that reflects the concise
nature of real user’s queries [36]. The average query length
in TREC 6 is 2.64.
We use the full-featured text search engine library Lucene
1
,
an open source IR framework. For all queries and
documents in the test collections, each term is stemmed
using Porter’s English stemmer [43], and stop words are
removed.
To measure the correlation of each feature and its behavior
with the query performance, we use the Mean Average
Precision (MAP) as the focus measure representing the
queries performance. MAP is defined as
(10)
Where |NQ| is the number of queries and APj is the average
precision for a given query Qj is defined as:
(11)
nj is the number of relevant documents for Qj . P@k is the
precision at k, it is the fraction of the top k retrieved
documents that are relevant.
B. Results
The main aim of this work is to study the correlation
between the set of features described above and the IR
system effectiveness.
A well-known problem in a typical pseudo relevance
feedback algorithm is its instability with respect to the
variation of two major parameters, namely, the number of
feedback documents and the number of terms [1]. Since our
purpose is not to show the contribution of the classification
task in the retrieval phase, we kept the same number of
clusters for all experiments and we have varied the values of
number of documents and terms. Figures 2, 3 and 4 present
results when their number of feedback documents is fixed to
20, 50 and 100 respectively. At each figure, the two graphics
show the correlation between features and MAP when
number of terms is 5 and 10 respectively.
All experiments show that the features forms two classes,
one class is formed by FDC and QC and the other class is
formed by FDB, QE and SDE. When the MAP increases,
1
https://lucene.apache.org
the features of the first class increases proportionally but
features of the second class decreases. We note that, FDB is
the most sensitive feature to the MAP variation. It is logical
since when the feedback documents forms a cluster of a
high quality IR system retrieves more relevant documents
and consequently the MAP increases in general.
VIII. CONCLUSIONS
This paper presented the results of an experimental study of
some features in the aim to control automatically any PRF
methods. We have Remarque that we can control this
scheme of PRF by just monitoring the behavior of a set of
features. We believe that the main reason for this result is
that when the PRF method is effective the set of feedback
documents contributes positively in the IR system
performance and has lower entropy and so the expanded
query entropy.
We plan to pursue several directions in our future work. We
will investigate the robustness of this proposition by doing
more extensive experiments in different ways. Moreover,
further works will focus on developing more efficient
features to predict the positive contribution of any PRF
method to enhance the IR system performance. We plan also
to develop function which incorporates all features and has
the same behavior of them.
(a): #terms=5
(b): #terms=10
Fig 2: behavior of FDB, SDE, FDC, QE and QC according to MAP, #doc=20.
(a): #terms=5
(b): #terms=10
Fig 3: behavior of FDB, SDE, FDC, QE and QC according to MAP, #doc=50.
(a): #terms=5
(b): #terms=10
Fig 4: behavior of FDB, SDE, FDC, QE and QC according to MAP, #doc=100.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Map
FDB
SDE
FDC
QE
QC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
FDB
Map
SDE
FDC
QE
QC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recherche
Simple
Refo sans
clas
FPF
Iteration 1
FPF
Iteration2
FPF
Iteration3
Map
Rayon
Entropie Doc
Clarté2 Doc
Entropie
Requete
Clarté 2
Requete
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
PM
Rayon
Entropie
Doc
Clarté2
Doc
Entropie
Requete
Clarté2
Requete
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Map
FDB
SDE
FDC
QE
QC
0
0.2
0.4
0.6
0.8
1
1.2
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Map
FDB
SDE
FDC
QE
QC
MAP
FDB
SDE
SDC
QE
QC
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
MAP
FDB
SDE
SDC
QE
QC
REFERENCES
[1] Carpineto C., and Romano G., A Survey of Automatic Query
Expansion in Information Retrieval, ACM Computing Surveys,
44, 1 (2012), 1-50.
[2] Manning D., Christopher P., R., and Schütze H., An Introduction
to Information Retrieval, Cambridge University Press, England
(2008).
[3] Baeza-Yates, R. A. and Ribeiro-Neto, B. : Modern Information
Retrieval. Pearson Education Ltd., 2nd edition (2011).
[4] Lavrenko V., and Croft B., Relevance-based language models, In
Proceedings of SIGIR’01, (Sept 2001), 120–127.
[5] Rocchio J., Relevance feedback in information retrieval, In The
SMART Retrieval System: Experiments in Automatic Document
Processing, Prentice-Hall Inc (1971), 313–323.
[6] Lv Y., Zhai C., and Chen W., A boosting approach to improving
pseudo-relevance feedback, In Proceedings of SIGIR’11 (2011),
165-174.
[7] Miao J., Huang J., and Ye Z., Proximity-based rocchio’s model for
pseudo relevance. In Proceedings of SIGIR’12 (2012), 535-544.
[8] Singh J., and Sharan A., A new fuzzy logic-based query expansion
model for efficient information retrieval using relevance feedback
approach, Neural Computing and Applications (2016), 1-24.
[9] Aly A., Using a query expansion technique to improve document
retrieval, International Journal Information Technologies and
Knowledge, 2(2008), 4, 343-348.
[10] Vaidyanathan R., Das S., and Srivastava N., Query Expansion
Strategy based on Pseudo Relevance Feedback and Term Weight
Scheme for Monolingual Retrieval, arXiv preprint arXiv (2015)
1502.05168.
[11] Salton G., and Buckley C,. Improving retrieval performance by
relevance feedback, Journal of the American Society for
Information Science, 4(1990), 44, 288–297.
[12] Zhai C., and Lafferty J., Model-based feedback in the KL-
divergence retrieval model, In Tenth International Conference on
Information and Knowledge Management (CIKM 2001), (2001)
403–410.
[13] Xu J., and Croft W., B., Improving the effectiveness of
information retrieval with local context analysis, ACM
Transactions on Information Systems, 18(2000), 79–112.
[14] Lv Y., and Zhai C., Positional relevance model for pseudo-
relevance feedback. In Proceedings of SIGIR’10 (2010), 579-596.
[15] He B., and Ounis I., Finding good feedback documents. In CIKM
'09 (2009), 2011–2014.
[16] Huang J., X., Miao J., and He B., High performance query
expansion using adaptive co-training. Information Processing &
Management, 49(2013), 2, 441-453.
[17] Pal D., Mitra M., Bhattacharya S., Improving Pseudo Relevance
Feedback in the Divergence from Randomness
Model. ICTIR2015: Proceedings of the 2015 International
Conference on The Theory of Information Retrieval (2015),
Northampton, MA, USA, 325-328.
[18] Xu J., Croft W., B., Quary expansion using local and global
document analysis. In: ACM SIGIR Forum, 51(2017), 168–175.
[19] Cao G., Nie J., Y., Gao J., and Robertson S., Selecting good
expansion terms for pseudo-relevance feedback, In Proceedings of
31st annual international ACM conference on research and
development in information retrieval, Singapore (2008), 243-250.
[20] Lee K., S., and Croft W., B., A deterministic resampling method
using overlapping document clusters for pseudo-relevance
feedback, Information Processing and Management, 49(2013),
792–806.
[21] Gelfer K., I., and Kurland O,. Cluster-based query expansion, In
Proceedings of the 32nd International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR ’09
(2009), Boston, MA, USA, 646–647.
[22] Levi O., Raiber F., Kurland O., and Guy I., Selective Cluster-
Based Document retrieval. CIKM 2016(2016), 1473-1482.
[23] Lang H., Wang B., Jones J., Li J., T., Ding F., and Liu Y., X.,
Query Performance Prediction for Information Retrieval Based on
Covering Topic Score. Journal Of Computer Science And
Technology, 4(2008), 23, 590-601.
[24] Tao T., and Zhai C., Regularized estimation of mixture models for
robust pseudo-relevance feedback, In SIGIR 2006, Seattle,
Washington, USA (2006), 162–169.
[25] Buckley C., Salton G., Allan J.: Automatic retrieval with locality
information using smart. In: TREC (1992), 59–72.
[26] Udupa R., Bhole A., and Bhattacharya P., A term is known by the
company it keeps: On selecting a good expansion set in pseudo-
relevance feedback, In Proceedings of ICTIR 2009, Advances in
IR Theory. Springer (2009), 104-115.
[27] Dang E., K., F., Luk R., W., P., Allan J., Fast forward index
methods for pseudo-relevance feedback retrieval, ACM Trans.,
Inf., Syst., 4 (2015), 33, 1–33.
[28] He B., Ounis I., Finding good feedback documents, in:
Proceedings of CIKM (2009), 2011–2014.
[29] Amati G., Carpineto C., Romano G., and Bordoni F., U., Query
difficulty, robustness and selective application of query expansion,
In Proc. of ECIR (2004), 127-137.
[30] Singh J., and Sharan A., Relevance Feedback Based Query
Expansion Model Using Borda Count and Semantic Similarity
Approach, Computational intelligence and neuroscience, 2015.
[31] Sakai T., Manabe T., and Koyama M., . Flexible pseudo-relevance
feedback via selective sampling, ACM Transactions on Asian
Language Information Processing (TALIP), 4(2005) , 2, 111–135.
[32] Huang Q,. Song D., Ruger S., Robust query-specific pseudo
feedback document selection for queryexpansion, In Proceedings
of the IR Research, 30th European Conference on Advances in
Information Retrieval, ECIR’08, Milton Keynes, Glasgow,
UK(2008), 547–554.
[33] Raiber F., Kurland O., Query-performance prediction: setting the
expectations straight,In Proceeding of SIGIR 2014 (2014), 13-22
[34] Hauff C., Hiemstra D., and de Jong F., : A survey of pre-retrieval
query performance predictors. In Proceedings of the 17th ACM
Conference on Information and Knowledge Management, CIKM
2008, New York, NY, USA(2008). ACM, 1419–1420.
[35] Parapar J., Barreiro A., Promoting Divergent Terms in the
Estimation of Relevance Models. In: Amati, G., Crestani, F. (eds.)
ICTIR 2011. LNCS, Springer, Heidelberg, 6931 (2011), 77–88.
[36] Ye Z., Huang X., He B., and Lin H., York university at TREC
2009: relevance feedback track. In Proceedings of TREC 2009
(2009).
[37] Gonzalez T., F., Clustering to minimize the maximum intercluster
distance. Theoretical Computation. Sci., 38(1985), 293-306.
[38] Lv Y.,and Zhai C.,. Adaptive relevance feedback in information
retrieval. In CIKM '09 (2009).
[39] Cronen-Townsend S., Zhou Y., and Croft W., B,. Predicting query
performance, In Proceedings of the 25th Annual International
ACM SIGIR Conference on Research and Development in
Information Retrieval, , Tampere, Finland (2002), 299-306.
[40] He, B,. Ounis, I., Query performance prediction. Inf.
Syst. 31(7): 585-594. 2006.
[41] Liu X., Croft W., B., Cluster-based retrieval using language
models. In Proc. 27th ACM, SIGIR conference on research and
development in information retrieval (2004), 186–193.
[42] Filippo G., Marco P., Paolo P., and Fabrizio S., A scalable
algorithm for high-quality clustering of Web snippets, In
Proceedings of SAC-06, 21st ACM Symposium on Applied
Computing, Dijon, FR (2006), 1058–1062.
[43] Porter M., F., An algorithm for suffix stripping, In Readings in
Information Retrieval (1997), 313–316.