Article
To read the full-text of this research, you can request a copy directly from the author.

Abstract

This paper gives a brief description of the Okapi projects and the work of the Centre for Interactive Systems research, as an introduction to this special issue of the Journal of Documentation. Okapi is the name given to an experimental text retrieval system (or rather, family of systems, as will be discussed below), based at City University, London. The current systems and their predecessors have been used as the basis for a series of projects, generally addressing aspects of user information-seeking behaviour and user-system interaction, as well as system design. The projects have been supported extensively by the British Library, and to some degree by a number of other funders. They have been at City since 1989; for the previous seven years they were based at the Polytechnic of Central London (now the University of Westminster). In order to give a picture of the system(s) that now constitute Okapi, it is appropriate to describe one version containing some of the features that have become central to the Okapi projects, and then to indicate the variety of systems now implemented or implementable within the present setup, as well as the directions it may go in the future. In what follows, papers in this issue are referred to by brief titles.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Document-Based Scoring Using Local Collection. Common retrieval models such as BM25 [38], language models [34], and DfR [2] are in this cell. Among these widely used models, Sequential Dependency Model (SDM) has consistently demonstrated strong effectiveness [29]. ...
... • BM25 [38]: ...
... The first two functions are based on commonly used retrieval models, Query Likelihood (QL) [34,53] and BM25 [38]. For QL, µ controls the degree of Dirichlet smoothing and p(t |C) is the background (collection) language model. ...
Conference Paper
Full-text available
Evidence derived from passages that closely represent likely answers to a posed query can be useful input to the ranking process. Based on a novel use of Community Question Answering data, we present an approach for the creation of such passages. A general framework for extracting answer passages and estimating their quality is proposed, and this evidence is integrated into ranking models. Our experiments on two web collections show that such quality estimates from answer passages provide a strong indication of document relevance and compare favorably to previous passage-based methods. Combining such evidence can significantly improve over a set of state-of-the-art ranking models, including Quality-Biased Ranking, External Expansion, and a combination of both. A final ranking model that incorporates all quality estimates achieves further improvements on both collections.
... For each instance (qi, dj), a feature extractor produces a vector of features that describes the match between qi and dj. Such features can be classical IR models (e.g., term frequency, inverse document frequency and Okapi BM25 [40]) or newly developed models (e.g., Feature Propagation [37] and Topical PageRank [33]). The inputs to the learning algorithm comprise training instances, their feature vectors and the corresponding relevance judgments. ...
... [47] exploits GP to develop new general purpose ranking functions based on primitive atomic features. Four baseline ranking functions, namely, inner product, cosine measure, probability measure and Okapi BM25 [40] are added in the initial population to guarantee the worst performance of an individual is as good as the baselines. [14] uses GP to evolve termweighting schemes in an adhoc IR model. ...
... In each fold, an average of 9,127 instances in the training set, an average of 3,042 instances in the validation set and an average of 3,042 instances in the testing set. The features cover most of the standard IR features, such as low-level content features (e.g., term frequency, inverse document frequency and document length), high-level content features (e.g., Okapi BM25 [40] and LMIR [62] with different smoothing methods), and others, such as the in-link number of a webpage and the length of the URL. The relevance judgments are quantified on three levels, namely, 2 for definitely relevant, 1 for possibly relevant and 0 for not relevant. ...
Article
Full-text available
Ranking plays a key role in many applications, such as document retrieval, recommendation, question answering, and machine translation. In practice, a ranking function (or model) is exploited to determine the rank-order relations between objects, with respect to a particular criterion. In this paper, a layered multipopulation genetic programming based method, known as RankMGP, is proposed to learn ranking functions for document retrieval by incorporating various types of retrieval models into a singular one with high effectiveness. RankMGP represents a potential solution (i.e., a ranking function) as an individual in a population of genetic programming and aims to directly optimize information retrieval evaluation measures in the evolution process. Overall, RankMGP consists of a set of layers and a sequential workflow running through the layers. In one layer, multiple populations evolve independently to generate a set of the best individuals. When the evolution process is completed, a new training dataset is created using the best individuals and the input training set of the layer. Then, the populations in the next layer evolve with the new training dataset. In the final layer, the best individual is obtained as the output ranking function. The proposed method is evaluated using the LETOR datasets and is found to be superior to classical information retrieval models, such as Okapi BM25. It is also statistically competitive with the state-of-the-art methods, including Ranking SVM, ListNet, AdaRank and RankBoost.
... In the past, many IR models such as Vector Space Model (VSM) [13,14], Probabilistic Model [12,1] and Language Model [10,18] have been proposed which based on term intersection approach. Term intersection is the approach where both document and query should share the same terms. ...
... P ml (w|d) = c(w, d) |d| (12) where |d| is the total length of the document d. Due to the data spareness problem, the maximum likelihood estimator directly assign null to the unseen words in a document. ...
... Ils sont généralement regroupés sous la notion de ressources terminologiques ou ontologiques [2]. [12] en propose une première définition « Ressource informatique décrivant le vocabulaire et les concepts spécifiques à un domaine, à une communauté pour le traitement de l'information ». Cette notion est ensuite concrétisée par les travaux de [21]. ...
Book
Nous avons le plaisir d'organiser à Lille la cinquième édition de l'atelier Recherche d'Information SEmantique, RISE 2013, associé à Plate-forme d'Intelligence Artificielle et avec le soutien de l'ARIA (Association francophone de Recherche d'Information et Applications). Le but de l'atelier est de proposer un espace d'échange autour de la synergie entre acquisition et gestion de ressources sémantiques (ontologies, terminologies, thesaurii, ...) et la Recherche d'Information. Ces thématiques sont à la croisée du Web Sémantique, de l'Ingénierie des Connaissances, du Traitement Automatique des Langues et de la Recherche d'Information. Les thèmes couverts par les contributions acceptées à RISE sont les suivants: • Modèles de Recherche d'Information Sémantique • Ontologies et Annotation Sémantique • Mesures de similarité sémantique. Les propositions ont porté sur des domaines variés : agriculture, médecine, botanique, patrimoine culturel et juridique.
... In our work, we assume a feature extractor Φ : D × Q → R s is given, which generates an s-dimensional feature vector x = (x 1 , x 2 , · · · , x s ) from a query q and its associated document d. Of our particular interest are features that involve both q and d, such as term frequency (TF), BM25 [33] and LMIR [34]. The labels are drawn from 0, 1 and 2, representing "irrelevant", "relevant" and "highly relevant". ...
... Thus the key to effectively train LTR models is to effectively generate sufficient and highquality training data with raw data from different parties. To generate useful and widely recognized features in LTR such as BM25 [33] and LMIR [34], the term frequency (TF) query is necessary. We formally define the cross-party TF as below. ...
... For each instance, i.e., (q i , d j ), a feature extractor produces a vector of features that describes the match between q i and d j . Such features can be classical IR models (e.g., term frequency, inverse document frequency, and Okapi BM25 [32]) or newly developed models (e.g., Hos-tRank [51], Feature Propagation [29,36], and Topical PageRank [26]). The inputs to the learning algorithm comprise training instances, their feature vectors, and the corresponding relevance judgments. ...
... However, empirical observations have found that features, i.e., the ranking models in this study, are not always independent. For example, TF-IDF [1] and Okapi BM25 [32] are considered somewhat correlated since both are designed based on term frequency and inverse document frequency. In such cases, the feature vector model neglects the correlations between features and treats the features as independent coordinate axes. ...
Article
As a crucial task In information retrieval, ranking defines the preferential order among the retrieved documents for a given query. Supervised learning has recently been dedicated to automatically learning ranking models by incorporating various models into one effective model. This paper proposes a novel supervised learning method, in which instances are represented as bags of contexts of features, instead of bags of features. The method applies rank-order correlations to measure the correlation relationships between features. The feature vectors of instances, i.e., the 1st-order raw feature vectors, are then mapped into the feature correlation space via projection to derive the context-level feature vectors, i.e., the 2nd-order context feature vectors. As for ranking model learning, Ranking SVM is employed with the 2nd-order context feature vectors as the input The proposed method is evaluated using the LETOR benchmark datasets and is found to perform well with competitive results. The results suggest that the learning method benefits from the rank-order-correlation-based feature vector context transformation.
... In IR, stopwords are considered ineffective for distinguishing relevant documents from non-relevant ones. Search engines such as Okapi (Robertson, 1997), Terrier (Ounis et al., 2005), Lemur 9 and Lucene4IR 10 use stopword lists before processing documents and queries to improve system performance, memory utilization and processing time (Feng et al., 2006;Gerlach et al., 2019;Ibrahim, 2006;Sadeghi & Vegas, 2014). Considering the complexity of languages, stopwords can be eliminated before ...
Article
Full-text available
The development of information retrieval systems and natural language processing tools has been made possible for many natural languages because of the availability of natural language resources and corpora. Although Amharic is the working language of Ethiopia, it is still an under-resourced language. There are no adequate resources and corpora for Amharic ad-hoc retrieval evaluation to date. The existing ones are not publicly accessible and are not suitable for making scientific evaluation of information retrieval systems. To promote the development of Amharic ad-hoc retrieval, we build an ad-hoc retrieval test collection that consists of raw text, morphologically annotated stem-based and root-based corpora, a stopword list, stem-based and root-based lexicons, and WordNet-like resources. We also created word embeddings using the raw text and morphologically segmented forms of the corpora. When building these resources and corpora, we heavily consider the morphological characteristics of the language. The aim of this paper is to present these Amharic resources and corpora that we made available to the research community for information retrieval tasks. These resources and corpora are also evaluated experimentally and by linguists.
... Text Ranking is a central problem for an Information retrieval system. Traditional approaches rank text mostly based on vector-based method (Baeza-Yates et al., 1999;Deerwester et al., 1990) and probabilistic-based method (Maron and Kuhns, 1960;Robertson, 1997). Later, the learning-torank method is developed by implementing supervised machine learning in ranking problems using manually-engineered features (Cao et al., 2007;Li, 2011;Liu, 2011). ...
... This mixed state created shows that the global and local ones are accepted together (Berger J. O., 2003). Robertson S. E. (1997) explains glocal marketing as an effort to position the products marketed in the target market by using the motifs specific to certain countries by global brands only for that country. This approach requires integration with the local market and the establishment of a flexible management approach. ...
Article
This study aims to investigate the role of human-machine interaction and automatic analysis concepts, which are a product of digitalization, on glocalization. In recent years, great attention has been paid to artificial intelligence applications in the field of marketing, as in every field of companies. With the application of artificial intelligence in the field of marketing, marketing strategies have also changed. The concept of glocalization, which is one of the current marketing strategies, is formed from the combination of the words global and local. Glocalization means the implementation of marketing strategies compatible with the cultural characteristics of the target market while presenting global products to local markets. It has created a new marketing behavior through digital products such as glocalization, human-machine interaction, and automatic analysis.
... We used the features provided by LETOR 4.0. For each document, there are 6 hyperlink features (PageRank value, inlink number, outlink number, number of slashes in URL, length of URL, and number of child pages) and 40 content features consisting of 20 classical features such as document length and term frequency, and 20 high level features such as results of BM25 (Robertson 1997) and LMIR (Zhai and Lafferty 2001) algorithms. Evaluation measures. ...
Article
We propose CCRank, the first parallel algorithm for learning to rank, targeting simultaneous improvement in learning accuracy and efficiency. CCRank is based on cooperative coevolution (CC), a divide-and-conquer framework that has demonstrated high promise in function optimization for problems with large search space and complex structures. Moreover, CC naturally allows parallelization of sub-solutions to the decomposed subproblems, which can substantially boost learning efficiency. With CCRank, we investigate parallel CC in the context of learning to rank. Extensive experiments on benchmarks in comparison with the state-of-the-art algorithms show that CCRank gains in both accuracy and efficiency.
... There are different approaches used to generate document embeddings such as TF-IDF 80 and BM25. 81 Contextual word embeddings can also be used to vectorize documents. 78,82 Feature is an individual measurable property, characteristic, or behavior observed. ...
Article
Full-text available
In general, Natural Language Processing (NLP) algorithms exhibit black-box behavior. Users input text and output are provided with no explanation of how the results are obtained. In order to increase understanding and trust, users value transparent processing which may explain derived results and enable understanding of the underlying routines. Many approaches take an opaque approach by default when designing NLP tools and do not incorporate a means to steer and manipulate the intermediate NLP steps. We present an interactive, customizable, visual framework that enables users to observe and participate in the NLP pipeline processes, explicitly manipulate the parameters of each step, and explore the result visually based on user preferences. The visible NLP (VNLP) pipeline design is then applied to a text similarity application to demonstrate the utility and advantages of a visible and transparent NLP pipeline in supporting users to understand and justify both the process and results. We also report feedback on our framework from a modern languages expert.
... The boolean match (Feature 1), e.g., simply states whether a term is contained in the query match or not and the text relevancy (Feature 4) measures how many words in the query match a term. The original matching features followed in LOV (Feature 2 and 3) are based on a standard BM25 matching score [34]. Feature 2 further assigns weights depending on which property of a term matches the query and Feature 3 is based on properties that describe the ontology [41]. ...
... We used the topic distillation tasks in TREC web tracks in 2003 (50 queries) and 2004 (75 queries) to evaluate our approach. We compare our approach with Okapi BM25 [8], PageRank (PR), GlobalHITS (GHITS), Topical-Sensitive PageRank (TSPR) [2], and Topical PageRank (TPR) [5]. All these link-based ranking models combine with Okapi BM25 linearly by ranks. ...
Article
The authority of web pages can be better estimated by controlling the authority flows among pages. Due to the complexity of calculating the authority flow, current systems only use pre computed authority flows in runtime. This limitation prohibits authority flow to be used more effectively as ranking factor. Proximity search have been used extensively in measuring the association between entities in data graphs. A key advantage of proximity search is the existence of efficient execution algorithms. However, it is desirable to be able to on-the-fly compute the authority flow at query time with this. Web pages are often recognized by others through contexts. In this work, we use proximity by examining the topicality relationship between associated pages to determine the authority distribution. In addition , we consider influence of authority propagation from diverse types of neighbours, since web pages, like people, are influenced by diverse types of neighbours within the same network. We propose a probabilistic method to model authority flows from different sources of neighbour pages. Experiments on the 2003 and 2004 TREC Web tracks demonstrate that this approach outperforms other competitive topical ranking models and produces more than 10% improvement over Page Rank on the quality of top ten search results. When increasing the types of incorporated neighbour sources, the performance shows stable improvements.
... Over the past several decades, many open source softwares have been built to facilitate the research of information retrieval, such as Okapi [7][8] [15], Indri 2 , Galago 3 , Terrier 4 , and Anserini [18]. These softwares greatly improve the efficiency of the evaluation of retrieval models on standard test collections. ...
Conference Paper
Open source softwares play an important role in information retrieval research. Most of the existing open source information retrieval systems are implemented in Java or C++ programming language. In this paper, we propose Parrot1, a Python-based interactive platform for information retrieval research. The proposed platform has mainly three advantages in comparison with the existing retrieval systems: (1) It is integrated with Jupyter Notebook, an interactive programming platform which has proved to be effective for data scientists to tackle big data and AI problems. As a result, users can interactively visualize and diagnose a retrieval model; (2) As an application written in Python, it can be easily used in combination with the popular deep learning frameworks such as Tersorflow and Pytorch; (3) It is designed especially for researchers. Less code is needed to create a new retrieval model or to modify an existing one. Our efforts have focused on three functionalists: good usability, interactive programming, and good interoperability with the popular deep learning frameworks. To confirm the performance of the proposed system, we conduct comparative experiments on a number of standard test collections. The experimental results show that the proposed system is both efficient and effective, providing a practical framework for researchers in information retrieval.
... Query-dependent models try to fetch the documents based on whether the query terms appear in the respective documents. They include different models of ranking such as Boolean model [2], vector space model (VSM) [2] and probabilistic ranking models like BM25 [3] and Language Model for IR [4]. Boolean model figures out the relevance of a document to the query based on whether the document is to be included or not in the result, but it fails to determine the degree of relevance. ...
Article
Full-text available
Today, amount of information on the web such as number of publicly accessible web pages, hosts and web data is increasing rapidly and exhibiting an enormous growth at an exponential rate. Thus, information retrieval on web is becoming more difficult. Conventional methods of information retrieval are not very effective in ranking since they rank the results without automatically learning the model. Machine learning domain called learning-to-rank comes to the aid to rank the obtained results. Different state-of-the-art methodologies have been developed for learning-to-rank to date. This paper focuses on finding out the best algorithm for web search by implementation of different state-of-the-art algorithms for learning-to-rank. Our work in this paper marks the implementation of learning-to-rank algorithms and analyses effect of topmost performing algorithms on respective datasets. It presents an overall review on the approaches designed under learning-to-rank and their evaluation strategies.
... Okapi Best Match 25 (BM25) model evolved from a series of previous models-BM1, BM11, BM12, and BM 15 (Robertson, 1997). BM25 builds on Eq. (9) and uses BM25 term weighting (Section 8.5), inverse document frequency (Section 8.2), and document length normalization (Section 8.4). ...
Chapter
This chapter presents a tutorial introduction to modern information retrieval concepts, models, and systems. It begins with a reference architecture for the current Information Retrieval (IR) systems, which provides a backdrop for rest of the chapter. Text preprocessing is discussed using a mini Gutenberg corpus. Next, a categorization of IR models is presented followed by Boolean IR model description. Positional index is introduced, and execution of phrase and proximity queries is discussed. Various term weighting schemes are discussed next followed by descriptions of three IR models—Vector Space, Probabilistic, and Language models. Approaches to evaluating IR systems are presented. Relevance feedback techniques as a means to improving retrieval effectiveness are described. Various IR libraries, frameworks, and test collections are indicated. The chapter concludes by outlining facets of IR research and indicating additional reading.
... In our case, we used the class descriptions in place of the set of documents and the text extracted from the URL as the query. Relecance is computed through: cosine similarity, generalized Jaccard coefficient [46], dice coefficient, overlap coefficient, Weighted Coordination Match Similarity [47], BM25 [48], and LMIRDIR [49]. As label we chose the class with highest ranking, while as a degree of membership we used the average of the relevance measures normalized to the range [0,1]. ...
... Given a user query described in terms of one or more keywords, an information retrieval system must find the most relevant documents to the user query from a potentially large collection of documents. Word salience scores based on term frequency, document frequency, and document length have been proposed such as tfidf and BM25 (Robertson, 1997). ...
... Strategy II assigns a weight ( In order to measure the retrieval performance of different ranking algorithms, we use precision at 10 documents(P@10) and Mean Average Precision(MAP) as our evaluation metrics. As the baseline, BM2500 [15] is adopted as the relevance weighting function. According to the BM2500 score, we choose the top 2000 results and thus we get two ranking lists: one is about relevance and the other is about importance (PR or WPR). ...
... In the past, a number of IR models such as Vector Space Model (VSM) [22,23], Probabilistic Model [20,1] and LM [17,28] have been proposed which based on term intersection approach. Term intersection is the approach where both document and query should share the same terms. ...
Conference Paper
With the explosive growth of online information such as email messages, news articles, and scientific literature, many institutions and museums are converting their cultural collections from physical data to digital format. However, this conversion resulted in the issues of inconsistency and incompleteness. Besides, the usage of inaccurate keywords also resulted in short query problem. Most of the time, the inconsistency and incompleteness are caused by the aggregation fault in annotating a document itself while the short query problem is caused by naive user who has prior knowledge and experience in cultural heritage domain. In this paper, we presented an approach to solve the problem of inconsistency, incompleteness and short query by incorporating the Term Similarity Matrix into the Language Model. Our approach is tested on the Cultural Heritage in CLEF (CHiC) collection which consists of short queries and documents. The results show that the proposed approach is effective and has improved the accuracy in retrieval time.
... Given a user query described in terms of one or more keywords, an information retrieval system must find the most relevant documents to the user query from a potentially large collection of documents. Word salience scores based on term frequency, document frequency, and document length have been proposed such as tfidf and BM25 [32]. ...
Article
Full-text available
Measuring the salience of a word is an essential step in numerous NLP tasks. Heuristic approaches such as tfidf have been used so far to estimate the salience of words. We propose \emph{Neural Word Salience} (NWS) scores, unlike heuristics, are learnt from a corpus. Specifically, we learn word salience scores such that, using pre-trained word embeddings as the input, can accurately predict the words that appear in a sentence, given the words that appear in the sentences preceding or succeeding that sentence. Experimental results on sentence similarity prediction show that the learnt word salience scores perform comparably or better than some of the state-of-the-art approaches for representing sentences on benchmark datasets for sentence similarity, while using only a fraction of the training and prediction times required by prior methods. Moreover, our NWS scores positively correlate with psycholinguistic measures such as concreteness, and imageability implying a close connection to the salience as perceived by humans.
... Traditionally, a ranking model is built without training. Items are ranked by either relevance or importance, such as Boolean model [2], BM25 [28] and PageRank [27]. These conventional ranking models require parameter tuning and face over-fitting problems. ...
... Traditionally, a ranking model is built without training. Items are ranked by either relevance or importance, such as Boolean model [2], BM25 [28] and PageRank [27]. These conventional ranking models require parameter tuning and face over-fitting problems. ...
Article
Full-text available
In this study, we apply learning-to-rank algorithms to design trading strategies using relative performance of a group of stocks based on investors' sentiment toward these stocks. We show that learning-to-rank algorithms are effective in producing reliable rankings of the best and the worst performing stocks based on investors' sentiment. More specifically, we use the sentiment shock and trend indicators introduced in the previous studies, and we design stock selection rules of holding long positions of the top 25\% stocks and short positions of the bottom 25\% stocks according to rankings produced by learning-to-rank algorithms. We then apply two learning-to-rank algorithms, ListNet and RankNet, in stock selection processes and test long-only and long-short portfolio selection strategies using 10 years of market and news sentiment data. Through backtesting of these strategies from 2006 to 2014, we demonstrate that our portfolio strategies produce risk-adjusted returns superior to the S\&P500 index return, the hedge fund industry average performance - HFRIEMN, and some sentiment-based approaches without learning-to-rank algorithm during the same period.
... The Okapi project researched at City University London [13] involves the testing of various text-based information retrieval techniques including the use of thesauri. Beaulieu [2] concluded that thesaurus use, both explicit and implicit, was beneficial to the retrieval process. ...
Conference Paper
Full-text available
Traditional hypermedia systems can be extended to allow content based matching to give more flexibility for user navigation, but this approach is still limited by the capabilities of multimedia matching technology. The addition of a multimedia thesaurus can overcome some of these limitations by allowing multimedia representations of concepts to act like synonyms in the query process. In addition, relationships between concepts allow navigation within the context of a semantic scope. The use of agents that independently examine the information in the system can also provide alternative methods for query evaluation. This paper presents a flexible architecture that supports such a system and describes initial work on implementation.
... Standard features for learning to rank include differet query-document features, e.g., BM25 [76], as well as query-independent features, e.g., PageRank [6]. Our feature space consists of both these standard monolingual features and cross-lingual similarities among documents. ...
Thesis
Full-text available
This work studies something new in Web search to cater for users’ cross-lingual information needs by using the common search interests found across different languages. We assume a generic scenario for monolingual users who are interested to find their relevant information under three general settings: (1) find relevant information in a foreign language, which needs machine to translate search results into the user’s own language; (2) find relevant information in multiple languages including the source language, which also requires machine translation for back translating search results; (3) find relevant information only in the user’s language, but due to the intrinsic cross-lingual nature of many queries, monolingual search can be done with the assistance of cross-lingual information from another language. We approach the problem by substantially extending two core mechanics of information retrieval for Web search across languages, namely, query formulation and relevance ranking. First, unlike traditional cross-lingual methods such as query translation and expansion, we propose a novel Cross-Lingual Query Suggestion model by leveraging large-scale query logs of search engine to learn to suggest closely related queries in the target language for a given source language query. The rationale behind our approach is the ever-increasing common search interests across Web users in different languages. Second, we generalize the usefulness of common search interests to enhance relevance ranking of documents by exploiting the correlation among the search results derived from bilingual queries, and overcome the weakness of traditional relevance estimation that only uses information of a single language or that of different languages separately. To this end, we attempt to learn a ranking function that incorporates various similarity measures among the retrieved documents in different languages. By modeling the commonality or similarity of search results, relevant documents in one language may help the relevance estimation of documents in a different language, and hence can improve the overall relevance estimation. This similar intuition is applicable to all the three settings described above.
... Furthermore, the original probabilistic retrieval models did not include the tf weighing and in an attempt to do so the research led to the development of BM25 ranking function (better known as Okapi BM25) as part of the Okapi experimental system (Robertson, 1997). Advancement in the basic vector model was also made in this era, the more well-known is the Latent Semantic Indexing (LSI) where, the effective dimension of the vector space of a collection can be reduced using Singular Value Decomposition (SVD) (Deerwester et al., 1990). ...
Thesis
La nécessité d'estimer la taille d’un logiciel pour pouvoir en estimer le coût et l’effort nécessaire à son développement est une conséquence de l'utilisation croissante des logiciels dans presque toutes les activités humaines. De plus, la nature compétitive de l’industrie du développement logiciel rend courante l’utilisation d’estimations précises de leur taille, au plus tôt dans le processus de développement. Traditionnellement, l’estimation de la taille des logiciels était accomplie a posteriori à partir de diverses mesures appliquées au code source. Cependant, avec la prise de conscience, par la communauté de l’ingénierie logicielle, que l’estimation de la taille du code est une donnée cruciale pour la maîtrise du développement et des coûts, l’estimation anticipée de la taille des logiciels est devenue une préoccupation répandue. Une fois le code écrit, l’estimation de sa taille et de son coût permettent d'effectuer des études contrastives et éventuellement de contrôler la productivité. D’autre part, les bénéfices apportés par l'estimation de la taille sont d'autant plus grands que cette estimation est effectuée tôt pendant le développement. En outre, si l’estimation de la taille peut être effectuée périodiquement au fur et à mesure de la progression de la conception et du développement, elle peut fournir des informations précieuses aux gestionnaires du projet pour suivre au mieux la progression du développement et affiner en conséquence l'allocation des ressources. Notre recherche se positionne autour des mesures d’estimation de la taille fonctionnelle, couramment appelées Analyse des Points de Fonctions, qui permettent d’estimer la taille d’un logiciel à partir des fonctionnalités qu’il doit fournir à l’utilisateur final, exprimées uniquement selon son point de vue, en excluant en particulier toute considération propre au développement. Un problème significatif de l'utilisation des points de fonction est le besoin d'avoir recours à des experts humains pour effectuer la quotation selon un ensemble de règles de comptage. Le processus d'estimation représente donc une charge de travail conséquente et un coût important. D'autre part, le fait que les règles de comptage des points de fonction impliquent nécessairement une part d'interprétation humaine introduit un facteur d'imprécision dans les estimations et rend plus difficile la reproductibilité des mesures. Actuellement, le processus d'estimation est entièrement manuel et contraint les experts humains à lire en détails l'intégralité des spécifications, une tâche longue et fastidieuse. Nous proposons de fournir aux experts humains une aide automatique dans le processus d'estimation, en identifiant dans le texte des spécifications, les endroits les plus à même de contenir des points de fonction. Cette aide automatique devrait permettre une réduction significative du temps de lecture et de réduire le coût de l'estimation, sans perte de précision. Enfin, l’identification non ambiguë des points de fonction permettra de faciliter et d'améliorer la reproductibilité des mesures. À notre connaissance, les travaux présentés dans cette thèse sont les premiers à se baser uniquement sur l’analyse du contenu textuel des spécifications, applicable dès la mise à disposition des spécifications préliminaires et en se basant sur une approche générique reposant sur des pratiques établies d'analyse automatique du langage naturel.
... The features mainly classified as content features and hyperlink features. The content features can be further classified as low-level features (some basic statistical information of the collection, documents and queries, such as term frequency tf and inverse document frequency idf ) and high-level features (the outputs of some classic approaches such as BM25 [49] and LMIR [50]). The hyperlink features usually include the numbers of hyperlinks to the documents, output of PageRank algorithm, etc. Constants serve as coefficients of features in f . ...
Article
Full-text available
We propose CCRank, the first parallel framework for evolutionary algorithms (EA) based learning to rank, aiming to significantly improve learning efficiency while maintain accuracy. CCRank is based on cooperative coevolution (CC), a divide-andconquer framework that has demonstrated high promise in function optimization for problems with large search space and complex structures. Moreover, CC naturally allows parallelization of sub-solutions to the decomposed sub-problems, which can substantially boost learning efficiency. With CCRank, we investigate parallel CC in the context of learning to rank. We implement CCRank with three EA-based learning to rank algorithms for demonstration. Extensive experiments on benchmarks in comparison with the state-of-the-art algorithms show the performance gains of CCRank in efficiency and accuracy.
... Similar to the work by Amati and van Rijsbergen [2002], this approach uses the Laplace law of succession [Feller, 1968] to derive the weighted term frequency (e.g., [Huang et al., 2003]) as the term frequency factor of BM term weights of the Okapi system [Robertson, 1997]. This approach derives the BM term weights in a way different from their original conception [Robertson and Walker, 1994]. ...
Thesis
In this thesis we study new retrieval models which simulate the "local" relevance decision-making for every term location in a document, these local relevance decisions are then combined as the "document-wide" relevance decision for the document. Local relevance decision for a term t occurred at the k-th location in a document is made by considering the document-context which is the window of terms centred at the term t at the k-th location. Therefore, different relevance scores (preferences) are obtained for the same term t at different locations in a document depending on its document-contexts. This differs from traditional models which term t receives the same score disregard of its locations in a document. A hybrid document-context model is studied which is the combination of various existing effective models and techniques. It estimates the relevance decision preference of document-contexts as the log-odds and combines the estimated preferences using different types of aggregation operators that comply with the relevance decision principles. The model is evaluated using retrospective experiments to reveal the potential of the model. Besides retrospective experiments, we also use top 20 documents from the initial ranked list to perform relevance feedback experiments with a probabilistic document-context model and the results are promising. We also show that when the size of the document-contexts is shrunk to unity, the document-context model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. Thus TF-IDF term weights can be interpreted as making relevance decisions. This helps to establish a unifying perspective about information retrieval as relevance decision-making and to develop advance TF-IDF-related term weights for future elaborate retrieval models. Lastly, we develop a new relevance feedback algorithm by splitting the ranked document list into multiple lists of document-contexts. The judgement of relevance of the documents is not done sequentially. This is called active feedback and we show that our new relevance feedback algorithm obtained better results than the conventional relevance feedback algorithm and this is done more reliably than a maximal marginal relevance (MMR) method which does not use document-contexts.
... Only text pages are reserved in the final dataset, which contains 2,998,821 visit records to 1,773,718 web pages by 38,887 users. All the 1.7 million web pages are crawled on a local machine, and a text-based web search engine (using Okapi BM2500 [116] ranking scheme) is developed over these web pages. This search engine will be used to evaluate the effectiveness of the Link Fusion algorithm in 4.3.1. ...
Article
Regardless of their domains, level, or expertise, students consider video lectures one of the most popular learning media while engaged in self-study sessions on any e-learning platform. In the absence of experts/teachers in a self-study session, students often need to browse the Internet to avail themselves of additional information on the relevant topics. Hence, it would be helpful for such motivated students if we augment the video lectures with such supplementary references. In this article, we present a video lecture augmentation system leveraging question-answer (QA) pairs offering supplementary references on the course-relevant concepts. We also designed a user interface to present these augmented video lectures categorically so that the students can readily opt for the augmentations of their choice. While we qualitatively surveyed the personalization of the augmentations and usability aspects of the user interface, we quantitatively evaluated our proposed video lecture augmentation system in terms of the performances of two primary underlying modules: augmentation retrieval and tag recommendation. We quantified the pedagogical effectiveness of the augmentations following an equivalent pretest-posttest setup. All these experiments indicate that the proposed augmentations are relevant and pedagogically effective, the categorical representation helps the students choose the necessary resources readily, and the designed interface is easy to use.
Chapter
The World Wide Web allows users and organizations to publish information and documents, which are instantly available for all other users of the Web. The data published to the Web continuously increases, providing the users with a vast amount of information on any topic imaginable. However, navigating the Web and identifying the relevant pieces of information in the abundance of data is not trivial. To cope with this problem, Web mining approaches are being used. Web mining includes the application of information retrieval, data mining, and machine learning approaches on Web data and the Web structure. This chapter provides a brief summary of Web mining approaches, including Web content mining, Web structure mining, Web usage mining, and Semantic Web mining.
Article
Full-text available
This paper addresses the feature selection problem in learning to rank (LTR). We propose a graph-based feature selection method, named FS-SCPR, which comprises four steps: (i) use ranking information to assess the similarity between features and construct an undirected feature similarity graph; (ii) apply spectral clustering to cluster features using eigenvectors of matrices extracted from the graph; (iii) utilize biased PageRank to assign a relevance score with respect to the ranking problem to each feature by incorporating each feature?s ranking performance as preference to bias the PageRank computation; and (iv) apply optimization to select the feature from each cluster with both the highest relevance score and most information of the features in the cluster. We also develop a new LTR for information retrieval (IR) approach that first exploits FS-SCPR as a preprocessor to determine discriminative and useful features and then employs Ranking SVM to derive a ranking model with the selected features. An evaluation, conducted using the LETOR benchmark datasets, demonstrated the competitive performance of our approach compared to representative feature selection methods and state-of-the-art LTR methods.
Article
Context: Security bug reports (SBRs) usually contain security-related vulnerabilities in software products, which could be exploited by malicious attackers. Hence, it is important to identify SBRs quickly and accurately among bug reports (BRs) that have been disclosed in bug tracking systems. Although a few methods have been already proposed for the detection of SBRs, challenging issues still remain due to noisy samples, class imbalance and data scarcity. Object: This motivates us to reveal the potential challenges faced by the state-of-the-art SBRs prediction methods from the viewpoint of data filtering and representation. Furthermore, the purpose of this paper is also to provide a general framework and new solutions to solve these problems. Method: In this study, we propose a novel approach LTRWES that incorporates learning to rank and word embedding into the identification of SBRs. Unlike previous keyword-based approaches, LTRWES is a content-based data filtering and representation framework that has several desirable properties not shared in other methods. Firstly, it exploits ranking model to efficiently filter non-security bug reports (NSBRs) that have higher content similarity with respect to SBRs. Secondly, it applies word embedding technology to transform the rest of NSBRs, together with SBRs, into low-dimensional real-value vectors. Result: Experiment results on benchmark and large real-world datasets show that our proposed method outperforms the state-of-the-art method. Conclusion: Overall, the LTRWES is valid with high performance. It will help security engineers to identify SBRs from thousands of NSBRs more accurately than existing algorithms. Therefore, this will positively encourage the research and development of the content-based methods for security bug report detection.
Chapter
Efficient ontology reuse is a key factor in the Semantic Web to enable and enhance the interoperability of computing systems. One important aspect of ontology reuse is concerned with ranking most relevant ontologies based on a keyword query. Apart from the semantic match of query and ontology, the state-of-the-art often relies on ontologies’ occurrences in the Linked Open Data (LOD) cloud to determine relevance. We observe that ontologies of some application domains, in particular those related to Web of Things (WoT), often do not appear in the underlying LOD datasets used to define ontologies’ popularity, resulting in ineffective ranking scores. This motivated us to investigate – based on the problematic WoT case – whether the scope of ranking models can be extended by relying on qualitative attributes instead of an explicit popularity feature. We propose a novel approach to ontology ranking by (i) selecting a range of relevant qualitative features, (ii) proposing a popularity measure for ontologies based on scholarly data, (iii) training a ranking model that uses ontologies’ popularity as prediction target for the relevance degree, and (iv) confirming its validity by testing it on independent datasets derived from the state-of-the-art. We find that qualitative features help to improve the prediction of the relevance degree in terms of popularity. We further discuss the influence of these features on the ranking model.
Chapter
Text classification is among the most broadly used machine learning tools in computational linguistic. Web information retrieval is one of the most important sectors that took advantage from this technique. Applications range from page classification, used by search engines, to URL classification used for focus crawling and on-line time-sensitive applications [2]. Due to the pressing need for the highest possible accuracy, a supervised learning approach is always preferred when an adequately large set of training examples is available. Nonetheless, since building such an accurate and representative training set often becomes impractical when the number of classes increases over a few units, alternative unsupervised or semi-supervised approaches have come out. The use of standard web directories as a source of examples can be prone to undesired effects due, for example, to the presence of maliciously misclassified web pages. In addition, this option is subjected to the existence of all the desired classes in the directory hierarchy.
Article
Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.
Article
Full-text available
Learning to rank has attracted much attention in the domain of information retrieval and machine learning. Prior studies on learning to rank mainly focused on three types of methods, namely, pointwise, pairwise and listwise. Each of these paradigms focuses on a different aspect of input instances sampled from the training dataset. This paper explores how to combine them to improve ranking performance. The basic idea is to incorporate the different loss functions and enrich the objective loss function. We present a flexible framework for multiple loss function incorporation and based on which three loss-weighting schemes are given. Moreover, in order to get good performance, we define several candidate loss functions and select them experimentally. The performance of the three types of weighting schemes is compared on LETOR3.0 dataset, which demonstrates that with a good weighting scheme, our method significantly outperforms the baselines which use single loss function, and it is at least comparable to the state-of-the-art algorithms in most cases.
Chapter
In this chapter, we introduce the LETOR benchmark datasets, including the following aspects: document corpora (together with query sets), document sampling, feature extraction, meta information, cross validation, and major ranking tasks supported.
Chapter
In this chapter, we introduce semi-supervised learning for ranking. The motivation of this topic comes from the fact that we can always collect a large number of unlabeled documents or queries at a low cost. It would be very helpful if one can leverage such unlabeled data in the learning-to-rank process. In this chapter, we mainly review a transductive approach and an inductive approach to this task, and discuss how to improve these approaches by taking the unique properties of ranking into consideration.
Conference Paper
Carrying out experiments in Information Retrieval is a heavy activity requiring fast tools to treat collections of significant size, and at the same time, flexible tools to leave the most possible freedom during the experimentation. System X-IOTA was developed to answer the criterion of flexibility and thus to support fast installation of various experiments using automatic natural language treatments. The architecture is designed to allow a distribution of computations among distributed servers. We use this framework to test different set of weighting particularly the new Deviation from Randomness against Okapi.
Article
An efficient and effective ranking mechanism in the search engines remains as a challenging problem. In recent years, a few relevance propagation models like Hyperlink-based score propagation, Hyperlink-based term propagation, and Popularity-based propagation models have been proposed. In this paper, we will give a comprehensive study of the relevance propagation technologies for Web information retrieval and conduct experimental evaluations over these models to know which model is more effective and efficient. We also propose a new relevance propagation model based on content, link structure (web graph), and number of slashes in the URL. It propagates content and the number of slashes as scores through the link structure. The goal is to find more relevant web pages to the user query. To compare relevance propagation models, Letor 3.0- a standard web test collection- was used in the experiments. We have concluded that using number of slashes in the propagation process provides improvement in Web information retrieval accuracy.
Article
Ranking plays an important role in information retrieval, aiming to sort the documents retrieved for a given query in the descending order of relevance. Recently, many approaches based on the idea of "learning to rank" have been proposed for doing ranking. Most of them consider all the documents of the training queries to build a static, query-independent ranking model. In this paper, we propose an adaptive, query- dependent framework for learning to rank based on a distributional similarity measure for gauging the similarity between queries. For each training query, one individual ranking model is learned from its associated set of documents. When a new query is consulted, the individual trained models of those training queries most similar to the new query are obtained and combined into a joint model which is then used to rank the documents retrieved for the new query. Experimental results show that our proposed approach works very well compared with other methods.
Article
Full-text available
This paper examines statistical techniques for exploiting relevance information to weight search terms. These techniques are presented as a natural extension of weighting methods using information about the distribution of index terms in documents in general. A series of relevance weighting functions is derived and is justified by theoretical considerations. In particular, it is shown that specific weighted search methods are implied by a general probabilistic theory of retrieval. Different applications of relevance weighting are illustrated by experimental results for test collections.