Article

Relevance Feedback in Information Retrieval. Smart System-Experiments in Automatic Document Process

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... 1. Embeddings are generated for all tags assigned to POIs. 2. For a particular POI, the embeddings of tags assigned to it are aggregated to obtain a representation of the POI in the embedding space. 3. Using Rocchio's algorithm (Rocchio, 1971) with the positive, negative and neutral ratings assigned by a traveller to various POIs, a profile is created for that traveller. This profile, also a vector in the embedding space, is expected to constitute an abstract, but high-level, description of the traveller's 'type': whether the user prefers outdoor activities, or inclined towards the arts and literature, or whether of 'food-drink-party' type, etc. 4. The profile of a traveller is matched against the vector representations of potential POIs, to be recommended, to assign scores. ...
... We employ word2vec (Mikolov et al., 2013), a word embedding model that captures the relationship between words. While representing the users and ranking the candidate POIs, we use the embedding based approach in the Rocchio feedback framework (Rocchio, 1971). ...
... Based on these preferences, a positive, a negative and a neutral profile vectors can be constructed for each user in the way discussed earlier in this section. Considering these different signals, we formalize a user specific user profile vector using the idea of Rocchio model (Rocchio, 1971). In the document retrieval scenario, Rocchio feedback method works in Vector Space Model (VSM) where the initial query vector is modified to based on the centroid of the set of relevant ( ) and non-relevant ( ) documents. ...
Article
E-tourism websites such as Foursquare, Tripadvisor, Yelp etc. allow users to rate the preferences for the places they have visited. Along with ratings, the services allow users to provide reviews on social media platforms. As the use of hashtags has been popular in social media, the users may also provide hashtag-like tags to express their opinion regarding some places. In this article, we propose an embedding based venue recommendation framework that represents Point Of Interest (POI) based on tag embedding and models the users (user profile) based on the POIs rated by them. We rank a set of candidate POIs to be recommended to the user based on the cosine similarity between respective user profile and the embedded representation of POIs. Experiments on TREC Contextual Suggestion data empirically confirm the effectiveness of the proposed model. We achieve significant improvement over PK-Boosting and CS-L2Rank, two state-of-the-art baseline methods. The proposed methods improve [email protected] by 12.8%, [email protected] by 4.4%, and MRR by 7.8% over CS-L2Rank. The proposed methods also minimize the risk of privacy leakage. To verify the overall robustness of the models, we tune the model parameters by discrete optimization over different measures (such as AP, NDCG, MRR, recall, etc.). The experiments have shown that the proposed methods are overall superior than the baseline models.
... (c) Reformulation of the search query: Once the candidate keywords are collected, they are ranked and the top few keywords are used to expand a given query (Step 7, Fig. 3). Less important keywords and the keywords non-existent in the corpus are discarded from the given query [56,141]. A few studies [56,114,135] argue that the same term weighting algorithm might always not work. ...
... However, Gay et al. [47] were the first to use the term weighting and relevance feedback to reformulate search queries in the context of code search (e.g., concept location). Since then many studies adopt the term weighting and relevance feedback in their query reformulation approaches [56,135,156,159,170]. Haiduc et al. [56] employ Rocchio's expansion [141], Dice similarity [29], and Robertson Selection Value (RSV) [140] to reformulate a query where they use TF-IDF as a proxy of term importance. Their query difficulty models also make use of several variants of TF-IDF (e.g., avgIDF, maxIDF, maxIDF) [57,114]. ...
... Mills et al. [114] later extend this model with seven post-retrieval metrics (i.e., 28 in total), and evaluate the model performance extensively. Given a search query, Haiduc et al. [56] generate a list of four reformulation candidates using Rocchio's expansion [141], Robertson Selection Value (RSV) [140], Dice similarity [29] and query reduction [29]. Then they train a query difficulty model to suggest the best reformulated query. ...
Preprint
Full-text available
Software developers often fix critical bugs to ensure the reliability of their software. They might also need to add new features to their software at a regular interval to stay competitive in the market. These bugs and features are reported as change requests (i.e., technical documents written by software users). Developers consult these documents to implement the required changes in the software code. As a part of change implementation, they often choose a few important keywords from a change request as an ad hoc query. Then they execute the query with a code search engine (e.g., Lucene) and attempt to find out the exact locations within the software code that need to be changed. Unfortunately, even experienced developers often fail to choose the right queries. As a consequence, the developers often experience difficulties in detecting the appropriate locations within the code and spend the majority of their time in numerous trials and errors. There have been many studies that attempt to support developers in constructing queries by automatically reformulating their ad hoc queries. In this systematic literature review, we carefully select 70 primary studies on query reformulations from 2,970 candidate studies, perform an in-depth qualitative analysis using the Grounded Theory approach, and then answer six important research questions. Our investigation has reported several major findings. First, to date, eight major methodologies (e.g., term weighting, query-term co-occurrence analysis, thesaurus lookup) have been adopted in query reformulation. Second, the existing studies suffer from several major limitations (e.g., lack of generalizability, vocabulary mismatch problem, weak evaluation, the extra burden on the developers) that might prevent their wide adoption. Finally, we discuss several open issues in search query reformulations and suggest multiple future research opportunities.
... We then iteratively refine the query representation at test-time using a stochastic gradient descent method. We theoretically show that our framework can be viewed as a generalized version of the Rocchio's algorithm for pseudo relevance feedback (Rocchio, 1971), which is a common technique in information retrieval for improving query representations with retrieval results. ...
... In fact, pseudo relevance feedback (PRF) techniques in information retrieval (Rocchio, 1971;Lavrenko and Croft, 2001) share a similar motivation with ours in that they refine query representations for single testing query. While most previous work utilized PRF for sparse retrieval (Zamani et al., 2018;Croft et al., 2010), recent work also started to apply PRF on dense retrieval (Yu et al., 2021;Wang et al., 2021;Li et al., 2021). ...
... where g is an update function and q t denotes the query representation after t-th updates over q 0 . The classical Rocchio's algorithm for PRF (Rocchio, 1971) updates the query representation as: ...
Preprint
Dense retrieval uses a contrastive learning framework to learn dense representations of queries and contexts. Trained encoders are directly used for each test query, but they often fail to accurately represent out-of-domain queries. In this paper, we introduce a framework that refines instance-level query representations at test time, with only the signals coming from the intermediate retrieval results. We optimize the query representation based on the retrieval result similar to pseudo relevance feedback (PRF) in information retrieval. Specifically, we adopt a cross-encoder labeler to provide pseudo labels over the retrieval result and iteratively refine the query representation with a gradient descent method, treating each test query as a single data point to train on. Our theoretical analysis reveals that our framework can be viewed as a generalization of the classical Rocchio's algorithm for PRF, which leads us to propose interesting variants of our method. We show that our test-time query refinement strategy improves the performance of phrase retrieval (+8.1% Acc@1) and passage retrieval (+3.7% Acc@20) for open-domain QA with large improvements on out-of-domain queries.
... In this article, we conduct an empirical study using 2,320 bug reports (939 natural language-only + 1,381 natural language texts and localization hints), and ten existing approaches on search query construction [13,31,32,56,57,63,64,70], and critically examine the state-of-the-art query construction practices in the IR-based bug localization. We perform three different analyses in our empirical study. ...
... Kevic and Fritz [32] use TF-IDF and three heuristics (e.g., POS tags, notation, position) to identify the important search terms from a bug report. Gay et al. [19] employ Rocchio's expansion [64] where they use TF-IDF in reformulating search queries for concept location. Haiduc et al. [25] later employ three frequency-based term weighting methods -Rocchio [64], RSV [63] and Dice [13], and deliver the best performing query keywords from the source code for concept location using machine learning. ...
... Gay et al. [19] employ Rocchio's expansion [64] where they use TF-IDF in reformulating search queries for concept location. Haiduc et al. [25] later employ three frequency-based term weighting methods -Rocchio [64], RSV [63] and Dice [13], and deliver the best performing query keywords from the source code for concept location using machine learning. Sisman and Kak [70] leverage spatial proximity between query keywords and candidate keywords within the source code (a.k.a., spatial code proximity (SCP)), and return such candidates that frequently co-occur with the query. ...
Preprint
Full-text available
Being light-weight and cost-effective, IR-based approaches for bug localization have shown promise in finding software bugs. However, the accuracy of these approaches heavily depends on their used bug reports. A significant number of bug reports contain only plain natural language texts. According to existing studies, IR-based approaches cannot perform well when they use these bug reports as search queries. On the other hand, there is a piece of recent evidence that suggests that even these natural language-only reports contain enough good keywords that could help localize the bugs successfully. On one hand, these findings suggest that natural language-only bug reports might be a sufficient source for good query keywords. On the other hand, they cast serious doubt on the query selection practices in the IR-based bug localization. In this article, we attempted to clear the sky on this aspect by conducting an in-depth empirical study that critically examines the state-of-the-art query selection practices in IR-based bug localization. In particular, we use a dataset of 2,320 bug reports, employ ten existing approaches from the literature, exploit the Genetic Algorithm-based approach to construct optimal, near-optimal search queries from these bug reports, and then answer three research questions. We confirmed that the state-of-the-art query construction approaches are indeed not sufficient for constructing appropriate queries (for bug localization) from certain natural language-only bug reports although they contain such queries. We also demonstrate that optimal queries and non-optimal queries chosen from bug report texts are significantly different in terms of several keyword characteristics, which has led us to actionable insights. Furthermore, we demonstrate 27%--34% improvement in the performance of non-optimal queries through the application of our actionable insights to them.
... Specifically, Hayes et al. (2006) proposed a pioneering approach that asks users to iteratively verify candidate links and uses the standard Rocchio algorithm (Rocchio 1971) to modify the weights of the terms in the text of both requirements and code based on user feedback. However, the improvements brought by this approach are reported to be both limited and not always evident by a follow-on work (De Lucia et al. 2006). ...
... Meanwhile, Hayes et al. (2006) proposed to ask users to iteratively verify each candidate link generated by IR techniques, rather than prepare a workable training set in advance. This approach then applies the feedback to the standard Rocchio algorithm (Rocchio 1971) on VSM to modify the weights of the terms in requirements and code in the same iteration when the user verifies a candidate link. Unfortunately, follow-on work (De Lucia et al. 2006) demonstrated that the benefits provided by Hayes et al.'s approach are both limited and not always evident. ...
... We used three mainstream IR models, i.e., VSM, LSI, and JS, to compare CLUSTER' with the nine baseline approaches. According to the experiment results, we set up the baseline approach ARF to apply the Rocchio algorithm (Rocchio 1971) after the creation of LSI subspace, i.e., decomposing the term-by-document matrix. We also apply the Rocchio Algorithm to JS to set up a more complete experiment, although existing work (Salton and Buckley 1990) has reported that the Rocchio Algorithm is not as competitive on the probabilistic model as on the Vector Space Model To find out whether CLUSTER' is able to improve IR-based traceability recovery with a small amount of user feedback, for UD-CSTI and ARF we assume that their users will verify all candidate links to reach the best performance of the two approaches (default settings according to related papers (Panichella et al. , 2015). ...
Article
Full-text available
Traceability recovery captures trace links among different software artifacts (e.g., requirements and code) when two artifacts cover the same part of system functionalities. These trace links provide important support for developers in software maintenance and evolution tasks. Information Retrieval (IR) is now the mainstream technique for semi-automatic approaches to recover candidate trace links based on textual similarities among artifacts. The performance of IR-based traceability recovery is evaluated by the ranking of relevant traces in the generated lists of candidate links. Unfortunately, this performance is greatly hindered by the vocabulary mismatch problem between different software artifacts. To address this issue, a growing body of enhancing strategies based on user feedback is proposed to adjust the calculated IR values of candidate links after the user verifies part of these links. However, the improvement brought by this kind of strategies requires a large amount of user feedback, which could be infeasible in practice. In this paper, we propose to improve IR-based traceability recovery by propagating a small amount of user feedback through the closeness analysis on call and data dependencies in the code. Specifically, our approach first iteratively asks users to verify a small set of candidate links. The collected frugal feedback is then composed with the quantified functional similarity for each code dependency (called closeness) and the generated IR values to improve the ranking of unverified links. An empirical evaluation based on nine real-world systems with three mainstream IR models shows that our approach can outperform five baseline approaches by using only a small amount of user feedback.
... An example is shown in Figure 1 where the first document introduces the synonymous term 'COVID-19' into the original query to clarify the original query that contains ambiguous 'Omicron'. Early PRF was widely studied for sparse retrieval like vector space models [50], probabilistic models [49], and language modeling methods [21,25,26,35,53,61]. Recently, some work has shifted to apply PRF in dense retrieval of single-representation [28,29,59] and multi-representation [54]. ...
... The first category of methods typically expands the query based on global resources, such as WordNet [18], thesauri [52], Wikipedia [1], Freebase [55], and Word2Vec [15]. While the second category, the so-called relevance feedback [50], is usually more popular. It leverages local relevance feedback for the original query to reformulate a query revision. ...
... It leverages local relevance feedback for the original query to reformulate a query revision. Relevance feedback information can be obtained through explicit feedback (e.g., document relevance judgments [50]), implicit feedback (e.g., clickthrough data [22]), or pseudo-relevance feedback (assuming the top-retrieved documents contain information relevant to the user's information need [3,11]). Of these, pseudo-relevance feedback is the most common, since no user intervention is required. ...
Preprint
Full-text available
Pseudo-relevance feedback (PRF) has proven to be an effective query reformulation technique to improve retrieval accuracy. It aims to alleviate the mismatch of linguistic expressions between a query and its potential relevant documents. Existing PRF methods independently treat revised queries originating from the same query but using different numbers of feedback documents, resulting in severe query drift. Without comparing the effects of two different revisions from the same query, a PRF model may incorrectly focus on the additional irrelevant information increased in the more feedback, and thus reformulate a query that is less effective than the revision using the less feedback. Ideally, if a PRF model can distinguish between irrelevant and relevant information in the feedback, the more feedback documents there are, the better the revised query will be. To bridge this gap, we propose the Loss-over-Loss (LoL) framework to compare the reformulation losses between different revisions of the same query during training. Concretely, we revise an original query multiple times in parallel using different amounts of feedback and compute their reformulation losses. Then, we introduce an additional regularization loss on these reformulation losses to penalize revisions that use more feedback but gain larger losses. With such comparative regularization, the PRF model is expected to learn to suppress the extra increased irrelevant information by comparing the effects of different revised queries. Further, we present a differentiable query reformulation method to implement this framework. This method revises queries in the vector space and directly optimizes the retrieval performance of query vectors, applicable for both sparse and dense retrieval models. Empirical evaluation demonstrates the effectiveness and robustness of our method for two typical sparse and dense retrieval models.
... These are used to initialize an iterative pool-based active learning workflow [50]. Reviewed documents are used to train a predictive model, which in turn is used to select further documents based on predicted relevance [51], uncertainty [52], or composite factors. Workflows may be batch-oriented (mimicking pre-machine learning manual workflows common in the law) or a stream of documents may be presented through an interactive interface with training done in the background. ...
... At the end of each round, a logistic regression model was trained and applied to the unlabeled documents. The training batch for the next round was then selected by one of three methods: a random sampling baseline, uncertainty sampling [52], or relevance feedback (top scoring documents) [51]. Variants of the latter two are widely used in eDiscovery [57]. ...
Preprint
Full-text available
Content moderation (removing or limiting the distribution of posts based on their contents) is one tool social networks use to fight problems such as harassment and disinformation. Manually screening all content is usually impractical given the scale of social media data, and the need for nuanced human interpretations makes fully automated approaches infeasible. We consider content moderation from the perspective of technology-assisted review (TAR): a human-in-the-loop active learning approach developed for high recall retrieval problems in civil litigation and other fields. We show how TAR workflows, and a TAR cost model, can be adapted to the content moderation problem. We then demonstrate on two publicly available content moderation data sets that a TAR workflow can reduce moderation costs by 20% to 55% across a variety of conditions.
... La méthode de boucle de rétropertinence en aveugle (pseudo relevance feedback) est dérivée de la méthode de boucle de rétropertinence (relevance feedback) (Rocchio, 1971) qui consiste à modifier la requête en utilisant des termes de l'ensemble des documents jugés pertinents, noté D p 1 . En pratique, la boucle de rétropertinence consiste à ajouter à la requête des termes pondérés issus de D p . ...
... Comme décrit dans la section 2.2.3, les méthodes d'expansion de requêtes consistent à ajouter des termes à la requête 1 dans le but d'améliorer la recherche. Il est possible d'étendre la requête en utilisant les termes des documents considérés comme pertinents (Rocchio, 1971), des bases de connaissances (Dalton et al., 2014) ou bien des vecteurs de mots appris à l'aide de modèles d'apprentissage automatique (Almasri et al., 2016). ...
Thesis
Ce travail de thèse se situe dans les domaines de la recherche d’information RI) textuelle et de l’apprentissage profond utilisant des réseaux de neurones. Les travaux effectués dans ce travail de thèse sont motivés par le fait que l’utilisation de réseaux de neurones en RI textuelle s’est révélée efficace sous certaines conditionsmais que leur utilisation présente néanmoins plusieurs limitations pouvant grandement restreindre leur application en pratique.Dans ce travail de thèse, nous proposons d’étudier l’incorporation de connaissances a priori pour aborder 3 limitations de l’utilisation de réseaux de neurones pour la RI textuelle : (1) la nécessité de disposer de grandes quantités de données étiquetées ; (2) les représentations du texte sont basées uniquement sur des analyses statistiques ; (3) le manque d’efficience.Nous nous sommes intéressés à trois types de connaissances a priori pour aborder les limitations mentionnées ci-dessus : (1) des connaissances issues d’une ressource semi-structurée : Wikipédia ; (2) des connaissances issues de ressources structurées sous forme de ressources sémantiques telles que des ontologies ou des thésaurus ; (3) des connaissances issues de texte non structurées.Dans un premier temps, nous proposons WIKIR : un outil libre d’accès permettant de créer automatiquement des collections de RI depuis Wikipédia. Les réseaux de neurones entraînés sur les collections créées automatiquement ont besoin par la suite de moins de données étiquetées pour atteindre de bonnes performances. Dans un second temps, nous avons développé des réseaux de neurones pour la RI utilisant des ressources sémantiques. L’intégration de ressources sémantiques aux réseaux de neurones leur permet d’atteindre de meilleures performances pour la recherche d’information dans le domaine médical. Finalement, nous présentons des réseaux de neurones utilisant des connaissances issues de texte non structurées pour améliorer la performance et l’efficience des modèles de référence de RI n’utilisant pas d’apprentissage.
... La méthode de boucle de rétropertinence en aveugle (pseudo relevance feedback) est dérivée de la méthode de boucle de rétropertinence (relevance feedback) (Rocchio, 1971) qui consiste à modifier la requête en utilisant des termes de l'ensemble des documents jugés pertinents, noté D p 1 . En pratique, la boucle de rétropertinence consiste à ajouter à la requête des termes pondérés issus de D p . ...
... Comme décrit dans la section 2.2.3, les méthodes d'expansion de requêtes consistent à ajouter des termes à la requête 1 dans le but d'améliorer la recherche. Il est possible d'étendre la requête en utilisant les termes des documents considérés comme pertinents (Rocchio, 1971), des bases de connaissances (Dalton et al., 2014) ou bien des vecteurs de mots appris à l'aide de modèles d'apprentissage automatique (Almasri et al., 2016). ...
Thesis
Ce travail de thèse se situe dans les domaines de la recherche d'information (RI) textuelle et de l'apprentissage profond utilisant des réseaux de neurones. Les travaux effectués dans ce travail de thèse sont motivés par le fait que l'utilisation de réseaux de neurones en RI textuelle s'est révélée efficace sous certaines conditions mais que leur utilisation présente néanmoins plusieurs limitations pouvant grandement restreindre leur application en pratique.Dans ce travail de thèse, nous proposons d'étudier l'incorporation de connaissances a priori pour aborder 3 limitations de l'utilisation de réseaux de neurones pour la RI textuelle : (1) la nécessité de disposer de grandes quantités de données étiquetées, (2) les représentations du texte sont basées uniquement sur des analyses statistiques, (3) le manque d'efficience.Nous nous sommes intéressés à trois types de connaissances a priori pour aborder les limitations mentionnées ci-dessus: (1) des connaissances issues d'une ressource semi-structurée : Wikipédia; (2) des connaissances issues de ressources structurées sous forme de ressources sémantiques telles que des ontologies ou des thésaurus; (3) des connaissances issues de texte non structurées.Dans un premier temps, nous proposons WIKIR : un outil libre d'accès permettant de créer automatiquement des collections de RI depuis Wikipédia. Les réseaux de neurones entraînés sur les collections créées automatiquement ont besoin par la suite de moins de données étiquetées pour atteindre de bonnes performances. Dans un second temps, nous avons développé des réseaux de neurones pour la RI utilisant des ressources sémantiques. L'intégration de ressources sémantiques aux réseaux de neurones leur permet d'atteindre de meilleures performances pour la recherche d'information dans le domaine médical. Finalement, nous présentons des réseaux de neurones utilisant des connaissances issues de texte non structurées pour améliorer la performance et l'efficience des modèles de référence de RI n'utilisant pas d'apprentissage.
... Wang et al. [31] proposed to extract a negative topic model from non-relevant documents from its mixture with the language model of the background corpus. The Rocchio model [25] considers both positive and negative feedback and can be used when only negative feedback is available. Wang et al. [32] compared various negative feedback methods in the framework of language model or vector space model. ...
... BERT-NeuQS and BERT-GT are state-of-the-art neural models for clarifying question selection. We discard the numbers of other negative feedback methods such as MultiNeg [15] and Rocchio [25] due to their inferior performance. BERT-NeuQS uses the query performance prediction scores of a candidate question for document retrieval to enrich the question representation. ...
Preprint
Users often need to look through multiple search result pages or reformulate queries when they have complex information-seeking needs. Conversational search systems make it possible to improve user satisfaction by asking questions to clarify users' search intents. This, however, can take significant effort to answer a series of questions starting with "what/why/how". To quickly identify user intent and reduce effort during interactions, we propose an intent clarification task based on yes/no questions where the system needs to ask the correct question about intents within the fewest conversation turns. In this task, it is essential to use negative feedback about the previous questions in the conversation history. To this end, we propose a Maximum-Marginal-Relevance (MMR) based BERT model (MMR-BERT) to leverage negative feedback based on the MMR principle for the next clarifying question selection. Experiments on the Qulac dataset show that MMR-BERT outperforms state-of-the-art baselines significantly on the intent identification task and the selected questions also achieve significantly better performance in the associated document retrieval tasks.
... Relevance feedback-based methods simply ask users to rate individual search results as relevant or irrelevant. Gay et al. [34] and Wang et al. [18] implement relevance feedback using the Rocchio algorithm [41] to update a vector representation of the search query and rerank the results. ...
... We note that SCS results may not necessarily have comments or function names from which tasks can be extracted, but may still be relevant to a query; therefore, ZaCQ does not directly filter non-candidates from the results. Instead, it promotes all functions similar to candidate functions and demotes those similar to rejected functions using the Rocchio algorithm [41]. This mechanism can be used for SCS engines that embed functions and queries in the same vector space; it works by creating an updated vector representation of the query shifted towards candidate result vectors and away from rejected ones (according to predefined hyperparameters), and then reranking each result by cosine similarity. ...
Preprint
In source code search, a common information-seeking strategy involves providing a short initial query with a broad meaning, and then iteratively refining the query using terms gleaned from the results of subsequent searches. This strategy requires programmers to spend time reading search results that are irrelevant to their development needs. In contrast, when programmers seek information from other humans, they typically refine queries by asking and answering clarifying questions. Clarifying questions have been shown to benefit general-purpose search engines, but have not been examined in the context of code search. We present a method for generating natural-sounding clarifying questions using information extracted from function names and comments. Our method outperformed a keyword-based method for single-turn refinement in synthetic studies, and was associated with shorter search duration in human studies.
... The core innovation in the Epistemic AI platforms is what we call knowledge mapping. Knowledge mapping uses a knowledge graph in combination with biomedical natural language processing (bioNLP) (Devlin et al, 2018;Przbyła et al, 2018;Afentenos et al, 2005;Elhadad et al, 2005;Yetisgen-Yildiz and Pratt, 2005;Goldstein, 2007;Roberts et al, 2007;Kipper-Schuler et al, 2008;Fiszman et al, 2009;Savova et al, 2010;Luther et al, 2011;see Bretonnel Cohen and Demner-Fushman, 2014 for review), relevance feedback (Agichtein, Brill and Dumais, 2006;States et al, 2009;Yu et al, 2009;Alatrash et al, 2012;Ji et al, 2016;Rocchio et al, 1971), and network analysis (Zhong et al, 2006;Mostafavi et al, 2008;Suthram et al, 2010;Cokol et al, 2011;Green et al, 2011;Ciofani et al, 2012;Marbach et al, 2012;Guimera And Sales-Pardo, 2013;Kurts et al, 2015;Shi et al, 2016;Suresh et al, 2016;Wong et al, 2016;Wang et al, 2017;Gligorijević et al, 2018;Castro et al, 2019;Chasman et al, 2019;Cramer et al, 2019;Miraldi, 2019;Siahpirani et al, 2019), for an interactive knowledge mapping platform. ...
... Precision and recall (measures of relevancy) increase through the interaction as the investigator adds more landmarks to the map, making it easier to identify and consolidate all of the relevant knowledge. This process is similar to "relevance feedback", which combines search with explicit supervision from users to indicate relevant or useful results (Rocchio, 1971;Agichtein, Brill and Dumais, 2006;States et al, 2009;Yu et al, 2009;Alatrash et al, 2012;Ji et al, 2016). Explicit relevance feedback was not widely adopted in traditional text search applications (Spink, Jansen and Ozmultu, 2000;Anick, 2003), but we are tackling a different problem (knowledge mapping, not search) that inherently involves user interaction and exploration in a manner that is more organic than previous attempts with text search. ...
Preprint
Full-text available
Epistemic AI accelerates biomedical discovery by finding hidden connections in the network of biomedical knowledge. The Epistemic AI web-based software platform embodies the concept of knowledge mapping, an interactive process that relies on a knowledge graph in combination with natural language processing (NLP), information retrieval, relevance feedback, and network analysis. Knowledge mapping reduces information overload, prevents costly mistakes, and minimizes missed opportunities in the research process. The platform combines state-of-the-art methods for information extraction with machine learning, artificial intelligence and network analysis. Starting from a single biological entity, such as a gene or disease, users may: a) construct a map of connections to that entity, b) map an entire domain of interest, and c) gain insight into large biological networks of knowledge. Knowledge maps provide clarity and organization, simplifying the day-to-day research processes.
... Query expansion approaches, which rewrite the user's query, have been shown to be an effective approach to alleviate the vocabulary discrepancies between the user query and the relevant documents, by modifying the user's original query to improve the retrieval effectiveness. Many approaches follow the pseudo-relevance feedback (PRF) paradigm -such as Rocchio's algorithm [28], the RM3 relevance language model [1], or the DFR query expansion models [4] where terms appearing in the top-ranked documents for the initial query are used to expand it. Query expansion (QE) approaches have also found a useful role when integrated with effective BERT-based neural reranking models, by providing a high quality set of candidate documents obtained using the expanded query, which can then be reranked [27,32,35]. ...
... Pseudo-relevance feedback approaches have a long history in Information Retrieval (IR) going back to Rocchio [28] who generated refined query reformulations through linear combinations of the sparse vectors (e.g. containing term frequency information) representing the query and the top-ranked feedback documents. Refined classical PRF models, such as Divergence from Randomness's Bo1 [4], KL [2], and RM3 relevance models [1] have demonstrated their effectiveness on many test collections. ...
Preprint
Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users' initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval -- through the use of neural contextual language models such as BERT for analysing the documents' and queries' contents and computing their relevance scores -- has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query (e.g. using BERT's [CLS] token), or via multiple representations (e.g. using an embedding for each token of the query and document). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, we extract representative feedback embeddings - while ensuring that these embeddings discriminate among passages -- which are then added to the query representation. These additional feedback embeddings are shown to both enhance the effectiveness of a reranking as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by upto 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed ColBERT-PRF method on a ColBERT dense retrieval approach.
... It is being used in many fields of NLP such as text summarization [1], [2], machine translation [3], paraphrase detection [4], [5], question-answering [6], dialog and conversational systems, sentiment analysis, and clinical information extraction [7]. There are some other applications such as relevance feedback, text classification [8], word sense disambiguation [9], subtopic mining, web search [10] and so on [11]- [13]. STS can be defined as a process that takes two sentences as input and returns a similarity score in the range [0,1] based on their meaning. ...
... In every iteration, the maximum similarity of one role of a sentence to all other roles in another sentence is stored (statement 14). Statements [5][6][7][8][9][10][11][12][13][14][15] summarize the above processes. There might some roles that are not similar anymore between sentence-pair and as a result their similarity score can be smaller. ...
Article
Full-text available
Semantic similarity between texts can be defined based on their meaning. Assessing the textual similarity is a prerequisite in almost all applications in the field of language processing and information retrieval. However, the diversity in the sentence structure makes it formidable to estimate the similarity. Some sentences pairs are lexicographically similar but semantically dissimilar. That is why the trivial lexical overlapping is not enough for measuring the similarity. To attain the semanticity of sentences, the context of the words and the structure of the sentence should be considered. In this paper, we propose a new method for capturing the semantic similarity between sentences based on their grammatical roles through word semantics. First, the sentences are divided grammatically into different parts where each part is considered as a grammatical role. Then multiple new measures are introduced to estimate the role-based similarity exploiting word semantics considering the sentence structure. The proposed similarity measures focus on inter-role and intra-role similarity between the sentence-pair. The word-level semantic information is extracted from a pre-trained word-embedding model. The performance of the proposed method was verified by conducting a wide range of experiments on the SemEval STS dataset. The experimental results indicated the effectiveness of the proposed method in terms of different standard evaluation metrics and outperformed some known related works.
... This class of QE approaches relies on the use of a local analysis of top-ranked or identified relevant documents resulting from an initial retrieval round in response to the original (non-reformulated) query. One of the earliest QE methods was proposed by Rocchio [21], who introduced the use of 'relevance feedback' for QE. In this technique, the user manually examines the results of the initial query and provides a judgement on the relevance of the retrieved documents. ...
... The collected feedback information is then used to reformulate the original query. Indeed, most of the relevance feedback methods are inspired by Rocchio's algorithm [21]. However, a major drawback of the relevance feedback technique is that it requires time and effort on the part of the user. ...
Article
This article presents a new query expansion (QE) method aiming to tackle term mismatch in information retrieval (IR). Previous research showed that selecting good expansion terms which do not hurt retrieval effectiveness remains an open and challenging research question. Our method investigates how global statistics of term co-occurrence can be used effectively to enhance expansion term selection and reweighting. Indeed, we build a co-occurrence graph using a context window approach over the entire collection, thus adopting a global QE approach. Then, we employ a semantic similarity measure inspired by the Okapi BM25 model, which allows to evaluate the discriminative power of words and to select relevant expansion terms based on their similarity to the query as a whole. The proposed method includes a reweighting step where selected terms are assigned weights according to their relevance to the query. What’s more, our method does not require matrix factorisation or complex text mining processes. It only requires simple co-occurrence statistics about terms, which reduces complexity and insures scalability. Finally, it has two free parameters that may be tuned to adapt the model to the context of a given collection and control co-occurrence normalisation. Extensive experiments on four standard datasets of English (TREC Robust04 and Washington Post) and French (CLEF2000 and CLEF2003) show that our method improves both retrieval effectiveness and robustness in terms of various evaluation metrics and outperforms competitive state-of-the-art baselines with significantly better results. We also investigate the impact of varying the number of expansion terms on retrieval results.
... TechTube allows a developer to express a query, representing the task at hand, in natural language. To account for missing terms and improve recall, our approach automatically augments the query by reformulating it with popular reformulation techniques (e.g., Rocchio [7]). The reformulated query is then matched against a repository of online technical videos. ...
... 2) Query Reformulation Engine: The textual description of a task (i.e., the query) provided by a developer may not contain all the necessary keywords to describe the task properly. To address this problem, TechTube reformulates the query using a standard query expansion technique -Rocchio's expansion [7]. TechTube uses the technique as follows. ...
Conference Paper
Full-text available
Software developers frequently watch technical videos and tutorials online as solutions to their problems. However, the audiovisual explanations of the videos might also claim more time from the developers than the text-only materials (e.g., programming Q&A threads). Thus, pinpointing and summarizing the relevant fragments from these videos could save the developers valuable time and effort. In this paper, we propose a novel technique -- TechTube -- that can be used to find video segments that are relevant to a given technical task. TechTube allows a developer to express the task as a natural language query. To account for missing vocabularies in the query, TechTube automatically reformulates the query using techniques based on information retrieval. The reformulated query is matched against a repository of online technical videos. The output from TechTube is a sequence of relevant video segments that can be useful to implement the task at hand. Unlike previous researches, our approach splits the video by detecting silence in video audio tracks. Experiments using 98 programming related search queries show that our approach delivers the relevant videos within the Top-5 results 93% of the time with a mean average precision of 76%. We also find that TechTube can deliver the most relevant section of a technical video with 67% precision and 53% recall that outperforms the closely related existing approach from the literature. Our developer study involving 16 participants reports that they found the video summaries generated by TechTube very accurate, precise, concise, and very useful for their programming tasks rather than the original complete videos.
... This feedback can take a variety of modalities. The oldest but still highly effective and prominent form of user feedback is relevance feedback, which had already been used in the field of text retrieval for a long time (Rocchio, 1971) before it was adopted for CBIR applications in the late-1990s (e.g., Picard et al., 1996;Cox et al., 2000). Under this regime, the system first performs an initial baseline retrieval based on the query image provided by the user and presents the top-scoring results. ...
... Query-Point Movement (QPM) approaches are one of the oldest techniques for integration of relevance feedback. Their most popular instance is Rocchio's Algorithm (Rocchio, 1971), which shifts the query towards the direction of the relevant and away from the irrelevant images, resulting in a new query q : ...
Thesis
Content-based image retrieval (CBIR) aims for finding images in large databases such as the internet based on their content. Given an exemplary query image provided by the user, the retrieval system provides a ranked list of similar images. Most contemporary CBIR systems compare images solely by means of their visual similarity, i.e., the occurrence of similar textures and the composition of colors. However, visual similarity does not necessarily coincide with semantic similarity. For example, images of butterflies and caterpillars can be considered as similar, because the caterpillar turns into a butterfly at some point in time. Visually, however, they do not have much in common. In this work, we propose to integrate such human prior knowledge about the semantics of the world into deep learning techniques. Class hierarchies serve as a source for this knowledge, which are readily available for a plethora of domains and encode is-a relationships (e.g., a poodle is a dog is an animal etc.). Our hierarchy-based semantic embeddings improve the semantic consistency of CBIR results substantially compared to conventional image representations and features. We furthermore present three different mechanisms for interactive image retrieval by incorporating user feedback to resolve the inherent semantic ambiguity present in the query image. One of the proposed methods reduces the required user feedback to a single click using clustering, while another keeps the human in the loop by actively asking for feedback regarding those images which are expected to improve the relevance model the most. The third method allows the user to select particularly interesting regions in images. These techniques yield more relevant results after a few rounds of feedback, which reduces the total amount of retrieved images the user needs to inspect to find relevant ones.
... Le SR basé sur le contenu recommande à l'utilisateur des items qui sont similaires à ceux qu'il a appréciés dans le passé. (Rocchio, 1971), la méthode probabiliste, les approches bayésiennes (Pazzani et Billsus, 2007), les arbres de décision (Quinlan, 2014), les techniques basées sur la similarité des espaces vectoriels (kNN) (Billsus et Pazzani, 2000). ...
... automatique (Machine Learning) sont souvent utilisées(Adomavicius et Tuzhilin, 2005;Jannach et al., 2010;Lops, de Gemmis et Semeraro, 2011).La dernière étape est exécutée par « le component du filtrage » qui calcule la similarité entre des nouveaux items et ceux que l'utilisateur a évalués dans le but de proposer des recommandations. Plusieurs méthodes peuvent être utilisées pour calculer la similarité(Dice, 1945;Rocchio, 1971;Salton, Wong et Yang, 1975;Chuang et Sher, 1993;Billsus et Pazzani, 2000;Herlocker et al., 2004;Pazzani et Billsus, 2007;Quinlan, 2014).IV.2.2 Représentation d'itemLa manière la plus simple de décrire un catalogue d'items est d'avoir une liste explicite des caractéristiques (également appelés l'attribut, profil d'item, la propriété etc.) de chaque item.Des items pourraient être recommandés à l'utilisateur en fonction de leurs caractéristiques.Quand le profil de l'utilisateur est exprimé sous forme d'une liste d'intérêts basée sur les mêmes caractéristiques, la tâche de recommandation consistera donc à faire coïncider les caractéristiques de l'item et le profil de l'utilisateur(Jannach et al., 2010;Negre, 2015).Beaucoup de systèmes de filtrage basé sur le contenu se concentrent sur des recommandations d'items contenant de l'information textuelle(Adomavicius et Tuzhilin, 2005).Une méthode classique est d'extraire une liste d'informations (mots-clés) pertinentes à partir des informations textuelles contenues dans l'item lui-même. Pour traiter les items multimédias ( le film, la musique et l'image etc.), les métadonnées et les tags sont beaucoup plus utilisés bien qu'ils ne soient pas les « contenus » d'items(Jannach et al., 2010). ...
Thesis
Le développement de l'Internet et de la technologie Web 2.0 qui ajoute à la facilité de publication le contenu généré par l'utilisateur, mettent à la disposition des utilisateurs une variété d’informations dont le volume est sans cesse croissant. Face à ce problème de surcharge d'informations, il est difficile pour les utilisateurs de s'orienter et de repérer des informations qui répondent à leurs besoins. De nombreux systèmes de filtrage de l'information sont développés pour faire face à ce problème : l'un d'entre eux est le système de recommandation. L'objectif principal des systèmes de recommandation est de fournir aux utilisateurs des propositions de contenus personnalisées. Le principe sous-jacent est de déduire les besoins d'information de l'utilisateur, puis d’identifier dans le système les informations qui répondent à des besoins et les lui recommander. Les systèmes de recommandation, largement utilisés dans divers domaines, peuvent aussi être intégrés à des réseaux sociaux. La plupart des réseaux sociaux se caractérisent à la fois par le nombre important d'interactions et par l'anonymat des utilisateurs. Ces caractéristiques correspondent aux conditions décrites en psychologie sociale pour qu'un état de désindividuation soit déclenché. Les utilisateurs des réseaux sociaux sont susceptibles de se trouver dans une situation où l'identité du groupe est significativement élevée et leur identité individuelle restreinte. Leurs pensées, leurs comportements et même leurs préférences sont fortement influencées par les normes de groupe, y compris, bien sûr, leurs rétroactions sur les informations reçues. Ces rétroactions pourraient être biaisées c'est-à-dire ne pas refléter les vraies préférences individuelles des utilisateurs. Ainsi les recommandations basées sur ces rétroactions biaisées seraient contraires à l'intention initiale des recommandations personnalisées. Cette thèse est consacrée à l'exploration du phénomène de la désindividuation qui peut exister dans les réseaux sociaux et de son impact sur le comportement de notation des utilisateurs, tout en incluant les différences culturelles. Nous choisissons comme terrain d'étude les systèmes de recommandation de films, ce qui nous amène à examiner les utilisateurs de quatre plateformes pour les cinéphiles à travers leur comportement de notation de films. Les résultats confirment l’existence du phénomène de la désindividuation dans les réseaux sociaux son impact significatif sur le comportement de notation des utilisateurs. La différence culturelle est également un facteur important qui influence le comportement de notation. Sur cette base, nous arguons que les systèmes de recommandation appliqués dans les réseaux sociaux doivent y faire attention et que certaines mesures visant à individualiser les utilisateurs devraient être prises avant de recueillir et d'analyser les réactions des utilisateurs.
... Further domain knowledge is incorporated through the use of well-known models and algorithms from NLU and information retrieval (IR). Most notably, we develop a self-supervised learning scheme for generating high-quality search sessions by exploiting insights from relevance feedback [Rocchio, 1971]. We train a supervised LM search agent based on T5 [Raffel et al., 2020] directly on this data. ...
... To create candidate refinements, we make use of the idea of pseudo-relevance feedback as suggested in Rocchio [1971]. An elementary refinement -called a Rocchio expansion -then takes the form q t+1 := q t ∆q t , ∆q t := [+| − TITLE | CONTENT ] w t , w t ∈ Σ t := Σ q t ∪ Σ τ t ∪ Σ α t ∪ Σ β t (2) ...
Preprint
Can machines learn to use a search engine as an interactive tool for finding information? That would have far reaching consequences for making the world's knowledge more accessible. This paper presents first steps in designing agents that learn meta-strategies for contextual query refinements. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based generative language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically constrained actions that can learn interactive search strategies completely from scratch. In both cases, we obtain significant improvements over one-shot search with a strong information retrieval baseline. Finally, we provide an in-depth analysis of the learned search policies.
... Pseudo relevance feedback (Rocchio, 1971;Robertson and Jones, 1976), also known as blind relevance feedback, is an important technique that can be used to improve the effectiveness of IR systems. The idea behind this technique is to assume that the initially returned top-k ranked documents are relevant and therefore their terms are used to perform a new query. ...
... 32 This approach was taken further either manually by certain teams (i.e., OHSU) or automatically, as seen in approaches in initial iterations of Covidex, 33 a consistently high-performing neural re-ranking methodology that was an early adopter of the "Udel" preprocessing method. In fact, adapting the queries to better represent document representations, or minimize query-document mismatch, has long been researched and includes work using relevance judgments 34 Second, it was difficult to capture run-specific differences between runs submitted by the same team, as team-specific features were often not provided. This had important implications in runs submitted in Round 5, where teams were allowed to submit up to 8 runs. ...
Article
The COVID-19 pandemic has resulted in a rapidly growing quantity of scientific publications from journal articles, preprints, and other sources. The TREC-COVID Challenge was created to evaluate information retrieval methods and systems for this quickly expanding corpus. Using the COVID-19 Open Research Dataset (CORD-19), several dozen research teams participated in over 5 rounds of the TREC-COVID Challenge. While previous work has compared IR techniques used on other test collections, there are no studies that have analyzed the methods used by participants in the TREC-COVID Challenge. We manually reviewed team run reports from Rounds 2 and 5, extracted features from the documented methodologies, and used a univariate and multivariate regression-based analysis to identify features associated with higher retrieval performance. We observed that fine-tuning datasets with relevance judgments, MS-MARCO, and CORD-19 document vectors was associated with improved performance in Round 2 but not in Round 5. Though the relatively decreased heterogeneity of runs in Round 5 may explain the lack of significance in that round, fine-tuning has been found to improve search performance in previous challenge evaluations by improving a system’s ability to map relevant queries and phrases to documents. Furthermore, term expansion was associated with improvement in system performance, and the use of the narrative field in the TREC-COVID topics was associated with decreased system performance in both rounds. These findings emphasize the need for clear queries in search. While our study has some limitations in its generalizability and scope of techniques analyzed, we identified some IR techniques that may be useful in building search systems for COVID-19 using the TREC-COVID test collections.
... There are several query expansion methods. A well-known and effective technique, in ad-hoc IR, to address the vocabulary mismatch problem is PRF (Rocchio, 1971) (Carpineto & Romano, 2012). It represents a small set of top-retrieved documents which are relevant to the query. ...
Article
Query Expansion (QE) approaches that involve the reformulation of queries by adding new terms to the initial user query, are intended to ameliorate the vocabulary mismatch between the query keywords and the documents’ in Information Retrieval Systems (IRS). One big issue in QE is the selection of the right candidate terms for expansion. For this purpose Linked Data can be used, as a valuable resource, for providing additional expansion features such as the values of sub- and super classes of resources. The underlying research question is whether interlinked data and vocabulary items provide features which can be taken into account for query expansion. In this paper, we introduced a new QE approach that aimed at improving IRS by using the well-known distribution based method Bose-Einstein statistics (Bo1) as well as Linked Data from the knowledge base DBpedia using different numbers of expansion terms. We evaluated the effectiveness of each method individually as well as their combinations using two Text REtrieval Conference (TREC) test collections. Our approach has lead to significant improvement in terms of precision, recall, Mean Average Precision (MAP) at rank 10, and normalized Discounted Cumulative Gain (nDCG) at different ranks compared to Pseudo Relevance Feedback (PRF) that we used as a baseline. The results show that the inclusion of semantic annotations clearly improves the retrieval performance over the baseline method.
... As it will be evidenced by these results, the Nearest Centroid (NC) model clearly outperforms the other risk-assessment methods. NC is a powerful similarity-based classification method that has been successfully applied in cancer class prediction from gene expression profiling [32] and in text classification [27], where it is usually known as the Rocchio classifier. In our scenario, the NC algorithm seeks to extract the main features of each of the archetypes of aggressors and, using them, analyzes new cases by computing the similarity to each of these general patterns. ...
Preprint
Gender-based crime is one of the most concerning scourges of contemporary society. Governments worldwide have invested lots of economic and human resources to radically eliminate this threat. Despite these efforts, providing accurate predictions of the risk that a victim of gender violence has of being attacked again is still a very hard open problem. The development of new methods for issuing accurate, fair and quick predictions would allow police forces to select the most appropriate measures to prevent recidivism. In this work, we propose to apply Machine Learning (ML) techniques to create models that accurately predict the recidivism risk of a gender-violence offender. The relevance of the contribution of this work is threefold: (i) the proposed ML method outperforms the preexisting risk assessment algorithm based on classical statistical techniques, (ii) the study has been conducted through an official specific-purpose database with more than 40,000 reports of gender violence, and (iii) two new quality measures are proposed for assessing the effective police protection that a model supplies and the overload in the invested resources that it generates. Additionally, we propose a hybrid model that combines the statistical prediction methods with the ML method, permitting authorities to implement a smooth transition from the preexisting model to the ML-based model. This hybrid nature enables a decision-making process to optimally balance between the efficiency of the police system and aggressiveness of the protection measures taken.
... Conventional Query Expansion. GAR shares some merits with query expansion (QE) methods based on pseudo relevance feedback (Rocchio, 1971;Abdul-Jaleel et al., 2004;Lv and Zhai, 2010) in that they both expand the queries with relevant contexts (terms) without the use of external supervision. GAR is superior as it expands the queries with knowledge stored in the PLMs rather than the retrieved passages and its expanded terms are learned through text generation. ...
... On the other hand, the local methods adjust a query based on the top-ranked documents retrieved by the original query. This kind of query expansion is called pseudo-relevance feedback (PRF) [22,163], which has been proven to be highly effective to improve the performance of many retrieval models [94,108,146]. Relevance model [94], mixture model and divergence minimization model [186] are the first PRF methods proposed under the language modeling framework. Since then, several other local methods have been proposed, but the relevance model is shown to be still among the state-of-the-art PRF methods and performs more robustly than many other methods [108]. ...
Preprint
Full-text available
Multi-stage ranking pipelines have been a practical solution in modern search systems, where the first-stage retrieval is to return a subset of candidate documents, and the latter stages attempt to re-rank those candidates. Unlike the re-ranking stages going through quick technique shifts during the past decades, the first-stage retrieval has long been dominated by classical term-based models. Unfortunately, these models suffer from the vocabulary mismatch problem, which may block the re-ranking stages from relevant documents at the very beginning. Therefore, it has been a long-term desire to build semantic models for the first-stage retrieval that can achieve high recall efficiently. Recently, we have witnessed an explosive growth of research interests on the first-stage semantic retrieval models. We believe it is the right time to survey the current status, learn from existing methods, and gain some insights for future development. In this paper, we describe the current landscape of semantic retrieval models from three major paradigms, paying special attention to recent neural-based methods. We review the benchmark datasets, optimization methods and evaluation metrics, and summarize the state-of-the-art models. We also discuss the unresolved challenges and suggest potentially promising directions for future work.
... After the initial retrieval of the documents, the primary query is reformulated by adding and re-weighing some extra terms [also known as pseudo-relevance feedback (PRF)] (Lee et al. 2008;Bendersky and Croft 2008;Robertson and Jones 1976;Carpineto and Romano 2012). For ranking and weighing query terms, a variety of weighing schemes are available such as term frequency-inverse document frequency (TF-IDF) (Carpineto and Romano 2012), Rocchios weight (Rocchio 1971), binary independence model (BIM) (Robertson and Jones 1976), chi-square (CHI) (Zia et al. 2015), robertson selection value (RSV) (Walker et al. 1996), Kullback-Leibler Divergence (KLD), Bose-Einstein1 (Bo1) and Bose-Einstein2 (Bo2) (Amati and Van Rijsbergen 2002) just to name a few. It is wellestablished that the inconsistency in PRF can lead to inaccurate retrieval of top documents (Xu et al. 2009). ...
Article
Full-text available
Retrieving relevant documents from a large set using the original query is a formidable challenge. A generic approach to improve the retrieval process is realized using pseudo-relevance feedback techniques. This technique allows the expansion of original queries with conducive keywords that returns the most relevant documents corresponding to the original query. In this paper, five different hybrid techniques were tested utilizing traditional query expansion methods. Later, the boosting query term method was proposed to reweigh and strengthen the original query. The query-wise analysis revealed that the proposed approach effectively identified the most relevant keywords, and that was true even for short queries. All the proposed methods’ potency was evaluated on three different datasets; Roshni, Hamshahri1, and FIRE2011. Compared to the traditional query expansion methods, the proposed methods improved the mean average precision values of Urdu, Persian, and English datasets by 14.02%, 9.93%, and 6.60%, respectively. The obtained results were also established using analysis of variance and post-hoc analysis.
... Simultaneously, many of these works such as those of Lewis (1998) and Turney (2002) had their origins in information retrieval and information theory. In these works, it was standard to use order-agnostic term frequency (TF) (Luhn, 1957) and inverse document frequency (IDF) (Spärck Jones, 1972) features, commonly under the joint framing of the TF-IDF weighting schema (Salton, 1991) as in the Rocchio classifier (Rocchio, 1971;Joachims, 1997). Comprehensive surveys and analyses of models for text classification are provided by Yang and Pedersen (1997); Yang and Liu (1999); Yang (1999); Aggarwal and Zhai (2012). ...
Preprint
Full-text available
The sequential structure of language, and the order of words in a sentence specifically, plays a central role in human language processing. Consequently, in designing computational models of language, the de facto approach is to present sentences to machines with the words ordered in the same order as in the original human-authored sentence. The very essence of this work is to question the implicit assumption that this is desirable and inject theoretical soundness into the consideration of word order in natural language processing. In this thesis, we begin by uniting the disparate treatments of word order in cognitive science, psycholinguistics, computational linguistics, and natural language processing under a flexible algorithmic framework. We proceed to use this heterogeneous theoretical foundation as the basis for exploring new word orders with an undercurrent of psycholinguistic optimality. In particular, we focus on notions of dependency length minimization given the difficulties in human and computational language processing in handling long-distance dependencies. We then discuss algorithms for finding optimal word orders efficiently in spite of the combinatorial space of possibilities. We conclude by addressing the implications of these word orders on human language and their downstream impacts when integrated in computational models.
... Our system incorporated this feedback only in a trivial manner, simply by removing documents or snippets that were identified as known negatives. There has been research on relevance feedback since at least 1971 [14], which could be incorporated into the system. More recent approaches, such as using a twin neural network with a contrastive loss [15], may work here. ...
Preprint
Full-text available
This paper presents Macquarie University's participation to the BioASQ Synergy Task, and BioASQ9b Phase B. In each of these tasks, our participation focused on the use of query-focused extractive summarisation to obtain the ideal answers to medical questions. The Synergy Task is an end-to-end question answering task on COVID-19 where systems are required to return relevant documents, snippets, and answers to a given question. Given the absence of training data, we used a query-focused summarisation system that was trained with the BioASQ8b training data set and we experimented with methods to retrieve the documents and snippets. Considering the poor quality of the documents and snippets retrieved by our system, we observed reasonably good quality in the answers returned. For phase B of the BioASQ9b task, the relevant documents and snippets were already included in the test data. Our system split the snippets into candidate sentences and used BERT variants under a sentence classification setup. The system used the question and candidate sentence as input and was trained to predict the likelihood of the candidate sentence being part of the ideal answer. The runs obtained either the best or second best ROUGE-F1 results of all participants to all batches of BioASQ9b. This shows that using BERT in a classification setup is a very strong baseline for the identification of ideal answers.
... Various techniques have been presented, which fall into three categories: (i) relevance/pseudo-relevance feedback, (ii) ontology/thesaurus-based, and (iii) leveraging user search logs. Relevance feedback-based techniques [33] involve submitting the original query to retrieve initial results, then asking the user to select the relevant ones, and finally expanding the query with additional terms extracted from the relevant results and conducting the search with the modified query. Pseudo-relevance [4,36,43], a variant of relevance feedback, assumes that the top few results returned from the original query are relevant [34]. ...
Conference Paper
Full-text available
In e-commerce search engines, query rewriting (QR) is a crucial technique that improves shopping experience by reducing the vocabulary gap between user queries and product catalog. Recent works have mainly adopted the generative paradigm. However, they hardly ensure high-quality generated rewrites and do not consider personalization, which leads to degraded search relevance. In this work, we present Contrastive Learning Enhanced Query Rewriting (CLE-QR), the solution used in Taobao product search. It uses a novel contrastive learning enhanced architecture based on “query retrieval−semantic relevance ranking−online ranking”. It finds the rewrites from hundreds of millions of historical queries while considering relevance and personalization. Specifically, we first alleviate the representation degeneration problem during the query retrieval stage by using an unsupervised contrastive loss, and then further propose an interaction-aware matching method to find the beneficial and incremental candidates, thus improving the quality and relevance of candidate queries.We then present a relevance-oriented contrastive pre-training paradigm on the noisy user feedback data to improve semantic ranking performance. Finally, we rank these candidates online with the user profile to model personalization for the retrieval of more relevant products. We evaluate CLE-QR on Taobao Product Search, one of the largest e-commerce platforms in China. Significant metrics gains are observed in online A/B tests. CLE-QR has been deployed to our large-scale commercial retrieval system and serviced hundreds of millions of users since December 2021. We also introduce its online deployment scheme, and share practical lessons and optimization tricks of our lexical match system.
... Modern QE techniques either involve document collection analysis (globally or locally) or are dictionary, thesaurus or ontology-based [5,9]. In global QE methods, the document corpus is analysed globally to extract similarity relations for query terms while in local QE methods relevance feedback as proposed by Rocchio is used [34]. This involves selecting expansion terms by exploring the top retrieved documents presuming these to be relevant. ...
Article
Full-text available
In Information Retrieval (IR) Systems, an essential technique employed to improve accuracy and efficiency is Query Expansion (QE). QE is the technique that reformulates the original query by adding the relevant terms that aid the retrieval process in generating more relevant outcomes. Numerous methods have been proposed in the literature that generates desirable results, however they do not provide evenly favourable results for all types of queries. One of the primary reasons for this is their inability to capture holistic relationships among the query terms. To tackle this issue, we have proposed a novel technique for QE that leverages a game-theoretic framework to recommend contextually relevant expansion terms for each query. In our approach, the query terms are interpreted as players that play a game with the other terms in the query in order to maximize their payoffs; the payoffs are determined using similarity measures between two query terms in the game. Our framework also works best for disambiguating polysemous query terms. The experimental section presents an analysis of the combination of various similarity and association measures employed in the proposed framework and a comparative analysis against state-of-art approaches. In addition to this, we present our analysis over three datasets, namely AP89, INEX and CLUEWEB in combination with WordNet and BabelNet as knowledge bases. The results show that the proposed work outperforms state-of-art algorithms.
... The iterative query reformulation process can be effective in tracking a user's evolving information need [SchManning, Christopher D and Raghavan, Prabhakar and ütze, 2008;Baeza-Yates et al., 2011]. The Rocchio algorithm [Rocchio, 1971] is a classic query refinement method where the feedback data is used to re-weight query terms in a vector space. ...
Preprint
Full-text available
A conversational information retrieval (CIR) system is an information retrieval (IR) system with a conversational interface which allows users to interact with the system to seek information via multi-turn conversations of natural language, in spoken or written form. Recent progress in deep learning has brought tremendous improvements in natural language processing (NLP) and conversational AI, leading to a plethora of commercial conversational services that allow naturally spoken and typed interaction, increasing the need for more human-centric interactions in IR. As a result, we have witnessed a resurgent interest in developing modern CIR systems in both research communities and industry. This book surveys recent advances in CIR, focusing on neural approaches that have been developed in the last few years. This book is based on the authors' tutorial at SIGIR'2020 (Gao et al., 2020b), with IR and NLP communities as the primary target audience. However, audiences with other background, such as machine learning and human-computer interaction, will also find it an accessible introduction to CIR. We hope that this book will prove a valuable resource for students, researchers, and software developers. This manuscript is a working draft. Comments are welcome.
... In a study on text classification, Rocchio [2] first proposed the Rocchio text classification algorithm, which uses a training set to construct a prototype vector for each class and allocates an input document to a certain class by calculating the similarity between all documents in the training set and the prototype vector. is method is easy to implement and calculate. ...
Article
Full-text available
A single model is often used to classify text data, but the generalization effect of a single model on text data sets is poor. To improve the model classification accuracy, a method is proposed that is based on a deep neural network (DNN), recurrent neural network (RNN), and convolutional neural network (CNN) and integrates multiple models trained by a deep learning network architecture to obtain a strong text classifier. Additionally, to increase the flexibility and accuracy of the model, various optimizer algorithms are used to train data sets. Moreover, to reduce the interference in the classification results caused by stop words in the text data, data preprocessing and text feature vector representation are used before training the model to improve its classification accuracy. The final experimental results show that the proposed model fusion method can achieve not only improved classification accuracy but also good classification effects on a variety of data sets.
... An application of relevance feedback is the Rocchio algorithm [Rocchio (1971)], a nearest centroid technique for classification that assigns to documents the class label of the training samples whose mean is closest to the document's class, modifying the weights of the vectors representing both the query and the documents. Rocchio Figure 1: Relevance feedback the whole collection rather than to the user query, which is potentially an unwanted behavior. ...
Preprint
Full-text available
With the increasing demand of intelligent systems capable of operating in different user contexts (e.g. users on the move) the correct interpretation of the user-need by such systems has become crucial to give a consistent answer to the user query. The most effective techniques which are used to address such task are in the fields of natural language processing and semantic expansion of terms. Such systems are aimed at estimating the actual meaning of input queries, addressing the concepts of the words which are expressed within the user questions. The aim of this paper is to demonstrate which semantic relation impacts the most in semantic expansion-based retrieval systems and to identify the best tradeoff between accuracy and noise introduction when combining such relations. The evaluations are made building a simple natural language processing system capable of querying any taxonomy-driven domain, making use of the combination of different semantic expansions as knowledge resources. The proposed evaluation employs a wide and varied taxonomy as a use-case, exploiting its labels as basis for the expansions. To build the knowledge resources several corpora have been produced and integrated as gazetteers into the NLP infrastructure with the purpose of estimating the pseudo-queries corresponding to the taxonomy labels, considered as the possible intents.
... Query Expansion A widely used approach to improve recall uses query expansion from relevance feedback that takes a user's judgment of a result's relevance and uses it to build an updated query model (Rocchio 1971). Pseudo-relevance feedback (PRF) (Lavrenko & Croft, 2001;Lv & Zhai, 2009;Zhai & Lafferty, 2001) approaches perform this task automatically, assuming the top documents are relevant. ...
Article
Full-text available
In this work, we study recent advances in context-sensitive language models for the task of query expansion. We study the behavior of existing and new approaches for lexical word-based expansion in both unsupervised and supervised contexts. For unsupervised models, we study the behavior of the Contextualized Embeddings for Query Expansion (CEQE) model. We introduce a new model, Supervised Contextualized Query Expansion with Transformers (SQET) that performs expansion as a supervised classification task and leverages context in pseudo-relevant results. We study the behavior of these expansion approaches for the tasks of ad-hoc document and passage retrieval. We conduct experiments combining expansion with probabilistic retrieval models as well as neural document ranking models. We evaluate expansion effectiveness on three standard TREC collections: Robust, Complex Answer Retrieval, and Deep Learning. We analyze the results of extrinsic retrieval effectiveness, intrinsic ability to rank expansion terms, and perform a qualitative analysis of the differences between the methods. We find out CEQE statically significantly outperforms static embeddings across all three datasets for Recall@1000. Moreover, CEQE outperforms static embedding-based expansion methods on multiple collections (by up to 18% on Robust and 31% on Deep Learning on average precision) and also improves over proven probabilistic pseudo-relevance feedback (PRF) models. SQET outperforms CEQE by 6% in P@20 on the intrinsic term ranking evaluation and is approximately as effective in retrieval performance. Models incorporating neural and CEQE-based expansion score achieves gains of up to 5% in P@20 and 2% in AP on Robust over the state-of-the-art transformer-based re-ranking model, Birch.
... Kowsari et al. (2019) provide a recent survey of text classification approaches. Traditional approaches include techniques such as the Rocchio algorithm (Rocchio, 1971), boosting (Schapire, 1990) and bagging (Breiman, 1996), and logistic regression (Cox & Snell, 1989), as well as naïve Bayes. Clustering-based approaches include k-nearest neighbor and support vector machines (Vapnik & Chervonenkis, 1964). ...
Article
Full-text available
Although several large knowledge graphs have been proposed in the scholarly field, such graphs are limited with respect to several data quality dimensions such as accuracy and coverage. In this article, we present methods for enhancing the Microsoft Academic Knowledge Graph (MAKG), a recently published large-scale knowledge graph containing metadata about scientific publications and associated authors, venues, and affiliations. Based on a qualitative analysis of the MAKG, we address three aspects. First, we adopt and evaluate unsupervised approaches for large-scale author name disambiguation. Second, we develop and evaluate methods for tagging publications by their discipline and by keywords, facilitating enhanced search and recommendation of publications and associated entities. Third, we compute and evaluate embeddings for all 254 million authors, 210 million papers, 49,000 journals, and 16,000 conference entities in the MAKG based on several state-of-the-art embedding techniques. Finally, we provide statistics for the updated MAKG. Our final MAKG is publicly available at https://makg.org and can be used for the search or recommendation of scholarly entities, as well as enhanced scientific impact quantification. Peer Review https://publons.com/publon/10.1162/qss_a_00183
... Document expansion techniques address the vocabulary mismatch problem [Zhao 2012]: queries can use terms semantically similar but lexically different from those used in the relevant documents. Traditionally, this problem has been addressed using query expansion techniques, such as relevance feedback [Rocchio 1971] and pseudo relevance feedback [Lavrenko and Croft 2001]. The advances in neural networks and natural language processing have paved the way to different techniques to address the vocabulary mismatch problem by expanding the documents by learning new terms. ...
Preprint
Full-text available
These lecture notes focus on the recent advancements in neural information retrieval, with particular emphasis on the systems and models exploiting transformer networks. These networks, originally proposed by Google in 2017, have seen a large success in many natural language processing and information retrieval tasks. While there are many fantastic textbook on information retrieval and natural language processing as well as specialised books for a more advanced audience, these lecture notes target people aiming at developing a basic understanding of the main information retrieval techniques and approaches based on deep learning. These notes have been prepared for a IR graduate course of the MSc program in Artificial Intelligence and Data Engineering at the University of Pisa, Italy.
... Moreover, C-Rank builds a co-occurrence graph based on the disambiguated concepts and ranks them according to their centrality in the graph, which is further used to identify the keyphrases from a scientific document. Kowsari et al. (2019) indicate distinct learning models, such as Rocchio classification (Rocchio, 1971), Naïve Bayes Classifier (Maron, 1961), k-nearest neighbor (Altman, 1992), and support vector machine (SVM) (Cortes & Vapnik, 1995). According to the authors, those and other cited models either are computationally expensive, do not solve nonlinear problems, are not robust, or require supervised training. ...
Article
Understanding the structure of a scientific domain and extracting specific information from it is laborious. The high amount of manual effort required to this end indicates that the way knowledge has been structured and visualized until the present day should be improved in software tools. Nowadays, scientific domains are organized based on citation networks or bag-of-words techniques, disregarding the intrinsic semantics of concepts presented in literature documents. We propose a novel approach to structure scientific fields, which uses semantic analysis from natural language texts to construct knowledge graphs. Then, our approach clusters knowledge graphs in their main topics and automatically extracts information such as the most relevant concepts in topics and overlapping concepts between topics. We evaluate the proposed model in two datasets from distinct areas. The results achieve up to 84% of accuracy in the task of document classification without using annotated data to segment topics from a set of input documents. Our solution identifies coherent keyphrases and key concepts considering the dataset used. The SciKGraph framework contributes by structuring knowledge that might aid researchers in the study of their areas, reducing the effort and amount of time devoted to groundwork.
... Our experiments are carried out by using the wellknown TFIDF method. This type of technique is based on the relevance feedback algorithm proposed by Rocchio (Rocchio, 1971). The idea of the TFIDF algorithm is to represent each document d by a vector D = (d 1 , d 2 ,..., d v ) in a vector space. ...
Conference Paper
Full-text available
In this paper, we will investigate an empirical term selection method for text categorization, namely Transition Point (TP) technique, and we will compare it to two other widely used methods: Term Frequency (TF) and Document Frequency (DF). For evaluation, we have used the well-known TFIDF technique. Experiments have been conducted by using the Ara-bic corpus Khaleej-2004 which is composed of 4 categories. The results obtained from this study show that performance is almost the same for the three techniques. However, we should note that TP is advantageous since it uses a vocabulary much smaller than the ones used in TF and DF.
... Fonte: Autoria Própria -baseada em [Rossi 2015] Por fim, também foi implementado nesse projeto o algoritmo de PUL RC-SVM [Li and Liu 2003], para fins de comparação com a literatura. O mesmo faz o uso do algoritmo Rocchio [Rocchio 1971] na primeira etapa, para geração de protótipos de documentos positivos e não rotulados. Dado isso, os documentos não rotulados que apresentam maior similaridade aos protótipos dos documentos não rotulados do que aos documentos da classe de interesse, são escolhidos como negativos. ...
Preprint
Full-text available
Atualmente há uma quantidade massiva de textos sendo produzida no universo digital. Esse grande conjunto de textos pode conter conhecimentoútil para diversasáreas, tanto acadêmicas quanto empresariais. Uma das formas para extração de conhecimento e gerenciamento de grandes volumes de textó e a classificação automática. Uma maneira de tornar mais atrativo e viável a utilização da classificação automática,é utilizando o aprendizado baseado em umaúnica classe (AMUC), no qualé aprendido um modelo de classificação considerando apenas documentos da classe de interesse do usuário. Porém, vale ressaltar que mesmo fazendo uso das técnicas de AMUC, uma grande quantidade de exemplos rotulados para classe de interesse precisa ser infor-mada para uma classificação acurada, o que ainda pode inviabilizar o uso prático do AMUC. Pode-se então fazer uso do aprendizado semissupervisio-nado baseado em umaúnica classe (do inglês Positive and Unlabeled Learning -PUL), o qual faz uso de exemplos não rotulados para melhorar a performance de classificação. Entretanto as técnicas de PUL encontradas na li-teratura fazem uso de algoritmos que, em geral, não obtém performances de classificação satisfatórias ou superiores a outros algoritmos de aprendizado se-missupervisionado. Dado isso, o objetivo desse projetoé a implementação e uso de técnicas de aprendizado semissupervisionado mais adequados para a classificação de textos, como as baseadas em redes. Foram executados experi-mentos em 10 coleções textuais considerando diferentes quantidades de exem-plos rotulados. Observou-se que ao escolher algoritmos semissupervisionados mais adequados para o aprendizado semissupervisionado, obteve-se ganhos para todas as coleções de textos. Além disso, foi possível obter melhores re-sultados em comparação com algoritmos baseline e melhores resultados que algoritmos de AMUC quando utilizados apenas 1 exemplo rotulado para a mai-oria das coleções, observando-se assim que a utilização de exemplos não ro-tulados nos algoritmos de PUL contribuem para o aumento de performance de classificação. Palavras-chave: aprendizado semissupervisionado, aprendizado baseado em umá unica classe, classificação de textos, aprendizado baseado em exemplos positivos e não rotulados.
... For such connection, we are inspired by connecting query (keyphrase in our case) and document, using relevance feedbacks, such as clicks on matching querydocument pair. As we have no such feedback between each keyphrase r j ∈ R and X , we can adopt zero-shot query log synthesis , by using pseudo-relevance feedback (Rocchio, 1971): We treat supporting documents S r j as feedback documents with pseudo-relevance to r j , to connect r j with X using top-M overlapped words, denoted by S * r j , between S r j and X . For selecting top-M , an unsupervised signal frequently used is tf-idf, to favor words appearing frequently in S r j (i.e., representative words) but infrequently in other documents (i.e., discriminative words) (Xu and Croft, 2017). ...
... For example, queries are typically composed of just a few keywords, which may not be sufficient to assess the relevance of documents to a query effectively. In classical IR, the technique of query expansion is employed to provide more context about users' actual needs (Rocchio, 1971), by exploiting synonymous terms to overcome the vocabulary mismatch problem. However, this is not suitable for neural language models which are trained to process well-formed sentences. ...
Conference Paper
Full-text available
With the advent of contextualized embeddings, attention towards neural ranking approaches for Information Retrieval increased considerably. However, two aspects have remained largely neglected: i) queries usually consist of few keywords only, which increases ambiguity and makes their contextualization harder, and ii) performing neural ranking on non-English documents is still cumbersome due to shortage of labeled datasets. In this paper we present SIR (Sense-enhanced Information Retrieval) to mitigate both problems by leveraging word sense information. At the core of our approach lies a novel multilingual query expansion mechanism based on Word Sense Disambiguation that provides sense definitions as additional semantic information for the query. Importantly, we use senses as a bridge across languages, thus allowing our model to perform considerably better than its supervised and unsupervised alternatives across French, German, Italian and Spanish languages on several CLEF benchmarks, while being trained on English Robust04 data only. We release SIR at https://github.com/SapienzaNLP/sir.
... Additional modifications are needed for both the VSM and language modelling approaches to account for relevance information. One of the early and most widely used approaches in this domain is the Rocchio algorithm (Rocchio, 1971) which modifies the query vector q in the VSM to reweight a query's terms based on their occurrences in relevant and non-relevant documents. In its most common form, the new query vectorq is computed as follows ...
Thesis
A user model is a fundamental component in user-centred information retrieval systems. It enables personalization of a user's search experience. The development of such a model involves three phases: collecting information about each user, representing such information, and integrating the model into a retrieval application. Progress in this area is typically met with privacy and scalability challenges that hinder the ability to synthesize collective knowledge from each user's search behaviour. In this thesis, I propose a framework that addresses each of these three phases. The proposed framework is based on social role theory from the social science literature and at the centre of this theory is the concept of a social position. A social position is a label for a group of users with similar behavioural patterns. Examples of such positions are traveller, patient, movie fan, and computer scientist. In this thesis, a social position acts as a label for users who are expected to have similar interests. The proposed framework does not require real users' data; rather it uses the web as a resource to model users. The proposed framework offers a data-driven and modular design for each of the three phases of building a user model. First, I present an approach to identify social positions from natural language sentences. I formulate this task as a binary classification task and develop a method to enumerate candidate social positions. The proposed classifier achieves an accuracy score of 85.8%, which indicates that social positions can be identified with good accuracy. Through an inter-annotator agreement study, I further show a reasonable level of agreement between users when identifying social positions. Second, I introduce a novel topic modelling-based approach to represent each social position as a multinomial distribution over words. This approach estimates a topic from a document collection for each position. To construct such a collection for a particular position, I propose a seeding algorithm that extracts a set of terms relevant to the social position. Coherence-based evaluation shows that the proposed approach learns significantly more coherent representations when compared with a relevance modelling baseline. Third, I present a diversification approach based on the proposed framework. Diversification algorithms aim to return a result list for a search query that would potentially satisfy users with diverse information needs. I propose to identify social positions that are relevant to a search query. These positions act as an implicit representation of the many possible interpretations of the search query. Then, relevant positions are provided to a diversification technique that proportionally diversifies results based on each social position's importance. I evaluate my approach using four test collections provided by the diversity task of the Text REtrieval Conference (TREC) web tracks for 2009, 2010, 2011, and 2012. Results demonstrate that my proposed diversification approach is effective and provides statistically significant improvements over various implicit diversification approaches. Fourth, I introduce a session-based search system under the framework of learning to rank. Such a system aims to improve the retrieval performance for a search query using previous user interactions during the search session. I present a method to match a search session to its most relevant social positions based on the session's interaction data. I then suggest identifying related sessions from query logs that are likely to be issued by users with similar information needs. Novel learning features are then estimated from the session's social positions, related sessions, and interaction data. I evaluate the proposed system using four test collections from the TREC session track. This approach achieves state-of-the-art results compared with effective session-based search systems. I demonstrate that such a strong performance is mainly attributed to features that are derived from social positions' data.
Article
Full-text available
Being light-weight and cost-effective, IR-based approaches for bug localization have shown promise in finding software bugs. However, the accuracy of these approaches heavily depends on their used bug reports. A significant number of bug reports contain only plain natural language texts. According to existing studies, IR-based approaches cannot perform well when they use these bug reports as search queries. On the other hand, there is a piece of recent evidence that suggests that even these natural language-only reports contain enough good keywords that could help localize the bugs successfully. On one hand, these findings suggest that natural language-only bug reports might be a sufficient source for good query keywords. On the other hand, they cast serious doubt on the query selection practices in the IR-based bug localization. In this article, we attempted to clear the sky on this aspect by conducting an in-depth empirical study that critically examines the state-of-the-art query selection practices in IR-based bug localization. In particular, we use a dataset of 2,320 bug reports, employ ten existing approaches from the literature, exploit the Genetic Algorithm-based approach to construct optimal, near-optimal search queries from these bug reports, and then answer three research questions. We confirmed that the state-of-the-art query construction approaches are indeed not sufficient for constructing appropriate queries (for bug localization) from certain natural language-only bug reports. However, these bug reports indeed contain high-quality search keywords in their texts even though they might not contain explicit hints for localizing bugs (e.g., stack traces). We also demonstrate that optimal queries and non-optimal queries chosen from bug report texts are significantly different in terms of several keyword characteristics (e.g., frequency, entropy, position, part of speech). Such an analysis has led us to four actionable insights on how to choose appropriate keywords from a bug report. Furthermore, we demonstrate 27%–34% improvement in the performance of non-optimal queries through the application of our actionable insights to them. Finally, we summarize our study findings with future research directions (e.g., machine intelligence in keyword selection).
Chapter
There has been an explosive growth of data and information in recent years with the coming of the World Wide Web. A major challenge in this arena is to serve the correct information to the correct person which adds up to a complex measure in efficient decision making. To solve these problems, the recommender system plays a vital role. Most of the e‐commerce websites used today make use of recommender systems for effective decision making. Today's recommender system takes into account only the content information, ignoring the sequential details, which also play a vital role for recognizing the behavior of users. The present paper explores the different types of recommender techniques with their mathematical foundation and also discusses some of the problems in the prevailing system. Our proposed approach makes use of sequential patterns of web navigation along with the content information and is based on set and sequence similarity measure (S3M) for generating recommendations on web data. The paper makes use of mathematics involved in finding the set and sequence similarity for recommendation to user on CTI news dataset. To create suggestions for users, our proposed method uses the principle of upper approximation & singular value decomposition.
Article
Text Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and deal with inevitable linguistic challenges. This paper offers a survey with practical and linguistic connotations, showcasing the complications and challenges tied to the application of modern Text Classification algorithms to languages other than English. We engage this subject from the perspective of the Italian language, and we discuss in detail issues related to the scarcity of task-specific datasets, as well as the issues posed by the computational expensiveness of modern approaches. We substantiate this by providing an extensively researched list of available datasets in Italian, comparing it with a similarly sought list for French, which we use for comparison. In order to simulate a real-world practical scenario, we apply a number of representative methods to custom-tailored multilabel classification datasets in Italian, French, and English. We conclude by discussing results, future challenges, and research directions from a linguistically inclusive perspective.
ResearchGate has not been able to resolve any references for this publication.