Chapter

Relevance Feedback in Information Retrieval

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Among the low-level features, color features and texture features hold a dominant and significant position. In the literature, various studies have focused on either color features or texture features [11][12][13][14] and [15].While each feature type has its own strengths and limitations. Indeed, color features can be used to determine objects that have distinct and consistent color patterns. ...
... Additionally, the texture feature was derived by calculating the mean and standard deviation of all components of the wavelet transform. In [12], the authors introduce three image features for image retrieval. Additionally, a feature selection technique is presented to choose optimal features, aiming to maximize the detection rate and simplify the computation of image retrieval. ...
... With an average precision of 99.32%, our CBIR scheme achieves the highest accuracy among the other state-of-the-art CBIR systems. Specifically, our proposed AVR Framework outperforms that of Jhanwar et al. [11], Lin et al. [12], ElAlami [18], Murala et al. [73], Hamroun [53] Yildizer et al. [15], Kundu et al. [19], Dubey et al. [16], [72]and [14] These initial findings demonstrate that our AVR scheme achieves superior performance, correctly retrieving 20 images from the flower category. In contrast, the results for the CBIR system in reference [53] show retrieval of 17 images. ...
Article
Full-text available
Currently, there is a lack of theoretical studies that approach classification and machine learning from a general perspective, without focusing on specific applications. While these studies tackle real problems like combining low-level and high-level descriptors, they fail to offer a comprehensive approach that effectively addresses the combination issue in all cases. Consequently, several questions arise when implementing such an approach, including: (i) how to model the combination of information from low-level and high-level descriptors; and (ii) how to evaluate the method's robustness across different applications. As these questions remain open and challenging, this study aims to provide answers. In this paper, we propose a new framework called AVR: "Advancing Video Retrieval ", which consists of three subsystems and is based on an optimized combination of low-level and high-level descriptors to enhance data retrieval accuracy. Performance analysis demonstrates that our AVR system significantly improves accuracy metrics compared to existing systems. For instance, using Corel's dataset, AVR increases average accuracy from 79.5% to 98.12% for the Beach category and from 77.75% to 98.45% for the Mountain category, surpassing the performance of our previous ISE system [38]. Furthermore, AVR achieves an average accuracy improvement of 99.32%, compared to 91.76% for ISE. Our system also outperforms our previous works, VINAS [40] and VISEN [120] systems, in terms of concept detection. For instance, the Car_Racing concept's value increases from 0.03 to 0.45 with AVR, leading to improved search results using the TRECVID dataset and a 98.41% average accuracy improvement compared to VINAS (85%) and VISEN (88%).
... Perhaps, one of the first generic text classifier was proposed by Rocchio [5] that works by generating object prototypes based on centroids of a Voronoi partition over TFIDF vectors. This strategy shows the effort to reduce the necessary memory to fit in the hardware available at that time. ...
... where T i is defined as the identity function I and a set of related functions, mutually exclusive. 5 We define the function ...
Preprint
A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackle using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task, using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, we propose a minimalistic and wide system able to tackle text classification tasks independent of domain and language, namely microTC. It is composed by some easy to implement text transformations, text representations, and a supervised learning algorithm. These pieces produce a competitive classifier even in the domain of informally written text. We provide a detailed description of microTC along with an extensive experimental comparison with relevant state-of-the-art methods. mircoTC was compared on 30 different datasets. Regarding accuracy, microTC obtained the best performance in 20 datasets while achieves competitive results in the remaining 10. The compared datasets include several problems like topic and polarity classification, spam detection, user profiling and authorship attribution. Furthermore, it is important to state that our approach allows the usage of the technology even without knowledge of machine learning and natural language processing.
... Generally speaking, pseudo-relevance feedback refers to techniques to average topretrieved documents to automatically expand an initial query. It has been studied widely in information retrieval both to extend existing retrieval models [10,11,12,13], and as part of query expansion frameworks [14,15]. Specifically, Lavrenko and Croft [10] and Zhai and Lafferty [11] propose two methodsthe Relevance Model and the Mixture Model, respectively -to include feedback information in the Kullback-Leibler (KL) divergence retrieval model [16]. ...
... After the expansion terms have been selected, we can proceed to re-weighting the query terms. A classical approach for this is the Rocchio's algorithm [15] using the Rocchio's Beta equation [33], given by: ...
Preprint
Media sharing applications, such as Flickr and Panoramio, contain a large amount of pictures related to real life events. For this reason, the development of effective methods to retrieve these pictures is important, but still a challenging task. Recognizing this importance, and to improve the retrieval effectiveness of tag-based event retrieval systems, we propose a new method to extract a set of geographical tag features from raw geo-spatial profiles of user tags. The main idea is to use these features to select the best expansion terms in a machine learning-based query expansion approach. Specifically, we apply rigorous statistical exploratory analysis of spatial point patterns to extract the geo-spatial features. We use the features both to summarize the spatial characteristics of the spatial distribution of a single term, and to determine the similarity between the spatial profiles of two terms -- i.e., term-to-term spatial similarity. To further improve our approach, we investigate the effect of combining our geo-spatial features with temporal features on choosing the expansion terms. To evaluate our method, we perform several experiments, including well-known feature analyses. Such analyses show how much our proposed geo-spatial features contribute to improve the overall retrieval performance. The results from our experiments demonstrate the effectiveness and viability of our method.
... Briefly, the library uses a scalable vectorization of documents through online Latent Semantic Analysis (LSA) [15]. For the recommendation part, it pairs the Rocchio Algorithm [22] with a large-scale approximate nearest neighbor search based on ball trees [23]. The library aims at providing responsive content-based recommendations utilizing only user's votes. ...
... The Rocchio algorithm is used to produce recommendations based on relevant and nonrelevant documents previously voted by the user [22]. The method can work with term frequency, tf-idf, or any term weighting schemes. ...
Preprint
Full-text available
Finding relevant publications is important for scientists who have to cope with exponentially increasing numbers of scholarly material. Algorithms can help with this task as they help for music, movie, and product recommendations. However, we know little about the performance of these algorithms with scholarly material. Here, we develop an algorithm, and an accompanying Python library, that implements a recommendation system based on the content of articles. Design principles are to adapt to new content, provide near-real time suggestions, and be open source. We tested the library on 15K posters from the Society of Neuroscience Conference 2015. Human curated topics are used to cross validate parameters in the algorithm and produce a similarity metric that maximally correlates with human judgments. We show that our algorithm significantly outperformed suggestions based on keywords. The work presented here promises to make the exploration of scholarly material faster and more accurate.
... These types of problems belong to the area of IR research known as Dynamic Information Retrieval (DIR), which we define as exhibiting three characteristics: user feedback, temporal dependency and an overall goal. In this paper we present DIR as a natural progression in IR research complexity; where early research concerned static problems such as ad hoc retrieval, which gave way to interactive tasks such as those incorporating relevance feedback [27], finally leading to dynamic systems where tasks such as ranking for session search are optimized [24]. ...
... With these features in mind, we extend the vector space example to the interactive scenario in Fig. 1b by introducing the Rocchio relevance feedback algorithm [27] for interactively re-ranking documents. Here, clicked documents in the PRP ranked first stage are used as implicit signals of relevance to modify the user's original query Q1 to Q2. Document re-retrieval occurs using Q2, returning documents using updated relevance scores given by τ , which is a function of Q2 and thus the original ranking aPRP, document relevancies r and observations o. ...
Preprint
Theoretical frameworks like the Probability Ranking Principle and its more recent Interactive Information Retrieval variant have guided the development of ranking and retrieval algorithms for decades, yet they are not capable of helping us model problems in Dynamic Information Retrieval which exhibit the following three properties; an observable user signal, retrieval over multiple stages and an overall search intent. In this paper a new theoretical framework for retrieval in these scenarios is proposed. We derive a general dynamic utility function for optimizing over these types of tasks, that takes into account the utility of each stage and the probability of observing user feedback. We apply our framework to experiments over TREC data in the dynamic multi page search scenario as a practical demonstration of its effectiveness and to frame the discussion of its use, its limitations and to compare it against the existing frameworks.
... Query expansion (Rocchio, 1971;Lavrenko and Croft, 2001) is a long-standing technique that rewrites the query based on pseudo-relevance feedback or external knowledge sources such as WordNet. For sparse retrieval, it can help bridge the lexical gap between the query and the documents. ...
... Both techniques aim to minimize the lexical gap between the query and the documents. Query expansion typically involves rewriting the query based on relevance feedback (Lavrenko and Croft, 2001;Rocchio, 1971) or lexical resources such as WordNet (Miller, 1992). In cases where labels are not available, the top-k retrieved documents can serve as pseudo-relevance feedback signals (Lv and Zhai, 2009). ...
Preprint
Full-text available
This paper introduces a simple yet effective query expansion approach, denoted as query2doc, to improve both sparse and dense retrieval systems. The proposed method first generates pseudo-documents by few-shot prompting large language models (LLMs), and then expands the query with generated pseudo-documents. LLMs are trained on web-scale text corpora and are adept at knowledge memorization. The pseudo-documents from LLMs often contain highly relevant information that can aid in query disambiguation and guide the retrievers. Experimental results demonstrate that query2doc boosts the performance of BM25 by 3% to 15% on ad-hoc IR datasets, such as MS-MARCO and TREC DL, without any model fine-tuning. Furthermore, our method also benefits state-of-the-art dense retrievers in terms of both in-domain and out-of-domain results.
... directed searching). The widely known example of interactive and directed searching is Relevance Feedback [208], which is also considered as a query expansion method (as will be later explained). Ruthven [213] investigates existing user-interaction-based IR methods for query formulation and reformulation, complex query languages, clustering, categorization, automated search and assistance, and implicit feedback and evidence. ...
... The logic behind local analysis depends on the fact that since the retrieved documents were deemed relevant to the user's information need, then the terms in these documents should also be relevant. Two main methods are used for local query expansion: (1) Relevance Feedback (RF) [208] and (2) Pseudo Relevance Feedback (PRF) [34]. The Relevance Model [125] was among the first proposed approaches for PRF, and still holds a state-of-the-art PRF performance compared to other more recent methods [32]. ...
Thesis
Full-text available
Despite the advantages of their low-resource settings, traditional sparse retrievers depend on exact matching approaches between high-dimensional bag-of-words (BoW) representations of both the queries and the collection. As a result, retrieval performance is restricted by semantic discrepancies and vocabulary gaps. On the other hand, transformer-based dense retrievers introduce significant improvements in information retrieval tasks by exploiting low-dimensional contextualized representations of the corpus. While dense retrievers are known for their relative effectiveness, they suffer from lower efficiency and lack of generalization issues, when compared to sparse retrievers. For a lightweight retrieval task, high computational resources and time consumption are major barriers encouraging the renunciation of dense models despite potential gains. In this work, I propose boosting the performance of sparse retrievers by expanding both the queries and the documents with linked entities in two formats for the entity names: 1) explicit and 2) hashed. A zero-shot end-to-end dense entity linking system is employed for entity recognition and disambiguation to augment the corpus. By leveraging the advanced entity linking methods, I believe that the effectiveness gap between sparse and dense retrievers can be narrowed. Experiments are conducted on the MS MARCO passage dataset using the original qrel set, the re-ranked qrels favoured by MonoT5 and the latter set further re-ranked by DuoT5. Since I am concerned with the early stage retrieval in cascaded ranking architectures of large information retrieval systems, the results are evaluated using recall@1000. The suggested approach is also capable of retrieving documents for query subsets judged to be particularly difficult in prior work. In addition, it is demonstrated that the non-expanded and the expanded runs with both explicit and hashed entities retrieve complementary results. Consequently, run combination methods such as run fusion and classifier selection are experimented to maximize the benefits of entity linking. Due to the success of entity methods for sparse retrieval, the proposed approach is also tested on dense retrievers. The corresponding results are reported in MRR@10.
... Query expansion (Rocchio, 1971;Lavrenko and Croft, 2001) is a long-standing technique that rewrites the query based on pseudo-relevance feedback or external knowledge sources such as WordNet. For sparse retrieval, it can help bridge the lexical gap between the query and the documents. ...
... Both techniques aim to minimize the lexical gap between the query and the documents. Query expansion typically involves rewriting the query based on relevance feedback (Lavrenko and Croft, 2001;Rocchio, 1971) or lexical resources such as Word-Net (Miller, 1992). In cases where labels are not available, the top-k retrieved documents can serve as pseudo-relevance feedback signals (Lv and Zhai, 2009). ...
... The user template modification used Rocchio model iteration [6]. Rocchio n is the number of related texts, 2 n is the number of unrelated texts, β and γ is the weighted of positive and negative feedback contribution rate. ...
Article
Full-text available
With the development of the network information, information processing is vital in all aspects. Information filtering is more important research aspect. Especially Chinese information filtering is urgent affairs. According to research of the domestic and abroad, in the article vector space method and hyB+ tree index method is combined to filter text. Experimental results show that, this method is feasible.
... A practical implementation of this idea is offered by tools like Dimension IMportance Estimators (DIMEs), which rank dimensions in the latent space by their importance and retain only the most relevant ones. DIMEs compute importance scores for each dimension based on a relevant feedback document, using methods such as Pseudo-Relevance Feedback (PRF) [33,39] and Large Language Models (LLM) [10,15,29,31]. In the PRF-based DIME approach, the embeddings of the top-ranked documents-retrieved from the corpus using similarity measures like cosine similarity or the inner product-are averaged to form the pseudo-relevant document representation. ...
Preprint
Recent advances in Information Retrieval have leveraged high-dimensional embedding spaces to improve the retrieval of relevant documents. Moreover, the Manifold Clustering Hypothesis suggests that despite these high-dimensional representations, documents relevant to a query reside on a lower-dimensional, query-dependent manifold. While this hypothesis has inspired new retrieval methods, existing approaches still face challenges in effectively separating non-relevant information from relevant signals. We propose a novel methodology that addresses these limitations by leveraging information from both relevant and non-relevant documents. Our method, ECLIPSE, computes a centroid based on irrelevant documents as a reference to estimate noisy dimensions present in relevant ones, enhancing retrieval performance. Extensive experiments on three in-domain and one out-of-domain benchmarks demonstrate an average improvement of up to 19.50% (resp. 22.35%) in mAP(AP) and 11.42% (resp. 13.10%) in nDCG@10 w.r.t. the DIME-based baseline (resp. the baseline using all dimensions). Our results pave the way for more robust, pseudo-irrelevance-based retrieval systems in future IR research.
... Query suggestion, instead, is an interactive approach that proposes different queries to the user to refine the search. Those techniques either base their reformulations on the initial search results -as happens in Rocchio's algorithm [23] with its Pseudo Relevance Feedback (PRF) -or on external knowledge sources, such as thesauri, ontologies, or large-scale language models. In the case of PRF, the system uses feedback from the initial set of retrieved documents to automatically adjust the query, while knowledgebased approaches leverage pre-defined relationships between terms to suggest more effective formulations. ...
Preprint
Full-text available
Query suggestion, a technique widely adopted in information retrieval, enhances system interactivity and the browsing experience of document collections. In cross-modal retrieval, many works have focused on retrieving relevant items from natural language queries, while few have explored query suggestion solutions. In this work, we address query suggestion in cross-modal retrieval, introducing a novel task that focuses on suggesting minimal textual modifications needed to explore visually consistent subsets of the collection, following the premise of ''Maybe you are looking for''. To facilitate the evaluation and development of methods, we present a tailored benchmark named CroQS. This dataset comprises initial queries, grouped result sets, and human-defined suggested queries for each group. We establish dedicated metrics to rigorously evaluate the performance of various methods on this task, measuring representativeness, cluster specificity, and similarity of the suggested queries to the original ones. Baseline methods from related fields, such as image captioning and content summarization, are adapted for this task to provide reference performance scores. Although relatively far from human performance, our experiments reveal that both LLM-based and captioning-based methods achieve competitive results on CroQS, improving the recall on cluster specificity by more than 115% and representativeness mAP by more than 52% with respect to the initial query. The dataset, the implementation of the baseline methods and the notebooks containing our experiments are available here: https://paciosoft.com/CroQS-benchmark/
... Typically, relevance feedback can be gathered explicitly by asking the user to judge documents, implicitly using logs or automatically from pseudo relevant documents [9]. This feedback is then used to expand the query, most commonly using Rocchio's Algorithm [22] or [18]. The difficulty in query expansion lies in selecting appropriate terms to expand the original query. ...
Preprint
Linking entities like people, organizations, books, music groups and their songs in text to knowledge bases (KBs) is a fundamental task for many downstream search and mining applications. Achieving high disambiguation accuracy crucially depends on a rich and holistic representation of the entities in the KB. For popular entities, such a representation can be easily mined from Wikipedia, and many current entity disambiguation and linking methods make use of this fact. However, Wikipedia does not contain long-tail entities that only few people are interested in, and also at times lags behind until newly emerging entities are added. For such entities, mining a suitable representation in a fully automated fashion is very difficult, resulting in poor linking accuracy. What can automatically be mined, though, is a high-quality representation given the context of a new entity occurring in any text. Due to the lack of knowledge about the entity, no method can retrieve these occurrences automatically with high precision, resulting in a chicken-egg problem. To address this, our approach automatically generates candidate occurrences of entities, prompting the user for feedback to decide if the occurrence refers to the actual entity in question. This feedback gradually improves the knowledge and allows our methods to provide better candidate suggestions to keep the user engaged. We propose novel human-in-the-loop retrieval methods for generating candidates based on gradient interleaving of diversification and textual relevance approaches. We conducted extensive experiments on the FACC dataset, showing that our approaches convincingly outperform carefully selected baselines in both intrinsic and extrinsic measures while keeping users engaged.
... Similar to Li et al (2016), we assign a given concept to a target category using Rocchio classification (Rocchio (1971)), where the centroid of each category is set to the category's corresponding embedding vector. Formally, given a set of n candidate concept categories G = {g 1 , ..., g n }, an instance concept c, an embedding function f , and a similarity function Sim, then c is assigned to the ith category g i such that g i = arg max i Sim(f (g i ), f (c)). ...
Preprint
Text representations using neural word embeddings have proven effective in many NLP applications. Recent researches adapt the traditional word embedding models to learn vectors of multiword expressions (concepts/entities). However, these methods are limited to textual knowledge bases (e.g., Wikipedia). In this paper, we propose a novel and simple technique for integrating the knowledge about concepts from two large scale knowledge bases of different structure (Wikipedia and Probase) in order to learn concept representations. We adapt the efficient skip-gram model to seamlessly learn from the knowledge in Wikipedia text and Probase concept graph. We evaluate our concept embedding models on two tasks: (1) analogical reasoning, where we achieve a state-of-the-art performance of 91% on semantic analogies, (2) concept categorization, where we achieve a state-of-the-art performance on two benchmark datasets achieving categorization accuracy of 100% on one and 98% on the other. Additionally, we present a case study to evaluate our model on unsupervised argument type identification for neural semantic parsing. We demonstrate the competitive accuracy of our unsupervised method and its ability to better generalize to out of vocabulary entity mentions compared to the tedious and error prone methods which depend on gazetteers and regular expressions.
... The query refine-ment could be in the form of re-weighting the query terms or automatically expanding the query with new terms. Rocchio [21] is widely considered to be the first formalization of relevance feedback technique, developed on the vector space model. He proposes query refinement based on the difference between the average vector of the relevant documents and the average vector of the non-relevant documents. ...
Preprint
One key challenge in talent search is to translate complex criteria of a hiring position into a search query, while it is relatively easy for a searcher to list examples of suitable candidates for a given position. To improve search efficiency, we propose the next generation of talent search at LinkedIn, also referred to as Search By Ideal Candidates. In this system, a searcher provides one or several ideal candidates as the input to hire for a given position. The system then generates a query based on the ideal candidates and uses it to retrieve and rank results. Shifting from the traditional Query-By-Keyword to this new Query-By-Example system poses a number of challenges: How to generate a query that best describes the candidates? When moving to a completely different paradigm, how does one leverage previous product logs to learn ranking models and/or evaluate the new system with no existing usage logs? Finally, given the different nature between the two search paradigms, the ranking features typically used for Query-By-Keyword systems might not be optimal for Query-By-Example. This paper describes our approach to solving these challenges. We present experimental results confirming the effectiveness of the proposed solution, particularly on query building and search ranking tasks. As of writing this paper, the new system has been available to all LinkedIn members.
... The field of information retrieval has long-standing experience in using feedback of (pseudo) relevance in the retrieval process [8]. However, explicit non-relevance information has been shown to be more difficult to incorporate. ...
Preprint
In this paper, we reflect on ways to improve the quality of bio-medical information retrieval by drawing implicit negative feedback from negated information in noisy natural language search queries. We begin by studying the extent to which negations occur in clinical texts and quantify their detrimental effect on retrieval performance. Subsequently, we present a number of query reformulation and ranking approaches that remedy these shortcomings by resolving natural language negations. Our experimental results are based on data collected in the course of the TREC Clinical Decision Support Track and show consistent improvements compared to state-of-the-art methods. Using our novel algorithms, we are able to reduce the negative impact of negations on early precision by up to 65%.
... One of the remedies for the vocabulary problem is query expansion, which aims to enrich the query by adding more relevant words to it. The earliest mention of such methods was in the seminal works of Maron and Kuhn [20], and Rocchio [21]. ...
Preprint
Full-text available
CQA services are valuable sources of knowledge that can be used to find answers to users' information needs. In these services, question retrieval aims to help users with their information needs by finding similar questions to theirs. However, finding similar questions is obstructed by the lexical gap that exists between relevant questions. In this work, we target this problem by using query expansion methods. We use word-similarity-based methods, propose a question-similarity-based method and selective expansion of these methods to expand a question that's been submitted and mitigate the lexical gap problem. Our best method achieves a significant relative improvement of 1.8\% compared to the best-performing baseline without query expansion.
... Our method can be compared to other solutions to test-time adaptation, a problem that has been well-studied across a variety of domains (Jang et al., 2023). In retrieval, one form of test-time adaptation is pseudo-relevance feedback (PRF) (Rocchio, 1971;Li et al., 2018;, where documents relevant to the query are used to construct a final, enhanced query representation. The query side of our model can be seen as a form of pseudo-relevance feedback; however, we train from scratch to support a more general form of PRF natively, on the document representation as well as the query. ...
Preprint
Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.
... Query expansion and pseudo relevance feedback offer resolve to some extent but come with certain latency overhead [8]. KL expansion [47], Rocchio [33], Relevance Modelling [25] and RM3 [1] are popular pseudo relevance feedback methods. Efficiency of lexical methods is exploited by using it as a first-stage document ranker for a short listed input to intricate dense retrieval and large language model based systems [18]. ...
Preprint
Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, dense retrieval methods are based on learned dense vectors and, consequently, are effective but comparatively slow. Since sparse and dense methods approach problems differently and use complementary relevance signals, approximation methods were proposed to balance effectiveness and efficiency. For efficiency, approximation methods like HNSW are frequently used to approximate exhaustive dense retrieval. However, approximation techniques still exhibit considerably higher latency than sparse approaches. We propose LexBoost that first builds a network of dense neighbors (a corpus graph) using a dense retrieval approach while indexing. Then, during retrieval, we consider both a document's lexical relevance scores and its neighbors' scores to rank the documents. In LexBoost this remarkably simple application of the Cluster Hypothesis contributes to stronger ranking effectiveness while contributing little computational overhead (since the corpus graph is constructed offline). The method is robust across the number of neighbors considered, various fusion parameters for determining the scores, and different dataset construction methods. We also show that re-ranking on top of LexBoost outperforms traditional dense re-ranking and leads to results comparable with higher-latency exhaustive dense retrieval.
... Carmel et al. [4] examine an idealized term-based pruning algorithm. The score of a term for a document is formulated by term dependencies of text relevance, i.e., TF-IDF [37] scoring function. The pruned index has been proved, to some extent, to provide results that are almost as good as search results derived from the full index. ...
... Responses update the latent estimation with a variant of the well-known Rocchio algorithm [50], chosen as the simplest vector interpolation. That is, the latent vectors of images corresponding to target classifications update the GAN estimate, so the generative output captures the group's opinion. ...
Article
Full-text available
Generative models are powerful tools for producing novel information by learning from example data. However, the current approaches require explicit manual input to steer generative models to match human goals. Furthermore, how these models would integrate implicit, diverse feedback and goals of multiple users remains largely unexplored. Here, we present a first-of-its-kind system that produces novel images of faces by inferring human goals directly from cross-subject brain signals while study subjects are looking at example images. We report on an experiment where brain responses to images of faces were recorded using electroencephalography in 30 subjects, focusing on specific salient visual features (VFs). Preferences toward VFs were decoded from subjects’ brain responses and used as implicit feedback for a generative adversarial network (GAN), which generated new images of faces. The results from a follow-up user study evaluating the presence of the target salient VFs show that the images generated from brain feedback represent the goal of the study subjects and are comparable to images generated with manual feedback. The methodology provides a stepping stone toward humans-in-the-loop image generation.
... In order to do so, LR is trained on a dataset of abstracts in a binary classification setup: 1 if the first author of the abstract is r and 0 otherwise. Then, in a similar fashion as the Rocchio algorithm [27], the final embedding of the abstract a is the average of the embedding of the sentences in a, positively or negatively weighted by LR predictions. The LR predictions are therefore used to estimate the relevance of the sentences in the average. ...
Article
Full-text available
The more science advances, the more questions are asked. This compounding growth can make it difficult to keep up with current research directions. Furthermore, this difficulty is exacerbated for junior researchers who enter fields with already large bases of potentially fruitful research avenues. In this paper, we propose a novel task and a recommender system for research directions, RecSOI, that draws from statements of ignorance (SOIs) found in the research literature. By building researchers’ profiles based on textual elements, RecSOI generates personalized recommendations of potential research directions tailored to their interests. In addition, RecSOI provides context for the recommended SOIs, so that users can quickly evaluate how relevant the research direction is for them. In this paper, we provide an overview of RecSOI’s functioning, implementation, and evaluation, demonstrating its effectiveness in guiding researchers through the vast landscape of potential research directions.
... We propose to generate QAC candidates from PRF [1,16,22]. The classic PRF paradigm assumes that the set of top-k ranked documents (emails) are relevant to the query. ...
Chapter
Full-text available
Traditional query auto-completion (QAC) relies heavily on search logs collected over many users. However, in on-device email search, the scarcity of logs and the governing privacy constraints make QAC a challenging task. In this work, we propose an on-device QAC method that runs directly on users’ devices, where users’ sensitive data and interaction logs are not collected, shared, or aggregated through web services. This method retrieves candidates using pseudo relevance feedback, and ranks them based on relevance signals that explore the textual and structural information from users’ emails. We also propose a private corpora based evaluation method, and empirically demonstrate the effectiveness of our proposed method.
... If the item description is a q-dimensional VSM using TF-IDF weights, the user profile may be a q-dimensional vector of weights where each entry characterizes the user's value for that keyword (Adomavicius and Tuzhilin 2005b). An averaging approach such as the Rocchio algorithm (Rocchio 1971) can be used to compute the user profile Lang (1995). Ratings are predicted for a user-item pair by evaluating a similarity score, such as cosine similarity, on the relevant user and item vectors. ...
... Query expansion approaches have been adopted to bridge this gap. These approaches typically involve addition of terms to the original query based on relevance feedback [22,37]. When user relevance feedback is unavailable, pseudo-relevance feedback mechanism is applied [10,28]. ...
Preprint
Full-text available
Query rewriting refers to an established family of approaches that are applied to underspecified and ambiguous queries to overcome the vocabulary mismatch problem in document ranking. Queries are typically rewritten during query processing time for better query modelling for the downstream ranker. With the advent of large-language models (LLMs), there have been initial investigations into using generative approaches to generate pseudo documents to tackle this inherent vocabulary gap. In this work, we analyze the utility of LLMs for improved query rewriting for text ranking tasks. We find that there are two inherent limitations of using LLMs as query re-writers -- concept drift when using only queries as prompts and large inference costs during query processing. We adopt a simple, yet surprisingly effective, approach called context aware query rewriting (CAR) to leverage the benefits of LLMs for query understanding. Firstly, we rewrite ambiguous training queries by context-aware prompting of LLMs, where we use only relevant documents as context.Unlike existing approaches, we use LLM-based query rewriting only during the training phase. Eventually, a ranker is fine-tuned on the rewritten queries instead of the original queries during training. In our extensive experiments, we find that fine-tuning a ranker using re-written queries offers a significant improvement of up to 33% on the passage ranking task and up to 28% on the document ranking task when compared to the baseline performance of using original queries.
Article
Full-text available
With the popularity of the Internet, e-mail with its fast and convenient advantages has gradually developed into one of the important communication tools in people's lives. However, the problem of followed spam is increasingly severe, it is not only the dissemination of harmful information, but also waste of public resources. To solve this problem, the author proposed a mail filtering algorithm based on the feedback correction probability learning. The feedback correction probability training has less feedback learning data and use error-driven training in order to achieve a high classification effect. The experiment also tested the idea.
Chapter
The field of biomedical research generates vast amount of data from various sources, including electronic health records, scientific publications, clinical trials, and experimental studies. This data provides valuable insights into the underlying mechanisms of diseases and their treatments. However, biomedical data's sheer volume and complexity pose significant challenges in retrieving and analyzing relevant information. This chapter provides an overview of information retrieval with focus on retrieving biomedical data. It begins by introducing information retrieval (IR) and highlights the significance of IR. Further, it discusses the role of query expansion in improving the performance of an information retrieval system (IRS). The insights gained from this chapter can help researchers in designing and developing more effective IR systems that can unlock the full potential of biomedical data and accelerate progress in biomedical research.
Chapter
Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-based learning to determine whether a query should be expanded. In contrast, we revisit the problem of selective PRF from a deep learning perspective, presenting a model that is entirely data-driven and trained in an end-to-end manner. The proposed model leverages a transformer-based bi-encoder architecture. Additionally, to further improve retrieval effectiveness with this selective PRF approach, we make use of the model’s confidence estimates to combine the information from the original and expanded queries. In our experiments, we apply this selective feedback on a number of different combinations of ranking and feedback models, and show that our proposed approach consistently improves retrieval effectiveness for both sparse and dense ranking models, with the feedback models being either sparse, dense or generative.
Chapter
We investigate the integration of Large Language Models (LLMs) into query encoders to improve dense retrieval without increasing latency and cost, by circumventing the dependency on LLMs at inference time. SoftQE incorporates knowledge from LLMs by mapping embeddings of input queries to those of the LLM-expanded queries. While improvements over various strong baselines on in-domain MS-MARCO metrics are marginal, SoftQE improves performance by 2.83 absolute percentage points on average on five out-of-domain BEIR tasks.
Article
Full-text available
We live in a “search engine society”. Underlying this self-description of post-modern society there is the crucial dependency of social memory from archives. Apart from moral and legal concerns, search engines are sociologically intriguing subject because of their close connection with the evolution of social memory. In this contribution I argue that search engines are non-semantic indexing systems which turn the circular interplay between users and the machine into a cybernetic system. The main function of this cybernetic system is to minimize the deviation from a difference, that between relevant and not-relevant. Through mechanical archives, post-modern social memory can cope with increasing knowledge complexity. The main challenge in this respect is how to preserve the capability of discarding in order to produce information.
Article
The Relevance Feedback (RF) process relies on accurate and real-time relevance estimation of feedback documents to improve retrieval performance. Since collecting explicit relevance annotations imposes an extra burden on the user, extensive studies have explored using pseudo-relevance signals and implicit feedback signals as substitutes. However, such signals are indirect indicators of relevance and suffer from complex search scenarios where user interactions are absent or biased. Recently, the advances in portable and high-precision brain-computer interface (BCI) devices have shown the possibility to monitor user’s brain activities during search process. Brain signals can directly reflect user’s psychological responses to search results and thus it can act as additional and unbiased RF signals. To explore the effectiveness of brain signals in the context of RF, we propose a novel RF framework that combines BCI-based relevance feedback with pseudo-relevance signals and implicit signals to improve the performance of document re-ranking. The experimental results on the user study dataset show that incorporating brain signals leads to significant performance improvement in our RF framework. Besides, we observe that brain signals perform particularly well in several hard search scenarios, especially when implicit signals as feedback are missing or noisy. This reveals when and how to exploit brain signals in the context of RF.
Article
As image datasets become ubiquitous, the problem of ad-hoc searches over image data is increasingly important. Many high-level data tasks in machine learning, such as constructing datasets for training and testing object detectors, imply finding ad-hoc objects or scenes within large image datasets as a key sub-problem. New foundational visual-semantic embeddings trained on massive web datasets such as Contrastive Language-Image Pre-Training (CLIP) can help users start searches on their own data, but we find there is a long tail of queries where these models fall short in practice. Seesaw is a system for interactive ad-hoc searches on image datasets that integrates state-of-the-art embeddings like CLIP with user feedback in the form of box annotations to help users quickly locate images of interest in their data even in the long tail of harder queries. One key challenge for Seesaw is that, in practice, many sensible approaches to incorporating feedback into future results, including state-of-the-art active-learning algorithms, can worsen results compared to introducing no feedback, partly due to CLIP's high-average performance. Therefore, Seesaw includes several algorithms that empirically result in larger and also more consistent improvements. We compare Seesaw's accuracy to both using CLIP alone and to a state-of-the-art active-learning baseline and find Seesaw consistently helps improve results for users across four datasets and more than a thousand queries. Seesaw increases Average Precision (AP) on search tasks by an average of .08 on a wide benchmark (from a base of .72), and by a .27 on a subset of more difficult queries where CLIP alone performs poorly.
Article
Personalized Query Expansion, the task of expanding queries with additional terms extracted from the user-related vocabulary, is a well-known solution to improve the retrieval performance of a system w.r.t. short queries. Recent approaches rely on word embeddings to select expansion terms from user-related texts. Although promising results have been delivered with former word embedding techniques, we argue that these methods are not suited for contextual word embeddings, which produce a unique vector representation for each term occurrence. In this article, we propose a Personalized Query Expansion method designed to solve the issues arising from the use of contextual word embeddings with the current Personalized Query Expansion approaches based on word embeddings. Specifically, we employ a clustering-based procedure to identify the terms that better represent the user interests and to improve the diversity of those selected for expansion, achieving improvements of up to 4% w.r.t. the best-performing baseline in terms of MAP@100. Moreover, our approach outperforms previous ones in terms of efficiency, allowing us to achieve sub-millisecond expansion times even in data-rich scenarios. Finally, we introduce a novel metric to evaluate the expansion terms’ diversity and empirically show the unsuitability of previous approaches based on word embeddings when employed along with contextual word embeddings, which cause the selection of semantically overlapping expansion terms.
ResearchGate has not been able to resolve any references for this publication.