Conference PaperPDF Available

Expected reciprocal rank for graded relevance

Authors:

Abstract and Figures

While numerous metrics for information retrieval are avail- able in the case of binary relevance, there is only one com- monly used metric for graded relevance, namely the Dis- counted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assump- tion: a document in a given position has always the same gain and discount independently of the documents shown above it. Inspired by the "cascade" user model, we present a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents. More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical recipro- cal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We conduct an extensive evaluation on the query logs of a commercial search engine and show that ERR correlates better with clicks metrics than other editorial metrics.
Content may be subject to copyright.
A preview of the PDF is not available
... Normalized discounted cumulative gain (NDCG) [59] is another known metric, which we will hereby expand. Other common metrics such as expected reciprocal rank (ERR) [60] consider the user's satisfaction from a ranked list and are irrelevant in our context. ...
... The second evaluation metric, presented in Sect. 5, relates additionally to the relevance scores from which the ranking is derived, and is based on the MRR measure [60]. The choice to consider two distinct metrics follows the fact that there are two main approaches that assess ranked lists in the IR literature. ...
... This measure is inspired by the reciprocal rank measure [60], which is commonly used in information retrieval literature. We begin with explaining the original measure and why the fit is not straightforward, and then proceed to present our enhancements. ...
Article
Full-text available
The problem of attacks on neural networks through input modification (i.e., adversarial examples) has attracted much attention recently. Being relatively easy to generate and hard to detect, these attacks pose a security breach that many suggested defenses try to mitigate. However, the evaluation of the effect of attacks and defenses commonly relies on traditional classification metrics, without adequate adaptation to adversarial scenarios. Most of these metrics are accuracy-based and therefore may have a limited scope and low distinctive power. Other metrics do not consider the unique characteristics of neural network functionality or measure the effectiveness of the attacks indirectly (e.g., through the complexity of their generation). In this paper, we present two metrics that are specifically designed to measure the effect of attacks, or the recovery effect of defenses, on the output of neural networks in multiclass classification tasks. Inspired by the normalized discounted cumulative gain and the reciprocal rank metrics used in information retrieval literature, we treat the neural network predictions as ranked lists of results. Using additional information about the probability of the rank enabled us to define novel metrics that are suited to the task at hand. We evaluate our metrics using various attacks and defenses on a pre-trained VGG19 model and the ImageNet dataset. Compared to the common classification metrics, our proposed metrics demonstrate superior informativeness and distinctiveness.
... Multi-aspect evaluation measures for IR have been studied for different tasks and aspects, starting from the INEX initiative with relevance and coverage [30]. Since then, measures have been proposed to evaluate relevance and novelty or diversity, such as -NDCG [16], MAP-IA [1] and IA-ERR [10]; relevance, novelty and the amount of user effort, such as nCT [42]; relevance, redundancy and user effort, such as RBU [5]; relevance and understandability, such as uRBP [49] and the Multidimensional Measure (MM) framework [37]; and relevance and credibility, such as NLRE, NGRE, nWCS, Convex Aggregating Measure (CAM) and WHAM [35]. All these measures have limitations; we describe these next. ...
... This might be a problem for gain based measures, thus a possible solution is to use TOMA to define the ideal ranking and then use effectiveness measures based on similarity to ideal rankings [17][18][19]. Finally, the empirical impact of varying both distance and weight functions should also be investigated, as should the impact of employing further multi-graded measures as ERR [10], and the alignment of our current approach with real user preferences. ...
Preprint
Full-text available
Information Retrieval evaluation has traditionally focused on defining principled ways of assessing the relevance of a ranked list of documents with respect to a query. Several methods extend this type of evaluation beyond relevance, making it possible to evaluate different aspects of a document ranking (e.g., relevance, usefulness, or credibility) using a single measure (multi-aspect evaluation). However, these methods either are (i) tailor-made for specific aspects and do not extend to other types or numbers of aspects, or (ii) have theoretical anomalies, e.g. assign maximum score to a ranking where all documents are labelled with the lowest grade with respect to all aspects (e.g., not relevant, not credible, etc.). We present a theoretically principled multi-aspect evaluation method that can be used for any number, and any type, of aspects. A thorough empirical evaluation using up to 5 aspects and a total of 425 runs officially submitted to 10 TREC tracks shows that our method is more discriminative than the state-of-the-art and overcomes theoretical limitations of the state-of-the-art.
... More generally, most IR metrics have a corresponding user browsing model, which hypothesizes the way in which users interact with each SERP, and the subconscious process they follow as they consume SERPs and assess usefulness -the attribute that we are trying to measure. Thus, one way in which IR effectiveness metrics have been studied is via the development of user browsing models of increasing sophistication [2], [7], [9], [26], [27], [48], [50]. Each such model maps a categorical SERP to a numeric assessment of that SERP's value on the real number line, usually between 0.0 and 1.0 inclusive, often in units of ''expected utility gained per document inspected'', using the corresponding browsing model as a guide to the manner in which the user consumes, and ends their consumption of, the SERP. ...
... The second mapping then takes an n-or k-vector of numeric gain values, combines them in a way that discounts gains as ranks increase, and generates a single numeric score. The metrics discounted cumulative gain (DCG) and normalized discounted cumulative gain (NDCG) [19] make quite deliberate use of real-valued document gains, as do RBP [25] and expected reciprocal rank (ERR) [9], with the goal of providing more nuanced effectiveness measurements, and hence the ability to respond with more sensitivity to perceived differences in SERP usefulness [39]. Average precision can also be broadened to make use of graded document relevance categories [12], [29]. ...
Article
Full-text available
A sequence of recent papers, including in this journal, has considered the role of measurement scales in information retrieval (IR) experimentation, and presented the argument that (only) uniform-step interval scales should be used. Hence, it has been argued, well-known metrics such as reciprocal rank, expected reciprocal rank, normalized discounted cumulative gain, and average precision, should be either discarded as measurement tools, or adapted so that their metric values lie at uniformly-spaced points on the number line. These papers paint a rather bleak picture of past decades of IR evaluation, at odds with the IR community’s overall emphasis on practical experimentation and measurable improvement. Our purpose in this work is to challenge that pessimistic assessment. In particular, we argue that mappings from categorical and ordinal data to sets of points on the number line are valid provided there is an external reason for each target point to have been selected. We first consider the general role of measurement scales, and of categorical, ordinal, interval, ratio, and absolute data collections. In connection with the first two of those categories we also provide examples of the knowledge that is captured and represented by numeric mappings to the real number line. Focusing then on information retrieval, we argue that document rankings are categorical data, and that the role of an effectiveness metric is to provide a single value that summarizes the usefulness to a user or population of users of any given ranking, with usefulness able to be represented as a continuous variable on a ratio scale. That is, we argue that most current IR metrics are well-founded, and, moreover, that those metrics are more meaningful in their current form than in the proposed “intervalized” versions.
... Given the train dataset, to select the best-performing model, we have applied a 5-fold cross-validation approach to evaluate different ranking models. The results are reported in Table 5 in terms of Normalized Discounted Cumulative Gain (NDCG) [47] and Expected Reciprocal Rank (ERR) [48]. Based on the results, we selected LamdaMART to report our results for the rest of the experiments to report the results of our explanation model. ...
Preprint
Full-text available
Twitter List recommender systems have the ability to generate accurate recommendations, but since they utilize heterogeneous user and List information on Twitter and usually apply complex hybrid prediction models, they cannot provide user-friendly intrinsic explanations. In this paper, we propose an explanation model to provide post-hoc explanations for recommended Twitter Lists based on the user’s own actions; and consequently benefits to improve recommendation acceptance by end users. The proposed model includes two main components: (1) candidate explanation generation in which the most semantically related actions of a user on Twitter to the recommended List are retrieved as candidate explanations; and (2) explanation ranking to re-rank candidates based on relatedness to the List and their informativeness. Through experiments on a real-world Twitter dataset, we demonstrate that the proposed explanation model can effectively generate related, informative and useful post-hoc explanations for the recommended Lists to users, while maintaining parity in recommendation performance.
... This work validates the ranking quality of the generated RL using the NDCG@t (Järvelin & Kekäläinen, 2002) (Normalized Discounted Cumulative Utility Gain) and MAP@t (Chapelle et al., 2009) (Mean Absolute Precision) measures as described in Eq. (10) and Eq.(11). ...
Article
A commercially viable multi-stakeholder recommendation system maximizes the utility gain by learning the personalized preferences of multiple stakeholders, such as consumers and producers. Existing multi-stakeholder studies rely on a consumer-item interaction matrix to evaluate the producers' preferences and utility gain. However, these methods result in a negligible boost in producers' utility, as consumer-item interaction provides only a limited insight into producers' preferences. Instead, an independent producer-item interaction matrix may better represent the needs and interests of producers. The deep neural networks have recently achieved encouraging results in a recommendation by estimating user preferences and learning user-item non-linear features. The multi-stakeholder recommendation system may employ this strength of the deep neural network to combine consumer-producer preferences and generate the optimal estimate of their common interest. Hence this study proposes a deep neural network-based multi-stakeholder recommendation system model for aggregating consumer and producer preferences. Next, a multi-criteria rating-based interaction matrix is proposed to learn the producers' preference over an item. Further, we perform deep neural network-based model training to generate the cumulative preference matrix by learning and aggregating the preferences of consumer and producer stakeholders. This work performs extensive experiments over Movie Lens-100K and 1M datasets with numerous activation functions, hidden layer configuration, and optimizers. The prediction accuracy, ranking, and utility gain-based evaluation results validate the success of the proposed model in developing a multi-criteria matrix for producers' and deep neural network-based multi-stakeholder preference aggregation over the baseline models.
... Two rank aggregation methods are provided, RankAggreg [50] and RobustRankAggre [51]. Three measurements, Normalized discounted cumulative gain (NDCG) [52], Expected reciprocal rank (ERR) [53], and Proportion (P) [54] are considered for evaluating the ranking results. The NDCG considers the position of a relevant pathway in the aggregated ranked list. ...
Article
Full-text available
Pathway analysis has been widely used to detect pathways and functions associated with complex disease phenotypes. The proliferation of this approach is due to better interpretability of its results and its higher statistical power compared with the gene-level statistics. A plethora of pathway analysis methods that utilize multi-omics setup, rather than just transcriptomics or proteomics, have recently been developed to discover novel pathways and biomarkers. Since multi-omics gives multiple views into the same problem, different approaches are employed in aggregating these views into a comprehensive biological context. As a result, a variety of novel hypotheses regarding disease ideation and treatment targets can be formulated. In this article, we review 32 such pathway analysis methods developed for multi-omics and multi-cohort data. We discuss their availability and implementation, assumptions, supported omics types and databases, pathway analysis techniques and integration strategies. A comprehensive assessment of each method's practicality, and a thorough discussion of the strengths and drawbacks of each technique will be provided. The main objective of this survey is to provide a thorough examination of existing methods to assist potential users and researchers in selecting suitable tools for their data and analysis purposes, while highlighting outstanding challenges in the field that remain to be addressed for future development.
... In this paper, we evaluate our model via both the accuracy and the expected reciprocal rank [12]. For accuracy, it takes the maximum value of the likelihood Fig. 9. ...
Conference Paper
Full-text available
In crowdsourced map services, digital maps are created and updated manually by volunteered users. Existing service providers usually provide users with a feature-rich map editor to add, drop, and modify roads. To make the map data more useful for widely-used applications such as navigation systems and travel planning services, it is important to provide not only the topology of the road network and the shapes of the roads, but also the types of each road segment (e.g., highway, regular road, secondary way, etc.). To reduce the cost of manual map editing, it is desirable to generate proper recommendations for users to choose from or conduct further modifications. There are several recent works aimed at generating road shapes from large number of historical trajectories; while to the best of our knowledge, none of the existing works have addressed the problem of inferring road types from historical trajectories. In this paper, we propose a model-based approach to infer road types from taxis trajectories. We use a combined inference method based on stacked generalization, taking into account both the topology of the road network and the historical trajectories. The experiment results show that our approach can generate quality recommendations of road types for users to choose from.
Article
Stock market prediction is considered as an important yet challenging aspect of financial analysis. The difficulty of forecasting arises from volatile and non-linear nature of stock market, which is affected by varied uncertain factors, ranging from financial ratios to macroeconomic indicators. Recent advances in machine learning, particularly ensembles, have made it possible for academic researchers and financial practitioners to forecast the stock market more efficiently. The novelty of this work is to evaluate how stock return in an oil-dependent country (i.e., Iran), which has been facing stagflation for a long time due to economic and political issues, is affected by fundamental and macroeconomic indicators. Our main objectives are to (1) find the most important fundamental and macroeconomic indicators that control the stock returns of companies listed on the Tehran Stock Exchange (TSE); (2) compare the performance of newly developed bagging- and boosting-based ensembles in predicting annual real stock returns of the TSE; and (3) develop multiclass classification models to forecast stock returns. Prior studies mainly focused on developing binary classification models, which simply predict whether stock returns will be positive or negative in the future. We, however, design multiclass classification models to provide more information for the investors and reduce the uncertainties associated with the prediction. To this end, we first provide a comprehensive list of 57 potential features affecting the stock returns. Next, the data are carefully preprocessed and fed to 14 different bagging- and boosting-based ensembles (e.g., Random Forest, LightGBM, XGBoost, Extra-Trees, AdaBoost, CatBoost) to predict the stock returns. The performance of ensembles is evaluated through different measures (e.g., accuracy, F-score, G-mean). We then propose a novel feature selection method to identify the most contributing features to the stock returns. Our proposed model identifies nearly 65% of 57 original features as redundancy, resulting in 20 most significant features. The selected features are fed to the mentioned ensembles to re-predict the stock returns. Finally, the performance of stock returns forecasts with and without selected features is compared. To design the ensembles, we employ the data from listed companies on the TSE for a 15-year period, spanning between 2005 to 2020. Results suggest that boosting ensembles, in general, outperform bagging-based methods. Among the boosting ensembles, XGBoost and AdaBoost provide the best and worst predictive performance, respectively. Among the bagging-like ensembles, Rotation Forest is the most accurate one, whereas Random Patches performs the worst. Further, our proposed feature selection approach effectively identifies the most representative features for stock returns prediction and can be used as a reliable framework for future investment decisions.
Article
Full-text available
Although Average Precision (AP) has been the most widely-used retrieval effectiveness metric since the ad-vent of Text Retrieval Conference (TREC), the general belief among researchers is that it lacks a user model. In light of this, Robertson recently pointed out that AP can be interpreted as a special case of Normalised Cu-mulative Precision (NCP), computed as an expectation of precision over a population of users who eventu-ally stop at different ranks in a list of retrieved docu-ments. He regards AP as a crude version of NCP, in that the probability distribution of the user's stopping behaviour is uniform across all relevant documents. In this paper, we generalise NCP further and demonstrate that AP and its graded-relevance version Q-measure are in fact reasonable metrics despite the above uniform probability assumption. From a proba-bilistic perspective, these metrics emphasise long-tail users who tend to dig deep into the ranked list, and thereby achieve high reliability. We also demonstrate that one of our new metrics, called NCU gu,β=1 , main-tains high correlation with AP and shows the highest discriminative power, i.e., the proportion of statisti-cally significantly different system pairs given a con-fidence level, by utilising graded relevance in a novel way. Our experimental results are consistent across NTCIR and TREC.
Conference Paper
Full-text available
Recent work has shown that average precision can be accurately estimated from a small random sample of judged documents. Unfortunately, such "random pools" cannot be used to evaluate retrieval measures in any standard way. In this work, we show that given such estimates of average precision, one can accurately infer the relevances of the remaining unjudged documents, thus obtaining a fully judged pool that can be used in standard ways for system evaluation of all kinds. Using TREC data, we demonstrate that our inferred judged pools are well correlated with assessor judgments, and we further demonstrate that our inferred pools can be used to accurately infer precision recall curves and all commonly used measures of retrieval performance.
Conference Paper
Full-text available
We investigate using gradient descent meth- ods for learning ranking functions; we pro- pose a simple probabilistic cost function, and we introduce RankNet, an implementation of these ideas using a neural network to model the underlying ranking function. We present test results on toy data and on data from a commercial internet search engine.
Article
This paper presents a study of the applicability of three user-effort-sensitive evaluation measures - "first 20 full precision," "search length," and "rank correlation" - on four Web-based search engines (Google, AltaVista, Excite and Metacrawler). The authors argue that these measures are better alternatives than precision and recall in Web search situations because of their emphasis on the quality of ranking. Eight sets of search topics were collected from four Ph.D. students in four different disciplines (biochemistry, industrial engineering, economics, and urban planning). Each participant was asked to provide two topics along with the corresponding query terms. Their relevance and credibility judgment of the Web pages were then used to compare the performance of the search engines using these three measures. The results show consistency among these three ranking evaluation measures, more so between "first 20 full precision" and search length than between rank correlation and the other two measures. Possible reasons for rank correlation's disagreement with the other two measures are discussed. Possible future research to improve these measures is also addressed.
Conference Paper
We study the problem of answering ambiguous web queries in a setting where there exists a taxonomy of information, and that both queries and documents may belong to more than one category according to this taxonomy. We present a systematic approach to diversifying results that aims to minimize the risk of dissatisfaction of the average user. We propose an algorithm that well approximates this objective in general, and is provably optimal for a natural special case. Furthermore, we generalize several classical IR metrics, including NDCG, MRR, and MAP, to explicitly account for the value of diversification. We demonstrate empirically that our algorithm scores higher in these generalized metrics compared to results produced by commercial search engines.
Article
Evaluation measures act as objective functions to be optimized by information retrieval systems. Such objective functions must accurately reflect user requirements, particularly when tuning IR systems and learning ranking functions. Ambiguity in queries and redundancy in retrieved documents are poorly reflected by current evaluation measures. In this paper, we present a framework for evaluation that systematically rewards novelty and diversity. We develop this framework into a specific evaluation measure, based on cumulative gain. We demonstrate the feasibility of our approach using a test collection based on the TREC question answering track.
Article
A measure of document retrieval system performance called the “expected search length reduction factor” is defined and compared with indicators, such as precision and recall, that have been suggested by other workers. The new measure is based on calculations of the expected number of irrelevant documents in the collection which would have to be searched through before the desired number of relevant documents could be found. Its advantages are: (1) it provides a single index for the property it attempts to measure; (2) it allows for gradations of retrieval status, through the mathematical concept of a “weak ordering”; (3) it evaluates retrieval performance relative to random searching; and (4) it takes into account the amount of relevant material desired by the requester.
Article
In this study the rankings of IR systems based on binary and graded relevance in TREC 7 and 8 data are compared. Relevance of a sample TREC results is reassessed using a relevance scale with four levels: non-relevant, marginally relevant, fairly relevant, highly relevant. Twenty-one topics and 90 systems from TREC 7 and 20 topics and 121 systems from TREC 8 form the data. Binary precision, and cumulated gain, discounted cumulated gain and normalised discounted cumulated gain are the measures compared. Different weighting schemes for relevance levels are tested with cumulated gain measures. Kendall’s rank correlations are computed to determine to what extent the rankings produced by different measures are similar. Weighting schemes from binary to emphasising highly relevant documents form a continuum, where the measures correlate strongly in the binary end, and less in the heavily weighted end. The results show the different character of the measures.
Conference Paper
We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence.