## No full-text available

To read the full-text of this research,

you can request a copy directly from the authors.

A central problem in ranking is to design a ranking measure for evaluation of
ranking functions. In this paper we study, from a theoretical perspective, the
widely used Normalized Discounted Cumulative Gain (NDCG)-type ranking measures.
Although there are extensive empirical studies of NDCG, little is known about
its theoretical properties. We first show that, whatever the ranking function
is, the standard NDCG which adopts a logarithmic discount, converges to 1 as
the number of items to rank goes to infinity. On the first sight, this result
is very surprising. It seems to imply that NDCG cannot differentiate good and
bad ranking functions, contradicting to the empirical success of NDCG in many
applications. In order to have a deeper understanding of ranking measures in
general, we propose a notion referred to as consistent distinguishability. This
notion captures the intuition that a ranking measure should have such a
property: For every pair of substantially different ranking functions, the
ranking measure can decide which one is better in a consistent manner on almost
all datasets. We show that NDCG with logarithmic discount has consistent
distinguishability although it converges to the same limit for all ranking
functions. We next characterize the set of all feasible discount functions for
NDCG according to the concept of consistent distinguishability. Specifically we
show that whether NDCG has consistent distinguishability depends on how fast
the discount decays, and 1/r is a critical point. We then turn to the cut-off
version of NDCG, i.e., NDCG@k. We analyze the distinguishability of NDCG@k for
various choices of k and the discount functions. Experimental results on real
Web search datasets agree well with the theory.

To read the full-text of this research,

you can request a copy directly from the authors.

... We applied the second metric, nDCG, to measure the ranking ability of gRNA design tools [45]. One interesting characteristic of nDCG is that the top results get more attention than the last ones through a discount function. ...

... One interesting characteristic of nDCG is that the top results get more attention than the last ones through a discount function. This function can be set to zero for a specific cut-off k, whereby the remaining results after the k-th one are completely ignored [45]. This is interesting because we do not want to base our judgment on how well a tool is doing in predicting inefficiency. ...

... In our experiments, we set k to be a constant proportion of the total samples n of each dataset (i.e., k = n/5). This choice was based on the results of a previous study [45], which showed that nDCG with such cut-off converges and consistently distinguishes between various ranking functions. The nDCG value for each tool was calculated as follows: We obtained a gRNA ranking based on the predictions of the tested tool, where variable i represents the i-th gRNA in the rank list. ...

The development of the CRISPR-Cas9 technology has provided a simple yet powerful system for targeted genome editing. Compared with previous gene-editing tools, the CRISPR-Cas9 system identifies target sites by the complementarity between the guide RNA (gRNA) and the DNA sequence, which is less expensive and time-consuming, as well as more precise and scalable. To effectively apply the CRISPR-Cas9 system, researchers need to identify target sites that can be cleaved efficiently and for which the candidate gRNAs have little or no cleavage at other genomic locations. For this reason, numerous computational approaches have been developed to predict cleavage efficiency and exclude undesirable targets. However, current design tools cannot robustly predict experimental success as prediction accuracy depends on the assumptions of the underlying model and how closely the experimental setup matches the training data. Moreover, the most successful tools implement complex machine learning and deep learning models, leading to predictions that are not easily interpretable.
Here, we introduce CRISPRedict , a simple linear model that provides accurate and inter-pretable predictions for guide design. Comprehensive evaluation on twelve independent datasets demonstrated that CRISPRedict has an equivalent performance with the currently most accurate tools and outperforms the remaining ones. Moreover, it has the most robust performance for both U6 and T7 data, illustrating its applicability to tasks under different conditions. Therefore, our system can assist researchers in the gRNA design process by providing accurate and explainable predictions. These predictions can then be used to guide genome editing experiments and make plausible hypotheses for further investigation. The source code of CRISPRedict along with instructions for use is available at https://github.com/VKonstantakos/CRISPRedict .

... Normalized discounted cumulative gain (nDCG) is a measure of ranking quality proposed by Wang et al. (WANG et al., 2013). This is a variation of the discounted cumulative gain (DCG), which is a variation of cumulative gain (CG) proposed by Jarvelin (JäRVELIN; KEKäLäINEN, 2002). ...

... We also perform an additional evaluation of the prediction results using normalized discounted cumulative gain (nDCG). As described in Chapter 2, it is a technique proposed by Wang et al. (WANG et al., 2013) to measure ranking quality. In this work, the goal is to evaluate the results incorporating the idea of ranking relevance. ...

Studies based on traditional data sources like surveys, for instance, offer poor scalability. The experiments are limited, and the results are restricted to small regions (such as a city or a state). The use of location-based social network (LBSN) data can mitigate the scalability problem by enabling the study of social behavior in large populations. When explored with Data Mining and Machine Learning techniques, LBSN data can be used to provide predictions of relevant cultural and behavioral data from cities or countries around the world. The main goal of this work is to predict and explore user behavior from LBSNs in the context of tourists’ mobility patterns. To achieve this goal, we propose PredicTour, which is an approach used to process LBSN users’ check-ins and to predict mobility patterns of tourists with or without previous visiting records when visiting new countries. PredicTour is composed of three key blocks: mobility modeling, profile extraction, and tourist’ mobility prediction. In the first block, sequences of check-ins in a time interval are associated with other user information to produce a new structure called "mobility descriptor”. In the profile extraction, self-organizing maps and fuzzy C-means work together to group users according to their mobility descriptors. PredicTour then identifies tourist’ profiles and estimates their mobility patterns in new countries. When comparing the performance of PredicTour with three well-known machine learning-based models, the results indicate that PredicTour outperforms the baseline approaches. Therefore, it is a good alternative for predicting and understanding international tourists’ mobility, which has an economic impact on the tourism industry, particularly when services and logistics across international borders should be provided. The proposed approach can be used in different applications such as recommender systems for tourists, and decision-making support for urban planners interested in improving both the tourists’ experience and attractiveness of venues through personalized services.

... A commonly used metric for measuring ranking quality is the discounted cumulative gain (DCG) (Wang et al. 2013). The DCG of a ranking is calculated as ...

... Finally, we introduce the normalized DCG (NDCG) (Wang et al. 2013) which is calculated as ...

The mining sector is a very relevant part of the Chilean economy, representing more than 14% of the country’s GDP and more than 50% of its exports. However, mining is also a high-risk activity where health, safety, and environmental aspects are fundamental concerns to take into account to render it viable in the longer term. The Chilean National Geology and Mining Service (SERNAGEOMIN, after its name in Spanish) is in charge of ensuring the safe operation of mines. On-site inspections are their main tool in order to detect issues, propose corrective measures, and track the compliance of those measures. Consequently, it is necessary to create inspection programs relying on a data-based decision-making strategy. This paper reports the work carried out in one of the most relevant dimensions of said strategy: predicting the mining worksites accident risk. That is, how likely it is a mining worksite to have accidents in the future. This risk is then used to create a priority ranking that is used to devise the inspection program. Estimating this risk at the government regulator level is particularly challenging as there is a very limited and biased data. Our main contribution is to apply a multi-task learning approach to train the risk prediction model in such a way that is able to overcome the constraints of the limited availability of data by fusing different sources. As part of this work, we also implemented a human-experience-based model that captures the procedures currently used by the current experts in charge of elaborating the inspection priority ranking. The mining worksites risk rankings built by model achieve a 121.2% NDCG performance improvement over the rankings based on the currently used experts’ model and outperforms the non-multi-task learning alternatives.

... The scores are ranked based on their degree of score value and then scaled using the binary logarithm. The nDCG-norm can be calculated as shown in equations (5) and (6) [36]. ...

... The 30 coefficient CQCC includes the delta and double delta coefficients and GMM is used as a two-class classifier with 512 components [37]. To evaluate the proposed and baseline normalization techniques, we used the ASV spoof 2019 dataset [3] which includes all three attacks including voice converted speech, text-to-speech (TTS) [36], and replay speech. For objective evaluation, the EER [39] and t-DCF are used to measure the performance of the score normalization techniques along with the DET curve [15] on the test dataset. ...

A spoof detection algorithm supports the speaker verification system to examine the false claims by an imposter through careful analysis of input test speech. The scores are employed to categorize the genuine and spoofed samples effectively. Under the mismatch conditions, the false acceptance ratio increases and can be reduced by appropriate score normalization techniques. In this article, we are using the normalized Discounted Cumulative Gain (nDCG) norm derived from ranking the speaker’s log-likelihood scores. The proposed scoring technique smoothens the decaying process due to logarithm with an added advantage from the ranking. The baseline spoof detection system employs Constant Q-Cepstral Co-efficient (CQCC) as the base features with a Gaussian Mixture Model (GMM) based classifier. The scores are computed using the ASVspoof 2019 dataset for normalized and without normalization conditions. The baseline techniques including the Zero normalization (Z-norm) and Test normalization (T-norm) are also considered. The proposed technique is found to perform better in terms of improved Equal Error Rate (EER) of 0.35 as against 0.43 for baseline system (no normalization) wrt to synthetic attacks using development data. Similarly, improvements are seen in the case of replay attack with EER of 7.83 for nDCG-norm and 9.87 with no normalization (no-norm). Furthermore, the tandem-Detection Cost Function (t-DCF) scores for synthetic attack are 0.015 for no-norm and 0.010 for proposed normalization. Additionally, for the replay attack the t-DCF scores are 0.195 for no-norm and 0.17 proposed normalization. The system performance is satisfactory when evaluated using evaluation data with EER of 8.96 for nDCG-norm as against 9.57 with no-norm for synthetic attacks while the EER of 9.79 for nDCG-norm as against 11.04 with no-norm for replay attacks. Supporting the EER, the t-DCF for nDCG-norm is 0.1989 and for no-norm is 0.2636 for synthetic attacks; while in case of replay attacks, the t-DCF is 0.2284 for the nDCG-norm and 0.2454 for no-norm. The proposed scoring technique is found to increase spoof detection accuracy and overall accuracy of speaker verification system.

... In addition, we borrowed nDCG from the information retrieval (IR) literature, in order to capture the ability of each tool to rank sequences correctly, according to their efficiency, without necessarily requesting an accurate prediction of the efficiency itself (85). Similarly, we implemented Precision (86) at different thresholds for the same purpose, which did not prove useful due to the discretization of the continuous actual efficiency into distinct intervals. ...

... One interesting characteristic of nDCG is that the top results get more attention than the last ones through a discount function. This function can be set to zero for a specific cut-off k, whereby the remaining results after the kth one are completely ignored (85). This is interesting because we do not want to base our judgment on how well a tool is doing in predicting inefficiency. ...

The clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) system has become a successful and promising technology for gene-editing. To facilitate its effective application, various computational tools have been developed. These tools can assist researchers in the guide RNA (gRNA) design process by predicting cleavage efficiency and specificity and excluding undesirable targets. However, while many tools are available, assessment of their application scenarios and performance benchmarks are limited. Moreover, new deep learning tools have been explored lately for gRNA efficiency prediction, but have not been systematically evaluated. Here, we discuss the approaches that pertain to the on-target activity problem, focusing mainly on the features and computational methods they utilize. Furthermore, we evaluate these tools on independent datasets and give some suggestions for their usage. We conclude with some challenges and perspectives about future directions for CRISPR–Cas9 guide design.

... To evaluate the performance of a top-K recommendation, we used common metrics precision (P@K), recall (R@K), mean reciprocal rank (MRR, M@K), and normalized discounted cumulative gain (NDCG, N@K), which are expressed as follows (Hsieh et al., 2017;Li et al., 2020;SeongKu et al., 2019;Wang et al., 2013): ...

In this study, a novel top-K ranking recommendation method called collaborative social metric learning (CSML) is proposed, which implements a trust network that provides both user-item and user-user interactions in simple structure. Most existing recommender systems adopting trust networks focus on item ratings, but this does not always guarantee optimal top-K ranking prediction. Conventional direct ranking systems in trust networks are based on sub-optimal correlation approaches that do not consider item-item relations. The proposed CSML algorithm utilizes the metric learning method to directly predict the top-K items in a trust network. A new triplet loss is further proposed, called socio-centric loss, which represents user-user interactions to fully exploit the information contained in a trust network, as an addition to the two commonly used triplet losses in metric learning for recommender systems, which consider user-item and item-item relations. Experimental results demonstrate that the proposed CSML outperformed existing recommender systems for real-world trust network data.

... The first term in (9) is the NMSE which penalizes any shift from the target values in the aforementioned relative sense, rather than in the absolute sense. The second term is the NDCG ranking loss [43] which is defined as ...

Cellular networks are becoming increasingly heterogeneous with higher base station (BS) densities and ever more frequency bands, making BS selection and band assignment key decisions in terms of rate and coverage. In this paper, we decompose the mobility-aware user association task into (i) forecasting of user rate and then (ii) convex utility maximization for user association accounting for the effects of BS load and handover overheads. Using a linear combination of normalized mean-squared error and normalized discounted cumulative gain as a novel loss function, a recurrent deep neural network is trained to reliably forecast the mobile users' future rates. Based on the forecast, the controller optimizes the association decisions to maximize the service rate-based network utility using our computationally efficient (speed up of 100x versus generic convex solver) algorithm based on the Frank-Wolfe method. Using an industry-grade network simulator developed by Meta, we show that the proposed model predictive control (MPC) approach improves the 5th percentile service rate by 3.5x compared to the traditional signal strength-based association, reduces the median number of handovers by 7x compared to a handover agnostic strategy, and achieves service rates close to a genie-aided scheme. Furthermore, our model-based approach is significantly more sample-efficient (needs 100x less training data) compared to model-free reinforcement learning (RL), and generalizes well across different user drop scenarios.

... Concerning recommendation utility for consumers, we monitored the Normalized Discounted Cumulative Gain (NDCG) [40], using binary relevance scores and a base-2 logarithm decay, and the Mean Reciprocal Rank (MRR) [11]. Differently from recall and accuracy, NDCG takes into account the position of the relevant products in the recommended list. ...

Path reasoning is a notable recommendation approach that models high-order user-product relations, based on a Knowledge Graph (KG). This approach can extract reasoning paths between recommended products and already experienced products and, then, turn such paths into textual explanations for the user. Unfortunately, evaluation protocols in this field appear heterogeneous and limited, making it hard to contextualize the impact of the existing methods. In this paper, we replicated three state-of-the-art relevant path reasoning recommendation methods proposed in top-tier conferences. Under a common evaluation protocol, based on two public data sets and in comparison with other knowledge-aware methods, we then studied the extent to which they meet recommendation utility and beyond objectives, explanation quality, and consumer and provider fairness. Our study provides a picture of the progress in this field, highlighting open issues and future directions. Source code: \url{https://github.com/giacoballoccu/rep-path-reasoning-recsys}.

... The position bias captures this intuition by assigning weights to each rank in a way that the highest ranks obtain particularly high weights [6]. We adopt its common formulation as a logarithmic discount, as it provides a smooth reduction and exhibits favorable theoretical properties [33]: b(k) := 1 log 2 (k + 1) ...

In recent years, several metrics have been developed for evaluating group fairness of rankings. Given that these metrics were developed with different application contexts and ranking algorithms in mind, it is not straightforward which metric to choose for a given scenario. In this paper, we perform a comprehensive comparative analysis of existing group fairness metrics developed in the context of fair ranking. By virtue of their diverse application contexts, we argue that such a comparative analysis is not straightforward. Hence, we take an axiomatic approach whereby we design a set of thirteen properties for group fairness metrics that consider different ranking settings. A metric can then be selected depending on whether it satisfies all or a subset of these properties. We apply these properties on eleven existing group fairness metrics, and through both empirical and theoretical results we demonstrate that most of these metrics only satisfy a small subset of the proposed properties. These findings highlight limitations of existing metrics, and provide insights into how to evaluate and interpret different fairness metrics in practical deployment. The proposed properties can also assist practitioners in selecting appropriate metrics for evaluating fairness in a specific application.

... Since we are to compare the performances of our models with baseline models, and our goal is to rank positive samples higher and return as many of them as possible, the metrics are chosen. mAP and NDCG are very sensitive to positive sample rankings (Yohanandan, 2020,Y. Wang et al., 2013; Recall gives good evaluation of positive sample coverage in returned courses as in its definition. ...

Many students have been complaining about how challenging it is to make course plans. This project is targeted at designing a course recommender system that has knowledge of course relationships to make well-informed suggestions. To achieve the objective, we propose to embed course knowledge graphs into the recommendation system and use hyperbolic embedding to learn the graph structure. We identify the sparsity graph issue and propose a graph augmentation approach from texts which is coined anchored edge addition. Experiments show that our model outperforms the baseline models and achieves consistently high performances at the faculty level. Also, our knowledge graph learning process is inductive so it can be extended to the recommendation scenarios of many universities.

... Mean Squared Error [12]) or how well it ranks relevant items (e.g. Normalized Discounted Cumulative Gain [22,23,23,45]). The "relevance" can be based on human ratings but, for real-world personalized systems where such ratings are unavailable, we typically utilize implicit feedback and define relevance based on user engagement and satisfaction. This data collection from an already deployed recommender can lead to sampling biases [24,43], and offline evaluations might therefore not reflect well the behavior of the actual system [21]. ...

There has been a flurry of research in recent years on notions of fairness in ranking and recommender systems, particularly on how to evaluate if a recommender allocates exposure equally across groups of relevant items (also known as provider fairness). While this research has laid an important foundation, it gave rise to different approaches depending on whether relevant items are compared per-user/per-query or aggregated across users. Despite both being established and intuitive, we discover that these two notions can lead to opposite conclusions, a form of Simpson's Paradox. We reconcile these notions and show that the tension is due to differences in distributions of users where items are relevant, and break down the important factors of the user's recommendations. Based on this new understanding, practitioners might be interested in either notions, but might face challenges with the per-user metric due to partial observability of the relevance and user satisfaction, typical in real-world recommenders. We describe a technique based on distribution matching to estimate it in such a scenario. We demonstrate on simulated and real-world recommender data the effectiveness and usefulness of such an approach.

... To calculate the reciprocal rank of the first relevant item recommended to each user, the second evaluation metric Mean Reciprocal Rank (MRR@N) 47 is denoted as where rank * u is the rank position of the first relevant item recommended to user u. Moreover, as the third evaluation metric, the Normalized Discounted Cumulative Gain (NDCG@N) 48 can further measure the overall ranking quality in a manner that accounts for the position of the hit by assigning higher scores to hits at top ranks as where δ(·) is an indicator function and positions are discounted logarithmically. ...

As an intuitive description of complex physical, social, or brain systems, complex networks have fascinated scientists for decades. Recently, to abstract a network’s topological and dynamical attributes, network representation has been a prevalent technique, which can map a network or substructures (like nodes) into a low-dimensional vector space. Since its mainstream methods are mostly based on machine learning, a black box of an input-output data fitting mechanism, the learned vector’s dimension is indeterminable and the elements are not interpreted. Although massive efforts to cope with this issue have included, say, automated machine learning by computer scientists and learning theory by mathematicians, the root causes still remain unresolved. Consequently, enterprises need to spend enormous computing resources to work out a set of model hyperparameters that can bring good performance, and business personnel still finds difficulties in explaining the learned vector’s practical meaning. Given that, from a physical perspective, this article proposes two determinable and interpretable node representation methods. To evaluate their effectiveness and generalization, this article proposes Adaptive and Interpretable ProbS (AIProbS), a network-based model that can utilize node representations for link prediction. Experimental results showed that the AIProbS can reach state-of-the-art precision beyond baseline models on some small data whose distribution of training and test sets is usually not unified enough for machine learning methods to perform well. Besides, it can make a good trade-off with machine learning methods on precision, determinacy (or robustness), and interpretability. In practice, this work contributes to industrial companies without enough computing resources but who pursue good results based on small data during their early stage of development and who require high interpretability to better understand and carry out their business.

... We use word-level Levantine distance and sentence embedding cosine similarity as two numerical features for each hypothesis. • Other features (2): hypothesis length and signal-noise ratio (SNR) are used as two additional features to LTR models. 1 In this paper NDCG [21] is used. The target rank for each hypothesis in the N -best list is base on their WER. ...

... The DCG measures the best-ranking result, and nDCG is the DCG normalized by the ideal DCG. Thus, the nDCG measure is always a number within [0, 1] [51]. The nDCG is computed for the Top@k patient rankings retrieved from a query patient. ...

Patient similarity assessment, which identifies patients similar to a given patient, is a fundamental component of many secondary uses of medical data. The assessment can be performed using electronic medical records (EMRs). Patient similarity measurement requires converting heterogeneous EMRs into comparable formats to calculate distance. This study presents a new data representation method for EMRs that considers the information in clinical narratives. To address the limitations of previous approaches in handling complex parts of EMR data, an unsupervised manner is proposed for building a patient representation, which integrates unstructured and structured data extracted from patients' EMRs. We employed a tree structure to model the extracted data that capture the temporal relations of multiple medical events from EMR. We processed clinical notes to extract medical concepts using Python libraries such as MedspaCy and ScispaCy and mapped entities to the Unified Medical Language System (UMLS). To capture temporal aspects of the extracted events, we developed two new relabeling methods for the non-leaf nodes of the tree. To create an embedding vector for each patient, we traversed the tree to generate sequences that the Doc2vec algorithm would use. The comprehensive evaluation of the proposed method for patient similarity and mortality prediction tasks demonstrated that our proposed model leads to lower mean-squared error (MSE), higher precision, and normalized discounted cumulative gain (NDCG) relative to baselines.

... To analyze the influence of the different methods in the sentence relation graph, we use Normalized Discounted Cumulative Gain (NDCG) [42] for evaluation. NDCG is a ranking evaluation metric. ...

In this paper, we develop a neural multi-document summarization model, named MuD2H (refers to Multi-Document to Headline) to generate an attractive and customized headline from a set of product descriptions. To the best of our knowledge, no one has used a technique for multi-document summarization to generate headlines in the past. Therefore, multi-document headline generation can be considered new problem setting. Our model implements a two-stage architecture, including an extractive stage and an abstractive stage. The extractive stage is a graph-based model that identified salient sentences, whereas the abstractive stage uses existing summaries as soft templates to guild the seq2seq model. A series of experiments are conducted by using KKday dataset. Experimental results show that the proposed method outperforms the others in terms of quantitative and qualitative aspects.

... NDCG is defined as standard when utilizing the inverse logarithmic decay (i.e. 1 log (i+1) . Note that the base of the logarithm is not important, as constant scaling will cancel out due to normalization [277]. ...

Research on recommendation systems is swiftly producing an abundance of novel methods, constantly challenging the current state-of-the-art. Inspired by advancements in many related fields, like Natural Language Processing and Computer Vision, many hybrid approaches based on deep learning are being proposed, making solid improvements over traditional methods. On the downside, this flurry of research activity, often focused on improving over a small number of baselines, makes it hard to identify reference methods and standardized evaluation protocols. Furthermore, the traditional categorization of recommendation systems into content-based, collaborative filtering and hybrid systems lacks the informativeness it once had. With this work, we provide a gentle introduction to recommendation systems, describing the task they are designed to solve and the challenges faced in research. Building on previous work, an extension to the standard taxonomy is presented, to better reflect the latest research trends, including the diverse use of content and temporal information. To ease the approach toward the technical methodologies recently proposed in this field, we review several representative methods selected primarily from top conferences and systematically describe their goals and novelty. We formalize the main evaluation metrics adopted by researchers and identify the most commonly used benchmarks. Lastly, we discuss issues in current research practices by analyzing experimental results reported on three popular datasets.

... We use the NDCG (Wang et al. 2013) as a measure of the overall quality of the recommendation. For each user u, we compute a ranked list of items in the evaluation (i.e., validation or test) set based on the predicted preferencesr u . ...

State-of-the-art music recommender systems are based on collaborative filtering, which builds upon learning similarities between users and songs from the available listening data. These approaches inherently face the cold-start problem, as they cannot recommend novel songs with no listening history. Content-aware recommendation addresses this issue by incorporating content information about the songs on top of collaborative filtering. However, methods falling in this category rely on a shallow user/item interaction that originates from a matrix factorization framework. In this work, we introduce neural content-aware collaborative filtering, a unified framework which alleviates these limits, and extends the recently introduced neural collaborative filtering to its content-aware counterpart. This model leverages deep learning for both extracting content information from low-level acoustic features and for modeling the interaction between users and songs embeddings. The deep content feature extractor can either directly predict the item embedding, or serve as a regularization prior, yielding two variants (strict and relaxed) of our model. Experimental results show that the proposed method reaches state-of-the-art results for both warm- and cold-start music recommendation tasks. We notably observe that exploiting deep neural networks for learning refined user/item interactions outperforms approaches using a more simple interaction model in a content-aware framework.

... We evaluate the developed MSE with our own dataset. It is evaluated using a new metric, the average hit rank (AHR), based on the prediction hit rate, and a classic metric, the normalized discounted cumulative gain (NDCG) [9]. We compare our engine to the major search engines. ...

Typically, search engines provide query suggestions to assist users in the search process. Query suggestions are very important for improving user’s search experience. However, most query suggestions are based on the user’s search logs, and they can be influenced by infrequently searched queries. Depending on the user’s query, query suggestions can be ineffective in global search engines but effective in a domestic search engine. Conversely, it can be effective in global engines and weak in domestic engines. In addition, log-based query suggestions require many search logs, which makes them difficult to construct outside of a large search engine. Some search engines do not provide query suggestions, making searches difficult for users. These query suggestion vulnerabilities degrade the user’s search experience. In this study, we develop a meta-suggestion, a new query suggestion scheme. Similar to meta-searches, meta-suggestions retrieve candidate queries of suggestions from other search engines. Meta-suggestions generate suggestions by reranking the aggregated candidate queries. We develop a meta-suggestion engine (MSE) browser extension that generates meta-suggestions. It can provide query suggestions for any webpage and does not require a search log. Comparing our meta-suggestions to major search engines such as Google, showed a 17% performance improvement on normalized discounted cumulative gain (NDCG) and a 31% improvement on precision. If more detailed factors, such as user preferences are discovered through continued research, it is expected that user searches will greatly improve. An enhanced user search experience is possible if factors, such as user preference, are examined in future work.

... Evaluation metrics [402], [403] used in experiments include mean absolute error (MAE) [404] and root mean squared error (RMSE) [191], [404] for evaluations on recommendation accuracy in predicting explicit user-item interactions, and Precision [348], Recall [348], [355] and Rank [405] for those in predicting implicit user-item interactions. Conceivably, in addition to these metrics, more like AUC and NDCG are also essential for evaluating recommendation accuracy. ...

As a pivotal tool to alleviate the information overload problem, recommender systems aim to predict user’s preferred items from millions of candidates by analyzing observed user-item relations. As for alleviating the sparsity and cold start problems encountered by recommender systems, researchers generally resort to employing side information or knowledge in recommendation as a strategy for uncovering hidden (indirect) user-item relations, aiming to enrich observed information (or data) for recommendation. However, in the face of the high complexity and large scale of side information and knowledge, this strategy largely relies for efficient implementation on the scalability of recommendation models. Not until after the prevalence of machine learning did graph embedding techniques be a recent concentration, which can efficiently utilize complex and large-scale data. In light of that, equipping recommender systems with graph embedding techniques has been widely studied these years, appearing to outperform conventional recommendation implemented directly based on graph topological analysis (or resolution). As the focus, this article systematically retrospects graph embedding-based recommendation from embedding techniques for bipartite graphs, general graphs and knowledge graphs, and proposes a general design pipeline of that. In addition, after comparing several representative graph embedding-based recommendation models with the most common-used conventional recommendation models on simulations, this article manifests that the conventional models can still overall outperform the graph embedding-based ones in predicting implicit user-item interactions, revealing the comparative weakness of graph embedding-based recommendation in these tasks. To foster future research, this article proposes constructive suggestions on making a trade-off between graph embedding-based recommendation and conventional recommendation in different tasks, and puts forward some open questions.

... Our evaluation covered recommendation utility, explanation quality, and fairness, computed on top-10 recommended lists ( = 10) for the sake of conciseness and clarity. We assessed recommendation utility through the Normalized Discounted Cumulative Gain (NDCG) [31], using binary relevance scores and a base-2 logarithm decay. We assessed explanation quality from three perspectives, namely linking interaction recency (Eq. ...

Existing explainable recommender systems have mainly modeled relationships between recommended and already experienced products, and shaped explanation types accordingly (e.g., movie "x" starred by actress "y" recommended to a user because that user watched other movies with "y" as an actress). However, none of these systems has investigated the extent to which properties of a single explanation (e.g., the recency of interaction with that actress) and of a group of explanations for a recommended list (e.g., the diversity of the explanation types) can influence the perceived explaination quality. In this paper, we conceptualized three novel properties that model the quality of the explanations (linking interaction recency, shared entity popularity, and explanation type diversity) and proposed re-ranking approaches able to optimize for these properties. Experiments on two public data sets showed that our approaches can increase explanation quality according to the proposed properties, fairly across demographic groups, while preserving recommendation utility. The source code and data are available at https://github.com/giacoballoccu/explanation-quality-recsys.

... which, instead of uniformly weighting all positions, introduces a logarithm discount function over the ranks where larger weights are applied to recommended items that appear at higher ranks [54]. NDCG@M is calculated by normalizing the DCG@M to [0, 1] by the ideal DCG@M where all relevant items are ranked at the top. ...

Hybrid recommendations have recently attracted a lot of attention where user features are utilized as auxiliary information to address the sparsity problem caused by insufficient user-item interactions. However, extracted user features generally contain rich multimodal information, and most of them are irrelevant to the recommendation purpose. In this article, we propose a variational bandwidth auto-encoder (VBAE) for recommendations, aiming to address the sparsity and noise problems simultaneously. VBAE first encodes user collaborative and feature information into Gaussian latent variables via deep neural networks to capture non-linear user similarities. Moreover, by considering the fusion of collaborative and feature variables as a virtual communication channel from an information-theoretic perspective, we introduce a user-dependent channel to dynamically control the information allowed to be accessed from the feature embeddings. A quantum-inspired uncertainty measurement of the hidden rating embeddings is proposed accordingly to infer the channel bandwidth by disentangling the uncertainty information in the ratings from the semantic information. Through this mechanism, VBAE incorporates adequate auxiliary information from user features if collaborative information is insufficient, while avoiding excessive reliance on noisy user features to improve its generalization ability to new users. Extensive experiments conducted on three datasets demonstrate the effectiveness of the proposed method.

Multi-behavior recommendation leverages auxiliary behaviors (e.g., view, add-to-cart) to improve the prediction for target behaviors (e.g., buy). Most existing works are built upon the assumption that all the auxiliary behaviors are positively correlated with target behaviors. However, we empirically find that such an assumption may not hold in real-world datasets. In fact, some auxiliary feedback is too noisy to be helpful, and it is necessary to restrict its influence for better performance. To this end, in this paper we propose a Bi-directional Contrastive Distillation (BCD) model for multi-behavior recommendation, aiming to distill valuable knowledge (about user preference) from the interplay of multiple user behaviors. Specifically, we design a forward distillation to distill the knowledge from auxiliary behaviors to help model target behaviors, and then a backward distillation to distill the knowledge from target behaviors to enhance the modelling of auxiliary behaviors. Through this circular learning, we can better extract the common knowledge from multiple user behaviors, where noisy auxiliary behaviors will not be involved. The experimental results on two real-world datasets show that our approach outperforms other counterparts in accuracy.KeywordsRecommender systemContrastive distillationMulti-behavior recommender

Path reasoning is a notable recommendation approach that models high-order user-product relations, based on a Knowledge Graph (KG). This approach can extract reasoning paths between recommended products and already experienced products and, then, turn such paths into textual explanations for the user. Unfortunately, evaluation protocols in this field appear heterogeneous and limited, making it hard to contextualize the impact of the existing methods. In this paper, we replicated three state-of-the-art relevant path reasoning recommendation methods proposed in top-tier conferences. Under a common evaluation protocol, based on two public data sets and in comparison with other knowledge-aware methods, we then studied the extent to which they meet recommendation utility and beyond objectives, explanation quality, and consumer and provider fairness. Our study provides a picture of the progress in this field, highlighting open issues and future directions. Source code: https://github.com/giacoballoccu/rep-path-reasoning-recsys.KeywordsRecommender systemsKnowledge graphsReplicability

Session-based recommender systems have evolved as a new paradigm in recent years, intending to capture short-term yet dynamic user preferences to give more timely and accurate suggestions that are responsive to the change in their session contexts. However, sparse data for user-item interaction has been one of the significant essential issues as we need a colossal amount of memory to store those sparse data. Seasonality is another major issue in recommendation systems as there are many variations in the pattern of customers’ interests at different time intervals. In our study, we resolve the above mentioned issues by using graph collaborative filtering and creating feature bins. As a case study, we used sequential data from YooChoose customers to validate the efficacy of our proposed methodology. Further, we use five state-of-the-art graph neural network models to get the best recommendation. The performance of those models is evaluated using the NDCG (Normalized Discounted Cumulative Gain) and ROC-AUC (Area under the Receiver operating characteristic curve) metrics. In our study, we find out that Residual Gated Convolutional Neural Network with four layers and Adam optimizer gave the best recommendations.

Numerous Knowledge Graphs (KGs) are being created to make Recommender Systems (RSs) not only intelligent but also knowledgeable. Integrating a KG in the recommendation process allows the underlying model to extract reasoning paths between recommended products and already experienced products from the KG. These paths can be leveraged to generate textual explanations to be provided to the user for a given recommendation. However, the existing explainable recommendation approaches based on KG merely optimize the selected reasoning paths for product relevance, without considering any user-level property of the paths for explanation. In this paper, we propose a series of quantitative properties that monitor the quality of the reasoning paths from an explanation perspective, based on recency, popularity, and diversity. We then combine in- and post-processing approaches to optimize for both recommendation quality and reasoning path quality. Experiments on three public data sets show that our approaches significantly increase reasoning path quality according to the proposed properties, while preserving recommendation quality. Source code, data sets, and KGs are available at https://tinyurl.com/bdbfzr4n.

There has been growing attention on explainable recommendation that is able to provide high-quality results as well as intuitive explanations. However, most existing studies use offline prediction strategies where recommender systems are trained once while used forever, which ignores the dynamic and evolving nature of user–item interactions. There are two main issues with these methods. First, their random dataset split setting will result in data leakage that knowledge should not be known at the time of training is utilized. Second, the dynamic characteristics of user preferences are overlooked, resulting in a model aging issue where the model’s performance degrades along with time. In this paper, we propose an updating enabled online prediction framework for the time-aware explainable recommendation. Specifically, we propose an online prediction scheme to eliminate the data leakage issue and two novel updating strategies to relieve the model aging issue. Moreover, we conduct extensive experiments on four real-world datasets to evaluate the effectiveness of our proposed methods. Compared with the state-of-the-art, our time-aware approach achieves higher accuracy results and more convincing explanations for the entire lifetime of recommendation systems, i.e., both the initial period and the long-term usage.

Recently, extensive attention from researchers has been paid to users referring to product review comments when choosing products while shopping online. These shoppers are also more frequently demanding a visual search facility to identify similar or identical products based on images they input. In this paper, the authors propose a product recommendation method to support a visual product search that combines the similarities of both visual and textual information to recommend products with a high level of satisfaction. The authors first utilize the image-based recognition method to calculate the similarities between user-inputted images and product images based on their SIFT features and the surrounding text. Next, to select satisfying products, the authors perform sentiment analysis on product reviews and combine this with users' repeat purchase behavior to recommend products that have a high level of satisfaction rating. Finally, the authors evaluate and discuss the proposed method using real e-commerce data.

This paper presents our work to recommend brands to customers that might be relevant to their style but the brands are new to them. To promote the exploration and discovery of new brands, we leverage article-embeddings, also known as Fashion DNA, a learned encoding for each article of clothing at Zalando, that is utilized for product and outfit recommendations. The model used in Fashion DNA’s work proposed a Logistic Matrix Factorization approach using sales data to learn customer style preferences. In this work, we evolved that approach to circumvent the cold-start problem for recommending new brands that do not have enough sales or digital footprint. First, we computed an embedding per brand, named Brand DNA, from the Fashion DNA of all articles that belong to a given brand. Then, we trained a model using Logistic Matrix Factorization to predict sales for a set of frequent customers and brands. That allowed us to learn customer style representations that can be leveraged to predict the likelihood of purchasing from a new brand by using its Brand DNA. Customers are also able to further explore Zalando’s assortment moving from the more popular products and brands.KeywordsEmbeddingsNeural networksLatent representationsDeep learning

With the emergence of various online trading technologies, fraudulent cases begin to occur frequently. The problem of fraud in public trading companies is a hot topic in financial field. This paper proposes a fraud detection model for public trading companies using datasets collected from SEC’s Accounting and Auditing Enforcement Releases (AAERs). At the same time, this computational finance model is solved with a nonlinear activated Beetle Antennae Search (NABAS) algorithm, which is a variant of the meta-heuristic optimization algorithm named Beetle Antennae Search (BAS) algorithm. Firstly, the fraud detection model is transformed into an optimization problem of minimizing loss function and using the NABAS algorithm to find the optimal solution. NABAS has only one search particle and explores the space under a given gradient estimation until it is less than an “Activated Threshold” and the algorithm is efficient in computation. Then, the random under-sampling with AdaBoost (RUSBoost) algorithm is employed to comprehensively evaluate the performance of NABAS. In addition, to reflect the superiority of NABAS in the fraud detection problem, it is compared with some popular methods in recent years, such as the logistic regression model and Support Vector Machine with Financial Kernel (SVM-FK) algorithm. Finally, the experimental results show that the NABAS algorithm has higher accuracy and efficiency than other methods in the fraud detection of public datasets.

The co-opinionatedness measure, that is, the similarity of cociting documents in their opinions about their cocited articles, has been recently proposed. The present study uses a wider range of baselines and benchmarks to investigate the measure’s effectiveness in retrieval ranking that was previously confirmed in a pilot study. A test collection was built including 30 seed documents and their 4702 cocited articles. Their citances and full-texts were analysed using natural language processing (NLP) and opinion mining techniques. Cocitation values, syntactical similarity and contexts similarity were used as baselines. The distributional semantic similarity and the linear and hierarchical Medical Subject Headings (MeSH) similarities served as benchmarks to evaluate the effect of the co-opinionatedness as a boosting factor on the performance of the baselines. The improvements in the rankings were measured by normalised discounted cumulative gain (nDCG). According to the findings, there existed significant differences between the nDCG mean values obtained before and after weighting the baselines by the co-opinionatedness measures. The results of the generalisability study corroborated the reliability and generalisability of the systems. Accordingly, the similarity in the opinions of the cociting papers towards their cocited articles can explain the cocitation relation in the scientific papers network and can be effectively utilised for improving the results of the cocitation-based retrieval systems.

As a vital energy resource and raw material for many industrial products, syngas (CO and H2) is of great significance. Dry reforming of methane (DRM) is an important approach to producing syngas (with a hydrogen-to-carbon-monoxide ratio of 1:1 in principle) from methane and carbon dioxide, with a lower operational cost as compared to other reforming techniques. However, many pure metallic catalysts used in DRM face deactivation issues due to coke formation or sintering of the metal particles. A systematic search for highly efficient metallic catalysts, which reduce the reaction barriers for the rate-determining steps and resist carbon deposition, is urgently needed. Nickel is a typical low-cost transition metal for activating the C-H bond in methane. In this work, we applied a two-step workflow to search for nickel-based bimetallic catalysts with doping metals M (M-Ni) by combining density functional theory (DFT) calculations and machine learning (ML). We focus on the two critical steps in DRM-CH4 and CO2 direct activations. We used DFT and slab models for the Ni(111) facet to explore the relevant reaction pathways and constructed a data set containing structural and energetic information for representative M-Ni systems. We used this dataset to train ML models with chemical-knowledge-based features and predicted CH4 and CO2 dissociation energies and barriers, which revealed the composition—activity relationships of the bimetallic catalysts. We also used these models to rank the predicted catalytic performance of candidate systems to demonstrate the applicability of ML for catalyst screening. We emphasized that ML ranking models would be more valuable than regression models in high-throughput screenings. Finally, we used our trained model to screen 12 unexplored M-Ni systems and showed that the DFT-computed energies and barriers are very close to the ML-predicted values for top candidates, validating the robustness of the trained model.

A textual data processing task that involves the automatic extraction of relevant and salient keyphrases from a document that expresses all the important concepts of the document is called keyphrase extraction. Due to technological advancements, the amount of textual information on the Internet is rapidly increasing as a lot of textual information is processed online in various domains such as offices, news portals, or for research purposes. Given the exponential increase of news articles on the Internet, manually searching for similar news articles by reading the entire news content that matches the user’s interests has become a time-consuming and tedious task. Therefore, automatically finding similar news articles can be a significant task in text processing. In this context, keyphrase extraction algorithms can extract information from news articles. However, selecting the most appropriate algorithm is also a problem. Therefore, this study analyzes various supervised and unsupervised keyphrase extraction algorithms, namely KEA, KP-Miner, YAKE, MultipartiteRank, TopicRank, and TeKET, which are used to extract keyphrases from news articles. The extracted keyphrases are used to compute lexical and semantic similarity to find similar news articles. The lexical similarity is calculated using the Cosine and Jaccard similarity techniques. In addition, semantic similarity is calculated using a word embedding technique called Word2Vec in combination with the Cosine similarity measure. The experimental results show that the KP-Miner keyphrase extraction algorithm, together with the Cosine similarity calculation using Word2Vec (Cosine-Word2Vec), outperforms the other combinations of keyphrase extraction algorithms and similarity calculation techniques to find similar news articles. The similar articles identified using KPMiner and the Cosine similarity measure with Word2Vec appear to be relevant to a particular news article and thus show satisfactory performance with a Normalized Discounted Cumulative Gain (NDCG) value of 0.97. This study proposes a method for finding similar news articles that can be used in conjunction with other methods already in use.

Recommender Systems ( RecSys ) provide suggestions in many decision-making processes. Given that groups of people can perform many real-world activities (e.g., a group of people attending a conference looking for a place to dine), the need for recommendations for groups has increased. A wide range of Group Recommender Systems ( GRecSys ) has been developed to aggregate individual preferences to group preferences. We analyze 175 studies related to GRecSys . Previous works evaluate their systems using different types of groups (sizes and cohesiveness), and most of such works focus on testing their systems using only one type of item, called experience goods (EG). As a consequence, it is hard to get consistent conclusions about the performance of GRecSys . We present the aggregation strategies and aggregation functions that GRecSys commonly use to aggregate group members’ preferences. This study experimentally compares the performance (i.e., accuracy, ranking quality, and usefulness) using four metrics (Hit Ratio, nDCG, Diversity, and Coverage) of eight representative RecSys for group recommendations on ephemeral groups. Moreover, we use two different aggregation strategies, ten different aggregation functions, and two different types of items on two types of datasets (Experience Goods (EG) and Search Goods (SG)) containing real-life datasets. The results show that the evaluation of GRecSys needs to use both EG and SG types of data because the different characteristics of datasets lead to different performance. GRecSys using Singular Value Decomposition (SVD) or Neural Collaborative Filtering (NCF) methods work better than others. It is observed that the Average aggregation function is the one that produces better results.

Function drives many early design considerations in product development, highlighting the importance of finding functionally similar examples if searching for sources of inspiration or evaluating designs against existing technology. However, it is difficult to capture what people consider is functionally similar and therefore, if measures that quantify and compare function using the products themselves are meaningful. In this work, human evaluations of similarity are compared to computationally determined values, shedding light on how quantitative measures align with human perceptions of functional similarity. Human perception of functional similarity is considered at two levels of abstraction: (1) the high-level purpose of a product and (2) how the product works. These human similarity evaluations are quantified by crowdsourcing 1360 triplet ratings at each functional abstraction and creating low-dimensional embeddings from the triplets. The triplets and embeddings are then compared to similarities that are computed between functional models using six representative measures, including both matching measures (e.g., cosine similarity) and network-based measures (e.g., spectral distance). The outcomes demonstrate how levels of abstraction and the fuzzy line between “highly similar” and “somewhat similar” products may impact human functional similarity representations and their subsequent alignment with computed similarity. The results inform how functional similarity can be leveraged by designers, with applications in creativity support tools, such as those used for design-by-analogy, or other computational methods in design that incorporate product function.

We reduce ranking, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC), to binary classification.
The core theorem shows that a binary classification regret of r on the induced binary problem implies an AUC regret of at most 2r. This is a large improvement over approaches such as ordering according to regressed scores, which have a regret transform
of r ↦nr where n is the number of elements.

We reduce ranking, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC), to binary classification.
The core theorem shows that abinary classification regret of r on the induced binary problem implies anAUC regret of at most 2r. This is alarge improvement over approaches such as ordering according to regressed scores, which have aregret transform
of r
↦
nr where n is the number of elements.

The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions.

Learning to rank is a relatively new field of study, aiming to learn a ranking func- tion from a set of training data with relevancy labels. The ranking algorithms are often evaluated using information retrieval measures, such as Normalized Dis- counted Cumulative Gain (NDCG) (1) and Mean Average Precision (MAP) (2). Until recently, most learning to rank algorithms were not using a loss function related to the above mentioned evaluation measures. The main difficulty in direct optimization of these measures is that they depend on the ranks of documents, not the numerical values output by the ranking function. We propose a probabilistic framework that addresses this challenge by optimizing the expectation of NDCG over all the possible permutations of documents. A relaxation strategy is used to approximate the average of NDCG over the space of permutation, and a bound optimization approach is proposed to make the computation efficient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art rank- ing algorithms on several benchmark data sets.

The nDCG measure has proven to be a popular measure of retrieval eectiveness utilizing graded relevance judgments. However, a number of dierent instantiations of nDCG ex- ist, depending on the arbitrary definition of the gain and discount functions used (1) to dictate the relative value of documents of dierent relevance grades and (2) to weight the importance of gain values at dierent ranks, respectively. In this work we discuss how to empirically derive a gain and discount function that optimizes the eciency or sta- bility of nDCG. First, we describe a variance decomposition analysis framework and an optimization procedure utilized to find the eciency- or stability-optimal gain and discount functions. Then we use TREC data sets to compare the op- timal gain and discount functions to the ones that have ap- peared in the IR literature with respect to (a) the eciency of the evaluation, (b) the induced ranking of systems, and (c) the discriminative power of the resulting nDCG measure.

Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature.

We present a theoretical analysis of super- vised ranking, providing necessary and suf- ficient conditions for the asymptotic consis- tency of algorithms based on minimizing a surrogate loss function. We show that many commonly used surrogate losses are incon- sistent; surprisingly, we show inconsistency even in low-noise settings. We present a new value-regularized linear loss, establish its consistency under reasonable assumptions on noise, and show that it outperforms conven- tional ranking losses in a collaborative filter- ing experiment.

The paper is concerned with learning to rank, which is to construct a model or a function for ranking objects. Learning to rank is useful for document retrieval, collaborative filtering, and many other applications. Several methods for learning to rank have been proposed, which take object pairs as 'instances' in learning. We refer to them as the pairwise approach in this paper. Al- though the pairwise approach offers advantages, it ignores the fact that ranking is a prediction task on list of objects. The paper postulates that learn- ing to rank should adopt the listwise approach in which lists of objects are used as 'instances' in learning. The paper proposes a new proba- bilistic method for the approach. Specifically it introduces two probability models, respectively referred to as permutation probability and top k probability, to define a listwise loss function for learning. Neural Network and Gradient Descent are then employed as model and algorithm in the learning method. Experimental results on infor- mation retrieval show that the proposed listwise approach performs better than the pairwise ap- proach.

This paper aims to conduct a study on the listwise approach to learning to rank. The listwise approach learns a ranking function by taking individual lists as instances and minimizing a loss function defined on the predicted list and the ground-truth list. Existing work on the approach mainly focused on the development of new algorithms; methods such as RankCosine and ListNet have been proposed and good performances by them have been observed. Unfortunately, the underlying theory was not sufficiently studied so far. To amend the problem, this paper proposes conducting theoretical analysis of learning to rank algorithms through investigations on the properties of the loss functions, including consistency, soundness, continuity, differentiability, convexity, and efficiency. A sufficient condition on consistency for ranking is given, which seems to be the first such result obtained in related research. The paper then conducts analysis on three loss functions: likelihood loss, cosine loss, and cross entropy loss. The latter two were used in RankCosine and ListNet. The use of the likelihood loss leads to the development of a new listwise method called ListMLE, whose loss function offers better properties, and also leads to better experimental results.

Several recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not trans- late into a benefit for users. Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query evalu- ation do not correlate with user performance on these tasks. In this study, we evaluate two different information retrieval tasks on TREC Web-track data: a precision-based user task, measured by the length of time that users need to find a sin- gle document that is relevant to a TREC topic; and, a simple recall-based task, represented by the total number of rele- vant documents that users can identify within five minutes. Users employ search engines with controlled mean average precision (MAP) of between 55% and 95%. Our results show that there is no significant relationship between system ef- fectiveness measured by MAP and the precision-based task. A significant, but weak relationship is present for the preci- sion at one document returned metric. A weak relationship is present between MAP and the simple recall-based task.

Machine learning is commonly used to improve ranked re- trieval systems. Due to computational diculties, few learn- ing techniques have been developed to directly optimize for mean average precision (MAP), despite its widespread use in evaluating such systems. Existing approaches optimiz- ing MAP either do not find a globally optimal solution, or are computationally expensive. In contrast, we present a general SVM learning algorithm that eciently finds a globally optimal solution to a straightforward relaxation of MAP. We evaluate our approach using the TREC 9 and TREC 10 Web Track corpora (WT10g), comparing against SVMs optimized for accuracy and ROCArea. In most cases we show our method to produce statistically significant im- provements in MAP scores.

We present a model, based on the maximum entropy method, for analyzing various measures of retrieval performance such as average precision, R-precision, and precision-at-cutoffs. Our methodology treats the value of such a measure as a constraint on the distribution of relevant documents in an unknown list, and the maximum entropy distribution can be determined subject to these constraints. For good measures of overall performance (such as average precision), the resulting maximum entropy distributions are highly correlated with actual distributions of relevant documents in lists as demonstrated through TREC data; for poor measures of overall performance, the correlation is weaker. As such, the maximum entropy method can be used to quantify the overall quality of a retrieval measure. Furthermore, for good measures of overall performance (such as average precision), we show that the corresponding maximum entropy distributions can be used to accurately infer precision-recall curves and the values of other measures of performance, and we demonstrate that the quality of these inferences far exceeds that predicted by simple retrieval measure correlation, as demonstrated through TREC data.

This paper presents an experimental study of users assessing the quality of Google web search results. In particular we look at how users' satisfaction correlates with the effectiveness of Google as quantified by IR measures such as precision and the suite of Cumulative Gain measures (CG, DCG, NDCG). Results indicate strong correlation between users' satisfaction, CG and precision, moderate correlation with DCG, with perhaps surprisingly negligible correlation with NDCG. The reasons for the low correlation with NDCG are examined.

. The problem of combining preferences arises in several applications, such as combining the results of different search engines. This work describes an efficient algorithm for combining multiple preferences. We first give a formal framework for the problem. We then describe and analyze a new boosting algorithm for combining preferences called RankBoost. We also describe an efficient implementation of the algorithm for a restricted case. We discuss two experiments we carried out to assess the performance of RankBoost. In the first experiment, we used the algorithm to combine different WWW search strategies, each of which is a query expansion for a given domain. For this task, we compare the performance of RankBoost to the individual search strategies. The second experiment is a collaborative-filtering task for making movie recommendations. Here, we present results comparing RankBoost to nearest-neighbor and regression algorithms. 1 Introduction Consider the followingmovie-recommendat...

The ROC curve is known to be the golden standard for measuring performance of a test/scoring statistic regarding its capacity of discrimination between two popu- lations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be seen as summaries of the ROC curve. This paper highlights the fact that many of these empirical criteria can be expressed as (conditional) linear rank statistics. We investigate the proper- ties of empirical maximizers of such performance criteria and provide preliminary results for the concentration properties of a novel class of random variables that we will call a linear rank process.

While numerous metrics for information retrieval are avail- able in the case of binary relevance, there is only one com- monly used metric for graded relevance, namely the Dis- counted Cumulative Gain (DCG). A drawback of DCG is its additive nature and the underlying independence assump- tion: a document in a given position has always the same gain and discount independently of the documents shown above it. Inspired by the "cascade" user model, we present a new editorial metric for graded relevance which overcomes this difficulty and implicitly discounts documents which are shown below very relevant documents. More precisely, this new metric is defined as the expected reciprocal length of time that the user will take to find a relevant document. This can be seen as an extension of the classical recipro- cal rank to the graded relevance case and we call this metric Expected Reciprocal Rank (ERR). We conduct an extensive evaluation on the query logs of a commercial search engine and show that ERR correlates better with clicks metrics than other editorial metrics.

Many of the classification algorithms developed in the machine learning literature, including the support vector machine and boosting, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. The convexity makes these algorithms computationally e#cient. The use of a surrogate, however, has statistical consequences that must be balanced against the computational virtues of convexity. To study these issues, we provide a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial upper bounds on excess risk under the weakest possible condition on the loss function: that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a simple variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise.

Given the size of the web, the search engine industry has argued that engines should be evaluated by their ability to retrieve highly relevant pages rather than all possible relevant pages. To explore the role highly relevant documents play in retrieval system evaluation, assessors for the TREC-9 web track used a three-point relevance scale and also selected best pages for each topic. The relative eectiveness of runs evaluated by dierent relevant document sets differed, con rming the hypothesis that dierent retrieval techniques work better for retrieving highly relevant documents. Yet evaluating by highly relevant documents can be unstable since there are relatively few highly relevant documents. TREC assessors frequently disagreed in their selection of the best page, and subsequent evaluation by best page across dierent assessors varied widely. The discounted cumulative gain measure introduced by Jarvelin and Kekalainen increases evaluation stability by incorporating all relevance judgments while still giving precedence to highly relevant documents.

The problem of ranking/ordering instances, instead of simply classifying them, has recently gained much attention in machine learning. In this paper we formulate the ranking problem in a rigorous statistical framework. The goal is to learn a ranking rule for deciding, among two instances, which one is "better," with minimum ranking risk. Since the natural estimates of the risk are of the form of a U-statistic, results of the theory of U-processes are required for investigating the consistency of empirical risk minimizers. We establish in particular a tail inequality for degenerate U-processes, and apply it for showing that fast rates of convergence may be achieved under specific noise assumptions, just like in classification. Convex risk minimization methods are also studied.

We study surrogate losses for learning to rank, in a framework where the rankings are induced by scores and the task is to learn the scoring function. We focus on the calibration of surrogate losses with respect to a ranking evaluation metric, where the calibration is equivalent to the guarantee that near-optimal values of the surrogate risk imply near-optimal values of the risk defined by the evaluation metric. We prove that if a surrogate loss is a convex function of the scores, then it is not calibrated with respect to two evaluation metrics widely used for search engine evaluation, namely the Average Precision and the Expected Reciprocal Rank. We also show that such convex surrogate losses cannot be calibrated with respect to the Pairwise Disagreement, an evaluation metric used when learning from pair-wise preferences. Our results cast lights on the intrinsic difficulty of some ranking problems, as well as on the limitations of learning-to-rank algorithms based on the minimization of a convex surrogate risk.

We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter.

We address the problem of designing surrogate losses for learning scoring functions in the context of label ranking. We extend to ranking problems a notion of order-preserving losses previously introduced for multiclass classification, and show that these losses lead to consistent formulations with respect to a family of ranking evaluation metrics. An order-preserving loss can be tailored for a given evaluation metric by appropriately setting some weights depending on this metric and the observed supervision. These weights, called the standard form of the supervision, do not always exist, but we show that previous consistency results for ranking were proved in special cases where they do. We then evaluate a new pairwise loss consistent with the (Normalized) Discounted Cumulative Gain on benchmark datasets.

This paper proposes evaluation methods based on the use of non-dichotomous relevance judgements in IR experiments. It is argued that evaluation methods should credit IR methods for their ability to retrieve highly relevant documents. This is desirable from the user point of view in modern large IR environments. The proposed methods are (1) a novel application of P-R curves and average precision computations based on separate recall bases for documents of different degrees of relevance, and (2) two novel measures computing the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. We then demonstrate the use of these evaluation methods in a case study on the effectiveness of query types, based on combinations of query structures and expansion, in retrieving documents of various degrees of relevance. The test was run with a best match retrieval system (In-Query1) in a text database consisting of newspaper articles. The results indicate that the tested strong query structures are most effective in retrieving highly relevant documents. The differences between the query types are practically essential and statistically significant. More generally, the novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.

This paper describes how the Bootstrap approach to sta- tistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical sig- nificance tests. We then describe straightforward methods for comparing the sensitivity of IR metrics based on Boot- strap Hypothesis Tests. Unlike the heuristics-based "swap" method proposed by Voorhees and Buckley, our method esti- mates the performance difference required to achieve a given significance level directly from Bootstrap Hypothesis Test results. In addition, we describe a simple way of examining the accuracy of rank correlation between two metrics based on the Bootstrap Estimate of Standard Error. We demon- strate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean. Categories and Subject Descriptors

Abstract—The ranking,problem,has become,increasingly important,in modern,applications,of statistical methods in automated decision making systems. In particular, we consider,a formulation,of the statistical ranking,problem which we call subset ranking, and focus on the DCG (discounted cumulated,gain) criterion that measures,the quality of items near,the top of the rank-list. Similar to error minimization for binary classification, direct opti- mization,of natural ranking,criteria such as DCG leads to a non-convex optimization,problems,that can be NP-hard. Therefore a computationally,more,tractable approach,is needed. We present,bounds,that relate the approximate optimization,of DCG to the approximate,minimization,of certain regression,errors. These bounds,justify the,use of convex,learning,formulations,for solving the,subset ranking,problem. The resulting estimation methods,are not conventional, in that we focus on the estimation quality in the top-portion of the rank-list. We further investigate the asymptotic,statistical behavior,of these formulations. Under appropriate conditions, the consistency of the estimation schemes,with respect to the DCG metric can be derived.

Modern large retrieval environments tend to overwhelm their users by their large output. Since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents. This can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. Alternatively, novel measures based on graded relevance judgments may be developed. This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. The first one accumulates the relevance scores of retrieved documents along the ranked result list. The second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. The third one computes the relative-to-the-ideal performance of IR techniques, based on the cumulative gain they are able to yield. These novel measures are defined and discussed and their use is demonstrated in a case study using TREC data: sample system run results for 20 queries in TREC-7. As a relevance base we used novel graded relevance judgments on a four-point scale. The test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. The graphs based on the measures also provide insight into the performance IR techniques and allow interpretation, for example, from the user point of view.

The purpose of this paper is to investigate statistical properties of risk minimization based multi-category classification methods. These methods can be considered as natural extensions of binary large margin classification. We establish conditions that guarantee the consistency of classifiers obtained in the risk minimization framework with respect to the classification error. Examples are provided for four specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information can be useful for statistical inferencing tasks beyond classification.

We are interested in supervised ranking algorithms that perform especially well near the top of the
ranked list, and are only required to perform sufficiently well on the rest of the list. In this work,
we provide a general form of convex objective that gives high-scoring examples more importance.
This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference
of the user. We choose ℓp-norms to provide a specific type of push; if the user sets p larger, the
objective concentrates harder on the top of the list. We derive a generalization bound based on
the p-norm objective, working around the natural asymmetry of the problem. We then derive a
boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the
algorithm is illustrated through experiments on repository data. We prove that the minimizer of the
algorithm’s objective is unique in a specific sense. Furthermore, we illustrate how our objective is
related to quality measurements for information retrieval.

Traducción de: Sviluppi in serie di funzioni ortogonali Contenido: Expansión de series de funciones ortogonales y nociones preliminares de los espacios de Hilbert; Expansiones en series de Fourier; Expansiones en series polinomiales de Legendre y armónicos esféricos; Expansiones en series de Laguerre y Hermite.

Discriminative models have been preferred over generative models in many machine learning problems in the recent past owing to some of their attractive theoretical properties. In this paper, we explore the applicability of discriminative classifiers for IR. We have compared the performance of two popular discriminative models, namely the maximum entropy model and support vector machines with that of language modeling, the state-of-the-art generative model for IR. Our experiments on ad-hoc retrieval indicate that although maximum entropy is significantly worse than language models, support vector machines are on par with language models. We argue that the main reason to prefer SVMs over language models is their ability to learn arbitrary features automatically as demonstrated by our experiments on the home-page finding task of TREC-10.

We study how close the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification error function. The measurement of closeness is characterized by the loss function used in the estimation. We show that such a classification scheme can be generally regarded as a (non maximum-likelihood) conditional in-class probability estimate, and we use this analysis to compare various convex loss functions that have appeared in the literature. Furthermore, the theoretical insight allows us to design good loss functions with desirable properties. Another aspect of our analysis is to demonstrate the consistency of certain classification methods using convex risk minimization.

This paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. Intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. While previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. This makes them di#cult and expensive to apply. The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. Such clickthrough data is available in abundance and can be recorded at very low cost. Taking a Support Vector Machine (SVM) approach, this paper presents a method for learning retrieval functions. From a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. Furthermore, it is shown to be feasible even for large sets of queries and features. The theoretical results are verified in a controlled experiment. It shows that the method can e#ectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming Google in terms of retrieval quality after only a couple of hundred training examples.

We discuss the problem of ranking instances. In our framework each instance is associated with a rank or a rating, which is an integer from 1 to k. Our goal is to find a rank-prediction rule that assigns each instance a rank which is as close as possible to the instance's true rank. We describe a simple and efficient online algorithm, analyze its performance in the mistake bound model, and prove its correctness. We describe two sets of experiments, with synthetic data and with the EachMovie dataset for collaborative filtering. In the experiments we performed, our algorithm outperforms online algorithms for regression and classification applied to ranking.

- R Baeza-Yates
- B Ribeiro-Neto

R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval, volume 82. Addison-Wesley New York, 1999.

On NDCG consistency of listwise ranking methods

- P Ravikumar
- A Tewari
- E Yang

P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking methods. In
Proceedings of 14th International Conference on Artificial Intelligence and Statistics, AISTATS,
2011.