Conference Paper

Revisiting the Performance of iALS on Item Recommendation Benchmarks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Reproducibility studies in [18,19,54] showed that sometimes even decade-old and conceptually quite simple methods, when properly tuned, almost consistently outperformed the most recent deep learning models of the time for top-n recommendation tasks. Later research then re-assessed the effectiveness of widely-used and more than fifteen years old matrix factorization models, finding that they are still competitive with models that would be considered the state-of-the-art today [48,55]. Similar findings were also reported for sequential recommendation models [43], models based on architectures like Graph Neural Networks (GNNs) [3,17,60], and even in other areas of applied machine learning like time-series forecasting [46]. ...
... Later on, further research works 'revisited' the effectiveness of traditional matrix factorization models. In [55], Rendle et al. reassessed the effectiveness of iALS from 2008 for implicit feedback datasets. They found that iALS, if properly tuned, was still competitive with much more recent models and at the same time benefited from better scalability. ...
... Otherwise, the expressive power of certain models, e.g., traditional ones based on matrix factorization, may be artificially constrained, making it easier to demonstrate that a newly proposed model "outperforms the state-of-the-art. " As a concrete example, Rendle et al. [55] showed that iALS can achieve very competitive effectiveness with an embedding size in the thousands, which is in striking contrast to the commonly used sizes of 64 or 128. ...
Preprint
Full-text available
Countless new machine learning models are published every year and are reported to significantly advance the state-of-the-art in \emph{top-n} recommendation. However, earlier reproducibility studies indicate that progress in this area may be quite limited. Specifically, various widespread methodological issues, e.g., comparisons with untuned baseline models, have led to an \emph{illusion of progress}. In this work, our goal is to examine whether these problems persist in today's research. To this end, we aim to reproduce the latest advancements reported from applying modern Denoising Diffusion Probabilistic Models to recommender systems, focusing on four models published at the top-ranked SIGIR conference in 2023 and 2024. Our findings are concerning, revealing persistent methodological problems. Alarmingly, through experiments, we find that the latest recommendation techniques based on diffusion models, despite their computational complexity and substantial carbon footprint, are consistently outperformed by simpler existing models. Furthermore, we identify key mismatches between the characteristics of diffusion models and those of the traditional \emph{top-n} recommendation task, raising doubts about their suitability for recommendation. We also note that, in the papers we analyze, the generative capabilities of these models are constrained to a minimum. Overall, our results and continued methodological issues call for greater scientific rigor and a disruptive change in the research and publication culture in this area.
... In this paper, we focus on whole-data models with weighted square loss. Weighted Matrix Factorization (WMF), also called iALS [14,20], pioneered this class of models and is still known to achieve competitive results while having highly scalable learning and prediction routines [22]. After its introduction, many extensions were proposed, among which three variants for context-aware recommender systems (CARS) [5,10,11], where each variant uses a different tensor decomposition method. ...
... To keep the notation simple, we depicted a single scalar parameter for regularizing all factors. In practice, it is often found that scaling the strength of the regularization with the amount of observations improves performance [22]. ...
... Hence, we adapted the hyperparameter ∈ [0, 1] of [22], which is an exponent over the sum of weights associated with the factor in the loss. With this parameter and the global scalar we compute the regularization strength per factor. ...
Preprint
Full-text available
Over recent years it has become well accepted that user interest is not static or immutable. There are a variety of contextual factors, such as time of day, the weather or the user's mood, that influence the current interests of the user. Modelling approaches need to take these factors into account if they want to succeed at finding the most relevant content to recommend given the situation. A popular method for context-aware recommendation is to encode context attributes as extra dimensions of the classic user-item interaction matrix, effectively turning it into a tensor, followed by applying the appropriate tensor decomposition methods to learn missing values. However, unlike with matrix factorization, where all decompositions are essentially a product of matrices, there exist many more options for decomposing tensors by combining vector, matrix and tensor products. We study the most successful decomposition methods that use weighted square loss and categorize them based on their tensor structure and regularization strategy. Additionally, we further extend the pool of methods by filling in the missing combinations. In this paper we provide an overview of the properties of the different decomposition methods, such as their complexity, scalability, and modelling capacity. These benefits are then contrasted with the performances achieved in offline experiments to gain more insight into which method to choose depending on a specific situation and constraints.
... To deal with the aforementioned challenges, aside from standardized benchmarks and evaluation criteria, research communities have devoted efforts to developing and optimizing simple and stronger baseline models, such as the latest effort on fine-tuning iALS [67,68] and SimpleX (CCL) [3]. In the meantime, researchers [4] have started to theoretically analyze and compare different models, such as matrix factorization (iALS) ...
... How can we debias the L2 (mean-squared-error, MSE) or L1 (meanabsolute-error, MAE) loss function with respect to the contrastive learning loss? Following this, does the well-known linear models, such as iALS [36,67,68] and EASE [5], needs to be debiased with respect to contrastive learning losses? ...
... Besides the typical listwise (softmax, InfoNCE) and pairwise (BPR) losses, another type of recommendation loss is the pointwise loss. Most of the linear (non-deep) models are based on pointwise losses, including the well-known iALS [36,67], EASE [5], and latest CCL [3]. Here, these loss functions aim to pull the estimated scoreŷ ui closer to its default score r ui . ...
Preprint
Full-text available
Recommender systems have become increasingly important with the rise of the web as a medium for electronic and business transactions. One of the key drivers of this technology is the ease with which users can provide feedback about their likes and dislikes through simple clicks of a mouse. This feedback is commonly collected in the form of ratings, but can also be inferred from a user's browsing and purchasing history. Recommender systems utilize users' historical data to infer customer interests and provide personalized recommendations. The basic principle of recommendations is that significant dependencies exist between user- and item-centric activity, which can be learned in a data-driven manner to make accurate predictions. Collaborative filtering is one family of recommendation algorithms that uses ratings from multiple users to predict missing ratings or uses binary click information to predict potential clicks. However, recommender systems can be more complex and incorporate auxiliary data such as content-based attributes, user interactions, and contextual information.
... Considering that the performance of most recommendation models is highly dependent on finding the right hyperparameters [21], we believe this to be the most fair comparison. Hyperparameter tuning often requires an extensive understanding of the underlying algorithms [20,21]. Additionally, in reproducing algorithms, implementation errors or misinterpretations of the original method, can also cause major differences in the measured performance [7]. ...
... First, we observe that LogWMF and LogEASE outperform their linear counterparts (iALS and EASE) on both datasets. On the MSD datasets, LogEASE is even the best performing algorithm of all the baselines reported in [20]. Though the difference in metrics is not large, we can still draw two conclusions from this result. ...
... For this model there are two variants: one with uniform regularization (no scaling), and one with frequency scaling. The results in Figure 2 were obtained on the test set with the hyperparameters of Table 1 as determined on the validation set with an embedding dimension of 4096 for ML20M and 8192 for the MSD dataset similar to [20]. [24] Due to the hyperparameters not being tuned independently of the embedding dimension, it is possible for some models to achieve slightly higher results on the lower dimensions. ...
Conference Paper
Full-text available
Matrix factorization is a well-known and effective methodology for top-k list recommendation. It became widely known during the Netflix challenge in 2006, and since then, many adapted and improved versions have been published. A particularly interesting matrix factorization algorithm called iALS (for implicit Alternating Least Squares) adapts the method for implicit feedback, i.e. a setting where only a very small amount of positive labels are available along with a majority of unknown labels. Compared to the classical task of rating prediction, learning from implicit feedback is applicable to many more domains, as the data is more abundant and requires less effort to elicit from users. However, the sparsity, imbalance, and implicit nature of the signal also pose unique challenges to retrieving the most relevant items to recommend. We revisit the role of unknown interactions in implicit matrix factorization. Traditionally, all unknowns are interpreted as negative samples and their importance in the training objective is then down-weighted to balance them out with the known, positive interactions. Interestingly, by adapting a probabilistic view of matrix factorization, we can retain the unknown nature of these interactions by modelling them as either positive or negative. With this new formulation that better fits the underlying data, we gain improved performance on the downstream recommendation task without any computational overhead compared to the popular iALS method. This paper outlines the key insights needed to adapt iALS to use logistic regression. Furthermore, a logistic version of the popular full-rank EASE model is introduced in a similar fasion. An extensive experimental evaluation on several real-world datasets demonstrates the effectiveness of our approach. Additionally, a discrepancy between the need for weighting between factorization and autoencoder models is discovered, leading towards a better understanding of these methods.
... It has also been a workhorse model in many industrial applications due to its simplicity, scalability, and strong performance [7]. Despite the invention of more powerful deep-learning models, IMF has proven to be a strong baseline model for collaborative filtering recommender systems, provided hyper-parameters are properly tuned [27]. However, it implies that all the user data is located on one server (or cluster) and that data is coming from a single domain. ...
... We adapt the solver proposed by Rendle et al. [27] to solve the optimization problem in step 4. of the Algorithm (1). We chose that notation of ALS since it helps to guide the hyperparameter search iteratively rather than jointly. ...
... , where is the embedding size of the factors. It helps to control the values of the dot products of factors at the beginning of training [27]. ...
Preprint
Full-text available
Data sparsity has been one of the long-standing problems for recommender systems. One of the solutions to mitigate this issue is to exploit knowledge available in other source domains. However, many cross-domain recommender systems introduce a complex architecture that makes them less scalable in practice. On the other hand, matrix factorization methods are still considered to be strong baselines for single-domain recommendations. In this paper, we introduce the CDIMF, a model that extends the standard implicit matrix factorization with ALS to cross-domain scenarios. We apply the Alternating Direction Method of Multipliers to learn shared latent factors for overlapped users while factorizing the interaction matrix. In a dual-domain setting, experiments on industrial datasets demonstrate a competing performance of CDIMF for both cold-start and warm-start. The proposed model can outperform most other recent cross-domain and single-domain models. We also provide the code to reproduce experiments on GitHub.
... Thus, despite its widespread adoption, the BPR model might face similar challenges. Moreover, careful tuning of hyperparameters in well-established matrix factorization-based models has been shown to achieve performance near state-of-theart (SOTA) methods [15,51]. This aspect further emphasizes the importance of a thorough evaluation of BPR, as many recent papers report a subpar performance of this model in several evaluation settings. ...
... Despite recent reviews of several popular models [15,23,47,51], a comprehensive reproducibility study of the BPR model has yet to be conducted. The literature employing BPR as a baseline often omits a detailed description of its implementation, with key features such as sampling methods, learnable item biases, optimizer selection, and separate regularization factors missing from many opensource frameworks. ...
... Another comprehensive study [47] revisited BERT4Rec [56], where authors found that various implementations yielded different quality metrics. Similarly, a study on iALS [26] reproducibility reported that minor changes to the original implementation greatly improved quality, emphasizing the importance of hyperparameters tuning [51]. However, to our knowledge, studies have yet to re-examine BPR despite being a highly cited model and often used as a baseline. ...
Preprint
Full-text available
Bayesian Personalized Ranking (BPR), a collaborative filtering approach based on matrix factorization, frequently serves as a benchmark for recommender systems research. However, numerous studies often overlook the nuances of BPR implementation, claiming that it performs worse than newly proposed methods across various tasks. In this paper, we thoroughly examine the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations. Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations. Furthermore, through extensive experiments on real-world datasets under modern evaluation settings, we demonstrate that with proper tuning of its hyperparameters, the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets. Specifically, on the Million Song Dataset, the BPR model with hyperparameters tuning statistically significantly outperforms Mult-VAE by 10% in NDCG@100 with binary relevance function.
... Providing optimal hyperparameters for datasets and/or giving a general guide on how to get good parameterization is important for having strong baselines. [5,7,21] • Evaluation setups. Data preprocessing, offline metrics, and executing appropriate experiments can make or break the evaluation part of a research project. ...
... Many algorithms might be affected by incorrect unofficial implementations. To somewhat alleviate the harm, we release feature complete reimplementations of GRU4Rec for both Tensorflow 21 and PyTorch 22 that are validated against the official version. ...
... The latter was also examined in detail in [21] through the example of the now 15 years old iALS algorithm. iALS can be a much stronger baseline -capable of beating more modern algorithms -with the appropriate parameterization. ...
Preprint
Reproducibility of recommender systems research has come under scrutiny during recent years. Along with works focusing on repeating experiments with certain algorithms, the research community has also started discussing various aspects of evaluation and how these affect reproducibility. We add a novel angle to this discussion by examining how unofficial third-party implementations could benefit or hinder reproducibility. Besides giving a general overview, we thoroughly examine six third-party implementations of a popular recommender algorithm and compare them to the official version on five public datasets. In the light of our alarming findings we aim to draw the attention of the research community to this neglected aspect of reproducibility.
... Additionally, we also considered Enhancing VAEs for Collaborative Filtering (EVCF) [24], an enhancing VAE model for collaborative filtering, which adopts a flexible prior and gating mechanism, to enhance the Gaussian prior and encoder in the original Multi-VAE. Finally, we considered implicit alternating least squares (iALS), a collaborative filtering model proposed by Steffen et al. [25]. This model outperforms conventional iALS hyperparameters with proper tuning. ...
... WMF [22] 0.3836 0.5139 0.4112 Multi-DAE [13] 0.3854 0.5202 0.4129 Multi-VAE [13] 0.3879 0.5216 0.4160 RaCT [23] 0.3942 0.5272 0.4242 EVCF [24] 0.4167 0.5518 0.4451 iALS [25] 0 In particular, User Profile (Categories + Images) outperforms other user profile approaches and shows significant improvements across all metrics when compared to the RecVAE base model, demonstrating increases of +0.0860 in NDCG@100, +0.0746 in Re-call@20, and +0.0562 in Recall@50. These improvements suggest that incorporating both genres and visual information into the user profile enables our model to better capture user preferences and generate more accurate recommendations. ...
Article
Full-text available
We propose a novel recommendation model for diversifying furniture recommendations and aligning them more closely with user preferences. Our model builds upon the Recommender Variational Autoencoder (RecVAE), known for its effectiveness and ability to overcome overfitting by linking user feedback with user representation. However, since RecVAE relies on implicit feedback data, it tends to exhibit bias towards popular items, potentially creating a recommendation filter bubble. While previous work has proposed user profiles learned from a user’s personal information and the textual data of an item, we propose user profiles generated from the image data on the item given the points of interest when selecting items in e-commerce and the ease of data acquisition. We hypothesize that to capture user preferences and provide tailored furniture recommendations accurately, it is essential to incorporate both reviewed text information and visual data on furniture pieces. To utilize user preferences well, we incorporate the Conditional Variational Autoencoder (CVAE) architecture, where both the encoder and decoder are conditioned on a user profile indicating the user’s preference information. Additionally, the user profile is trained to capture the user’s preference for a specific predefined style. We trained our models using MovieLens-20M and the Amazon Furniture Review Dataset, a new dataset dedicated to furniture recommendations. As a result, on both datasets, our model outperformed previous models, including RecVAE. These findings show the effectiveness of our user profile approach in diversifying and personalizing furniture recommendations.
... Despite the increased sophistication of IARS models and recommendation models in general, and despite the substantial computational complexity and carbon footprint of such models [43,49], a number of recent studies have indicated that more complex models are not necessarily more effective than longer-existing simpler ones. Ferrari Dacrema et al. [10,11] for example found that in almost all cases they examined, a recent neural recommendation model was outperformed by traditional algorithms based, e.g., on matrix factorization. 1 Similar observations of the competitive performance of traditional learning techniques were later reported also in [1,32,36,37]. For the case of sequential recommendation settings, related reports on the effectiveness of simple models-also in practical settings-can be found in [16,18,20,27,41]. ...
... In terms of the baselines considered in our study, we so far only included EASE as a non-neural model-based technique. The exploration of traditional models based on matrix factorization, like ALS or BPR, which both were found to be competitive with recent models [32,37], is part of our ongoing and future work. ...
Preprint
Full-text available
Lately, we have observed a growing interest in intent-aware recommender systems (IARS). The promise of such systems is that they are capable of generating better recommendations by predicting and considering the underlying motivations and short-term goals of consumers. From a technical perspective, various sophisticated neural models were recently proposed in this emerging and promising area. In the broader context of complex neural recommendation models, a growing number of research works unfortunately indicates that (i) reproducing such works is often difficult and (ii) that the true benefits of such models may be limited in reality, e.g., because the reported improvements were obtained through comparisons with untuned or weak baselines. In this work, we investigate if recent research in IARS is similarly affected by such problems. Specifically, we tried to reproduce five contemporary IARS models that were published in top-level outlets, and we benchmarked them against a number of traditional non-neural recommendation models. In two of the cases, running the provided code with the optimal hyperparameters reported in the paper did not yield the results reported in the paper. Worryingly, we find that all examined IARS approaches are consistently outperformed by at least one traditional model. These findings point to sustained methodological issues and to a pressing need for more rigorous scholarly practices.
... [17], factorization models, e.g. [24], to sequential models, e.g. [35], latent generative models, e.g. ...
... The Netflix competition and subsequent research works on available public datasets have shown that SLIM [17] and matrix factorization techniques tend to outperform neighbourhood methods [14,28]. A push for neural approximations of matrix factorization took place [8,3], with the conclusion that the proposed MLP architectures fail to learn a better nonlinear variant of the dot product and are outperformed by careful implementations of matrix factorization [5,23,24]. More recently, some neural approaches have obtained state of the art results. ...
Preprint
Latent variable collaborative filtering methods have been a standard approach to modelling user-click interactions due to their simplicity and effectiveness. However, there is limited work on analyzing the mathematical properties of these methods in particular on preventing the overfitting towards the identity, and such methods typically utilize loss functions that overlook the geometry between items. In this work, we introduce a notion of generalization gap in collaborative filtering and analyze this with respect to latent collaborative filtering models. We present a geometric upper bound that gives rise to loss functions, and a way to meaningfully utilize the geometry of item-metadata to improve recommendations. We show how these losses can be minimized and gives the recipe to a new latent collaborative filtering algorithm, which we refer to as GeoCF, due to the geometric nature of our results. We then show experimentally that our proposed GeoCF algorithm can outperform other all existing methods on the Movielens20M and Netflix datasets, as well as two large-scale internal datasets. In summary, our work proposes a theoretically sound method which paves a way to better understand generalization of collaborative filtering at large.
... Efficiently representing users and items in recommender systems is a rich field with years of work of traditional techniques such as Matrix Factorization [30,44]. When applying LLM to recommender systems, users and items are key objects, and it is critical for LLM to be able to understand them. ...
... The training target for each example is the ground truth item id. For training inputs, we append each item's random indexing id with its behavioral embedding, which are computed using the iALS matrix factorization algorithm [44] on the user sequence training set. We use the provided train, development, and test split in the OpenP5 dataset, which uses the last item in the user sequence for testing and the second from the last item in the user sequence for development. ...
Preprint
Large-language Models (LLMs) have been extremely successful at tasks like complex dialogue understanding, reasoning and coding due to their emergent abilities. These emergent abilities have been extended with multi-modality to include image, audio, and video capabilities. Recommender systems, on the other hand, have been critical for information seeking and item discovery needs. Recently, there have been attempts to apply LLMs for recommendations. One difficulty of current attempts is that the underlying LLM is usually not trained on the recommender system data, which largely contains user interaction signals and is often not publicly available. Another difficulty is user interaction signals often have a different pattern from natural language text, and it is currently unclear if the LLM training setup can learn more non-trivial knowledge from interaction signals compared with traditional recommender system methods. Finally, it is difficult to train multiple LLMs for different use-cases, and to retain the original language and reasoning abilities when learning from recommender system data. To address these three limitations, we propose an Item-Language Model (ILM), which is composed of an item encoder to produce text-aligned item representations that encode user interaction signals, and a frozen LLM that can understand those item representations with preserved pretrained knowledge. We conduct extensive experiments which demonstrate both the importance of the language-alignment and of user interaction knowledge in the item encoder.
... As we can see from the figure, a larger embedding size generally positively affects the model performance. This result echoes similar findings of a recent reproducibility paper [47]; however, interestingly, in the RecJPQ case, increasing embedding dimensionality does not change the amount of information we store per each item, as the length of the code defines it rather than the embedding size. Instead, it increases model capacity, increasing the amount of information that can be stored in each sub-item embedding, allowing to account for more item characteristics. ...
... Existing embedding compression methods have limitations, leading to our proposed method, RecJPQ. Our evaluation of RecJPQ on three datasets resulted in significant model size reduction, e.g., 47.94× compression of the SASRec model on the Gowalla dataset. ...
Conference Paper
Sequential Recommendation is a popular recommendation task that uses the order of user-item interaction to model evolving users' interests and sequential patterns in their behaviour. Current state-of-the-art Transformer-based models for sequential recommendation, such as BERT4Rec and SASRec, generate sequence embeddings and compute scores for catalogue items, but the increasing catalogue size makes training these models costly. The Joint Product Quan-tisation (JPQ) method, originally proposed for passage retrieval, markedly reduces the size of the retrieval index with minimal effect on model effectiveness, by replacing passage embeddings with a limited number of shared sub-embeddings. This paper introduces RecJPQ, a novel adaptation of JPQ for sequential recommendations, which takes the place of item embeddings tensor and replaces item embeddings with a concatenation of a limited number of shared sub-embeddings and, therefore, limits the number of learnable model parameters. The main idea of RecJPQ is to split items into sub-item entities before training the main recommendation model, which is inspired by splitting words into tokens and training tokenisers in language models. We apply RecJPQ to SASRec, BERT4Rec, and GRU4rec models on three large-scale sequential datasets. Our results showed that RecJPQ could notably reduce the model size (e.g., 48× reduction for the Gowalla dataset with no effectiveness degradation). RecJPQ can also improve model performance through a regularisation effect (e.g. +0.96% NDCG@10 improvement on the Booking.com dataset). Overall, RecJPQ allows the training of state-of-the-art transformer recommenders in industrial applications, where datasets with millions of items are common.
... Similar observations were made in the area of recommender systems [9,23], as well as in other fields of applied machine learning, e.g., in time series forecasting [21]. In these and in several other works it turned out that the latest published models are in fact often not outperforming existing models and sometimes conceptually simple or longer-known methods can reach at least similar performance levels, at least in offline evaluations [19,24]. 1 ...
... We iterate here that the ranking of the tuned models in Table 4 is not important, because we limited our search for hyperparameter ranges to typical values, and we limited the number of tuning iterations for the computationally complex models. Exploring alternative or more unusual ranges, as done recently in [24], may help to further improve the performance of the individual models. ...
... Third, sampled versions of these metrics have been adopted for efficiency reasons, but are inconsistent with their unsampled counterparts and should thus be avoided [8,50]. Fourth, multiple recent works have reported troubling trends in the reproducibility of widely cited methods [25,26,[60][61][62]-similar issues plagued the adjacent IR field a decade earlier [3]. ...
... Most traditional ranking evaluation metrics stem from IR. Valcarce et al. find that nDCG offers the best discriminative power among them [79]. Findings like this reinforce the community's trust in nDCG, and it is commonly used to compare novel top-recommendation methods to the state-of-the-art, also in reproducibility studies [25,26,60,61]. Ferrante et al. argue that while nDCG can be preferable because it is bounded and normalised, problems can arise because the metric is not easily transformed to an interval scale [24]. ...
Preprint
Full-text available
Approaches to recommendation are typically evaluated in one of two ways: (1) via a (simulated) online experiment, often seen as the gold standard, or (2) via some offline evaluation procedure, where the goal is to approximate the outcome of an online experiment. Several offline evaluation metrics have been adopted in the literature, inspired by ranking metrics prevalent in the field of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one such metric that has seen widespread adoption in empirical studies, and higher (n)DCG values have been used to present new methods as the state-of-the-art in top-n recommendation for many years. Our work takes a critical look at this approach, and investigates when we can expect such metrics to approximate the gold standard outcome of an online experiment. We formally present the assumptions that are necessary to consider DCG an unbiased estimator of online reward and provide a derivation for this metric from first principles, highlighting where we deviate from its traditional uses in IR. Importantly, we show that normalising the metric renders it inconsistent, in that even when DCG is unbiased, ranking competing methods by their normalised DCG can invert their relative order. Through a correlation analysis between off- and on-line experiments conducted on a large-scale recommendation platform, we show that our unbiased DCG estimates strongly correlate with online reward, even when some of the metric's inherent assumptions are violated. This statement no longer holds for its normalised variant, suggesting that nDCG's practical utility may be limited.
... Most modern recommender systems leverage the implicit feedback paradigm, which utilizes data that is not explicitly provided by the user, such as click data, purchase history, browsing behavior. Research in recommender systems employs simpler linear models [24,31,46,47], or neural models, many of which employ the variational autoencoder [28] framework, e.g., cVAE [9], RecVAE [55] or the MultVAE [36]. Neural models have benefits beyond recommendation performance, e.g., in controllability / critiquing [33,38,64,68], with some models utilizing disentanglement [2,17] for this purpose [6,39,44]. ...
... Recommendation (together with, arguably, time series and tabular data) is one of only few areas where neural models do not seem to have gained supremacy yet. This has been shown in the settings of general recommendation [13,24,46,47], sparse interactions [31], session-based [37] and next basket recommendation [22,32,34]. In these benchmarks, winning methods are variations of matrix factorization (MF) techniques (SVD++, (i)ALS, EASE [59], and SLIM [45]) or even the most popular benchmark. ...
Preprint
Full-text available
In this paper we propose RecFusion, which comprise a set of diffusion models for recommendation. Unlike image data which contain spatial correlations, a user-item interaction matrix, commonly utilized in recommendation, lacks spatial relationships between users and items. We formulate diffusion on a 1D vector and propose binomial diffusion, which explicitly models binary user-item interactions with a Bernoulli process. We show that RecFusion approaches the performance of complex VAE baselines on the core recommendation setting (top-n recommendation for binary non-sequential feedback) and the most common datasets (MovieLens and Netflix). Our proposed diffusion models that are specialized for 1D and/or binary setups have implications beyond recommendation systems, such as in the medical domain with MRI and CT scans.
... Previous studies [29,45,70] have reported the effectiveness of large dimensionalities in rating prediction tasks. Rendle et al. [49] also recently showed that high-dimensional models can achieve very high ranking accuracy under appropriate regularization. Furthermore, successful models in top-item recommendation, such as variational autoencoder (VAE)-based models [32], often use large dimensions. ...
... For our experiments, we use implicit alternating least squares (iALS) [24,28,49], which is widely used in practical applications and implemented in distributed frameworks, such as Apache Spark [39]. 1 Formally, we can say f is order-preserving if f satisfies, for any x ∈ R |V | and ∀ , ∈ V such that ≠ , it holds that ...
Preprint
Full-text available
Beyond accuracy, there are a variety of aspects to the quality of recommender systems, such as diversity, fairness, and robustness. We argue that many of the prevalent problems in recommender systems are partly due to low-dimensionality of user and item embeddings, particularly when dot-product models, such as matrix factorization, are used. In this study, we showcase empirical evidence suggesting the necessity of sufficient dimensionality for user/item embeddings to achieve diverse, fair, and robust recommendation. We then present theoretical analyses of the expressive power of dot-product models. Our theoretical results demonstrate that the number of possible rankings expressible under dot-product models is exponentially bounded by the dimension of item factors. We empirically found that the low-dimensionality contributes to a popularity bias, widening the gap between the rank positions of popular and long-tail items; we also give a theoretical justification for this phenomenon.
... WSL is frequently employed in dual encoders. iALS (Rendle et al. (2022)) utilizes this loss to optimize matrix factorization models, while SAGram (Krichene et al. (2018)) applies it to optimize non-linear encoders. A noteworthy advantage of WSL stems from the applicability of ALS or higher-order gradient descent optimization methods like Newton's method, which involve updating the left and right latent matrices using the closed-form solutions or the second-order Hessian matrix. ...
Preprint
Ranking tasks constitute fundamental components of extreme similarity learning frameworks, where extremely large corpora of objects are modeled through relative similarity relationships adhering to predefined ordinal structures. Among various ranking surrogates, Softmax (SM) Loss has been widely adopted due to its natural capability to handle listwise ranking via global negative comparisons, along with its flexibility across diverse application scenarios. However, despite its effectiveness, SM Loss often suffers from significant computational overhead and scalability limitations when applied to large-scale object spaces. To address this challenge, we propose novel loss formulations that align directly with ranking metrics: the Ranking-Generalizable \textbf{squared} (RG2^2) Loss and the Ranking-Generalizable interactive (RG×^\times) Loss, both derived through Taylor expansions of the SM Loss. Notably, RG2^2 reveals the intrinsic mechanisms underlying weighted squared losses (WSL) in ranking methods and uncovers fundamental connections between sampling-based and non-sampling-based loss paradigms. Furthermore, we integrate the proposed RG losses with the highly efficient Alternating Least Squares (ALS) optimization method, providing both generalization guarantees and convergence rate analyses. Empirical evaluations on real-world datasets demonstrate that our approach achieves comparable or superior ranking performance relative to SM Loss, while significantly accelerating convergence. This framework offers the similarity learning community both theoretical insights and practically efficient tools, with methodologies applicable to a broad range of tasks where balancing ranking quality and computational efficiency is essential.
... In that context, we however have to keep in mind that more recent models are not necessarily always favorable over more traditional approaches in terms of prediction accuracy [14]. In contrast, the recent literature suggests that the BPR and ALS models chosen for our experiments are still competitive with today's methods [32,36,37]. ...
Article
The essence of calibration in recommender systems is to generate recommendations that match the distribution of a given user’s past preferences regarding certain item features—e.g., in terms of preferred genres in the case of movies—while preserving relevance. The user’s past preference distribution is usually derived by considering the features of all items that the user previously liked. However, the most common approach in the literature to derive this distribution has certain limitations. First, it does not consider that user preferences may change over time. Second, there are domains where the relevant item features are set-valued, e.g., a movie can have several genres. In such cases, existing calibration approaches may represent the true user’s preference distribution in a suboptimal way. In this work, we, therefore, propose two novel approaches to derive the preference distributions of users for the purpose of calibration. The first method allows us to decrease the relevance of possibly outdated preference information. The second method is an entropy-based approach, which aims to capture better the user’s true preferences towards certain item features. Extensive experimental evaluations on four distinct datasets confirm that the proposed techniques are more effective in reducing the level of miscalibration than the common state-of-the-art calibration approach.
... Combined, these aspects may lead to a certain stagnation in our field, as discussed already a decade ago [24,17, 71]. Similar discussion has been ongoing more recently, e.g., [13,18,33]. ...
Chapter
Full-text available
To date, there have been a large number of papers written on challenges and best practices for evaluating recommender systems [6, 9, 13, 17, 18, 36, 38, 24, 36, 48]. Still, papers written and published today often fall short of embracing the practices suggested in prior works. Hence, we aim to suggest practical methods for the recommender systems community to guide researchers toward embracing such practices. We suggest concrete tools that can be immediately implemented in prominent recommendation system research venues such as ACM RecSys and ACM TORS. We believe that the research community, as a whole, largely agrees on many of the practices that should be embraced. However, it is often the case that individuals are unaware of the many challenges of rigorous evaluation. In addition, adopting these practices often comes at a significant cost in terms of the invested effort and required time. Hence, it may be tempting for researchers not to prioritize such issues when preparing their work for publication.
... As such, it is not too surprising that a number of studies in recent years have revealed that the progress that we make in terms of algorithms that lead to better offline accuracy results may actually be quite limited. Quite worrying, these studies show that often decade-old methods or conceptually simple techniques based on nearest-neighbor search can outperform even the latest neural models [30,31,57,61,72,73,82]. One main reason for this phantom progress [31] lies in the apparently common practice of benchmarking a newly proposed and meticulously fine-tuned model against baselines that were not particularly tuned for the given dataset(s). ...
Preprint
Full-text available
In the area of recommender systems, the vast majority of research efforts is spent on developing increasingly sophisticated recommendation models, also using increasingly more computational resources. Unfortunately, most of these research efforts target a very small set of application domains, mostly e-commerce and media recommendation. Furthermore, many of these models are never evaluated with users, let alone put into practice. The scientific, economic and societal value of much of these efforts by scholars therefore remains largely unclear. To achieve a stronger positive impact resulting from these efforts, we posit that we as a research community should more often address use cases where recommender systems contribute to societal good (RS4Good). In this opinion piece, we first discuss a number of examples where the use of recommender systems for problems of societal concern has been successfully explored in the literature. We then proceed by outlining a paradigmatic shift that is needed to conduct successful RS4Good research, where the key ingredients are interdisciplinary collaborations and longitudinal evaluation approaches with humans in the loop.
... Moreover, considering that a good intrinsic meaning is commonly achieved by context-window models [32], we have also implemented two contextual neural embeddings-based recommenders, Item2Vec (I2V) [2] and User2Vec (U2V) [18], using the gensim [36] library. All models were selected based on their popularity in recent studies [38,39], ease of replication, and the fact that they do not rely on item metadata, showing the power those methods can have for figuring out knowledge about the items without consuming this information. ...
Conference Paper
With the constant growth in available information and the popularization of technology, recommender systems have to deal with an increasing number of users and items. This leads to two problems in representing items: scalability and sparsity. Therefore, many recommender systems aim to generate low-dimensional dense representations of items. Matrix factorization techniques are popular, but models based on neural embeddings have recently been proposed and are gaining ground in the literature. Their main goal is to learn dense representations with intrinsic meaning. However, most studies proposing embeddings for recommender systems ignore this property and focus only on extrinsic evaluations. This study presents a guideline for assessing the intrinsic quality of matrix factorization and neural-based embedding models for collaborative filtering, comparing the results with a traditional extrinsic evaluation. To enrich the evaluation pipeline, we suggest adapting an intrinsic evaluation task commonly employed in the Natural Language Processing literature, and we propose a novel strategy for evaluating the learned representation compared to a content-based scenario. Finally, every mentioned technique is analyzed over established recommender models, and the results show how vector representations that do not yield good recommendations can still be useful in other tasks that demand intrinsic knowledge, highlighting the potential of this perspective of evaluation.
... If the rank is unknown, an overestimated rank coupled with a proper penalization of 1 2 ∥W ∥ 2 F + ∥H∥ 2 F can yield state-of-the-art results. For example, in [13], a properly tuned matrix factorization model using the above regularizer can outperform deep neural networks on recommendation systems. In [14], they showed that the sightly different regularizer ∥W ∥ * + 1 2 ∥H∥ 2 F yields better results than 1 2 ∥W ∥ 2 F + ∥H∥ 2 F , both with uniform or nonuniform samplings. ...
Preprint
Full-text available
Low-rank matrix approximation is a standard, yet powerful, embedding technique that can be used to tackle a broad range of problems, including the recovery of missing data. In this paper, we focus on the performance of nonnegative matrix factorization (NMF) with minimum-volume (MinVol) regularization on the task of nonnegative data imputation. The particular choice of the MinVol regularization is justified by its interesting identifiability property and by its link with the nuclear norm. We show experimentally that MinVol NMF is a relevant model for nonnegative data recovery, especially when the recovery of a unique embedding is desired. Additionally, we introduce a new version of MinVol NMF that exhibits some promising results.
... This performance is influenced by the data set, the algorithm, and its hyperparameters. Hyperparameter optimization techniques like grid search, random search, or Bayesian optimization are commonly applied to improve recommendation performance by determining the best hyperparameter values for an algorithm [3,12,21,30,39,42,49,55,56]. ...
Chapter
The hyperparameters of recommender systems for top-n predictions are typically optimized to enhance the predictive performance of algorithms. Thereby, the optimization algorithm, e.g., grid search or random search, searches for the best hyperparameter configuration according to an optimization-target metric, like nDCG or Precision. In contrast, the optimized algorithm, e.g., Alternating Least Squares Matrix Factorization or Bayesian Personalized Ranking, internally optimizes a different loss function during training, like squared error or cross-entropy. To tackle this discrepancy, recent work focused on generating loss functions better suited for recommender systems. Yet, when evaluating an algorithm using a top-n metric during optimization, another discrepancy between the optimization-target metric and the training loss has so far been ignored. During optimization, the top-n items are selected for computing a top-n metric; ignoring that the top-n items are selected from the recommendations of a model trained with an entirely different loss function. Item recommendations suitable for optimization-target metrics could be outside the top-n recommended items; hiddenly impacting the optimization performance. Therefore, we were motivated to analyze whether the top-n items are optimal for optimization-target top-n metrics. In pursuit of an answer, we exhaustively evaluate the predictive performance of 250 selection strategies besides selecting the top-n. We extensively evaluate each selection strategy over twelve implicit feedback and eight explicit feedback data sets with eleven recommender systems algorithms. Our results show that there exist selection strategies other than top-n that increase predictive performance for various algorithms and recommendation domains. However, the performance of the top 43%\sim 43\% of selection strategies is not significantly different. We discuss the impact of our findings on optimization and re-ranking in recommender systems and feasible solutions. The implementation of our study is publicly available.
... Quite surprisingly, for the category of neural SBRS models that operate solely on user-item interactions like GRU4Rec, various studies have shown that simpler methods, e.g., based on nearest-neighbor techniques, can lead to competitive accuracy results and in many cases even outperform the more complex neural models ( [7,12,14,20]). Similar observations were made for traditional top-n recommendation tasks ( [1,5,6,24,25]), where the latest neural models were found to be outperformed by longer-existing approaches, e.g., based on matrix factorization. 1 Various factors that can contribute to this phenomenon of 'phantom progress' were discussed in the literature [3]. Besides the issue that the baseline algorithms in many cases may not have been properly tuned in the reported experiments [26], a central issue lies in the selection of the baseline models themselves. ...
Chapter
In session-based recommendation settings, a recommender system has to base its suggestions on the user interactions that are observed in an ongoing session. Since such sessions can consist of only a small set of interactions, various approaches based on Graph Neural Networks (GNN) were recently proposed, as they allow us to integrate various types of side information about the items in a natural way. Unfortunately, a variety of evaluation settings are used in the literature, e.g., in terms of protocols, metrics and baselines, making it difficult to assess what represents the state of the art. In this work, we present the results of an evaluation of eight recent GNN-based approaches that were published in high-quality outlets. For a fair comparison, all models are systematically tuned and tested under identical conditions using three common datasets. We furthermore include k-nearest-neighbor and sequential rules-based models as baselines, as such models have previously exhibited competitive performance results for similar settings. To our surprise, the evaluation showed that the simple models outperform all recent GNN models in terms of the Mean Reciprocal Rank, which we used as an optimization criterion, and were only outperformed in three cases in terms of the Hit Rate. Additional analyses furthermore reveal that several other factors that are often not deeply discussed in papers, e.g., random seeds, can markedly impact the performance of GNN-based models. Our results therefore (a) point to continuing issues in the community in terms of research methodology and (b) indicate that there is ample room for improvement in session-based recommendation.
... Evaluation has many aspects from metrics [17,35] to hyperparameter optimization [27]. [6,7] suggests that one of the reasons behind the low reproducibility rate of recent papers is that evaluation setups are lifted from earlier work without their validity being checked. ...
Preprint
Even though offline evaluation is just an imperfect proxy of online performance -- due to the interactive nature of recommenders -- it will probably remain the primary way of evaluation in recommender systems research for the foreseeable future, since the proprietary nature of production recommenders prevents independent validation of A/B test setups and verification of online results. Therefore, it is imperative that offline evaluation setups are as realistic and as flawless as they can be. Unfortunately, evaluation flaws are quite common in recommender systems research nowadays, due to later works copying flawed evaluation setups from their predecessors without questioning their validity. In the hope of improving the quality of offline evaluation of recommender systems, we discuss four of these widespread flaws and why researchers should avoid them.
Article
Recommender systems have become integral to many online services, leveraging user data to provide personalized recommendations. However, as these systems grow in complexity, understanding the rationale behind their recommendations becomes increasingly difficult. Explainable Artificial Intelligence (XAI) has emerged as a crucial field addressing this challenge, particularly in ensuring transparency and trustworthiness in automated decision-making processes. In this paper, we introduce Learning to eXplain Recommendations (LXR) , a scalable, model-agnostic framework designed to generate counterfactually correct explanations for recommender systems. LXR generates explanations for recommendations produced by any differentiable recommender system. By leveraging both factual and counterfactual loss terms, LXR offers robust, accurate, and computationally efficient explanations that reflect the model’s internal decision-making process. A key feature of LXR is its focus on the factual correctness of explanations through counterfactual reasoning, bridging the gap between plausible and accurate explanations. Unlike traditional approaches that rely on exhaustive perturbations of user data, LXR uses a self-supervised learning method to generate explanations efficiently, without sacrificing accuracy. LXR operates in two stages: a pre-training step and a novel Inference-Time Fine-tuning (ITF) step that refines explanations at the individual recommendation level, significantly improving accuracy with minimal computational overhead. Additionally, LXR is applied to hybrid recommender models incorporating demographic data, demonstrating its versatility across real-world scenarios. Finally, we also showcase LXR’s ability to explain recommendations at various ranks within a user’s recommendation list. As a secondary contribution, we introduce several novel evaluation metrics, inspired by saliency maps from computer vision, to rigorously assess the counterfactual correctness of explanations in recommender systems. Our results demonstrate that LXR sets a new benchmark for explainability, providing accurate, transparent, and interpretable explanations. The code is available on our GitHub repository: https://github.com/DeltaLabTLV/LXR.
Article
A large catalogue size is one of the central challenges in training recommendation models: a large number of items makes them memory and computationally inefficient to compute scores for all items during training, forcing these models to deploy negative sampling. However, negative sampling increases the proportion of positive interactions in the training data, and therefore, models trained with negative sampling tend to overestimate the probabilities of positive interactions – a phenomenon we call overconfidence . While the absolute values of the predicted scores/probabilities are not important for the ranking of retrieved recommendations, overconfident models may fail to estimate nuanced differences in the top-ranked items, resulting in degraded performance. In this paper, we show that overconfidence explains why the popular SASRec model underperforms when compared to BERT4Rec. This is contrary to the BERT4Rec authors’ explanation that the difference in performance is due to the bi-directional attention mechanism. To mitigate overconfidence, we propose a novel Generalised Binary Cross-Entropy Loss function (gBCE) and theoretically prove that it can mitigate overconfidence. We further propose the gSASRec model, an improvement over SASRec that deploys an increased number of negatives and the gBCE loss. Through detailed experiments on three datasets, we show that gSASRec does not exhibit the overconfidence problem. As a result, gSASRec can outperform BERT4Rec (e.g. +9.47% NDCG on the MovieLens-1M dataset), while requiring less training time (e.g. -73% training time on MovieLens-1M). Moreover, in contrast to BERT4Rec, gSASRec is suitable for large datasets that contain more than 1 million items. Finally, we show how addressing overconfidence can improve model calibration – the ability of a model to predict actual interaction probabilities accurately. By applying gBCE to the SASRec model on MovieLens-1M dataset, we reduce the models’ expected calibration error by 98.9% (from 0.966 to 0.01).
Article
Embedding representations are a popular approach for modeling users and items in recommender systems, e.g., matrix factorization, two-tower models or autoencoders, where items and users are embedded in a small dimensional, dense embedding space. On the other hand, there are methods that model high-dimensional relationships between items, most notably item-based collaborative filtering (CF), which is based on an item-to-item similarity matrix. Item-based CF has been proposed over two decades ago and gained new interest through new learning methods in the form of SLIM [18] and most recently EASE [25]. In this work, we rephrase traditional item-based collaborative filtering as sparse user encoders where the user encoder is an (arbitrary) function and the item representation is learned. Item-based CF is a special case where the sparse user encoding is the one-hot encoding of a user’s history. Different from typical dense user/item encoder models, this work targets high-dimensional and sparse user encoders. The core contribution is an efficient closed form learning algorithm that can solve arbitrary sparse user encoders. Several applications of this algorithm including higher order encoders, hashed encoders, and feature based encoders are presented.
Article
Full-text available
In this paper, we propose a new low-rank matrix factorization model dubbed bounded simplex-structured matrix factorization (BSSMF). Given an input matrix X and a factorization rank r , BSSMF looks for a matrix W with r columns and a matrix H with r rows such that XWHX \approx WH where the entries in each column of W are bounded, that is, they belong to given intervals, and the columns of H belong to the probability simplex, that is, H is column stochastic. BSSMF generalizes nonnegative matrix factorization (NMF), and simplex-structured matrix factorization (SSMF). BSSMF is particularly well suited when the entries of the input matrix X belong to a given interval; for example when the rows of X represent images, or X is a rating matrix such as in the Netflix and MovieLens datasets where the entries of X belong to the interval [1,5]. The simplex-structured matrix H not only leads to an easily understandable decomposition providing a soft clustering of the columns of X , but implies that the entries of each column of WH belong to the same intervals as the columns of W . In this paper, we first propose a fast algorithm for BSSMF, even in the presence of missing data in X . Then we provide identifiability conditions for BSSMF, that is, we provide conditions under which BSSMF admits a unique decomposition, up to trivial ambiguities. Finally, we illustrate the effectiveness of BSSMF on two applications: extraction of features in a set of images, and the matrix completion problem for recommender systems.
Article
Most recommender systems rely on user interaction data for personalization. Usually, the recommendation quality improves with more data. In this work, we study the quality implications when limiting user interaction data for personalization purposes. We formalize this problem and provide algorithms for selecting a smaller subset of user interaction data. We propose a selection method that picks the subset of a user’s history items that maximizes the expected recommendation quality. We show on well studied benchmarks that it is possible to achieve high quality results with small subsets of less than ten items per user.
Conference Paper
Full-text available
Collaborative filtering models based on matrix factorization and learned similarities using Artificial Neural Networks (ANNs) have gained significant attention in recent years. This is, in part, because ANNs have demonstrated very good results in a wide variety of recommendation tasks. However, the introduction of ANNs within the recommendation ecosystem has been recently questioned, raising several comparisons in terms of efficiency and effectiveness. One aspect most of these comparisons have in common is their focus on accuracy, neglecting other evaluation dimensions important for the recommendation, such as novelty, diversity, or accounting for biases. In this work, we replicate experiments from three different papers that compare Neural Collaborative Filtering (NCF) and Matrix Factorization (MF), to extend the analysis to other evaluation dimensions. First, our contribution shows that the experiments under analysis are entirely reproducible, and we extend the study including other accuracy metrics and two statistical hypothesis tests. Second, we investigated the Diversity and Novelty of the recommendations, showing that MF provides a better accuracy also on the long tail, although NCF provides a better item coverage and more diversified recommendation lists. Lastly, we discuss the bias effect generated by the tested methods. They show a relatively small bias, but other recommendation baselines, with competitive accuracy performance, consistently show to be less affected by this issue. This is the first work, to the best of our knowledge, where several complementary evaluation dimensions have been explored for an array of state-of-the-art algorithms covering recent adaptations of ANNs and MF. Hence, we aim to show the potential these techniques may have on beyond-accuracy evaluation while analyzing the effect on reproducibility these complementary dimensions may spark. The code to reproduce the experiments is publicly available on GitHub at https:// tny.sh/ Reenvisioning
Article
Full-text available
The design of algorithms that generate personalized ranked item lists is a central topic of research in the field of recommender systems. In the past few years, in particular, approaches based on deep learning (neural) techniques have become dominant in the literature. For all of them, substantial progress over the state-of-the-art is claimed. However, indications exist of certain problems in today’s research practice, e.g., with respect to the choice and optimization of the baselines used for comparison, raising questions about the published claims. To obtain a better understanding of the actual progress, we have compared recent results in the area of neural recommendation approaches based on collaborative filtering against a consistent set of existing simple baselines. The worrying outcome of the analysis of these recent works—all were published at prestigious scientific conferences between 2015 and 2018—is that 11 of the 12 reproducible neural approaches can be outperformed by conceptually simple methods, e.g., based on the nearest-neighbor heuristic or linear models. None of the computationally complex neural methods was actually consistently better than already existing learning-based techniques, e.g., using matrix factorization or linear models. In our analysis, we discuss common issues in today’s research practice, which, despite the many papers that are published on the topic, have apparently led the field to a certain level of stagnation.
Conference Paper
Full-text available
Albeit the implicit feedback based recommendation problem - when only the user history is available but there are no ratings - is the most typical setting in real-world applications, it is much less researched than the explicit feedback case. State-of-the-art algorithms that are efficient on the explicit case cannot be straightforwardly transformed to the implicit case if scalability should be maintained. There are few implicit feedback benchmark datasets, therefore new ideas are usually experimented on explicit benchmarks. In this paper, we propose a generic context-aware implicit feedback recommender algorithm, coined iTALS. iTALS applies a fast, ALS-based tensor factorization learning method that scales linearly with the number of non-zero elements in the tensor. The method also allows us to incorporate various contextual information into the model while maintaining its computational efficiency. We present two context-aware implementation variants of iTALS. The first incorporates seasonality and enables to distinguish user behavior in different time intervals. The other views the user history as sequential information and has the ability to recognize usage pattern typical to certain group of items, e.g. to automatically tell apart product types that are typically purchased repetitively or once. Experiments performed on five implicit datasets (LastFM 1K, Grocery, VoD, and ``implicitized'' Netflix and MovieLens 10M) show that by integrating context-aware information with our factorization framework into the state-of-the-art implicit recommender algorithm the recommendation quality improves significantly.
Conference Paper
Full-text available
The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions.
Conference Paper
Full-text available
In many commercial systems, the 'best bet' recommendations are shown, but the predicted rating values are not. This is usually referred to as a top-N recommendation task, where the goal of the recommender system is to find a few specific items which are supposed to be most appealing to the user. Common methodologies based on error metrics (such as RMSE) are not a natural fit for evaluating the top-N recommendation task. Rather, top-N performance can be directly measured by alternative methodologies based on accuracy metrics (such as precision/recall). An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task. Results show that improvements in RMSE often do not translate into accuracy improvements. In particular, a naive non-personalized algorithm can outperform some common recommendation approaches and almost match the accuracy of sophisticated algorithms. Another finding is that the very few top popular items can skew the top-N performance. The analysis points out that when evaluating a recommender algorithm on the top-N recommendation task, the test set should be chosen carefully in order to not bias accuracy metrics towards non-personalized solutions. Finally, we offer practitioners new variants of two collaborative filtering algorithms that, regardless of their RMSE, significantly outperform other recommender algorithms in pursuing the top-N recommendation task, with offering additional practical advantages. This comes at surprise given the simplicity of these two methods.
Conference Paper
Full-text available
Alternating least squares (ALS) is a powerful matrix factorization (MF) algorithm for both explicit and implicit feedback based recommender systems. As shown in many articles, increasing the number of latent factors (denoted by K) boosts the prediction accuracy of MF based recommender systems, including ALS as well. The price of the better accuracy is paid by the increased running time: the running time of the original version of ALS is proportional to K3. Yet, the running time of model building can be important in recommendation systems; if the model cannot keep up with the changing item portfolio and/or user profile, the prediction accuracy can be degraded. In this paper we present novel and fast ALS variants both for the implicit and explicit feedback datasets, which offers better trade-off between running time and accuracy. Due to the significantly lower computational complexity of the algorithm - linear in terms of K - the model being generated under the same amount of time is more accurate, since the faster training enables to build model with more latent factors. We demonstrate the efficiency of our ALS variants on two datasets using two performance measures, RMSE and average relative position (ARP), and show that either a significantly more accurate model can be generated under the same amount of time or a model with similar prediction accuracy can be created faster; for explicit feedback the speed-up factor can be even 5-10.
Conference Paper
Full-text available
Many recommendation systems suggest items to users by utilizing the techniques of collaborative filtering (CF) based on historical records of items that the users have viewed, purchased, or rated. Two major problems that most CF approaches have to contend with are scalability and sparseness of the user profiles. To tackle these issues, in this paper, we describe a CF algorithm alternating-least-squares with weighted-λ -regularization (ALS-WR), which is implemented on a parallel Matlab platform. We show empirically that the performance of ALS-WR (in terms of root mean squared error (RMSE)) monotonically improves with both the number of features and the number of ALS iterations. We applied the ALS-WR algorithm on a large-scale CF problem, the Netflix Challenge, with 1000 hidden features and obtained a RMSE score of 0.8985, which is one of the best results based on a pure method. In addition, combining with the parallel version of other known methods, we achieved a performance improvement of 5.91% over Netflix’s own CineMatch recommendation system. Our method is simple and scales well to very large datasets.
Conference Paper
Full-text available
This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.
Conference Paper
Full-text available
A common task of recommender systems is to improve customer experience through personalized recommenda- tions based on prior implicit feedback. These systems pas- sively track different sorts of user behavior, such as pur- chase history, watching habits and browsing activity, in or- der to model user preferences. Unlike the much more ex- tensively researched explicit feedback, we do not have any direct input from the users regarding their preferences. In particular, we lack substantial evidence on which products consumer dislike. In this work we identify unique proper- ties of implicit feedback datasets. We propose treating the data as indication of positive and negative preference asso- ciated with vastly varying confidence levels. This leads to a factor model which is especially tailored for implicit feed- back recommenders. We also suggest a scalable optimiza- tion procedure, which scales linearly with the data size. The algorithmis used successfully within a recommender system for television shows. It compares favorably with well tuned implementations of other known methods. In addition, we offer a novel way to give explanations to recommendations given by this factor model.
Conference Paper
Full-text available
We introduce the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks. We describe its creation process, its content, and its possible uses. Attractive features of the Million Song Database include the range of existing resources to which it is linked, and the fact that it is the largest current research dataset in our field. As an illustration, we present year prediction as an example application, a task that has, until now, been difficult to study owing to the absence of a large set of suitable data. We show positive results on year prediction, and discuss more generally the future development of the dataset.
Chapter
Papers from the 2006 flagship meeting on neural computation, with contributions from physicists, neuroscientists, mathematicians, statisticians, and computer scientists. The annual Neural Information Processing Systems (NIPS) conference is the flagship meeting on neural computation and machine learning. It draws a diverse group of attendees—physicists, neuroscientists, mathematicians, statisticians, and computer scientists—interested in theoretical and applied aspects of modeling, simulating, and building neural-like or intelligent systems. The presentations are interdisciplinary, with contributions in algorithms, learning theory, cognitive science, neuroscience, brain imaging, vision, speech and signal processing, reinforcement learning, and applications. Only twenty-five percent of the papers submitted are accepted for presentation at NIPS, so the quality is exceptionally high. This volume contains the papers presented at the December 2006 meeting, held in Vancouver. Bradford Books imprint
Chapter
The task of item recommendation is to select the best items for a user from a large catalogue of items. Item recommenders are commonly trained from implicit feedback which consists of past actions that are positive only. Core challenges of item recommendation are (1) how to formulate a training objective from implicit feedback and (2) how to efficiently train models over a large item catalogue. This chapter formulates the item recommendation problem and points out its unique characteristics. Then different training objectives are discussed. The main body deals with learning algorithms and presents sampling based algorithms for general recommenders and more efficient algorithms for dot product models. Finally, the application of item recommenders for retrieval tasks is discussed.
Conference Paper
Neural network based models for collaborative filtering have started to gain attention recently. One branch of research is based on using deep generative models to model user preferences where variational autoencoders were shown to produce state-of-the-art results. However, there are some potentially problematic characteristics of the current variational autoencoder for CF. The first is the too simplistic prior that VAEs incorporate for learning the latent representations of user preference. The other is the model's inability to learn deeper representations with more than one hidden layer for each network. Our goal is to incorporate appropriate techniques to mitigate the aforementioned problems of variational autoencoder CF and further improve the recommendation performance. Our work is the first to apply flexible priors to collaborative filtering and show that simple priors (in original VAEs) may be too restrictive to fully model user preferences and setting a more flexible prior gives significant gains. We experiment with the VampPrior, originally proposed for image generation, to examine the effect of flexible priors in CF. We also show that VampPriors coupled with gating mechanisms outperform SOTA results including the Variational Autoencoder for Collaborative Filtering by meaningful margins on 2 popular benchmark datasets (MovieLens & Netflix).
Conference Paper
Combining simple elements from the literature, we define a linear model that is geared toward sparse data, in particular implicit feedback data for recommender systems. We show that its training objective has a closed-form solution, and discuss the resulting conceptual insights. Surprisingly, this simple model achieves better ranking accuracy than various state-of-the-art collaborative-filtering approaches, including deep non-linear models, on most of the publicly available data-sets used in our experiments.
Conference Paper
We extend variational autoencoders (VAEs) to collaborative filtering for implicit feedback. This non-linear probabilistic model enables us to go beyond the limited modeling capacity of linear factor models which still largely dominate collaborative filtering research.We introduce a generative model with multinomial likelihood and use Bayesian inference for parameter estimation. Despite widespread use in language modeling and economics, the multinomial likelihood receives less attention in the recommender systems literature. We introduce a different regularization parameter for the learning objective, which proves to be crucial for achieving competitive performance. Remarkably, there is an efficient way to tune the parameter using annealing. The resulting model and learning algorithm has information-theoretic connections to maximum entropy discrimination and the information bottleneck principle. Empirically, we show that the proposed approach significantly outperforms several state-of-the-art baselines, including two recently-proposed neural network approaches, on several real-world datasets. We also provide extended experiments comparing the multinomial likelihood with other commonly used likelihood functions in the latent factor collaborative filtering literature and show favorable results. Finally, we identify the pros and cons of employing a principled Bayesian inference approach and characterize settings where it provides the most significant improvements.
Conference Paper
In recent years, deep neural networks have yielded immense success on speech recognition, computer vision and natural language processing. However, the exploration of deep neural networks on recommender systems has received relatively less scrutiny. In this work, we strive to develop techniques based on neural networks to tackle the key problem in recommendation --- collaborative filtering --- on the basis of implicit feedback. Although some recent work has employed deep learning for recommendation, they primarily used it to model auxiliary information, such as textual descriptions of items and acoustic features of musics. When it comes to model the key factor in collaborative filtering --- the interaction between user and item features, they still resorted to matrix factorization and applied an inner product on the latent features of users and items. By replacing the inner product with a neural architecture that can learn an arbitrary function from data, we present a general framework named NCF, short for Neural network-based Collaborative Filtering. NCF is generic and can express and generalize matrix factorization under its framework. To supercharge NCF modelling with non-linearities, we propose to leverage a multi-layer perceptron to learn the user-item interaction function. Extensive experiments on two real-world datasets show significant improvements of our proposed NCF framework over the state-of-the-art methods. Empirical evidence shows that using deeper layers of neural networks offers better recommendation performance.
Conference Paper
In recent years, interest in recommender research has shifted from explicit feedback towards implicit feedback data. A diversity of complex models has been proposed for a wide variety of applications. Despite this, learning from implicit feedback is still computationally challenging. So far, most work relies on stochastic gradient descent (SGD) solvers which are easy to derive, but in practice challenging to apply, especially for tasks with many items. For the simple matrix factorization model, an efficient coordinate descent (CD) solver has been previously proposed. However, efficient CD approaches have not been derived for more complex models. In this paper, we provide a new framework for deriving efficient CD algorithms for complex recommender models. We identify and introduce the property of k-separable models. We show that k-separability is a sufficient property to allow efficient optimization of implicit recommender problems with CD. We illustrate this framework on a variety of state-of-the-art models including factorization machines and Tucker decomposition. To summarize, our work provides the theory and building blocks to derive efficient implicit CD algorithms for complex recommender models.
Conference Paper
Most real-world recommender services measure their performance based on the top-N results shown to the end users. Thus, advances in top-N recommendation have far-ranging consequences in practical applications. In this paper, we present a novel method, called Collaborative Denoising Auto-Encoder (CDAE), for top-N recommendation that utilizes the idea of Denoising Auto-Encoders. We demonstrate that the proposed model is a generalization of several well-known collaborative filtering models but with more flexible components. Thorough experiments are conducted to understand the performance of CDAE under various component settings. Furthermore, experimental results on several public datasets demonstrate that CDAE consistently outperforms state-of-the-art top-N recommendation methods on a variety of common evaluation metrics.
Conference Paper
This paper contributes improvements on both the effectiveness and efficiency of Matrix Factorization (MF) methods for implicit feedback. We highlight two critical issues of existing works. First, due to the large space of unobserved feedback, most existing works resort to assign a uniform weight to the missing data to reduce computational complexity. However, such a uniform assumption is invalid in real-world settings. Second, most methods are also designed in an offline setting and fail to keep up with the dynamic nature of online data. We address the above two issues in learning MF models from implicit feedback. We first propose to weight the missing data based on item popularity, which is more effective and flexible than the uniform-weight assumption. However, such a non-uniform weighting poses efficiency challenge in learning the model. To address this, we specifically design a new learning algorithm based on the element-wise Alternating Least Squares (eALS) technique, for efficiently optimizing a MF model with variably-weighted missing data. We exploit this efficiency to then seamlessly devise an incremental update strategy that instantly refreshes a MF model given new feedback. Through comprehensive experiments on two public datasets in both offline and online protocols, we show that our implemented, open-source (https://github.com/hexiangnan/sigir16-eals) eALS consistently outperforms state-of-the-art implicit MF methods.
Chapter
The collaborative filtering (CF) approach to recommenders has recently enjoyed much interest and progress. The fact that it played a central role within the recently completed Netflix competition has contributed to its popularity. This chapter surveys the recent progress in the field. Matrix factorization techniques, which became a first choice for implementing CF, are described together with recent innovations. We also describe several extensions that bring competitive accuracy into neighborhood methods, which used to dominate the field. The chapter demonstrates how to utilize temporal models and implicit feedback to extend models accuracy. In passing, we include detailed descriptions of some the central methods developed for tackling the challenge of the Netflix Prize competition.
Article
The MovieLens datasets are widely used in education, research, and industry. They are downloaded hundreds of thousands of times each year, reflecting their use in popular press programming books, traditional and online courses, and software. These datasets are a product of member activity in the MovieLens movie recommendation system, an active research platform that has hosted many experiments since its launch in 1997. This article documents the history of MovieLens and the MovieLens datasets. We include a discussion of lessons learned from running a long-standing, live research platform from the perspective of a research organization. We document best practices and limitations of using the MovieLens datasets in new research.
Conference Paper
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method, called WSABIE, both outperforms several baseline methods and is faster and consumes less memory.
2021. iALS++: Speeding up Matrix Factorization with Subspace Optimization
  • Li Steffen Rendle Walid Krichene
  • Yehuda Zhang
  • Koren
On the Difficulty of Evaluating Baselines: A Study on Recommender Systems
  • Li Steffen Rendle
  • Yehuda Zhang
  • Koren
RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering
  • Sam Lobel
  • Chunyuan Li
  • Jianfeng Gao
  • Lawrence Carin
  • Lobel Sam
  • Li Steffen Rendle
  • Yehuda Zhang
  • Koren