Conference PaperPDF Available

Performance of recommender algorithms on top-N recommendation tasks


Abstract and Figures

In many commercial systems, the 'best bet' recommendations are shown, but the predicted rating values are not. This is usually referred to as a top-N recommendation task, where the goal of the recommender system is to find a few specific items which are supposed to be most appealing to the user. Common methodologies based on error metrics (such as RMSE) are not a natural fit for evaluating the top-N recommendation task. Rather, top-N performance can be directly measured by alternative methodologies based on accuracy metrics (such as precision/recall). An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task. Results show that improvements in RMSE often do not translate into accuracy improvements. In particular, a naive non-personalized algorithm can outperform some common recommendation approaches and almost match the accuracy of sophisticated algorithms. Another finding is that the very few top popular items can skew the top-N performance. The analysis points out that when evaluating a recommender algorithm on the top-N recommendation task, the test set should be chosen carefully in order to not bias accuracy metrics towards non-personalized solutions. Finally, we offer practitioners new variants of two collaborative filtering algorithms that, regardless of their RMSE, significantly outperform other recommender algorithms in pursuing the top-N recommendation task, with offering additional practical advantages. This comes at surprise given the simplicity of these two methods.
Content may be subject to copyright.
Performance of Recommender Algorithms
on Top-N Recommendation Tasks
Paolo Cremonesi
Politecnico di Milano
Milan, Italy
Yehuda Koren
Yahoo! Research
Haifa, Israel
Roberto Turrin
Milan, Italy
In many commercial systems, the ‘best bet’ recommenda-
tions are shown, but the predicted rating values are not.
This is usually referred to as a top-N recommendation task,
where the goal of the recommender system is to find a few
specific items which are supposed to be most appealing to
the user. Common methodologies based on error metrics
(such as RMSE) are not a natural fit for evaluating the top-
N recommendation task. Rather, top-N performance can
be directly measured by alternative methodologies based on
accuracy metrics (such as precision/recall).
An extensive evaluation of several state-of-the art recom-
mender algorithms suggests that algorithms optimized for
minimizing RMSE do not necessarily perform as expected
in terms of top-N recommendation task. Results show that
improvements in RMSE often do not translate into accu-
racy improvements. In particular, a naive non-personalized
algorithm can outperform some common recommendation
approaches and almost match the accuracy of sophisticated
algorithms. Another finding is that the very few top popular
items can skew the top-N performance. The analysis points
out that when evaluating a recommender algorithm on the
top-N recommendation task, the test set should be chosen
carefully in order to not bias accuracy metrics towards non-
personalized solutions. Finally, we offer practitioners new
variants of two collaborative filtering algorithms that, re-
gardless of their RMSE, significantly outperform other rec-
ommender algorithms in pursuing the top-N recommenda-
tion task, with offering additional practical advantages. This
comes at surprise given the simplicity of these two methods.
Categories and Subject Descriptors
H.3.4 [Information Storage and Retrieval]: Systems
and Software—user profiles and alert services; performance
evaluation (efficiency and effectiveness); H.3.3 [Information
Storage and Retrieval]: Information Search and Retrieval—
Information filtering
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
RecSys2010, September 26–30, 2010, Barcelona, Spain.
Copyright 2010 ACM 978-1-60558-906-0/10/09 ...$10.00.
General Terms
Algorithms, Experimentation, Measurement, Performance
A common practice with recommender systems is to eval-
uate their performance through error metrics such as RMSE
(root mean squared error), which capture the average error
between the actual ratings and the ratings predicted by the
system. However, in many commercial systems only the
‘best bet’ recommendations are shown, while the predicted
rating values are not [8]. That is, the system suggests a few
specific items to the user that are likely to be very appealing
to him. While the majority of the literature is focused on
convenient error metrics (RMSE, MAE), such classical error
criteria do not really measure top-N performance. At most,
they can serve as proxies of the true top-N experience. Di-
rect evaluation of top-N performance must be accomplished
by means of alternative methodologies based on accuracy
metrics (e.g., recall and precision).
In this paper we evaluate – through accuracy metrics –
the performance of several collaborative filtering algorithms
in pursuing the top-N recommendation task. Evaluation is
contrasted with performance of the same methods on the
RMSE metric. Tests have been performed on the Netflix
and Movielens datasets.
The contribution of the work is threefold: (i) we show
that there is no trivial relationship between error metrics
and accuracy metrics; (ii) we propose a careful construction
of the test set for not biasing accuracy metrics; (iii) we intro-
duce new variants of existing algorithms that improve top-N
performance together with other practical benefits.
We first compare some state-of-the-art algorithms (e.g.,
Asymmetric SVD) with a non-personalized algorithm based
on item popularity. The surprising result is that the perfor-
mance of the non-personalized algorithm on top-N recom-
mendations are comparable to the performance of sophisti-
cated, personalized algorithms, regardless of their RMSE.
However, a non-personalized, popularity-based algorithm
can only provide trivial recommendations, interesting nei-
ther to users, which can get bored and disappointed by the
recommender system, nor to content providers, which invest
in a recommender system for pushing up sales of less known
items. For such a reason, we run an additional set of experi-
ments in order to evaluate the performance of the algorithms
while excluding the extremely popular items. As expected,
the accuracy of all algorithms decreases, as it is more difficult
to recommend non-trivial items. Yet, ranking of the differ-
ent algorithms aligns better with our expectations, with the
non-personalized methods being ranked lower. Thus, when
evaluating algorithms in the top-N recommendation task,
we would advise to carefully choose the test set, otherwise
accuracy metrics are strongly biased.
Finally, when pursuing a top-N recommendation task, ex-
act rating prediction is not required. We present new vari-
ants of two collaborative filtering algorithms that are not
designed for minimizing RMSE, but consistently outperform
other recommender algorithms in top-N recommendations.
This finding becomes even more important, when consider-
ing the simple and less conventional nature of the outper-
forming methods.
The testing methodology adopted in this study is similar
to the one described in [6] and, in particular, in [10]. For each
dataset, known ratings are split into two subsets: training
set Mand test set T. The test set Tcontains only 5-stars
ratings. So we can reasonably state that Tcontains items
relevant to the respective users.
The detailed procedure used to create Mand Tfrom the
Netflix dataset is similar to the one set for the Netflix prize,
maintaining compatibility with results published in other re-
search papers [3]. Netflix released a training dataset contain-
ing about 100M ratings, referred to as the training dataset.
In addition to the training set, Netflix also provided a val-
idation set, referred to as the probe set, containing 1.4M
ratings. In this work, the training set Mis the original Net-
flix training set, while the test set Tcontains all the 5-stars
ratings from the probe set (|T|=384,573). As expected, the
probe set was not used for training.
We adopted a similar procedure for the Movielens dataset
[13]. We randomly sub-sampled 1.4% of the ratings from
the dataset in order to create a probe set. The training set
Mcontains the remaining ratings. The test set Tcontains
all the 5-star ratings from the probe set.
In order to measure recall and precision, we first train the
model over the ratings in M. Then, for each item irated
5-stars by user uin T:
(i) We randomly select 1000 additional items unrated by
user u. We may assume that most of them will not be
of interest to user u.
(ii) We predict the ratings for the test item iand for the
additional 1000 items.
(iii) We form a ranked list by ordering all the 1001 items
according to their predicted ratings. Let pdenote the
rank of the test item iwithin this list. The best result
corresponds to the case where the test item iprecedes
all the random items (i.e., p= 1).
(iv) We form a top-N recommendation list by picking the
Ntop ranked items from the list. If pNwe have a
hit (i.e., the test item iis recommended to the user).
Otherwise we have a miss. Chances of hit increase with
N. When N= 1001 we always have a hit.
The computation of recall and precision proceeds as fol-
lows. For any single test case, we have a single relevant
item (the tested item i). By definition, recall for a single
test can assume either the value 0 (in case of miss) or 1 (in
case of hit). Similarly, precision can assume either the value
0 20% 40% 60% 80% 100%
% of ratings
% of items
(popular) Long−tail
Figure 1: Rating distribution for Netflix (solid line)
and Movielens (dashed line). Items are ordered ac-
cording to popularity (most popular at the bottom).
0 or 1/N. The overall recall and precision are defined by
averaging over all test cases:
recall(N) = #hits
precision(N) = #hits
N· |T|=recall(N)
where |T|is the number of test ratings. Note that the hy-
pothesis that all the 1000 random items are non-relevant
to user utends to underestimate the computed recall and
precision with respect to true recall and precision.
2.1 Popular items vs. long-tail
According to the well known long-tail distribution of rated
items applicable to many commercial systems, the majority
of ratings are condensed in a small fraction of the most pop-
ular items [1].
Figure 1 plots the empirical rating distributions of the
Netflix and Movielens datasets. Items in the vertical axis
are ordered according to their popularity, most popular at
the bottom. We observe that about 33% of ratings collected
by Netflix involve only the 1.7% of most popular items (i.e.,
302 items). We refer to this small set of very popular items
as the short-head, and to the remaining set of less popular
items – about 98% of the total – as the long-tail [5]. We
also note that Movielens’ rating distribution is slightly less
long-tailed then Netflix’s: the short-head (33% of ratings)
involves the 5.5% of popular items (i.e., 213 items).
Recommending popular items is trivial and do not bring
much benefits to users and content providers. On the other
hand, recommending less known items adds novelty and
serendipity to the users but it is usually a more difficult
task. In this study we aim at evaluating the accuracy of
recommender algorithms in suggesting non-trivial items. To
this purpose, the test set Thas been further partitioned into
two subsets, Thead and Tlong, such that items in Thead are in
the short-head while items in Tlong are in the long-tail of the
Most recommender systems are based on collaborative fil-
tering (CF), where recommendations rely only on past user
behavior (to be referred here as ‘ratings’, though such behav-
ior can include other user activities on items like purchases,
rentals and clicks), regardless of domain knowledge. There
are two primary approaches to CF: (i) the neighborhood
approach and (ii) the latent factor approach.
Neighborhood models represent the most common appro-
ach to CF. They are based on the similarity among either
users or items. For instance, two users are similar because
they have rated similarly the same set of items. A dual
concept of similarity can be defined among items.
Latent factor approaches model users and items as vectors
in the same ‘latent factor’ space by means of a reduced num-
ber of hidden factors. In such a space, users and items are
directly comparable: the rating of user uon item iis pre-
dicted by the proximity (e.g., inner-product) between the
related latent factor vectors.
3.1 Non-personalized models
Non-personalized recommenders present to any user a pre-
defined, fixed list of items, regardless of his/her preferences.
Such algorithms serve as baselines for the more complex per-
sonalized algorithms.
A simple estimation rule, referred to as Movie Average
(MovieAvg), recommends top-N items with the highest av-
erage rating. The rating of user uon item iis predicted
as the mean rating expressed by the community on item i,
regardless of the ratings given by u.
A similar prediction schema, denoted by Top Popular (Top-
Pop), recommends top-N items with the highest popularity
(largest number of ratings). Notice that in this case the rat-
ing of user uabout item icannot be inferred, but the output
of this algorithm is only a ranked list of items. As a conse-
quence, RMSE or other error metrics are not applicable.
3.2 Neighborhood models
Neighborhood models base their prediction on the simi-
larity relationships among either users or items.
Algorithms centered on user-user similarity predict the
rating by a user based on the ratings expressed by users sim-
ilar to him about such item. On the other hand, algorithms
centered on item-item similarity compute the user preference
for an item based on his/her own ratings on similar items.
The latter is usually the preferred approach (e.g., [15]), as it
usually performs better in RMSE terms, while being more
scalable. Both advantages are related to the fact that the
number of items is typically smaller than the number of
users. Another advantage of item-item algorithms is that
reasoning behind a recommendation to a specific user can be
explained in terms of the items previously rated by him/her.
In addition, basing the system parameters on items (rather
than users) allows aseamless handling of users and ratings
new to the system. For such reasons, we focus on item-item
neighborhood algorithms.
The similarity between item iand item jis measured as
the tendency of users to rate items iand jsimilarly. It is
typically based either on the cosine, the adjusted cosine, or
(most commonly) the Pearson correlation coefficient [15].
Item-item similarity is computed on the common raters.
In the typical case of a very sparse dataset, it is likely that
some pairs of items have a poor support, leading to a non-
reliable similarity measure. For such a reason, if nij denotes
the number of common raters and sij the similarity between
item iand item j, we can define the shrunk similarity dij
as the coefficient dij =nij
nij +λ1
sij where λ1is a shrinking
factor [10]. A typical value of λ1is 100.
Neighborhood models are further enhanced by means of a
kNN (k-nearest-neighborhood) approach. When predicting
rating rui, we consider only the kitems rated by uthat are
the most similar to i. We denote the set of most similar
items by Dk(u;i). The kNN approach discards the items
poorly correlated to the target item, thus decreasing noise
for improving the quality of recommendations.
Prior to comparing and summing different ratings, it is
advised to remove different biases which mask the more fun-
damental relations between items. Such biases include item-
effects which represent the fact that certain items tend to
receive higher ratings than others. They also include user-
effects, which represent the tendency of certain users to rate
higher than others. More delicate calculation of the biases,
would also estimate temporal effects [11], but this is beyond
the scope of this work. We take as baselines the static item-
and user-biases, following [10]. Formally, the bias associated
with the rating of user uto item iis denoted by bui.
An item-item kNN method predicts the residual rating
rui bui as the weighted average of the residual ratings of
similar items:
ˆrui =bui +PjDk(u;i)dij (ruj buj )
Hereinafter, we refer to this model as Correlation Neigh-
borhood (CorNgbr), where sij is measured as the Pearson
correlation coefficient.
3.2.1 Non-normalized Cosine Neighborhood
Notice that in (1), the denominator forces that predicted
rating values fall in the correct range, e.g., [1 ...5] for a typi-
cal star-ratings systems. However, for a top-N recommenda-
tion task, exact rating values are not necessary. We simply
want to rank items by their appeal to the user. In such a
case, we can simplify the formula by removing the denom-
inator. A benefit of this would be higher ranking for items
with many similar neighbors (that is high PjDk(u;i)dij ),
where we have a higher confidence in the recommendation.
Therefore, we propose to rank items by the following coeffi-
cient denoted by ˆrui:
ˆrui =bui +X
dij (ruj buj ) (2)
Here ˆrui does not represent a proper rating, but is rather a
metric for the association between user uand item i. We
should note that similar non-normalized neighborhood rules
were mentioned by others [7, 10].
In our experiments, the best results in terms of accuracy
metrics have been obtained by computing sij as the cosine
similarity. Unlike Pearson correlation which is computed
only on ratings shared by common raters, the cosine co-
efficient between items iand jis computed over all rat-
ings (taking missing values as zeroes), that is: cos(i, j) =
j||2). We denote such model by Non-Normalized
Cosine Neighborhood (NNCosNgbr).
3.3 Latent Factor Models
Recently, several recommender algorithms based on la-
tent factor models have been proposed. Most of them are
based on factoring the user-item ratings matrix [12], also
informally known as SVD models after the related Singular
Value Decomposition.
The key idea of SVD models is to factorize the user-item
rating matrix to a product of two lower rank matrices, one
containing the so-called ‘user factors’, while the other one
containing the so-called ‘item-factors’. Thus, each user uis
represented with an f-dimensional user factors vector pu
f. Similarly, each item iis represented with an item factors
vector qi∈ ℜf. Prediction of a rating given by user ufor
item iis computed as the inner product between the related
factor vectors (adjusted for biases), i.e.,
ˆrui =bui +puqiT(3)
Since conventional SVD is undefined in the presence of
unknown values – i.e., the missing ratings – several solu-
tions have been proposed. Earlier works addressed the issue
by filling missing ratings with a baseline estimations (e.g.,
[16]). However, this leads to a very large, dense user rating
matrix, whose factorization becomes computationally infea-
sible. More recent works learn factor vectors directly on
known ratings through a suitable objective function which
minimizes prediction error. The proposed objective func-
tions are usually regularized in order to avoid overfitting
(e.g., [14]). Typically, gradient descent is applied to mini-
mize the objective function.
As with neighborhood methods, this article concentrates
on methods which represent users as a combination of item
features, without requiring any user-specific parameteriza-
tion. The advantages of these methods are that they can
create recommendations for users new to the system without
re-evaluation of parameters. Likely, they can immediately
adjust their recommendations to just entered ratings, pro-
viding users with an immediate feedback for their actions.
Finally, such methods can explain their recommendations in
terms of items previously rated by the user.
Thus, we experimented with a powerful matrix factoriza-
tion model, which indeed represents users as a combination
of item features. The method is known as Asymmetric-SVD
(AsySVD) and is reported to reach an RMSE of 0.9000 on
the Netflix dataset [10].
In addition, we have experimented with a beefed up matrix-
factorization approach known as SVD++ [10], which repre-
sents highest quality in RMSE-optimized factorization meth-
ods, albeit users are no longer represented as a combination
of item features; see [10].
3.3.1 PureSVD
While pursuing a top-N recommendation task, we are in-
terested only in a correct item ranking, not caring about
exact rating prediction. This grants us some flexibility, like
considering all missing values in the user rating matrix as
zeros, despite being out of the 1-to-5 star rating range. In
terms of predictive power, the choice of zero is not very im-
portant, and we have received similar results with higher
imputed values. Importantly, now we can leverage existing
highly optimized software packages for performing conven-
tional SVD on sparse matrices, which becomes feasible since
all matrix entries are now non-missing. Thus, the user rating
matrix Ris estimated by the factorization [2]:
where, Uis a n×forthonormal matrix, Qis a m×f
Dataset Users Items Ratings Density
Movielens 6,040 3,883 1M 4.26%
Netflix 480,189 17,770 100M 1.18%
Table 1: Statistical properties of Movielens and Net-
orthonormal matrix, and Σis a f×fdiagonal matrix con-
taining the first fsingular values.
In order to demonstrate the ease of imputing zeroes, we
should mention that we used a non-multithreaded SVD pack-
age (SVDLIBC, based on the SVDPACKC library [4]), which
factorized the 480K users by 17,770 movies Netflix dataset
under 10 minutes on an i7 PC (f= 150).
Let us define P=U·Σ, so that the u-th row of Prep-
resents the user factors vector pu, while the i-th row of Q
represents the item factors vector qi. Accordingly, ˆrui can
be computed similarly to (3).
In addition, since Uand Qhave orthonormal columns,
we can straightforwardly derive that:
where Ris the user rating matrix. Consequently, by denot-
ing with ruthe u-th row of the user rating matrix - i.e., the
vector of ratings of user u, we can rewrite the prediction rule
ˆrui =ru·Q·qiT(6)
Note that, similarly to (1), in a slight abuse of notation,
the symbol ˆrui , is not exactly a valid rating value, but an
association measure between user uand item i.
In the following we will refer to this model as PureSVD.
As with item-item kNN and AsySVD, PureSVD offers all
the benefits of representing users as a combination of item
features (by Eq. (5)), without any user-specific parameter-
ization. It also offers convenient optimization, which does
not require tunning learning constants.
In this section we present the quality of the recommender
algorithms presented in Section 3 on two standard datasets:
MovieLens [13] and Netflix [3]. Both are publicly available
movie rating datasets. Collected ratings are in a 1-to-5 star
scale. Table 1 summarizes their statistical properties.
We used the methodology defined in Section 2 for evalu-
ating six recommender algorithms. The first two - MovieAvg
and TopPop - are non-personalized algorithms, and we would
expect them to be outperformed by any recommender algo-
rithm. The third prediction rule - CorNgbr - is a well tuned
neighborhood-based algorithm, probably the most popular
in the literature of collaborative filtering. The forth algo-
rithm is a variant of CorNgbr - NNCosNgbr - and it is one
of the two proposed algorithm oriented to accuracy metrics.
Fifth is the latent factor model AsySVD with 200 factors.
Sixth is a 200-D SVD++, among the most powerful latent
factor models in terms of RMSE. Finally, we consider our
variant of latent factor models, PureSVD, which is shown in
two configurations: one with a fewer latent factors (50), and
one with a larger number of latent factors (150 for Movielens
and 300 for the larger Netflix dataset).
Three of the algorithms – TopPop, NNCosNgbr, and PureSVD
– are not sensible from an error minimization viewpoint and
cannot be assessed by an RMSE measure. The other four
algorithms were optimized to deliver best RMSE results,
and their RMSE scores on the Netflix test set are as fol-
lows: 1.053 for MovieAvg, 0.9406 for CorNgbr, 0.9000 for
AsySVD, and 0.8911 for SVD++ [10].
For each dataset, we have performed one set of experi-
ments on the full test set and one set of experiments on the
long-tail test set. We report the recall as a function of N
(i.e., the number of items recommended), and the precision
as a function of the recall. As for recall(N), we have zoomed
in on Nin the range [1 ...20]. Larger values of Ncan be
ignored for a typical top-N recommendations task. Indeed,
there is no difference whether an appealing movie is placed
within the top 100 or the top 200, because in neither case it
will be presented to the user.
4.1 Movielens dataset
Figure 2 reports the performance of the algorithms on
the Movielens dataset over the full test set. It is apparent
that the algorithms have significant performance disparity in
terms of top-N accuracy. For instance, the recall of AsySVD
at N= 10 is about 0.28, i.e., the model has a probability of
28% to place an appealing movie in the top-10. Surprisingly,
the recall of the non-personalized TopPop is very similar to
AsySVD (e.g., at N= 10 recall is about 0.29). The best
algorithms in terms of accuracy are the non-RMSE-oriented
NNCosNgbr and PureSVD, which reach, at N= 10, a recall
equaling about 0.44 and 0.52, respectively. As for the latter,
this means that about 50% of 5-star movies are presented
in a top-10 recommendation. The best algorithm in the
RMSE-oriented familiy is SVD++, with a recall close to
that of NNCosNgbr.
Figure 2(b) confirms that PureSVD also outperforms the
other algorithms in terms of precision metrics, followed by
the other non-RMSE-oriented algorithm – NNCosNgbr. Each
line represents the precision of the algorithm at a given re-
call. For example, when the recall is about 0.2, precision of
NNCosNgbr is about 0.12. Again, TopPop performance is
aligned with that of a state-of-the-art algorithm as AsySVD.
We should note the gross underperformance of the widely
used CorNgbr algorithm, whose performance is in line with
the very naive MovieAvg. We should also note that the pre-
cision of SVD++ is competitive with that of PureSVD50 for
small values of recall.
The strange and somehow unexpected result of TopPop
motivates the second set of experiments, accomplished over
the long-tail items, whose results are drawn in Figure 3. As
a reminder, now we exclude the very popular items from
consideration. Here the ordering among the several recom-
mender algorithms better aligns with our expectations. In
fact, recall and precision of the non-personalized TopPop
dramatically falls down and it is very unlikely to recom-
mend a 5-star movie within the first 20 positions. However,
even when focusing on the long-tail, the best algorithms is
still PureSVD, whose recall at N= 10 is about 40%. Note
that while the best performance of PureSVD was with 50
latent factors in the case of full test set, here the best perfor-
mance is reached with a larger number of latent factors, i.e.,
150. Performance of NNCosNgbr now becomes significantly
worse than PureSVD, while SVD++ is the best within the
RMSE-oriented algorithms.
4.2 Netflix dataset
Analogously to the results presented for Movielens, Fig-
ures 4 and 5 show the performance of the algorithms on the
Netflix dataset. As before, we focus on both the full test set
and the long-tail test set.
Once again, non-personalized TopPop shows surprisingly
good results when including the 2% head items, outper-
forming the widely popular CorNgbr. However, the more
powerful AsySVD and SDV++, which were possibly better
tuned for the Netflix data, are slightly outperforming Top-
Pop. Note also that AsySVD is now in line with SVD++.
Consistent with the Movielens experience, the best per-
forming algorithm in terms of recall and precision is still the
non-RMSE-oriented PureSVD. As for the other non-RMSE-
oriented NNCosNgbr, the picture becomes mixed. It is still
outperforming the RMSE-oriented algorithms when includ-
ing the head items, but somewhat underperforms them when
the top-2% most popular items are excluded.
The behavior of the commonly used CorNgbr on the Net-
flix dataset is very surprising. While it significantly under-
performs others on the full test set, it becomes among the
top-performers when concentrating on the longer tail. In
fact, while all the algorithms decrease their precision and
recall when passing from the full to the long-item test set,
CorNgbr appears more accurate in recommending long-tail
items. After all, the wide acceptance of the CorNgbr ap-
proach might be for a reason given the importance of long-
tail items.
Over both Movielens and Netflix datasets, regardless of in-
clusion of head-items, PureSVD is consistently the top per-
former, beating more detailed and sophisticated latent fac-
tor models. Given its simplicity, and poor design in terms
of RMSE optimization, we did not expect this result. In
fact, we would view it as a good news for practitioners of
recommender systems, as PureSVD combines multiple ad-
vantages. First, it is very easy to code, without a need to
tune learning constants, and fully relies on off-the-shelf opti-
mized SVD packages . This comes with good computational
performance in both offline and online modes. PureSVD
also has the convenience of representing the users as a com-
bination of item features (Eq. (6)), offering designers a flex-
ibility in handling new users, new ratings by existing users
and explaining the reasoning behind the generated recom-
An interesting finding, observed at both Movielens and
Netflix datasets, is that when moving to longer tail items,
accuracy improves with raising the dimensionality of the
PureSVD model. This may be related to the fact that
the first latent factors of PureSVD capture properties of
the most popular items, while the additional latent factors
represent more refined features related to unpopular items.
Hence, when practitioners use PureSVD, they should pick its
dimensionality while accounting for the fact that the num-
ber of latent factors influences the quality of long-tail items
differently than head items.
We would like to offer an explanation as to why PureSVD
could consistently deliver better top-N results than best RMSE-
refined latent factor models. This may have to do with a
limitation of RMSE testing, which concentrates only on the
ratings that the user provided to the system. This way,
RMSE (or MAE for the matter) is measured on a held-out
test set containing only items that the user chose to rate,
while completely missing any evaluation of the method on
0 5 10 15 20
(a) recall
0 0.2 0.4 0.6 0.8 1
(b) precision vs recall
Figure 2: Movielens: (a) recall-at-Nand (b) precision-versus-recall on all items.
0 5 10 15 20
(a) recall
0 0.2 0.4 0.6 0.8 1
(b) precision vs recall
Figure 3: Movielens: (a) recall-at-Nand (b) precision-versus-recall on long-tail (94% of items).
0 5 10 15 20
(a) recall
0 0.2 0.4 0.6 0.8 1
(b) precision vs recall
Figure 4: Netflix: (a) recall-at-Nand (b) precision-versus-recall on all items.
0 5 10 15 20
(a) recall
0 0.2 0.4 0.6 0.8 1
(b) precision vs recall
Figure 5: Netflix: (a) recall-at-Nand (b) precision-versus-recall on long-tail (98% of items).
items that the user has never rated. This testing-mode bodes
well with the RMSE-oriented models, which are trained only
on the known ratings, while largely ignoring the missing en-
tries. Yet, such a testing methodology misses much of the
reality, where all items should count, not only those actually
rated by the user in the past. The proposed Top-N based
accuracy measures indeed do better in this respect, by di-
rectly involving all possible items (including unrated ones)
in the testing phase. This may explain the outperformance
of PureSVD, which considers all possible user-item pairs (re-
gardless of rating availability) in the training phase.
Our general advise to practitioners is to consider PureSVD
as a recommender algorithm. Still there are several unex-
plored ways that may improve PureSVD. First, one can opti-
mize the value imputed at the missing entries. Direct usage
of ready sparse SVD solvers (which usually assume a default
value of zero) would still be possible by translating all given
scores. For example, imputing a value of 3 instead of zero
would be effectively achieved by translating the given star
ratings from the [1...5] range into the [-2...2] range. Second,
one can grant a lower confidence to the imputed values, such
that SVD will emphasize more efforts on real ratings. For
an explanation on how this can be accomplished consider
[9]. However, we would expect such a confidence weighting
to significantly increase the time complexity of the offline
Evaluation of recommender has long been divided between
accuracy metrics (e.g., precision/recall) and error metrics
(notably, RMSE and MAE). The mathematical convenience
and fitness with formal optimization methods, have made
error metrics like RMSE more popular, and they are indeed
dominating the literature. However, it is well recognized
that accuracy measures may be a more natural yardstick, as
they directly assess the quality of top-N recommendations.
This work shows, through an extensive empirical study,
that the convenient assumption that an error metric such as
RMSE can serve as good proxy for top-N accuracy is ques-
tionable at best. There is no monotonic relation between
error metrics and accuracy metrics. This may call for a re-
evaluation of optimization goals for top-N systems. On the
bright side we have presented simple and efficient variants
of known algorithms, which are useless in RMSE terms, and
yet deliver superior results when pursuing top-N accuracy.
In passing, we have also discussed possible pitfalls in the
design of a test set for conducting a top-N accuracy eval-
uation. In particular, a careless construction of the test
set would make recall and precision strongly biased towards
non personalized algorithms. An easy solution, which we
adopted, was excluding the extremely popular items from
the test set (while retaining 98% of the items). The re-
sulting test set, which emphasizes the rather important non-
trivial items, seems to shape better with our expectations.
First, it correctly shows the lower value of non-personalized
algorithms. Second, it shows a good behavior for the wi-
dely used correlation-based kNN approach, which otherwise
(when evaluated on the full set of items) exhibits extremely
poor results, strongly confronting the accepted practice.
[1] C. Anderson. The Long Tail: Why the Future of
Business Is Selling Less of More. Hyperion, July 2006.
[2] R. Bambini, P. Cremonesi, and R. Turrin.
Recommender Systems Handbook, chapter A
Recommender System for an IPTV Service Provider:
a Real Large-Scale Production Environment. Springer,
[3] J. Bennett and S. Lanning. The Netflix Prize.
Proceedings of KDD Cup and Workshop, pages 3–6,
[4] M. W. Berry. Large-scale sparse singular value
computations. The International Journal of
Supercomputer Applications, 6(1):13–49, Spring 1992.
[5] O. Celma and P. Cano. From hits to niches? or how
popular artists can bias music recommendation and
discovery. Las Vegas, USA, August 2008.
[6] P. Cremonesi, E. Lentini, M. Matteucci, and
R. Turrin. An evaluation methodology for
recommender systems. 4th Int. Conf. on Automated
Solutions for Cross Media Content and Multi-channel
Distribution, pages 224–231, Nov 2008.
[7] M. Deshpande and G. Karypis. Item-based top-n
recommendation algorithms. ACM Transactions on
Information Systems (TOIS), 22(1):143–177, 2004.
[8] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl.
Evaluating collaborative filtering recommender
systems. ACM Transactions on Information Systems
(TOIS), 22(1):5–53, 2004.
[9] Y. Hu, Y. Koren, and C. Volinsky. Collaborative
filtering for implicit feedback datasets. Data Mining,
IEEE International Conference on, 0:263–272, 2008.
[10] Y. Koren. Factorization meets the neighborhood: a
multifaceted collaborative filtering model. In KDD ’08:
Proceeding of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 426–434, New York, NY, USA, 2008. ACM.
[11] Y. Koren. Collaborative filtering with temporal
dynamics. In KDD ’09: Proceedings of the 15th ACM
SIGKDD international conference on Knowledge
discovery and data mining, pages 447–456, New York,
NY, USA, 2009. ACM.
[12] Y. Koren, R. M. Bell, and C. Volinsky. Matrix
factorization techniques for recommender systems.
IEEE Computer, 42(8):30–37, 2009.
[13] B. Miller, I. Albert, S. Lam, J. Konstan, and J. Riedl.
MovieLens unplugged: experiences with an
occasionally connected recommender system.
Proceedings of the 8th international conference on
Intelligent user interfaces, pages 263–266, 2003.
[14] A. Paterek. Improving regularized singular value
decomposition for collaborative filtering. Proceedings
of KDD Cup and Workshop, 2007.
[15] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl.
Item-based collaborative filtering recommendation
algorithms. 10th Int. Conf. on World Wide Web,
pages 285–295, 2001.
[16] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Application of Dimensionality Reduction in
Recommender System-A Case Study. Defense
Technical Information Center, 2000.
... Second, we have a much larger set of candidates for the same graph-O(n 2 ) instead of O(n). Finally, in real applications, link prediction is usually treated as a ranking problem, where we want positive links to be ranked higher than negative links, rather than as a classification problem, e.g. in recommendation systems, where we want to retrieve the top-k most likely links (Cremonesi et al., 2010;Hubert et al., 2022). We discuss this in more detail in Section 3.1 below. ...
... Metric. Following work in the heterogeneous information network , knowledgegraph (Lin et al., 2015), and recommendation systems (Cremonesi et al., 2010;Hubert et al., 2022) communities, we choose to use Hits@k over AUC-ROC metrics, since we often empirically prioritize ranking candidate links from a selected node context (e.g. ranking the probability that user A will buy item B, C, or D), as opposed to arbitrarily ranking a randomly chosen positive over negative link (e.g. ...
A recent focal area in the space of graph neural networks (GNNs) is graph self-supervised learning (SSL), which aims to derive useful node representations without labeled data. Notably, many state-of-the-art graph SSL methods are contrastive methods, which use a combination of positive and negative samples to learn node representations. Owing to challenges in negative sampling (slowness and model sensitivity), recent literature introduced non-contrastive methods, which instead only use positive samples. Though such methods have shown promising performance in node-level tasks, their suitability for link prediction tasks, which are concerned with predicting link existence between pairs of nodes (and have broad applicability to recommendation systems contexts) is yet unexplored. In this work, we extensively evaluate the performance of existing non-contrastive methods for link prediction in both transductive and inductive settings. While most existing non-contrastive methods perform poorly overall, we find that, surprisingly, BGRL generally performs well in transductive settings. However, it performs poorly in the more realistic inductive settings where the model has to generalize to links to/from unseen nodes. We find that non-contrastive models tend to overfit to the training graph and use this analysis to propose T-BGRL, a novel non-contrastive framework that incorporates cheap corruptions to improve the generalization ability of the model. This simple modification strongly improves inductive performance in 5/6 of our datasets, with up to a 120% improvement in Hits@50--all with comparable speed to other non-contrastive baselines and up to 14x faster than the best-performing contrastive baseline. Our work imparts interesting findings about non-contrastive learning for link prediction and paves the way for future researchers to further expand upon this area.
... • Popular recommends the items with the largest number of purchases across users. • SVD is a method that factorizes the user-item matrix by singular value decomposition [4]. The portfolio data forms the user-item matrix, where a user-item entry is 1 if the user has bought the item and 0 otherwise. ...
... We used Tensorflow's implementation of padding and masking to deal with variable length input in the RNNs.4 In all models the second layer is a dense layer with ReLU activation function. ...
Full-text available
While personalised recommendations are successful in domains like retail, where large volumes of user feedback on items are available, the generation of automatic recommendations in data-sparse domains, like insurance purchasing, is an open problem. The insurance domain is notoriously data-sparse because the number of products is typically low (compared to retail) and they are usually purchased to last for a long time. Also, many users still prefer the telephone over the web for purchasing products, reducing the amount of web-logged user interactions. To address this, we present a recurrent neural network recommendation model that uses past user sessions as signals for learning recommendations. Learning from past user sessions allows dealing with the data scarcity of the insurance domain. Specifically, our model learns from several types of user actions that are not always associated with items, and unlike all prior session-based recommendation models, it models relationships between input sessions and a target action (purchasing insurance) that does not take place within the input sessions. Evaluation on a real-world dataset from the insurance domain (ca. 44K users, 16 items, 54K purchases, and 117K sessions) against several state-of-the-art baselines shows that our model outperforms the baselines notably. Ablation analysis shows that this is mainly due to the learning of dependencies across sessions in our model. We contribute the first ever session-based model for insurance recommendation, and make available our dataset to the research community.
... Pure Singular Value Decomposition (SVD) [21] and Nonnegative Matrix Factorization (NMF) [22] are CF methods that decompose the interaction matrix to two low-rank matrices for users and items. The learned user and item matrices in NMF contain non-negative values. ...
... In this section, we first present the CF and MTR approaches from which we select the best performing models for the two parts of our approach. In the first step, which is selecting the CF model, SVD [21], NMF [22], UKNN [36], IKNN [37] and SLIM [23] are considered for the datasets with explicit feedback and BPR [24], WRMF [26], WARP [25] and MVAE [27] are used for datasets with implicit feedback. For the second step RF, ERT and KNNR are used as MTRs in the proposed approach. ...
Full-text available
The cold-start problem is one of the main challenges in recommender systems and specifically in collaborative filtering methods. Such methods, albeit effective, typically can not handle new items or users that do not have any prior interaction activity in the system. In this paper, we propose a novel two-step approach to address the cold-start problem. First, we view the user-item interactions in a positive unlabeled (PU) learning setting and reconstruct the interaction matrix between users and warm items, detecting missing links and recommending warm items to existing users. Second, an inductive multi-target regressor is trained on this reconstructed interaction matrix and subsequently predicts interactions for new items that enter the system. To the best of our knowledge, this is the first time that such a two-step PU learning method is proposed to address the cold-start problem in recommender systems. To evaluate the proposed approach, we employed four benchmark datasets from movie and news recommendation domains with explicit and implicit feedback. We compared our method against three other competitor approaches that address the cold-start problem and showed that our proposed method significantly outperforms them, achieving in a case an increase of 16.9% in terms of NDCG.
... Since one of the design requirements is to generate the recommendations as lists of ranked items grouped by topic (see Section 5.4), the recommender system's performance should be optimised and evaluated for the relevance of produced topics to editors and the accurate ranking of items. Thereby, precision, recall, F1 metrics, and list-wise metrics such as mean average recall (MAR), mean average precision (MAP), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG), are the most appropriate measures for assessing the quality of recommended topics and items (Cremonesi et al, 2010). ...
Full-text available
Wikidata is an open knowledge graph created, managed, and maintained collaboratively by a global community of volunteers. As it continues to grow, it faces substantial editor engagement challenges, including acquiring new editors to tackle an increasing workload and retaining existing editors. Experiences from other online communities and peer-production systems, including Wikipedia, suggest that recommending tasks to editors could help with both. Our aim with this paper is to elicit the user requirements for a Wikidata recommendations system. We conduct a mixed-methods study with a thematic analysis of in-depth interviews with 31 Wikidata editors and three Wikimedia managers, complemented by a quantitative analysis of edit records of 3,740 Wikidata editors. The insights gained from the study help us outline design requirements for the Wikidata recommender system. We conclude with a discussion of the implications of this work and directions for future work.
... Specifically, the item-sampling strategy calculates the top-K evaluation metrics using only a small set of item samples (Koren 2008;Cremonesi, Koren, and Turrin 2010;He et al. 2017;Ebesu, Shen, and Fang 2018;Hu et al. 2018;Krichene et al. 2019;Wang et al. 2019;Yang et al. 2018a,b). Rendle (2020, 2022) show that the top-K metrics based on the samples differ from the global metrics using all the items. ...
Full-text available
Since Rendle and Krichene argued that commonly used sampling-based evaluation metrics are ``inconsistent'' with respect to the global metrics (even in expectation), there have been a few studies on the sampling-based recommender system evaluation. Existing methods try either mapping the sampling-based metrics to their global counterparts or more generally, learning the empirical rank distribution to estimate the top-$K$ metrics. However, despite existing efforts, there is still a lack of rigorous theoretical understanding of the proposed metric estimators, and the basic item sampling also suffers from the ``blind spot'' issue, i.e., estimation accuracy to recover the top-$K$ metrics when $K$ is small can still be rather substantial. In this paper, we provide an in-depth investigation into these problems and make two innovative contributions. First, we propose a new item-sampling estimator that explicitly optimizes the error with respect to the ground truth, and theoretically highlight its subtle difference against prior work. Second, we propose a new adaptive sampling method which aims to deal with the ``blind spot'' problem and also demonstrate the expectation-maximization (EM) algorithm can be generalized for such a setting. Our experimental results confirm our statistical analysis and the superiority of the proposed works. This study helps lay the theoretical foundation for adopting item sampling metrics for recommendation evaluation, and provides strong evidence towards making item sampling a powerful and reliable tool for recommendation evaluation.
... Previous works use item embeddings as representation for users. (Cremonesi et al., 2010;Kabbur et al., 2013) adopts a combination of items rated by users to compute user embeddings and frees the model from learning parametric user-specific embeddings. Furthermore, there are quite a few auto-encoder architectures for recommendation problem, leveraging user's rating vector (ratings on all the items) as input, estimating user embedding (as latent variables), and decoding missing values in the rating vector (Sedhain et al., 2015;Liang et al., 2018). ...
... where T sample ∈ T is a set of 1000 randomly sampled target items following the similar procedure used in [22]. AUC value of 1 represents the best score, whereas value of 0.5 represents a random guess. ...
Recent decades have seen exponential growth in data acquisition attributed to advancements in edge device technology. Factory controllers, smart home appliances, mobile devices, medical equipment, and automotive sensors are a few examples of edge devices capable of collecting data. Traditionally, these devices are limited to data collection and transfer functionalities, whereas decision-making capabilities were missing. However, with the advancement in microcontroller and processor technologies, edge devices can perform complex tasks. As a result, it provides avenues for pushing training machine learning models to the edge devices, also known as learning-at-the-edge. Furthermore, these devices operate in a distributed environment that is constrained by high latency, slow connectivity, privacy, and sometimes time-critical applications. The traditional distributed machine learning methods are designed to operate in a centralized manner, assuming data is stored on cloud storage. The operating environment of edge devices is impractical for transferring data to cloud storage, rendering centralized approaches impractical for training machine learning models on edge devices. Decentralized Machine Learning techniques are designed to enable learning-at-the-edge without requiring data to leave the edge device. The main principle in decentralized learning is to build consensus on a global model among distributed devices while keeping the communication requirements as low as possible. The consensus-building process requires averaging local models to reach a global model agreed upon by all workers. The exact averaging schemes are efficient in quickly reaching global consensus but are communication inefficient. Decentralized approaches employ in-exact averaging schemes that generally reduce communication by communicating in the immediate neighborhood. However, in-exact averaging introduces variance in each worker's local values, requiring extra iterations to reach a global solution. This thesis addresses the problem of learning-at-the-edge devices, which is generally referred to as decentralized machine learning or Edge Machine Learning. More specifically, we will focus on the Decentralized Parallel Stochastic Gradient Descent (DPSGD) learning algorithm, which can be formulated as a consensus-building process among distributed workers or fast linear iteration for decentralized model averaging. The consensus-building process in decentralized learning depends on the efficacy of in-exact averaging schemes, which have two main factors, i.e., convergence time and communication. Therefore, a good solution should keep communication as low as possible without sacrificing convergence time. An in-exact averaging solution consists of a connectivity structure (topology) between workers and weightage for each link. We formulate an optimization problem with the objective of finding an in-exact averaging solution that can achieve fast consensus (convergence time) among distributed workers keeping the communication cost low. Since direct optimization of the objective function is infeasible, a local search algorithm guided by the objective function is proposed. Extensive empirical evaluations on image classification tasks show that the in-exact averaging solutions constructed through the proposed method outperform state-of-the-art solutions. Next, we investigate the problem of learning in a decentralized network of edge devices, where a subset of devices are close to each other in that subset but further apart from other devices not in the subset. Closeness specifically refers to geographical proximity or fast communication links. We proposed a hierarchical two-layer sparse communication topology that localizes dense communication among a subgroup of workers and builds consensus through a sparse inter-subgroup communication scheme. We also provide empirical evidence of the proposed solution scaling better on Machine Learning tasks than competing methods. Finally, we address scalability issues of a pairwise ranking algorithm that forms an important class of problem in online recommender systems. The existing solutions based on a parallel stochastic gradient descent algorithm define a static model parameter partitioning scheme, creating an imbalance of work distribution among distributed workers. We propose a dynamic block partitioning and exchange strategy for the model parameters resulting in work balance among distributed workers. Empirical evidence on publicly available benchmark datasets indicates that the proposed method scales better than the static block-based methods and outperforms competing state-of-the-art methods.
This aims to propose a collaborative filtering algorithm based on embedding representation and word embedding techniques, namely UI2vec. According to the joint feature extraction network designed in this paper, UI2vec embeds users and items on the potential space at the same time, and uses the item similarity between them to predict the user’s content of interest. Then a generative model VUI2vec with more stable performance is proposed based on UI2vec, which maps users and items as independent Gaussian distributions and obtains the approximate posterior distribution of both by variational inference. The recommendation performance of UI2vec and VUI2vec is evaluated on TaFeng, Movielens, and Netflix datasets. The impact of important superparameters within the model on performance is investigated. The experimental results show that compared with the baseline model, the proposed methods performs consistently well.
With the rise of Amazon, Netflix, and other e-commerce portals, many users widely depend on reviews by customers who used the product, before deciding to go ahead with a purchase. Users’ reviews are generally diverse. While some reviews can genuinely be relied upon, a few other reviews, at the same time, can be misleading. In this paper, an improved recommendation system with aspect-based sentiment analysis that replaces the attention sublayers with simple fast Fourier transform in the input embedding, to model heterogeneous semantic relationships in text is proposed. Developing a high-quality recommendation system, to recommend with excellent coverage over different aspects of a product review, is the need of the hour these days. Different deep learning techniques for aspect-based recommendation systems make use of attention mechanism to capture diverse syntactic and semantic relationships from the reviews. Experimental analysis on datasets such as SemEval 2014 Laptop Reviews, Restaurant Reviews, Twitter Data shows that the aspect-based sentiment analysis of the model outperforms the baseline models considerably, with an accuracy rate of 75.06 %, 79.93%, and 72.31% on Laptop Reviews, Restaurant Reviews, and Twitter data, respectively. Despite using attention-based model with many parameters, the model is able to be trained with less number of parameters with the proposed variant of the recommendation systems with aspect-based sentiment analysis model. The performance of the model was also evaluated on three fine-tuned environments showing promising results.KeywordsRecommendation systemsAspect-based sentiment analysisInformation retrieval
Recommender systems relieve users from cognitive overloading by predicting preferred items for users. Due to the complexity of interactions between users and items, graph neural networks (GNN) use graph structures to effectively model user-item interactions. However, existing GNN approaches have the following limitations: 1) user reviews are not adequately modeled in graphs. Therefore, user preferences and item properties that are described in user reviews are lost for modeling users and items; and 2) GNNs assume deterministic relations between users and items, which lack the stochastic modeling to estimate the uncertainties in neighbor relations. To mitigate the limitations, we build tripartite graphs to model user reviews as nodes that connect with users and items. We estimate neighbor relations with stochastic variables and propose a Bayesian graph attention network ( i.e. , ContGraph ) to accurately predict user ratings. ContGraph incorporates the prior knowledge of user preferences to regularize the posterior inference of attention weights. Our experimental results show that ContGraph significantly outperforms thirteen state-of-the-art models and improves the best performing baseline ( i.e. , ANR) by 5.23% on 25 datasets in the 5-core version. Moreover, we show that correctly modeling the semantics of user reviews in graphs can help express the semantics of users and items.
Full-text available
The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems---a personalized information filtering technology used to identify a set of items that will be of interest to a certain user. User-based collaborative filtering is the most successful technology for building recommender systems to date and is extensively used in many commercial recommender systems. Unfortunately, the computational complexity of these methods grows linearly with the number of customers, which in typical commercial applications can be several millions. To address these scalability concerns model-based recommendation techniques have been developed. These techniques analyze the user--item matrix to discover relations between the different items and use these relations to compute the list of recommendations.In this article, we present one such class of model-based recommendation algorithms that first determines the similarities between the various items and then uses them to identify the set of items to be recommended. The key steps in this class of algorithms are (i) the method used to compute the similarity between the items, and (ii) the method used to combine these similarities in order to compute the similarity between a basket of items and a candidate recommender item. Our experimental evaluation on eight real datasets shows that these item-based algorithms are up to two orders of magnitude faster than the traditional user-neighborhood based recommender systems and provide recommendations with comparable or better quality.
Full-text available
This paper presents some experiments to analyse the pop-ularity effect in music recommendation. Popularity is mea-sured in terms of total playcounts, and the Long Tail model is used in order to rank music artists. Furthermore, metrics derived from complex network analysis are used to detect the influence of the most popular artists in the network of similar artists. The results from the experiments reveal that—as expected by its inherent social component—the collaborative filtering approach is prone to popularity bias. This has some conse-quences on the discovery ratio as well as in the navigation through the Long Tail. On the other hand, in both audio content–based and human expert–based approaches artists are linked independently of their popularity. This allows one to navigate from a mainstream artist to a Long Tail artist in just two or three clicks.
Conference Paper
Full-text available
Recommender systems use statistical and knowledge discovery techniques in order to recommend products to users and to mitigate the problem of information overload. The evaluation of the quality of recommender systems has become an important issue for choosing the best learning algorithms. In this paper we propose an evaluation methodology for collaborative filtering (CF) algorithms. This methodology carries out a clear, guided and repeatable evaluation of a CF algorithm. We apply the methodology on two datasets, with different characteristics, using two CF algorithms: singular value decomposition and naive bayesian networks.
In October, 2006 Netflix released a dataset containing 100 million anonymous movie ratings and challenged the data mining, machine learning and computer science communities to develop systems that could beat the accuracy of its recommendation system, Cinematch. We briefly describe the challenge itself, review related work and efforts, and summarize visible progress to date. Other potential uses of the data are outlined, including its application to the KDD Cup 2007.
A key part of a recommender system is a collaborative filter-ing algorithm predicting users' preferences for items. In this paper we describe different efficient collaborative filtering techniques and a framework for combining them to obtain a good prediction. The methods described in this paper are the most im-portant parts of a solution predicting users' preferences for movies with error rate 7.04% better on the Netflix Prize dataset than the reference algorithm Netflix Cinematch. The set of predictors used includes algorithms suggested by Netflix Prize contestants: regularized singular value de-composition of data with missing values, K-means, postpro-cessing SVD with KNN. We propose extending the set of predictors with the following methods: addition of biases to the regularized SVD, postprocessing SVD with kernel ridge regression, using a separate linear model for each movie, and using methods similar to the regularized SVD, but with fewer parameters. All predictors and selected 2-way interactions between them are combined using linear regression on a holdout set.
In this chapter we describe the integration of a recommender system into the production environment of Fastweb, one of the largest European IP Television (IPTV) providers. The recommender system implements both collaborative and content-based techniques, suitable tailored to the specific requirements of an IPTV architecture, such as the limited screen definition, the reduced navigation capabilities, and the strict time constraints. The algorithms are extensively analyzed by means of off-line and on-line tests, showing the effectiveness of the recommender systems: up to 30% of the recommendations are followed by a purchase, with an estimated lift factor (increase in sales) of 15%.