Conference PaperPDF Available

Performance of recommender algorithms on top-N recommendation tasks

Authors:

Abstract and Figures

In many commercial systems, the 'best bet' recommendations are shown, but the predicted rating values are not. This is usually referred to as a top-N recommendation task, where the goal of the recommender system is to find a few specific items which are supposed to be most appealing to the user. Common methodologies based on error metrics (such as RMSE) are not a natural fit for evaluating the top-N recommendation task. Rather, top-N performance can be directly measured by alternative methodologies based on accuracy metrics (such as precision/recall). An extensive evaluation of several state-of-the art recommender algorithms suggests that algorithms optimized for minimizing RMSE do not necessarily perform as expected in terms of top-N recommendation task. Results show that improvements in RMSE often do not translate into accuracy improvements. In particular, a naive non-personalized algorithm can outperform some common recommendation approaches and almost match the accuracy of sophisticated algorithms. Another finding is that the very few top popular items can skew the top-N performance. The analysis points out that when evaluating a recommender algorithm on the top-N recommendation task, the test set should be chosen carefully in order to not bias accuracy metrics towards non-personalized solutions. Finally, we offer practitioners new variants of two collaborative filtering algorithms that, regardless of their RMSE, significantly outperform other recommender algorithms in pursuing the top-N recommendation task, with offering additional practical advantages. This comes at surprise given the simplicity of these two methods.
Content may be subject to copyright.
Performance of Recommender Algorithms
on Top-N Recommendation Tasks
Paolo Cremonesi
Politecnico di Milano
Milan, Italy
paolo.cremonesi@polimi.it
Yehuda Koren
Yahoo! Research
Haifa, Israel
yehuda@yahoo-inc.com
Roberto Turrin
Neptuny
Milan, Italy
roberto.turrin@polimi.it
ABSTRACT
In many commercial systems, the ‘best bet’ recommenda-
tions are shown, but the predicted rating values are not.
This is usually referred to as a top-N recommendation task,
where the goal of the recommender system is to find a few
specific items which are supposed to be most appealing to
the user. Common methodologies based on error metrics
(such as RMSE) are not a natural fit for evaluating the top-
N recommendation task. Rather, top-N performance can
be directly measured by alternative methodologies based on
accuracy metrics (such as precision/recall).
An extensive evaluation of several state-of-the art recom-
mender algorithms suggests that algorithms optimized for
minimizing RMSE do not necessarily perform as expected
in terms of top-N recommendation task. Results show that
improvements in RMSE often do not translate into accu-
racy improvements. In particular, a naive non-personalized
algorithm can outperform some common recommendation
approaches and almost match the accuracy of sophisticated
algorithms. Another finding is that the very few top popular
items can skew the top-N performance. The analysis points
out that when evaluating a recommender algorithm on the
top-N recommendation task, the test set should be chosen
carefully in order to not bias accuracy metrics towards non-
personalized solutions. Finally, we offer practitioners new
variants of two collaborative filtering algorithms that, re-
gardless of their RMSE, significantly outperform other rec-
ommender algorithms in pursuing the top-N recommenda-
tion task, with offering additional practical advantages. This
comes at surprise given the simplicity of these two methods.
Categories and Subject Descriptors
H.3.4 [Information Storage and Retrieval]: Systems
and Software—user profiles and alert services; performance
evaluation (efficiency and effectiveness); H.3.3 [Information
Storage and Retrieval]: Information Search and Retrieval—
Information filtering
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
RecSys2010, September 26–30, 2010, Barcelona, Spain.
Copyright 2010 ACM 978-1-60558-906-0/10/09 ...$10.00.
General Terms
Algorithms, Experimentation, Measurement, Performance
1. INTRODUCTION
A common practice with recommender systems is to eval-
uate their performance through error metrics such as RMSE
(root mean squared error), which capture the average error
between the actual ratings and the ratings predicted by the
system. However, in many commercial systems only the
‘best bet’ recommendations are shown, while the predicted
rating values are not [8]. That is, the system suggests a few
specific items to the user that are likely to be very appealing
to him. While the majority of the literature is focused on
convenient error metrics (RMSE, MAE), such classical error
criteria do not really measure top-N performance. At most,
they can serve as proxies of the true top-N experience. Di-
rect evaluation of top-N performance must be accomplished
by means of alternative methodologies based on accuracy
metrics (e.g., recall and precision).
In this paper we evaluate – through accuracy metrics –
the performance of several collaborative filtering algorithms
in pursuing the top-N recommendation task. Evaluation is
contrasted with performance of the same methods on the
RMSE metric. Tests have been performed on the Netflix
and Movielens datasets.
The contribution of the work is threefold: (i) we show
that there is no trivial relationship between error metrics
and accuracy metrics; (ii) we propose a careful construction
of the test set for not biasing accuracy metrics; (iii) we intro-
duce new variants of existing algorithms that improve top-N
performance together with other practical benefits.
We first compare some state-of-the-art algorithms (e.g.,
Asymmetric SVD) with a non-personalized algorithm based
on item popularity. The surprising result is that the perfor-
mance of the non-personalized algorithm on top-N recom-
mendations are comparable to the performance of sophisti-
cated, personalized algorithms, regardless of their RMSE.
However, a non-personalized, popularity-based algorithm
can only provide trivial recommendations, interesting nei-
ther to users, which can get bored and disappointed by the
recommender system, nor to content providers, which invest
in a recommender system for pushing up sales of less known
items. For such a reason, we run an additional set of experi-
ments in order to evaluate the performance of the algorithms
while excluding the extremely popular items. As expected,
the accuracy of all algorithms decreases, as it is more difficult
to recommend non-trivial items. Yet, ranking of the differ-
ent algorithms aligns better with our expectations, with the
non-personalized methods being ranked lower. Thus, when
evaluating algorithms in the top-N recommendation task,
we would advise to carefully choose the test set, otherwise
accuracy metrics are strongly biased.
Finally, when pursuing a top-N recommendation task, ex-
act rating prediction is not required. We present new vari-
ants of two collaborative filtering algorithms that are not
designed for minimizing RMSE, but consistently outperform
other recommender algorithms in top-N recommendations.
This finding becomes even more important, when consider-
ing the simple and less conventional nature of the outper-
forming methods.
2. TESTING METHODOLOGY
The testing methodology adopted in this study is similar
to the one described in [6] and, in particular, in [10]. For each
dataset, known ratings are split into two subsets: training
set Mand test set T. The test set Tcontains only 5-stars
ratings. So we can reasonably state that Tcontains items
relevant to the respective users.
The detailed procedure used to create Mand Tfrom the
Netflix dataset is similar to the one set for the Netflix prize,
maintaining compatibility with results published in other re-
search papers [3]. Netflix released a training dataset contain-
ing about 100M ratings, referred to as the training dataset.
In addition to the training set, Netflix also provided a val-
idation set, referred to as the probe set, containing 1.4M
ratings. In this work, the training set Mis the original Net-
flix training set, while the test set Tcontains all the 5-stars
ratings from the probe set (|T|=384,573). As expected, the
probe set was not used for training.
We adopted a similar procedure for the Movielens dataset
[13]. We randomly sub-sampled 1.4% of the ratings from
the dataset in order to create a probe set. The training set
Mcontains the remaining ratings. The test set Tcontains
all the 5-star ratings from the probe set.
In order to measure recall and precision, we first train the
model over the ratings in M. Then, for each item irated
5-stars by user uin T:
(i) We randomly select 1000 additional items unrated by
user u. We may assume that most of them will not be
of interest to user u.
(ii) We predict the ratings for the test item iand for the
additional 1000 items.
(iii) We form a ranked list by ordering all the 1001 items
according to their predicted ratings. Let pdenote the
rank of the test item iwithin this list. The best result
corresponds to the case where the test item iprecedes
all the random items (i.e., p= 1).
(iv) We form a top-N recommendation list by picking the
Ntop ranked items from the list. If pNwe have a
hit (i.e., the test item iis recommended to the user).
Otherwise we have a miss. Chances of hit increase with
N. When N= 1001 we always have a hit.
The computation of recall and precision proceeds as fol-
lows. For any single test case, we have a single relevant
item (the tested item i). By definition, recall for a single
test can assume either the value 0 (in case of miss) or 1 (in
case of hit). Similarly, precision can assume either the value
0 20% 40% 60% 80% 100%
0
0.1%
1%
10%
100%
% of ratings
% of items
Netflix
Movielens
Short−head
(popular) Long−tail
(unpopular)
Figure 1: Rating distribution for Netflix (solid line)
and Movielens (dashed line). Items are ordered ac-
cording to popularity (most popular at the bottom).
0 or 1/N. The overall recall and precision are defined by
averaging over all test cases:
recall(N) = #hits
|T|
precision(N) = #hits
N· |T|=recall(N)
N
where |T|is the number of test ratings. Note that the hy-
pothesis that all the 1000 random items are non-relevant
to user utends to underestimate the computed recall and
precision with respect to true recall and precision.
2.1 Popular items vs. long-tail
According to the well known long-tail distribution of rated
items applicable to many commercial systems, the majority
of ratings are condensed in a small fraction of the most pop-
ular items [1].
Figure 1 plots the empirical rating distributions of the
Netflix and Movielens datasets. Items in the vertical axis
are ordered according to their popularity, most popular at
the bottom. We observe that about 33% of ratings collected
by Netflix involve only the 1.7% of most popular items (i.e.,
302 items). We refer to this small set of very popular items
as the short-head, and to the remaining set of less popular
items – about 98% of the total – as the long-tail [5]. We
also note that Movielens’ rating distribution is slightly less
long-tailed then Netflix’s: the short-head (33% of ratings)
involves the 5.5% of popular items (i.e., 213 items).
Recommending popular items is trivial and do not bring
much benefits to users and content providers. On the other
hand, recommending less known items adds novelty and
serendipity to the users but it is usually a more difficult
task. In this study we aim at evaluating the accuracy of
recommender algorithms in suggesting non-trivial items. To
this purpose, the test set Thas been further partitioned into
two subsets, Thead and Tlong, such that items in Thead are in
the short-head while items in Tlong are in the long-tail of the
distribution.
3. COLLABORATIVE ALGORITHMS
Most recommender systems are based on collaborative fil-
tering (CF), where recommendations rely only on past user
behavior (to be referred here as ‘ratings’, though such behav-
ior can include other user activities on items like purchases,
rentals and clicks), regardless of domain knowledge. There
are two primary approaches to CF: (i) the neighborhood
approach and (ii) the latent factor approach.
Neighborhood models represent the most common appro-
ach to CF. They are based on the similarity among either
users or items. For instance, two users are similar because
they have rated similarly the same set of items. A dual
concept of similarity can be defined among items.
Latent factor approaches model users and items as vectors
in the same ‘latent factor’ space by means of a reduced num-
ber of hidden factors. In such a space, users and items are
directly comparable: the rating of user uon item iis pre-
dicted by the proximity (e.g., inner-product) between the
related latent factor vectors.
3.1 Non-personalized models
Non-personalized recommenders present to any user a pre-
defined, fixed list of items, regardless of his/her preferences.
Such algorithms serve as baselines for the more complex per-
sonalized algorithms.
A simple estimation rule, referred to as Movie Average
(MovieAvg), recommends top-N items with the highest av-
erage rating. The rating of user uon item iis predicted
as the mean rating expressed by the community on item i,
regardless of the ratings given by u.
A similar prediction schema, denoted by Top Popular (Top-
Pop), recommends top-N items with the highest popularity
(largest number of ratings). Notice that in this case the rat-
ing of user uabout item icannot be inferred, but the output
of this algorithm is only a ranked list of items. As a conse-
quence, RMSE or other error metrics are not applicable.
3.2 Neighborhood models
Neighborhood models base their prediction on the simi-
larity relationships among either users or items.
Algorithms centered on user-user similarity predict the
rating by a user based on the ratings expressed by users sim-
ilar to him about such item. On the other hand, algorithms
centered on item-item similarity compute the user preference
for an item based on his/her own ratings on similar items.
The latter is usually the preferred approach (e.g., [15]), as it
usually performs better in RMSE terms, while being more
scalable. Both advantages are related to the fact that the
number of items is typically smaller than the number of
users. Another advantage of item-item algorithms is that
reasoning behind a recommendation to a specific user can be
explained in terms of the items previously rated by him/her.
In addition, basing the system parameters on items (rather
than users) allows aseamless handling of users and ratings
new to the system. For such reasons, we focus on item-item
neighborhood algorithms.
The similarity between item iand item jis measured as
the tendency of users to rate items iand jsimilarly. It is
typically based either on the cosine, the adjusted cosine, or
(most commonly) the Pearson correlation coefficient [15].
Item-item similarity is computed on the common raters.
In the typical case of a very sparse dataset, it is likely that
some pairs of items have a poor support, leading to a non-
reliable similarity measure. For such a reason, if nij denotes
the number of common raters and sij the similarity between
item iand item j, we can define the shrunk similarity dij
as the coefficient dij =nij
nij +λ1
sij where λ1is a shrinking
factor [10]. A typical value of λ1is 100.
Neighborhood models are further enhanced by means of a
kNN (k-nearest-neighborhood) approach. When predicting
rating rui, we consider only the kitems rated by uthat are
the most similar to i. We denote the set of most similar
items by Dk(u;i). The kNN approach discards the items
poorly correlated to the target item, thus decreasing noise
for improving the quality of recommendations.
Prior to comparing and summing different ratings, it is
advised to remove different biases which mask the more fun-
damental relations between items. Such biases include item-
effects which represent the fact that certain items tend to
receive higher ratings than others. They also include user-
effects, which represent the tendency of certain users to rate
higher than others. More delicate calculation of the biases,
would also estimate temporal effects [11], but this is beyond
the scope of this work. We take as baselines the static item-
and user-biases, following [10]. Formally, the bias associated
with the rating of user uto item iis denoted by bui.
An item-item kNN method predicts the residual rating
rui bui as the weighted average of the residual ratings of
similar items:
ˆrui =bui +PjDk(u;i)dij (ruj buj )
PjDk(u;i)dij
(1)
Hereinafter, we refer to this model as Correlation Neigh-
borhood (CorNgbr), where sij is measured as the Pearson
correlation coefficient.
3.2.1 Non-normalized Cosine Neighborhood
Notice that in (1), the denominator forces that predicted
rating values fall in the correct range, e.g., [1 ...5] for a typi-
cal star-ratings systems. However, for a top-N recommenda-
tion task, exact rating values are not necessary. We simply
want to rank items by their appeal to the user. In such a
case, we can simplify the formula by removing the denom-
inator. A benefit of this would be higher ranking for items
with many similar neighbors (that is high PjDk(u;i)dij ),
where we have a higher confidence in the recommendation.
Therefore, we propose to rank items by the following coeffi-
cient denoted by ˆrui:
ˆrui =bui +X
jDk(u;i)
dij (ruj buj ) (2)
Here ˆrui does not represent a proper rating, but is rather a
metric for the association between user uand item i. We
should note that similar non-normalized neighborhood rules
were mentioned by others [7, 10].
In our experiments, the best results in terms of accuracy
metrics have been obtained by computing sij as the cosine
similarity. Unlike Pearson correlation which is computed
only on ratings shared by common raters, the cosine co-
efficient between items iand jis computed over all rat-
ings (taking missing values as zeroes), that is: cos(i, j) =
~
i·~
j/(||
~
i||2·||~
j||2). We denote such model by Non-Normalized
Cosine Neighborhood (NNCosNgbr).
3.3 Latent Factor Models
Recently, several recommender algorithms based on la-
tent factor models have been proposed. Most of them are
based on factoring the user-item ratings matrix [12], also
informally known as SVD models after the related Singular
Value Decomposition.
The key idea of SVD models is to factorize the user-item
rating matrix to a product of two lower rank matrices, one
containing the so-called ‘user factors’, while the other one
containing the so-called ‘item-factors’. Thus, each user uis
represented with an f-dimensional user factors vector pu
f. Similarly, each item iis represented with an item factors
vector qi∈ ℜf. Prediction of a rating given by user ufor
item iis computed as the inner product between the related
factor vectors (adjusted for biases), i.e.,
ˆrui =bui +puqiT(3)
Since conventional SVD is undefined in the presence of
unknown values – i.e., the missing ratings – several solu-
tions have been proposed. Earlier works addressed the issue
by filling missing ratings with a baseline estimations (e.g.,
[16]). However, this leads to a very large, dense user rating
matrix, whose factorization becomes computationally infea-
sible. More recent works learn factor vectors directly on
known ratings through a suitable objective function which
minimizes prediction error. The proposed objective func-
tions are usually regularized in order to avoid overfitting
(e.g., [14]). Typically, gradient descent is applied to mini-
mize the objective function.
As with neighborhood methods, this article concentrates
on methods which represent users as a combination of item
features, without requiring any user-specific parameteriza-
tion. The advantages of these methods are that they can
create recommendations for users new to the system without
re-evaluation of parameters. Likely, they can immediately
adjust their recommendations to just entered ratings, pro-
viding users with an immediate feedback for their actions.
Finally, such methods can explain their recommendations in
terms of items previously rated by the user.
Thus, we experimented with a powerful matrix factoriza-
tion model, which indeed represents users as a combination
of item features. The method is known as Asymmetric-SVD
(AsySVD) and is reported to reach an RMSE of 0.9000 on
the Netflix dataset [10].
In addition, we have experimented with a beefed up matrix-
factorization approach known as SVD++ [10], which repre-
sents highest quality in RMSE-optimized factorization meth-
ods, albeit users are no longer represented as a combination
of item features; see [10].
3.3.1 PureSVD
While pursuing a top-N recommendation task, we are in-
terested only in a correct item ranking, not caring about
exact rating prediction. This grants us some flexibility, like
considering all missing values in the user rating matrix as
zeros, despite being out of the 1-to-5 star rating range. In
terms of predictive power, the choice of zero is not very im-
portant, and we have received similar results with higher
imputed values. Importantly, now we can leverage existing
highly optimized software packages for performing conven-
tional SVD on sparse matrices, which becomes feasible since
all matrix entries are now non-missing. Thus, the user rating
matrix Ris estimated by the factorization [2]:
ˆ
R=U·Σ·QT(4)
where, Uis a n×forthonormal matrix, Qis a m×f
Dataset Users Items Ratings Density
Movielens 6,040 3,883 1M 4.26%
Netflix 480,189 17,770 100M 1.18%
Table 1: Statistical properties of Movielens and Net-
flix.
orthonormal matrix, and Σis a f×fdiagonal matrix con-
taining the first fsingular values.
In order to demonstrate the ease of imputing zeroes, we
should mention that we used a non-multithreaded SVD pack-
age (SVDLIBC, based on the SVDPACKC library [4]), which
factorized the 480K users by 17,770 movies Netflix dataset
under 10 minutes on an i7 PC (f= 150).
Let us define P=U·Σ, so that the u-th row of Prep-
resents the user factors vector pu, while the i-th row of Q
represents the item factors vector qi. Accordingly, ˆrui can
be computed similarly to (3).
In addition, since Uand Qhave orthonormal columns,
we can straightforwardly derive that:
P=U·Σ=R·Q(5)
where Ris the user rating matrix. Consequently, by denot-
ing with ruthe u-th row of the user rating matrix - i.e., the
vector of ratings of user u, we can rewrite the prediction rule
ˆrui =ru·Q·qiT(6)
Note that, similarly to (1), in a slight abuse of notation,
the symbol ˆrui , is not exactly a valid rating value, but an
association measure between user uand item i.
In the following we will refer to this model as PureSVD.
As with item-item kNN and AsySVD, PureSVD offers all
the benefits of representing users as a combination of item
features (by Eq. (5)), without any user-specific parameter-
ization. It also offers convenient optimization, which does
not require tunning learning constants.
4. RESULTS
In this section we present the quality of the recommender
algorithms presented in Section 3 on two standard datasets:
MovieLens [13] and Netflix [3]. Both are publicly available
movie rating datasets. Collected ratings are in a 1-to-5 star
scale. Table 1 summarizes their statistical properties.
We used the methodology defined in Section 2 for evalu-
ating six recommender algorithms. The first two - MovieAvg
and TopPop - are non-personalized algorithms, and we would
expect them to be outperformed by any recommender algo-
rithm. The third prediction rule - CorNgbr - is a well tuned
neighborhood-based algorithm, probably the most popular
in the literature of collaborative filtering. The forth algo-
rithm is a variant of CorNgbr - NNCosNgbr - and it is one
of the two proposed algorithm oriented to accuracy metrics.
Fifth is the latent factor model AsySVD with 200 factors.
Sixth is a 200-D SVD++, among the most powerful latent
factor models in terms of RMSE. Finally, we consider our
variant of latent factor models, PureSVD, which is shown in
two configurations: one with a fewer latent factors (50), and
one with a larger number of latent factors (150 for Movielens
and 300 for the larger Netflix dataset).
Three of the algorithms – TopPop, NNCosNgbr, and PureSVD
– are not sensible from an error minimization viewpoint and
cannot be assessed by an RMSE measure. The other four
algorithms were optimized to deliver best RMSE results,
and their RMSE scores on the Netflix test set are as fol-
lows: 1.053 for MovieAvg, 0.9406 for CorNgbr, 0.9000 for
AsySVD, and 0.8911 for SVD++ [10].
For each dataset, we have performed one set of experi-
ments on the full test set and one set of experiments on the
long-tail test set. We report the recall as a function of N
(i.e., the number of items recommended), and the precision
as a function of the recall. As for recall(N), we have zoomed
in on Nin the range [1 ...20]. Larger values of Ncan be
ignored for a typical top-N recommendations task. Indeed,
there is no difference whether an appealing movie is placed
within the top 100 or the top 200, because in neither case it
will be presented to the user.
4.1 Movielens dataset
Figure 2 reports the performance of the algorithms on
the Movielens dataset over the full test set. It is apparent
that the algorithms have significant performance disparity in
terms of top-N accuracy. For instance, the recall of AsySVD
at N= 10 is about 0.28, i.e., the model has a probability of
28% to place an appealing movie in the top-10. Surprisingly,
the recall of the non-personalized TopPop is very similar to
AsySVD (e.g., at N= 10 recall is about 0.29). The best
algorithms in terms of accuracy are the non-RMSE-oriented
NNCosNgbr and PureSVD, which reach, at N= 10, a recall
equaling about 0.44 and 0.52, respectively. As for the latter,
this means that about 50% of 5-star movies are presented
in a top-10 recommendation. The best algorithm in the
RMSE-oriented familiy is SVD++, with a recall close to
that of NNCosNgbr.
Figure 2(b) confirms that PureSVD also outperforms the
other algorithms in terms of precision metrics, followed by
the other non-RMSE-oriented algorithm – NNCosNgbr. Each
line represents the precision of the algorithm at a given re-
call. For example, when the recall is about 0.2, precision of
NNCosNgbr is about 0.12. Again, TopPop performance is
aligned with that of a state-of-the-art algorithm as AsySVD.
We should note the gross underperformance of the widely
used CorNgbr algorithm, whose performance is in line with
the very naive MovieAvg. We should also note that the pre-
cision of SVD++ is competitive with that of PureSVD50 for
small values of recall.
The strange and somehow unexpected result of TopPop
motivates the second set of experiments, accomplished over
the long-tail items, whose results are drawn in Figure 3. As
a reminder, now we exclude the very popular items from
consideration. Here the ordering among the several recom-
mender algorithms better aligns with our expectations. In
fact, recall and precision of the non-personalized TopPop
dramatically falls down and it is very unlikely to recom-
mend a 5-star movie within the first 20 positions. However,
even when focusing on the long-tail, the best algorithms is
still PureSVD, whose recall at N= 10 is about 40%. Note
that while the best performance of PureSVD was with 50
latent factors in the case of full test set, here the best perfor-
mance is reached with a larger number of latent factors, i.e.,
150. Performance of NNCosNgbr now becomes significantly
worse than PureSVD, while SVD++ is the best within the
RMSE-oriented algorithms.
4.2 Netflix dataset
Analogously to the results presented for Movielens, Fig-
ures 4 and 5 show the performance of the algorithms on the
Netflix dataset. As before, we focus on both the full test set
and the long-tail test set.
Once again, non-personalized TopPop shows surprisingly
good results when including the 2% head items, outper-
forming the widely popular CorNgbr. However, the more
powerful AsySVD and SDV++, which were possibly better
tuned for the Netflix data, are slightly outperforming Top-
Pop. Note also that AsySVD is now in line with SVD++.
Consistent with the Movielens experience, the best per-
forming algorithm in terms of recall and precision is still the
non-RMSE-oriented PureSVD. As for the other non-RMSE-
oriented NNCosNgbr, the picture becomes mixed. It is still
outperforming the RMSE-oriented algorithms when includ-
ing the head items, but somewhat underperforms them when
the top-2% most popular items are excluded.
The behavior of the commonly used CorNgbr on the Net-
flix dataset is very surprising. While it significantly under-
performs others on the full test set, it becomes among the
top-performers when concentrating on the longer tail. In
fact, while all the algorithms decrease their precision and
recall when passing from the full to the long-item test set,
CorNgbr appears more accurate in recommending long-tail
items. After all, the wide acceptance of the CorNgbr ap-
proach might be for a reason given the importance of long-
tail items.
5. DISCUSSION — PureSVD
Over both Movielens and Netflix datasets, regardless of in-
clusion of head-items, PureSVD is consistently the top per-
former, beating more detailed and sophisticated latent fac-
tor models. Given its simplicity, and poor design in terms
of RMSE optimization, we did not expect this result. In
fact, we would view it as a good news for practitioners of
recommender systems, as PureSVD combines multiple ad-
vantages. First, it is very easy to code, without a need to
tune learning constants, and fully relies on off-the-shelf opti-
mized SVD packages . This comes with good computational
performance in both offline and online modes. PureSVD
also has the convenience of representing the users as a com-
bination of item features (Eq. (6)), offering designers a flex-
ibility in handling new users, new ratings by existing users
and explaining the reasoning behind the generated recom-
mendations.
An interesting finding, observed at both Movielens and
Netflix datasets, is that when moving to longer tail items,
accuracy improves with raising the dimensionality of the
PureSVD model. This may be related to the fact that
the first latent factors of PureSVD capture properties of
the most popular items, while the additional latent factors
represent more refined features related to unpopular items.
Hence, when practitioners use PureSVD, they should pick its
dimensionality while accounting for the fact that the num-
ber of latent factors influences the quality of long-tail items
differently than head items.
We would like to offer an explanation as to why PureSVD
could consistently deliver better top-N results than best RMSE-
refined latent factor models. This may have to do with a
limitation of RMSE testing, which concentrates only on the
ratings that the user provided to the system. This way,
RMSE (or MAE for the matter) is measured on a held-out
test set containing only items that the user chose to rate,
while completely missing any evaluation of the method on
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N
recall(N)
MovieAvg
TopPop
CorNgbr
NNCosNgbr
AsySVD
SVD++
PureSVD50
PureSVD150
(a) recall
0 0.2 0.4 0.6 0.8 1
0
0.02
0.04
0.06
0.08
0.1
0.12
recall
precision
MovieAvg
TopPop
CorNgbr
NNCosNgbr
AsySVD
SVD++
PureSVD50
PureSVD150
(b) precision vs recall
Figure 2: Movielens: (a) recall-at-Nand (b) precision-versus-recall on all items.
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N
recall(N)
MovieAvg
TopPop
CorNgbr
NNCosNgbr
AsySVD
SVD++
PureSVD50
PureSVD150
(a) recall
0 0.2 0.4 0.6 0.8 1
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
recall
precision
MovieAvg
TopPop
CorNgbr
NNCosNgbr
AsySVD
SVD++
PureSVD50
PureSVD150
(b) precision vs recall
Figure 3: Movielens: (a) recall-at-Nand (b) precision-versus-recall on long-tail (94% of items).
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N
recall(N)
MovieAvg
TopPop
CorNgbr
NNCosNgbr
AsySVD
SVD++
PureSVD50
PureSVD300
(a) recall
0 0.2 0.4 0.6 0.8 1
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
recall
precision
MovieAvg
TopPop
CorNgbr
NNCosNgbr
AsySVD
SVD++
PureSVD50
PureSVD300
(b) precision vs recall
Figure 4: Netflix: (a) recall-at-Nand (b) precision-versus-recall on all items.
0 5 10 15 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
N
recall(N)
MovieAvg
TopPop
CorNgbr
NNCosNgbr
AsySVD
SVD++
PureSVD50
PureSVD300
(a) recall
0 0.2 0.4 0.6 0.8 1
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
recall
precision
MovieAvg
TopPop
CorNgbr
NNCosNgbr
AsySVD
SVD++
PureSVD50
PureSVD300
(b) precision vs recall
Figure 5: Netflix: (a) recall-at-Nand (b) precision-versus-recall on long-tail (98% of items).
items that the user has never rated. This testing-mode bodes
well with the RMSE-oriented models, which are trained only
on the known ratings, while largely ignoring the missing en-
tries. Yet, such a testing methodology misses much of the
reality, where all items should count, not only those actually
rated by the user in the past. The proposed Top-N based
accuracy measures indeed do better in this respect, by di-
rectly involving all possible items (including unrated ones)
in the testing phase. This may explain the outperformance
of PureSVD, which considers all possible user-item pairs (re-
gardless of rating availability) in the training phase.
Our general advise to practitioners is to consider PureSVD
as a recommender algorithm. Still there are several unex-
plored ways that may improve PureSVD. First, one can opti-
mize the value imputed at the missing entries. Direct usage
of ready sparse SVD solvers (which usually assume a default
value of zero) would still be possible by translating all given
scores. For example, imputing a value of 3 instead of zero
would be effectively achieved by translating the given star
ratings from the [1...5] range into the [-2...2] range. Second,
one can grant a lower confidence to the imputed values, such
that SVD will emphasize more efforts on real ratings. For
an explanation on how this can be accomplished consider
[9]. However, we would expect such a confidence weighting
to significantly increase the time complexity of the offline
training.
6. CONCLUSIONS
Evaluation of recommender has long been divided between
accuracy metrics (e.g., precision/recall) and error metrics
(notably, RMSE and MAE). The mathematical convenience
and fitness with formal optimization methods, have made
error metrics like RMSE more popular, and they are indeed
dominating the literature. However, it is well recognized
that accuracy measures may be a more natural yardstick, as
they directly assess the quality of top-N recommendations.
This work shows, through an extensive empirical study,
that the convenient assumption that an error metric such as
RMSE can serve as good proxy for top-N accuracy is ques-
tionable at best. There is no monotonic relation between
error metrics and accuracy metrics. This may call for a re-
evaluation of optimization goals for top-N systems. On the
bright side we have presented simple and efficient variants
of known algorithms, which are useless in RMSE terms, and
yet deliver superior results when pursuing top-N accuracy.
In passing, we have also discussed possible pitfalls in the
design of a test set for conducting a top-N accuracy eval-
uation. In particular, a careless construction of the test
set would make recall and precision strongly biased towards
non personalized algorithms. An easy solution, which we
adopted, was excluding the extremely popular items from
the test set (while retaining 98% of the items). The re-
sulting test set, which emphasizes the rather important non-
trivial items, seems to shape better with our expectations.
First, it correctly shows the lower value of non-personalized
algorithms. Second, it shows a good behavior for the wi-
dely used correlation-based kNN approach, which otherwise
(when evaluated on the full set of items) exhibits extremely
poor results, strongly confronting the accepted practice.
7. REFERENCES
[1] C. Anderson. The Long Tail: Why the Future of
Business Is Selling Less of More. Hyperion, July 2006.
[2] R. Bambini, P. Cremonesi, and R. Turrin.
Recommender Systems Handbook, chapter A
Recommender System for an IPTV Service Provider:
a Real Large-Scale Production Environment. Springer,
2010.
[3] J. Bennett and S. Lanning. The Netflix Prize.
Proceedings of KDD Cup and Workshop, pages 3–6,
2007.
[4] M. W. Berry. Large-scale sparse singular value
computations. The International Journal of
Supercomputer Applications, 6(1):13–49, Spring 1992.
[5] O. Celma and P. Cano. From hits to niches? or how
popular artists can bias music recommendation and
discovery. Las Vegas, USA, August 2008.
[6] P. Cremonesi, E. Lentini, M. Matteucci, and
R. Turrin. An evaluation methodology for
recommender systems. 4th Int. Conf. on Automated
Solutions for Cross Media Content and Multi-channel
Distribution, pages 224–231, Nov 2008.
[7] M. Deshpande and G. Karypis. Item-based top-n
recommendation algorithms. ACM Transactions on
Information Systems (TOIS), 22(1):143–177, 2004.
[8] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl.
Evaluating collaborative filtering recommender
systems. ACM Transactions on Information Systems
(TOIS), 22(1):5–53, 2004.
[9] Y. Hu, Y. Koren, and C. Volinsky. Collaborative
filtering for implicit feedback datasets. Data Mining,
IEEE International Conference on, 0:263–272, 2008.
[10] Y. Koren. Factorization meets the neighborhood: a
multifaceted collaborative filtering model. In KDD ’08:
Proceeding of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 426–434, New York, NY, USA, 2008. ACM.
[11] Y. Koren. Collaborative filtering with temporal
dynamics. In KDD ’09: Proceedings of the 15th ACM
SIGKDD international conference on Knowledge
discovery and data mining, pages 447–456, New York,
NY, USA, 2009. ACM.
[12] Y. Koren, R. M. Bell, and C. Volinsky. Matrix
factorization techniques for recommender systems.
IEEE Computer, 42(8):30–37, 2009.
[13] B. Miller, I. Albert, S. Lam, J. Konstan, and J. Riedl.
MovieLens unplugged: experiences with an
occasionally connected recommender system.
Proceedings of the 8th international conference on
Intelligent user interfaces, pages 263–266, 2003.
[14] A. Paterek. Improving regularized singular value
decomposition for collaborative filtering. Proceedings
of KDD Cup and Workshop, 2007.
[15] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl.
Item-based collaborative filtering recommendation
algorithms. 10th Int. Conf. on World Wide Web,
pages 285–295, 2001.
[16] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
Application of Dimensionality Reduction in
Recommender System-A Case Study. Defense
Technical Information Center, 2000.
... Data clustering is another effective approach that is typically used to split the problem into a number of subproblems of smaller size with more connected information. Nevertheless, in case of a particular MF method, based on Singular Value Decomposition (SVD) [32], simply imputing zero relevance scores for an unobserved values may produce better results [20,53]. Additional smoothing can be achieved in that case with help of a kernel trick [82]. ...
... For illustration purposes here we assume that missing data in R is replaced with zeroes (other ways of dealing with missing values problem are briefly described in Section 3.2). Despite the simplicity of the assumption this approach is known to serve as a strong baseline [20,53]. The Eckart-Young theorem states, that an optimal solution to the resulting optimization task min ...
... The regularization term of the model is defined by (20). The authors adopt a stochastic gradient descent (SGD) algorithm for solving the optimization task. ...
Preprint
A substantial progress in development of new and efficient tensor factorization techniques has led to an extensive research of their applicability in recommender systems field. Tensor-based recommender models push the boundaries of traditional collaborative filtering techniques by taking into account a multifaceted nature of real environments, which allows to produce more accurate, situational (e.g. context-aware, criteria-driven) recommendations. Despite the promising results, tensor-based methods are poorly covered in existing recommender systems surveys. This survey aims to complement previous works and provide a comprehensive overview on the subject. To the best of our knowledge, this is the first attempt to consolidate studies from various application domains in an easily readable, digestible format, which helps to get a notion of the current state of the field. We also provide a high level discussion of the future perspectives and directions for further improvement of tensor-based recommendation systems.
... The task can be further divided into two scenarios, namely in-matrix and out-of-matrix. In the in-matrix scenario, systems recommend top-k items which have not been rated by the target user but have been rated by other users [29] [3]. Based on the collaborations between similar users or items, state-of-the-art models [22] [10] [14] [11] apply collaborative filtering (CF) to generate recommendations. ...
... We adopt the evaluation metric Accuracy@k used in [29], [43]- [45] to evaluate the top-k video recommendation accuracy in our experiment. The Accuracy@k metric reveals the ratio between the number of the overall positives that are correctly predicted by the personalized recommendations and the number of the total ground truth positives, where a higher value means better performance. ...
Preprint
Video recommendation has become an essential way of helping people explore the massive videos and discover the ones that may be of interest to them. In the existing video recommender systems, the models make the recommendations based on the user-video interactions and single specific content features. When the specific content features are unavailable, the performance of the existing models will seriously deteriorate. Inspired by the fact that rich contents (e.g., text, audio, motion, and so on) exist in videos, in this paper, we explore how to use these rich contents to overcome the limitations caused by the unavailability of the specific ones. Specifically, we propose a novel general framework that incorporates arbitrary single content feature with user-video interactions, named as collaborative embedding regression (CER) model, to make effective video recommendation in both in-matrix and out-of-matrix scenarios. Our extensive experiments on two real-world large-scale datasets show that CER beats the existing recommender models with any single content feature and is more time efficient. In addition, we propose a priority-based late fusion (PRI) method to gain the benefit brought by the integrating the multiple content features. The corresponding experiment shows that PRI brings real performance improvement to the baseline and outperforms the existing fusion methods.
... Explicit feedback was shown to be invaluable throughout the Netflix Prize, but the competition framed RMSE (the prediction task) as the end goal. There has since been great attention on other, perhaps more useful, evaluation schemes [20,69,70,84,136,137], with immediate attention shifting to the TopN problem [138]. Special attention has been given to fairness [74], and more niche evaluation measures, such as serendipity [78,79], now also being considered. ...
Article
Full-text available
This survey is intended to inform non-expert readers about the field of recommender systems, particularly collaborative filtering, through the lens of the impactful Netflix Prize competition. Readers will quickly be brought up to speed on pivotal recommender systems advances through the Netflix Prize, informing their prospective state-of-the-art research with meaningful historic artifacts. We begin with the pivotal FunkSVD approach early in the competition. We then discuss Probabilistic Matrix Factorization and the importance and extensibility of the model. We examine the strategies of the Netflix Prize winner, providing comparisons to the Probabilistic Matrix Factorization framework as well as commentary as to why one approach became extensively used in research while another did not. Collectively, these models help to understand the progression of collaborative filtering through the Netflix Prize era. In each topic, we include ample discussion of results and background information. Finally, we highlight major veins of research following the competition.
... The data set is split into a training and test set with, occasionally, an additional validation set being used for model parameter tuning. All models are trained on this data, and for each user in the test set, a top-N list of recommendations is generated based on predicted user satisfaction (Cremonesi et al. 2010). ...
Article
Full-text available
Point-of-interest (POI) recommendations are essential for travelers and the e-tourism business. They assist in decision-making regarding what venues to visit and where to dine and stay. While it is known that traditional recommendation algorithms’ performance depends on data characteristics like sparsity, popularity bias, and preference distributions, the impact of these data characteristics has not been systematically studied in the POI recommendation domain. To fill this gap, we extend a previously proposed explanatory framework by introducing new explanatory variables specifically relevant to POI recommendation. At its core, the framework relies on having subsamples with different data characteristics to compute a regression model, which reveals the dependencies between data characteristics and performance metrics of recommendation models. To obtain these subsamples, we subdivide a POI recommendation data set on New York City and measure the effect of these characteristics on different classical POI recommendation algorithms in terms of accuracy, novelty, and item exposure. Our findings confirm the crucial role of key data features like density, popularity bias, and the distribution of check-ins in POI recommendation. Additionally, we identify the significance of novel factors, such as user mobility and the duration of user activity. In summary, our work presents a generic method to quantify the influence of data characteristics on recommendation performance. The results not only show why certain POI recommendation algorithms excel in specific recommendation problems derived from a LBSN check-in data set in New York City, but also offer practical insights into which data characteristics need to be addressed to achieve better recommendation performance.
... There is a large body of literature on recommendation methods for single items (Su and Khoshgoftaar 2009;Cremonesi, Koren, and Turrin 2010). In many practical situations, a group of items has to be recommended, as in the case of team formation. ...
Preprint
Full-text available
"A common decision made by people, whether healthy or with health conditions, is choosing meals like breakfast, lunch, and dinner, comprising combinations of foods for appetizer, main course, side dishes, desserts, and beverages. Often, this decision involves tradeoffs between nutritious choices (e.g., salt and sugar levels, nutrition content) and convenience (e.g., cost and accessibility, cuisine type, food source type). We present a data-driven solution for meal recommendations that considers customizable meal configurations and time horizons. This solution balances user preferences while accounting for food constituents and cooking processes. Our contributions include introducing goodness measures, a recipe conversion method from text to the recently introduced multimodal rich recipe representation (R3) format, learning methods using contextual bandits that show promising preliminary results, and the prototype, usage-inspired, BEACON system."
... These works formulated the recommendation task as a regression problem to predict the rating score. Later on, some research found that a good CF model in rating prediction may not necessarily perform well in top-K recommendation [10], and called on recommendation research to focus more on the ranking task. ...
Preprint
Full-text available
Item recommendation is a personalized ranking task. To this end, many recommender systems optimize models with pairwise ranking objectives, such as the Bayesian Personalized Ranking (BPR). Using matrix Factorization (MF) --- the most widely used model in recommendation --- as a demonstration, we show that optimizing it with BPR leads to a recommender model that is not robust. In particular, we find that the resultant model is highly vulnerable to adversarial perturbations on its model parameters, which implies the possibly large error in generalization. To enhance the robustness of a recommender model and thus improve its generalization performance, we propose a new optimization framework, namely Adversarial Personalized Ranking (APR). In short, our APR enhances the pairwise ranking method BPR by performing adversarial training. It can be interpreted as playing a minimax game, where the minimization of the BPR objective function meanwhile defends an adversary, which adds adversarial perturbations on model parameters to maximize the BPR objective function. To illustrate how it works, we implement APR on MF by adding adversarial perturbations on the embedding vectors of users and items. Extensive experiments on three public real-world datasets demonstrate the effectiveness of APR --- by optimizing MF with APR, it outperforms BPR with a relative improvement of 11.2% on average and achieves state-of-the-art performance for item recommendation. Our implementation is available at: https://github.com/hexiangnan/adversarial_personalized_ranking.
... Metrics. As suggested by Cremonesi et al. [2] for Top-N recommendation, we used recall@k and precision@k, as well as nDCG, a commonly used metric for ranking evaluation [12]. Table 1 presents the results, which we summarize in three points: ...
Preprint
Advances in image processing and computer vision in the latest years have brought about the use of visual features in artwork recommendation. Recent works have shown that visual features obtained from pre-trained deep neural networks (DNNs) perform very well for recommending digital art. Other recent works have shown that explicit visual features (EVF) based on attractiveness can perform well in preference prediction tasks, but no previous work has compared DNN features versus specific attractiveness-based visual features (e.g. brightness, texture) in terms of recommendation performance. In this work, we study and compare the performance of DNN and EVF features for the purpose of physical artwork recommendation using transactional data from UGallery, an online store of physical paintings. In addition, we perform an exploratory analysis to understand if DNN embedded features have some relation with certain EVF. Our results show that DNN features outperform EVF, that certain EVF features are more suited for physical artwork recommendation and, finally, we show evidence that certain neurons in the DNN might be partially encoding visual features such as brightness, providing an opportunity for explaining recommendations based on visual neural models.
... We tested the method with different number of neighbors, finding that using all neighbors works best. 3. PureSVD [35]. A state-of-the-art model-based CF method for top-K recommendation, which performs SVD on the whole matrix. ...
Preprint
The bipartite graph is a ubiquitous data structure that can model the relationship between two entity types: for instance, users and items, queries and webpages. In this paper, we study the problem of ranking vertices of a bipartite graph, based on the graph's link structure as well as prior information about vertices (which we term a query vector). We present a new solution, BiRank, which iteratively assigns scores to vertices and finally converges to a unique stationary ranking. In contrast to the traditional random walk-based methods, BiRank iterates towards optimizing a regularization function, which smooths the graph under the guidance of the query vector. Importantly, we establish how BiRank relates to the Bayesian methodology, enabling the future extension in a probabilistic way. To show the rationale and extendability of the ranking methodology, we further extend it to rank for the more generic n-partite graphs. BiRank's generic modeling of both the graph structure and vertex features enables it to model various ranking hypotheses flexibly. To illustrate its functionality, we apply the BiRank and TriRank (ranking for tripartite graphs) algorithms to two real-world applications: a general ranking scenario that predicts the future popularity of items, and a personalized ranking scenario that recommends items of interest to users. Extensive experiments on both synthetic and real-world datasets demonstrate BiRank's soundness (fast convergence), efficiency (linear in the number of graph edges) and effectiveness (achieving state-of-the-art in the two real-world tasks).
... To alleviate the problems of memory-based methods, many model-based methods have been proposed, which use observed ratings to learn a predictive model. Among them, matrix factorization (MF) based models, e.g., PureSVD [5] and weighted regularized MF (WRMF) [10], are very popular due to their capability of capturing the implicit relationships among items and their outstanding performance. Nevertheless, it introduces high computational complexity and also faces the problem of uninterpretable recommendations. ...
Preprint
Recommender systems play an increasingly important role in online applications to help users find what they need or prefer. Collaborative filtering algorithms that generate predictions by analyzing the user-item rating matrix perform poorly when the matrix is sparse. To alleviate this problem, this paper proposes a simple recommendation algorithm that fully exploits the similarity information among users and items and intrinsic structural information of the user-item matrix. The proposed method constructs a new representation which preserves affinity and structure information in the user-item rating matrix and then performs recommendation task. To capture proximity information about users and items, two graphs are constructed. Manifold learning idea is used to constrain the new representation to be smooth on these graphs, so as to enforce users and item proximities. Our model is formulated as a convex optimization problem, for which we need to solve the well-known Sylvester equation only. We carry out extensive empirical evaluations on six benchmark datasets to show the effectiveness of this approach.
Article
Full-text available
The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems---a personalized information filtering technology used to identify a set of items that will be of interest to a certain user. User-based collaborative filtering is the most successful technology for building recommender systems to date and is extensively used in many commercial recommender systems. Unfortunately, the computational complexity of these methods grows linearly with the number of customers, which in typical commercial applications can be several millions. To address these scalability concerns model-based recommendation techniques have been developed. These techniques analyze the user--item matrix to discover relations between the different items and use these relations to compute the list of recommendations.In this article, we present one such class of model-based recommendation algorithms that first determines the similarities between the various items and then uses them to identify the set of items to be recommended. The key steps in this class of algorithms are (i) the method used to compute the similarity between the items, and (ii) the method used to combine these similarities in order to compute the similarity between a basket of items and a candidate recommender item. Our experimental evaluation on eight real datasets shows that these item-based algorithms are up to two orders of magnitude faster than the traditional user-neighborhood based recommender systems and provide recommendations with comparable or better quality.
Article
Full-text available
This paper presents some experiments to analyse the pop-ularity effect in music recommendation. Popularity is mea-sured in terms of total playcounts, and the Long Tail model is used in order to rank music artists. Furthermore, metrics derived from complex network analysis are used to detect the influence of the most popular artists in the network of similar artists. The results from the experiments reveal that—as expected by its inherent social component—the collaborative filtering approach is prone to popularity bias. This has some conse-quences on the discovery ratio as well as in the navigation through the Long Tail. On the other hand, in both audio content–based and human expert–based approaches artists are linked independently of their popularity. This allows one to navigate from a mainstream artist to a Long Tail artist in just two or three clicks.
Conference Paper
Full-text available
Recommender systems use statistical and knowledge discovery techniques in order to recommend products to users and to mitigate the problem of information overload. The evaluation of the quality of recommender systems has become an important issue for choosing the best learning algorithms. In this paper we propose an evaluation methodology for collaborative filtering (CF) algorithms. This methodology carries out a clear, guided and repeatable evaluation of a CF algorithm. We apply the methodology on two datasets, with different characteristics, using two CF algorithms: singular value decomposition and naive bayesian networks.
Article
In October, 2006 Netflix released a dataset containing 100 million anonymous movie ratings and challenged the data mining, machine learning and computer science communities to develop systems that could beat the accuracy of its recommendation system, Cinematch. We briefly describe the challenge itself, review related work and efforts, and summarize visible progress to date. Other potential uses of the data are outlined, including its application to the KDD Cup 2007.
Article
A key part of a recommender system is a collaborative filter-ing algorithm predicting users' preferences for items. In this paper we describe different efficient collaborative filtering techniques and a framework for combining them to obtain a good prediction. The methods described in this paper are the most im-portant parts of a solution predicting users' preferences for movies with error rate 7.04% better on the Netflix Prize dataset than the reference algorithm Netflix Cinematch. The set of predictors used includes algorithms suggested by Netflix Prize contestants: regularized singular value de-composition of data with missing values, K-means, postpro-cessing SVD with KNN. We propose extending the set of predictors with the following methods: addition of biases to the regularized SVD, postprocessing SVD with kernel ridge regression, using a separate linear model for each movie, and using methods similar to the regularized SVD, but with fewer parameters. All predictors and selected 2-way interactions between them are combined using linear regression on a holdout set.
Chapter
In this chapter we describe the integration of a recommender system into the production environment of Fastweb, one of the largest European IP Television (IPTV) providers. The recommender system implements both collaborative and content-based techniques, suitable tailored to the specific requirements of an IPTV architecture, such as the limited screen definition, the reduced navigation capabilities, and the strict time constraints. The algorithms are extensively analyzed by means of off-line and on-line tests, showing the effectiveness of the recommender systems: up to 30% of the recommendations are followed by a purchase, with an estimated lift factor (increase in sales) of 15%.
Article
As the Netflix Prize competition has demonstrated, matrix factorization models are superior to classic nearest neighbor techniques for producing product recommendations, allowing the incorporation of additional information such as implicit feedback, temporal effects, and confidence levels.