Content uploaded by Paolo Cremonesi

Author content

All content in this area was uploaded by Paolo Cremonesi on Sep 08, 2015

Content may be subject to copyright.

Performance of Recommender Algorithms

on Top-N Recommendation Tasks

Paolo Cremonesi

Politecnico di Milano

Milan, Italy

paolo.cremonesi@polimi.it

Yehuda Koren

Yahoo! Research

Haifa, Israel

yehuda@yahoo-inc.com

Roberto Turrin

Neptuny

Milan, Italy

roberto.turrin@polimi.it

ABSTRACT

In many commercial systems, the ‘best bet’ recommenda-

tions are shown, but the predicted rating values are not.

This is usually referred to as a top-N recommendation task,

where the goal of the recommender system is to ﬁnd a few

speciﬁc items which are supposed to be most appealing to

the user. Common methodologies based on error metrics

(such as RMSE) are not a natural ﬁt for evaluating the top-

N recommendation task. Rather, top-N performance can

be directly measured by alternative methodologies based on

accuracy metrics (such as precision/recall).

An extensive evaluation of several state-of-the art recom-

mender algorithms suggests that algorithms optimized for

minimizing RMSE do not necessarily perform as expected

in terms of top-N recommendation task. Results show that

improvements in RMSE often do not translate into accu-

racy improvements. In particular, a naive non-personalized

algorithm can outperform some common recommendation

approaches and almost match the accuracy of sophisticated

algorithms. Another ﬁnding is that the very few top popular

items can skew the top-N performance. The analysis points

out that when evaluating a recommender algorithm on the

top-N recommendation task, the test set should be chosen

carefully in order to not bias accuracy metrics towards non-

personalized solutions. Finally, we oﬀer practitioners new

variants of two collaborative ﬁltering algorithms that, re-

gardless of their RMSE, signiﬁcantly outperform other rec-

ommender algorithms in pursuing the top-N recommenda-

tion task, with oﬀering additional practical advantages. This

comes at surprise given the simplicity of these two methods.

Categories and Subject Descriptors

H.3.4 [Information Storage and Retrieval]: Systems

and Software—user proﬁles and alert services; performance

evaluation (eﬃciency and eﬀectiveness); H.3.3 [Information

Storage and Retrieval]: Information Search and Retrieval—

Information ﬁltering

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

RecSys2010, September 26–30, 2010, Barcelona, Spain.

Copyright 2010 ACM 978-1-60558-906-0/10/09 ...$10.00.

General Terms

Algorithms, Experimentation, Measurement, Performance

1. INTRODUCTION

A common practice with recommender systems is to eval-

uate their performance through error metrics such as RMSE

(root mean squared error), which capture the average error

between the actual ratings and the ratings predicted by the

system. However, in many commercial systems only the

‘best bet’ recommendations are shown, while the predicted

rating values are not [8]. That is, the system suggests a few

speciﬁc items to the user that are likely to be very appealing

to him. While the majority of the literature is focused on

convenient error metrics (RMSE, MAE), such classical error

criteria do not really measure top-N performance. At most,

they can serve as proxies of the true top-N experience. Di-

rect evaluation of top-N performance must be accomplished

by means of alternative methodologies based on accuracy

metrics (e.g., recall and precision).

In this paper we evaluate – through accuracy metrics –

the performance of several collaborative ﬁltering algorithms

in pursuing the top-N recommendation task. Evaluation is

contrasted with performance of the same methods on the

RMSE metric. Tests have been performed on the Netﬂix

and Movielens datasets.

The contribution of the work is threefold: (i) we show

that there is no trivial relationship between error metrics

and accuracy metrics; (ii) we propose a careful construction

of the test set for not biasing accuracy metrics; (iii) we intro-

duce new variants of existing algorithms that improve top-N

performance together with other practical beneﬁts.

We ﬁrst compare some state-of-the-art algorithms (e.g.,

Asymmetric SVD) with a non-personalized algorithm based

on item popularity. The surprising result is that the perfor-

mance of the non-personalized algorithm on top-N recom-

mendations are comparable to the performance of sophisti-

cated, personalized algorithms, regardless of their RMSE.

However, a non-personalized, popularity-based algorithm

can only provide trivial recommendations, interesting nei-

ther to users, which can get bored and disappointed by the

recommender system, nor to content providers, which invest

in a recommender system for pushing up sales of less known

items. For such a reason, we run an additional set of experi-

ments in order to evaluate the performance of the algorithms

while excluding the extremely popular items. As expected,

the accuracy of all algorithms decreases, as it is more diﬃcult

to recommend non-trivial items. Yet, ranking of the diﬀer-

ent algorithms aligns better with our expectations, with the

non-personalized methods being ranked lower. Thus, when

evaluating algorithms in the top-N recommendation task,

we would advise to carefully choose the test set, otherwise

accuracy metrics are strongly biased.

Finally, when pursuing a top-N recommendation task, ex-

act rating prediction is not required. We present new vari-

ants of two collaborative ﬁltering algorithms that are not

designed for minimizing RMSE, but consistently outperform

other recommender algorithms in top-N recommendations.

This ﬁnding becomes even more important, when consider-

ing the simple and less conventional nature of the outper-

forming methods.

2. TESTING METHODOLOGY

The testing methodology adopted in this study is similar

to the one described in [6] and, in particular, in [10]. For each

dataset, known ratings are split into two subsets: training

set Mand test set T. The test set Tcontains only 5-stars

ratings. So we can reasonably state that Tcontains items

relevant to the respective users.

The detailed procedure used to create Mand Tfrom the

Netﬂix dataset is similar to the one set for the Netﬂix prize,

maintaining compatibility with results published in other re-

search papers [3]. Netﬂix released a training dataset contain-

ing about 100M ratings, referred to as the training dataset.

In addition to the training set, Netﬂix also provided a val-

idation set, referred to as the probe set, containing 1.4M

ratings. In this work, the training set Mis the original Net-

ﬂix training set, while the test set Tcontains all the 5-stars

ratings from the probe set (|T|=384,573). As expected, the

probe set was not used for training.

We adopted a similar procedure for the Movielens dataset

[13]. We randomly sub-sampled 1.4% of the ratings from

the dataset in order to create a probe set. The training set

Mcontains the remaining ratings. The test set Tcontains

all the 5-star ratings from the probe set.

In order to measure recall and precision, we ﬁrst train the

model over the ratings in M. Then, for each item irated

5-stars by user uin T:

(i) We randomly select 1000 additional items unrated by

user u. We may assume that most of them will not be

of interest to user u.

(ii) We predict the ratings for the test item iand for the

additional 1000 items.

(iii) We form a ranked list by ordering all the 1001 items

according to their predicted ratings. Let pdenote the

rank of the test item iwithin this list. The best result

corresponds to the case where the test item iprecedes

all the random items (i.e., p= 1).

(iv) We form a top-N recommendation list by picking the

Ntop ranked items from the list. If p≤Nwe have a

hit (i.e., the test item iis recommended to the user).

Otherwise we have a miss. Chances of hit increase with

N. When N= 1001 we always have a hit.

The computation of recall and precision proceeds as fol-

lows. For any single test case, we have a single relevant

item (the tested item i). By deﬁnition, recall for a single

test can assume either the value 0 (in case of miss) or 1 (in

case of hit). Similarly, precision can assume either the value

0 20% 40% 60% 80% 100%

0

0.1%

1%

10%

100%

% of ratings

% of items

Netflix

Movielens

Short−head

(popular) Long−tail

(unpopular)

Figure 1: Rating distribution for Netﬂix (solid line)

and Movielens (dashed line). Items are ordered ac-

cording to popularity (most popular at the bottom).

0 or 1/N. The overall recall and precision are deﬁned by

averaging over all test cases:

recall(N) = #hits

|T|

precision(N) = #hits

N· |T|=recall(N)

N

where |T|is the number of test ratings. Note that the hy-

pothesis that all the 1000 random items are non-relevant

to user utends to underestimate the computed recall and

precision with respect to true recall and precision.

2.1 Popular items vs. long-tail

According to the well known long-tail distribution of rated

items applicable to many commercial systems, the majority

of ratings are condensed in a small fraction of the most pop-

ular items [1].

Figure 1 plots the empirical rating distributions of the

Netﬂix and Movielens datasets. Items in the vertical axis

are ordered according to their popularity, most popular at

the bottom. We observe that about 33% of ratings collected

by Netﬂix involve only the 1.7% of most popular items (i.e.,

302 items). We refer to this small set of very popular items

as the short-head, and to the remaining set of less popular

items – about 98% of the total – as the long-tail [5]. We

also note that Movielens’ rating distribution is slightly less

long-tailed then Netﬂix’s: the short-head (33% of ratings)

involves the 5.5% of popular items (i.e., 213 items).

Recommending popular items is trivial and do not bring

much beneﬁts to users and content providers. On the other

hand, recommending less known items adds novelty and

serendipity to the users but it is usually a more diﬃcult

task. In this study we aim at evaluating the accuracy of

recommender algorithms in suggesting non-trivial items. To

this purpose, the test set Thas been further partitioned into

two subsets, Thead and Tlong, such that items in Thead are in

the short-head while items in Tlong are in the long-tail of the

distribution.

3. COLLABORATIVE ALGORITHMS

Most recommender systems are based on collaborative ﬁl-

tering (CF), where recommendations rely only on past user

behavior (to be referred here as ‘ratings’, though such behav-

ior can include other user activities on items like purchases,

rentals and clicks), regardless of domain knowledge. There

are two primary approaches to CF: (i) the neighborhood

approach and (ii) the latent factor approach.

Neighborhood models represent the most common appro-

ach to CF. They are based on the similarity among either

users or items. For instance, two users are similar because

they have rated similarly the same set of items. A dual

concept of similarity can be deﬁned among items.

Latent factor approaches model users and items as vectors

in the same ‘latent factor’ space by means of a reduced num-

ber of hidden factors. In such a space, users and items are

directly comparable: the rating of user uon item iis pre-

dicted by the proximity (e.g., inner-product) between the

related latent factor vectors.

3.1 Non-personalized models

Non-personalized recommenders present to any user a pre-

deﬁned, ﬁxed list of items, regardless of his/her preferences.

Such algorithms serve as baselines for the more complex per-

sonalized algorithms.

A simple estimation rule, referred to as Movie Average

(MovieAvg), recommends top-N items with the highest av-

erage rating. The rating of user uon item iis predicted

as the mean rating expressed by the community on item i,

regardless of the ratings given by u.

A similar prediction schema, denoted by Top Popular (Top-

Pop), recommends top-N items with the highest popularity

(largest number of ratings). Notice that in this case the rat-

ing of user uabout item icannot be inferred, but the output

of this algorithm is only a ranked list of items. As a conse-

quence, RMSE or other error metrics are not applicable.

3.2 Neighborhood models

Neighborhood models base their prediction on the simi-

larity relationships among either users or items.

Algorithms centered on user-user similarity predict the

rating by a user based on the ratings expressed by users sim-

ilar to him about such item. On the other hand, algorithms

centered on item-item similarity compute the user preference

for an item based on his/her own ratings on similar items.

The latter is usually the preferred approach (e.g., [15]), as it

usually performs better in RMSE terms, while being more

scalable. Both advantages are related to the fact that the

number of items is typically smaller than the number of

users. Another advantage of item-item algorithms is that

reasoning behind a recommendation to a speciﬁc user can be

explained in terms of the items previously rated by him/her.

In addition, basing the system parameters on items (rather

than users) allows aseamless handling of users and ratings

new to the system. For such reasons, we focus on item-item

neighborhood algorithms.

The similarity between item iand item jis measured as

the tendency of users to rate items iand jsimilarly. It is

typically based either on the cosine, the adjusted cosine, or

(most commonly) the Pearson correlation coeﬃcient [15].

Item-item similarity is computed on the common raters.

In the typical case of a very sparse dataset, it is likely that

some pairs of items have a poor support, leading to a non-

reliable similarity measure. For such a reason, if nij denotes

the number of common raters and sij the similarity between

item iand item j, we can deﬁne the shrunk similarity dij

as the coeﬃcient dij =nij

nij +λ1

sij where λ1is a shrinking

factor [10]. A typical value of λ1is 100.

Neighborhood models are further enhanced by means of a

kNN (k-nearest-neighborhood) approach. When predicting

rating rui, we consider only the kitems rated by uthat are

the most similar to i. We denote the set of most similar

items by Dk(u;i). The kNN approach discards the items

poorly correlated to the target item, thus decreasing noise

for improving the quality of recommendations.

Prior to comparing and summing diﬀerent ratings, it is

advised to remove diﬀerent biases which mask the more fun-

damental relations between items. Such biases include item-

eﬀects which represent the fact that certain items tend to

receive higher ratings than others. They also include user-

eﬀects, which represent the tendency of certain users to rate

higher than others. More delicate calculation of the biases,

would also estimate temporal eﬀects [11], but this is beyond

the scope of this work. We take as baselines the static item-

and user-biases, following [10]. Formally, the bias associated

with the rating of user uto item iis denoted by bui.

An item-item kNN method predicts the residual rating

rui −bui as the weighted average of the residual ratings of

similar items:

ˆrui =bui +Pj∈Dk(u;i)dij (ruj −buj )

Pj∈Dk(u;i)dij

(1)

Hereinafter, we refer to this model as Correlation Neigh-

borhood (CorNgbr), where sij is measured as the Pearson

correlation coeﬃcient.

3.2.1 Non-normalized Cosine Neighborhood

Notice that in (1), the denominator forces that predicted

rating values fall in the correct range, e.g., [1 ...5] for a typi-

cal star-ratings systems. However, for a top-N recommenda-

tion task, exact rating values are not necessary. We simply

want to rank items by their appeal to the user. In such a

case, we can simplify the formula by removing the denom-

inator. A beneﬁt of this would be higher ranking for items

with many similar neighbors (that is high Pj∈Dk(u;i)dij ),

where we have a higher conﬁdence in the recommendation.

Therefore, we propose to rank items by the following coeﬃ-

cient denoted by ˆrui:

ˆrui =bui +X

j∈Dk(u;i)

dij (ruj −buj ) (2)

Here ˆrui does not represent a proper rating, but is rather a

metric for the association between user uand item i. We

should note that similar non-normalized neighborhood rules

were mentioned by others [7, 10].

In our experiments, the best results in terms of accuracy

metrics have been obtained by computing sij as the cosine

similarity. Unlike Pearson correlation which is computed

only on ratings shared by common raters, the cosine co-

eﬃcient between items iand jis computed over all rat-

ings (taking missing values as zeroes), that is: cos(i, j) =

~

i·~

j/(||

~

i||2·||~

j||2). We denote such model by Non-Normalized

Cosine Neighborhood (NNCosNgbr).

3.3 Latent Factor Models

Recently, several recommender algorithms based on la-

tent factor models have been proposed. Most of them are

based on factoring the user-item ratings matrix [12], also

informally known as SVD models after the related Singular

Value Decomposition.

The key idea of SVD models is to factorize the user-item

rating matrix to a product of two lower rank matrices, one

containing the so-called ‘user factors’, while the other one

containing the so-called ‘item-factors’. Thus, each user uis

represented with an f-dimensional user factors vector pu∈

ℜf. Similarly, each item iis represented with an item factors

vector qi∈ ℜf. Prediction of a rating given by user ufor

item iis computed as the inner product between the related

factor vectors (adjusted for biases), i.e.,

ˆrui =bui +puqiT(3)

Since conventional SVD is undeﬁned in the presence of

unknown values – i.e., the missing ratings – several solu-

tions have been proposed. Earlier works addressed the issue

by ﬁlling missing ratings with a baseline estimations (e.g.,

[16]). However, this leads to a very large, dense user rating

matrix, whose factorization becomes computationally infea-

sible. More recent works learn factor vectors directly on

known ratings through a suitable objective function which

minimizes prediction error. The proposed objective func-

tions are usually regularized in order to avoid overﬁtting

(e.g., [14]). Typically, gradient descent is applied to mini-

mize the objective function.

As with neighborhood methods, this article concentrates

on methods which represent users as a combination of item

features, without requiring any user-speciﬁc parameteriza-

tion. The advantages of these methods are that they can

create recommendations for users new to the system without

re-evaluation of parameters. Likely, they can immediately

adjust their recommendations to just entered ratings, pro-

viding users with an immediate feedback for their actions.

Finally, such methods can explain their recommendations in

terms of items previously rated by the user.

Thus, we experimented with a powerful matrix factoriza-

tion model, which indeed represents users as a combination

of item features. The method is known as Asymmetric-SVD

(AsySVD) and is reported to reach an RMSE of 0.9000 on

the Netﬂix dataset [10].

In addition, we have experimented with a beefed up matrix-

factorization approach known as SVD++ [10], which repre-

sents highest quality in RMSE-optimized factorization meth-

ods, albeit users are no longer represented as a combination

of item features; see [10].

3.3.1 PureSVD

While pursuing a top-N recommendation task, we are in-

terested only in a correct item ranking, not caring about

exact rating prediction. This grants us some ﬂexibility, like

considering all missing values in the user rating matrix as

zeros, despite being out of the 1-to-5 star rating range. In

terms of predictive power, the choice of zero is not very im-

portant, and we have received similar results with higher

imputed values. Importantly, now we can leverage existing

highly optimized software packages for performing conven-

tional SVD on sparse matrices, which becomes feasible since

all matrix entries are now non-missing. Thus, the user rating

matrix Ris estimated by the factorization [2]:

ˆ

R=U·Σ·QT(4)

where, Uis a n×forthonormal matrix, Qis a m×f

Dataset Users Items Ratings Density

Movielens 6,040 3,883 1M 4.26%

Netﬂix 480,189 17,770 100M 1.18%

Table 1: Statistical properties of Movielens and Net-

ﬂix.

orthonormal matrix, and Σis a f×fdiagonal matrix con-

taining the ﬁrst fsingular values.

In order to demonstrate the ease of imputing zeroes, we

should mention that we used a non-multithreaded SVD pack-

age (SVDLIBC, based on the SVDPACKC library [4]), which

factorized the 480K users by 17,770 movies Netﬂix dataset

under 10 minutes on an i7 PC (f= 150).

Let us deﬁne P=U·Σ, so that the u-th row of Prep-

resents the user factors vector pu, while the i-th row of Q

represents the item factors vector qi. Accordingly, ˆrui can

be computed similarly to (3).

In addition, since Uand Qhave orthonormal columns,

we can straightforwardly derive that:

P=U·Σ=R·Q(5)

where Ris the user rating matrix. Consequently, by denot-

ing with ruthe u-th row of the user rating matrix - i.e., the

vector of ratings of user u, we can rewrite the prediction rule

ˆrui =ru·Q·qiT(6)

Note that, similarly to (1), in a slight abuse of notation,

the symbol ˆrui , is not exactly a valid rating value, but an

association measure between user uand item i.

In the following we will refer to this model as PureSVD.

As with item-item kNN and AsySVD, PureSVD oﬀers all

the beneﬁts of representing users as a combination of item

features (by Eq. (5)), without any user-speciﬁc parameter-

ization. It also oﬀers convenient optimization, which does

not require tunning learning constants.

4. RESULTS

In this section we present the quality of the recommender

algorithms presented in Section 3 on two standard datasets:

MovieLens [13] and Netﬂix [3]. Both are publicly available

movie rating datasets. Collected ratings are in a 1-to-5 star

scale. Table 1 summarizes their statistical properties.

We used the methodology deﬁned in Section 2 for evalu-

ating six recommender algorithms. The ﬁrst two - MovieAvg

and TopPop - are non-personalized algorithms, and we would

expect them to be outperformed by any recommender algo-

rithm. The third prediction rule - CorNgbr - is a well tuned

neighborhood-based algorithm, probably the most popular

in the literature of collaborative ﬁltering. The forth algo-

rithm is a variant of CorNgbr - NNCosNgbr - and it is one

of the two proposed algorithm oriented to accuracy metrics.

Fifth is the latent factor model AsySVD with 200 factors.

Sixth is a 200-D SVD++, among the most powerful latent

factor models in terms of RMSE. Finally, we consider our

variant of latent factor models, PureSVD, which is shown in

two conﬁgurations: one with a fewer latent factors (50), and

one with a larger number of latent factors (150 for Movielens

and 300 for the larger Netﬂix dataset).

Three of the algorithms – TopPop, NNCosNgbr, and PureSVD

– are not sensible from an error minimization viewpoint and

cannot be assessed by an RMSE measure. The other four

algorithms were optimized to deliver best RMSE results,

and their RMSE scores on the Netﬂix test set are as fol-

lows: 1.053 for MovieAvg, 0.9406 for CorNgbr, 0.9000 for

AsySVD, and 0.8911 for SVD++ [10].

For each dataset, we have performed one set of experi-

ments on the full test set and one set of experiments on the

long-tail test set. We report the recall as a function of N

(i.e., the number of items recommended), and the precision

as a function of the recall. As for recall(N), we have zoomed

in on Nin the range [1 ...20]. Larger values of Ncan be

ignored for a typical top-N recommendations task. Indeed,

there is no diﬀerence whether an appealing movie is placed

within the top 100 or the top 200, because in neither case it

will be presented to the user.

4.1 Movielens dataset

Figure 2 reports the performance of the algorithms on

the Movielens dataset over the full test set. It is apparent

that the algorithms have signiﬁcant performance disparity in

terms of top-N accuracy. For instance, the recall of AsySVD

at N= 10 is about 0.28, i.e., the model has a probability of

28% to place an appealing movie in the top-10. Surprisingly,

the recall of the non-personalized TopPop is very similar to

AsySVD (e.g., at N= 10 recall is about 0.29). The best

algorithms in terms of accuracy are the non-RMSE-oriented

NNCosNgbr and PureSVD, which reach, at N= 10, a recall

equaling about 0.44 and 0.52, respectively. As for the latter,

this means that about 50% of 5-star movies are presented

in a top-10 recommendation. The best algorithm in the

RMSE-oriented familiy is SVD++, with a recall close to

that of NNCosNgbr.

Figure 2(b) conﬁrms that PureSVD also outperforms the

other algorithms in terms of precision metrics, followed by

the other non-RMSE-oriented algorithm – NNCosNgbr. Each

line represents the precision of the algorithm at a given re-

call. For example, when the recall is about 0.2, precision of

NNCosNgbr is about 0.12. Again, TopPop performance is

aligned with that of a state-of-the-art algorithm as AsySVD.

We should note the gross underperformance of the widely

used CorNgbr algorithm, whose performance is in line with

the very naive MovieAvg. We should also note that the pre-

cision of SVD++ is competitive with that of PureSVD50 for

small values of recall.

The strange and somehow unexpected result of TopPop

motivates the second set of experiments, accomplished over

the long-tail items, whose results are drawn in Figure 3. As

a reminder, now we exclude the very popular items from

consideration. Here the ordering among the several recom-

mender algorithms better aligns with our expectations. In

fact, recall and precision of the non-personalized TopPop

dramatically falls down and it is very unlikely to recom-

mend a 5-star movie within the ﬁrst 20 positions. However,

even when focusing on the long-tail, the best algorithms is

still PureSVD, whose recall at N= 10 is about 40%. Note

that while the best performance of PureSVD was with 50

latent factors in the case of full test set, here the best perfor-

mance is reached with a larger number of latent factors, i.e.,

150. Performance of NNCosNgbr now becomes signiﬁcantly

worse than PureSVD, while SVD++ is the best within the

RMSE-oriented algorithms.

4.2 Netﬂix dataset

Analogously to the results presented for Movielens, Fig-

ures 4 and 5 show the performance of the algorithms on the

Netﬂix dataset. As before, we focus on both the full test set

and the long-tail test set.

Once again, non-personalized TopPop shows surprisingly

good results when including the 2% head items, outper-

forming the widely popular CorNgbr. However, the more

powerful AsySVD and SDV++, which were possibly better

tuned for the Netﬂix data, are slightly outperforming Top-

Pop. Note also that AsySVD is now in line with SVD++.

Consistent with the Movielens experience, the best per-

forming algorithm in terms of recall and precision is still the

non-RMSE-oriented PureSVD. As for the other non-RMSE-

oriented NNCosNgbr, the picture becomes mixed. It is still

outperforming the RMSE-oriented algorithms when includ-

ing the head items, but somewhat underperforms them when

the top-2% most popular items are excluded.

The behavior of the commonly used CorNgbr on the Net-

ﬂix dataset is very surprising. While it signiﬁcantly under-

performs others on the full test set, it becomes among the

top-performers when concentrating on the longer tail. In

fact, while all the algorithms decrease their precision and

recall when passing from the full to the long-item test set,

CorNgbr appears more accurate in recommending long-tail

items. After all, the wide acceptance of the CorNgbr ap-

proach might be for a reason given the importance of long-

tail items.

5. DISCUSSION — PureSVD

Over both Movielens and Netﬂix datasets, regardless of in-

clusion of head-items, PureSVD is consistently the top per-

former, beating more detailed and sophisticated latent fac-

tor models. Given its simplicity, and poor design in terms

of RMSE optimization, we did not expect this result. In

fact, we would view it as a good news for practitioners of

recommender systems, as PureSVD combines multiple ad-

vantages. First, it is very easy to code, without a need to

tune learning constants, and fully relies on oﬀ-the-shelf opti-

mized SVD packages . This comes with good computational

performance in both oﬄine and online modes. PureSVD

also has the convenience of representing the users as a com-

bination of item features (Eq. (6)), oﬀering designers a ﬂex-

ibility in handling new users, new ratings by existing users

and explaining the reasoning behind the generated recom-

mendations.

An interesting ﬁnding, observed at both Movielens and

Netﬂix datasets, is that when moving to longer tail items,

accuracy improves with raising the dimensionality of the

PureSVD model. This may be related to the fact that

the ﬁrst latent factors of PureSVD capture properties of

the most popular items, while the additional latent factors

represent more reﬁned features related to unpopular items.

Hence, when practitioners use PureSVD, they should pick its

dimensionality while accounting for the fact that the num-

ber of latent factors inﬂuences the quality of long-tail items

diﬀerently than head items.

We would like to oﬀer an explanation as to why PureSVD

could consistently deliver better top-N results than best RMSE-

reﬁned latent factor models. This may have to do with a

limitation of RMSE testing, which concentrates only on the

ratings that the user provided to the system. This way,

RMSE (or MAE for the matter) is measured on a held-out

test set containing only items that the user chose to rate,

while completely missing any evaluation of the method on

0 5 10 15 20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

N

recall(N)

MovieAvg

TopPop

CorNgbr

NNCosNgbr

AsySVD

SVD++

PureSVD50

PureSVD150

(a) recall

0 0.2 0.4 0.6 0.8 1

0

0.02

0.04

0.06

0.08

0.1

0.12

recall

precision

MovieAvg

TopPop

CorNgbr

NNCosNgbr

AsySVD

SVD++

PureSVD50

PureSVD150

(b) precision vs recall

Figure 2: Movielens: (a) recall-at-Nand (b) precision-versus-recall on all items.

0 5 10 15 20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

N

recall(N)

MovieAvg

TopPop

CorNgbr

NNCosNgbr

AsySVD

SVD++

PureSVD50

PureSVD150

(a) recall

0 0.2 0.4 0.6 0.8 1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

recall

precision

MovieAvg

TopPop

CorNgbr

NNCosNgbr

AsySVD

SVD++

PureSVD50

PureSVD150

(b) precision vs recall

Figure 3: Movielens: (a) recall-at-Nand (b) precision-versus-recall on long-tail (94% of items).

0 5 10 15 20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

N

recall(N)

MovieAvg

TopPop

CorNgbr

NNCosNgbr

AsySVD

SVD++

PureSVD50

PureSVD300

(a) recall

0 0.2 0.4 0.6 0.8 1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

recall

precision

MovieAvg

TopPop

CorNgbr

NNCosNgbr

AsySVD

SVD++

PureSVD50

PureSVD300

(b) precision vs recall

Figure 4: Netﬂix: (a) recall-at-Nand (b) precision-versus-recall on all items.

0 5 10 15 20

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

N

recall(N)

MovieAvg

TopPop

CorNgbr

NNCosNgbr

AsySVD

SVD++

PureSVD50

PureSVD300

(a) recall

0 0.2 0.4 0.6 0.8 1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

recall

precision

MovieAvg

TopPop

CorNgbr

NNCosNgbr

AsySVD

SVD++

PureSVD50

PureSVD300

(b) precision vs recall

Figure 5: Netﬂix: (a) recall-at-Nand (b) precision-versus-recall on long-tail (98% of items).

items that the user has never rated. This testing-mode bodes

well with the RMSE-oriented models, which are trained only

on the known ratings, while largely ignoring the missing en-

tries. Yet, such a testing methodology misses much of the

reality, where all items should count, not only those actually

rated by the user in the past. The proposed Top-N based

accuracy measures indeed do better in this respect, by di-

rectly involving all possible items (including unrated ones)

in the testing phase. This may explain the outperformance

of PureSVD, which considers all possible user-item pairs (re-

gardless of rating availability) in the training phase.

Our general advise to practitioners is to consider PureSVD

as a recommender algorithm. Still there are several unex-

plored ways that may improve PureSVD. First, one can opti-

mize the value imputed at the missing entries. Direct usage

of ready sparse SVD solvers (which usually assume a default

value of zero) would still be possible by translating all given

scores. For example, imputing a value of 3 instead of zero

would be eﬀectively achieved by translating the given star

ratings from the [1...5] range into the [-2...2] range. Second,

one can grant a lower conﬁdence to the imputed values, such

that SVD will emphasize more eﬀorts on real ratings. For

an explanation on how this can be accomplished consider

[9]. However, we would expect such a conﬁdence weighting

to signiﬁcantly increase the time complexity of the oﬄine

training.

6. CONCLUSIONS

Evaluation of recommender has long been divided between

accuracy metrics (e.g., precision/recall) and error metrics

(notably, RMSE and MAE). The mathematical convenience

and ﬁtness with formal optimization methods, have made

error metrics like RMSE more popular, and they are indeed

dominating the literature. However, it is well recognized

that accuracy measures may be a more natural yardstick, as

they directly assess the quality of top-N recommendations.

This work shows, through an extensive empirical study,

that the convenient assumption that an error metric such as

RMSE can serve as good proxy for top-N accuracy is ques-

tionable at best. There is no monotonic relation between

error metrics and accuracy metrics. This may call for a re-

evaluation of optimization goals for top-N systems. On the

bright side we have presented simple and eﬃcient variants

of known algorithms, which are useless in RMSE terms, and

yet deliver superior results when pursuing top-N accuracy.

In passing, we have also discussed possible pitfalls in the

design of a test set for conducting a top-N accuracy eval-

uation. In particular, a careless construction of the test

set would make recall and precision strongly biased towards

non personalized algorithms. An easy solution, which we

adopted, was excluding the extremely popular items from

the test set (while retaining ∼98% of the items). The re-

sulting test set, which emphasizes the rather important non-

trivial items, seems to shape better with our expectations.

First, it correctly shows the lower value of non-personalized

algorithms. Second, it shows a good behavior for the wi-

dely used correlation-based kNN approach, which otherwise

(when evaluated on the full set of items) exhibits extremely

poor results, strongly confronting the accepted practice.

7. REFERENCES

[1] C. Anderson. The Long Tail: Why the Future of

Business Is Selling Less of More. Hyperion, July 2006.

[2] R. Bambini, P. Cremonesi, and R. Turrin.

Recommender Systems Handbook, chapter A

Recommender System for an IPTV Service Provider:

a Real Large-Scale Production Environment. Springer,

2010.

[3] J. Bennett and S. Lanning. The Netﬂix Prize.

Proceedings of KDD Cup and Workshop, pages 3–6,

2007.

[4] M. W. Berry. Large-scale sparse singular value

computations. The International Journal of

Supercomputer Applications, 6(1):13–49, Spring 1992.

[5] O. Celma and P. Cano. From hits to niches? or how

popular artists can bias music recommendation and

discovery. Las Vegas, USA, August 2008.

[6] P. Cremonesi, E. Lentini, M. Matteucci, and

R. Turrin. An evaluation methodology for

recommender systems. 4th Int. Conf. on Automated

Solutions for Cross Media Content and Multi-channel

Distribution, pages 224–231, Nov 2008.

[7] M. Deshpande and G. Karypis. Item-based top-n

recommendation algorithms. ACM Transactions on

Information Systems (TOIS), 22(1):143–177, 2004.

[8] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl.

Evaluating collaborative ﬁltering recommender

systems. ACM Transactions on Information Systems

(TOIS), 22(1):5–53, 2004.

[9] Y. Hu, Y. Koren, and C. Volinsky. Collaborative

ﬁltering for implicit feedback datasets. Data Mining,

IEEE International Conference on, 0:263–272, 2008.

[10] Y. Koren. Factorization meets the neighborhood: a

multifaceted collaborative ﬁltering model. In KDD ’08:

Proceeding of the 14th ACM SIGKDD international

conference on Knowledge discovery and data mining,

pages 426–434, New York, NY, USA, 2008. ACM.

[11] Y. Koren. Collaborative ﬁltering with temporal

dynamics. In KDD ’09: Proceedings of the 15th ACM

SIGKDD international conference on Knowledge

discovery and data mining, pages 447–456, New York,

NY, USA, 2009. ACM.

[12] Y. Koren, R. M. Bell, and C. Volinsky. Matrix

factorization techniques for recommender systems.

IEEE Computer, 42(8):30–37, 2009.

[13] B. Miller, I. Albert, S. Lam, J. Konstan, and J. Riedl.

MovieLens unplugged: experiences with an

occasionally connected recommender system.

Proceedings of the 8th international conference on

Intelligent user interfaces, pages 263–266, 2003.

[14] A. Paterek. Improving regularized singular value

decomposition for collaborative ﬁltering. Proceedings

of KDD Cup and Workshop, 2007.

[15] B. Sarwar, G. Karypis, J. Konstan, and J. Reidl.

Item-based collaborative ﬁltering recommendation

algorithms. 10th Int. Conf. on World Wide Web,

pages 285–295, 2001.

[16] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.

Application of Dimensionality Reduction in

Recommender System-A Case Study. Defense

Technical Information Center, 2000.