Conference PaperPDF Available

Collaborative Filtering for Implicit Feedback Datasets

Authors:

Abstract

A common task of recommender systems is to improve customer experience through personalized recommenda- tions based on prior implicit feedback. These systems pas- sively track different sorts of user behavior, such as pur- chase history, watching habits and browsing activity, in or- der to model user preferences. Unlike the much more ex- tensively researched explicit feedback, we do not have any direct input from the users regarding their preferences. In particular, we lack substantial evidence on which products consumer dislike. In this work we identify unique proper- ties of implicit feedback datasets. We propose treating the data as indication of positive and negative preference asso- ciated with vastly varying confidence levels. This leads to a factor model which is especially tailored for implicit feed- back recommenders. We also suggest a scalable optimiza- tion procedure, which scales linearly with the data size. The algorithmis used successfully within a recommender system for television shows. It compares favorably with well tuned implementations of other known methods. In addition, we offer a novel way to give explanations to recommendations given by this factor model.
Collaborative Filtering for Implicit Feedback Datasets
Yifan Hu
AT&T Labs – Research
Florham Park, NJ 07932
Yehuda Koren
Yahoo! Research
Haifa 31905, Israel
Chris Volinsky
AT&T Labs – Research
Florham Park, NJ 07932
Abstract
A common task of recommender systems is to improve
customer experience through personalized recommenda-
tions based on prior implicit feedback. These systems pas-
sively track different sorts of user behavior, such as pur-
chase history, watching habits and browsing activity, in or-
der to model user preferences. Unlike the much more ex-
tensively researched explicit feedback, we do not have any
direct input from the users regarding their preferences. In
particular, we lack substantial evidence on which products
consumer dislike. In this work we identify unique proper-
ties of implicit feedback datasets. We propose treating the
data as indication of positive and negative preference asso-
ciated with vastly varying confidence levels. This leads to a
factor model which is especially tailored for implicit feed-
back recommenders. We also suggest a scalable optimiza-
tion procedure, which scales linearly with the data size. The
algorithm is used successfully within a recommender system
for television shows. It compares favorably with well tuned
implementations of other known methods. In addition, we
offer a novel way to give explanations to recommendations
given by this factor model.
1 Introduction
As e-commerce is growing in popularity, an important
challenge is helping customers sort through a large variety
of offered products to easily find the ones they will enjoy
the most. One of the tools that address this challenge is rec-
ommender systems, which are attracting a lot of attention
recently [1, 4, 12]. These systems provide users with per-
sonalized recommendations for products or services, which
hopefully suit their unique taste and needs. The technology
behind those systems is based on profiling users and prod-
ucts, and finding how to relate them.
Broadly speaking, recommender systems are based on
two different strategies (or combinations thereof). The con-
Work done while author was at AT&T Labs – Research
tent based approach creates a profile for each user or prod-
uct to characterize its nature. As an example, a movie pro-
file could include attributes regarding its genre, the par-
ticipating actors, its box office popularity, etc. User pro-
files might include demographic information or answers to
a suitable questionnaire. The resulting profiles allow pro-
grams to associate users with matching products. However,
content based strategies require gathering external informa-
tion that might not be available or easy to collect.
An alternative strategy, our focus in this work, relies only
on past user behavior without requiring the creation of ex-
plicit profiles. This approach is known as Collaborative
Filtering (CF), a term coined by the developers of the first
recommender system - Tapestry [8]. CF analyzes relation-
ships between users and interdependencies among products,
in order to identify new user-item associations. For exam-
ple, some CF systems identify pairs of items that tend to be
rated similarly or like-minded users with similar history of
rating or purchasing to deduce unknown relationships be-
tween users and items. The only required information is the
past behavior of users, which might be their previous trans-
actions or the way they rate products. A major appeal of CF
is that it is domain free, yet it can address aspects of the data
that are often elusive and very difficult to profile using con-
tent based techniques. While generally being more accu-
rate than content based techniques, CF suffers from the cold
start problem, due to its inability to address products new
to the system, for which content based approaches would be
adequate.
Recommender systems rely on different types of in-
put. Most convenient is the high quality explicit feedback,
which includes explicit input by users regarding their inter-
est in products. For example, Netflix collects star ratings
for movies and TiVo users indicate their preferences for
TV shows by hitting thumbs-up/down buttons. However,
explicit feedback is not always available. Thus, recom-
menders can infer user preferences from the more abundant
implicit feedback, which indirectly reflect opinion through
observing user behavior [14]. Types of implicit feedback
include purchase history, browsing history, search patterns,
or even mouse movements. For example, a user that pur-
chased many books by the same author probably likes that
author.
The vast majority of the literature in the field is focused
on processing explicit feedback; probably thanks to the con-
venience of using this kind of pure information. However,
in many practical situations recommender systems need to
be centered on implicit feedback. This may reflect reluc-
tance of users to rate products, or limitations of the system
that is unable to collect explicit feedback. In an implicit
model, once the user gives approval to collect usage data,
no additional explicit feedback (e.g. ratings) is required on
the user’s part.
This work conducts an exploration into algorithms
specifically suitable for processing implicit feedback. It re-
flects some of the major lessons and developments that were
achieved while we built a TV shows recommender engine.
Our setup prevents us from actively gathering explicit feed-
back from users, so the system was solely based on implicit
feedback – analyzing watching habits of anonymized users.
It is crucial to identify the unique characteristics of im-
plicit feedback, which prevent the direct use of algorithms
that were designed with explicit feedback in mind. In the
following we list the prime characteristics:
1. No negative feedback. By observing the users behav-
ior, we can infer which items they probably like and
thus chose to consume. However, it is hard to reliably
infer which items a user did not like. For example, a
user that did not watch a certain show might have done
so because she dislikes the show or just because she
did not know about the show or was not available to
watch it. This fundamental asymmetry does not exist
in explicit feedback where users tell us both what they
like and what they dislike. It has several implications.
For example, explicit recommenders tend to focus on
the gathered information – those user-item pairs that
we know their ratings – which provide a balanced pic-
ture on the user preference. Thus, the remaining user-
item relationships, which typically constitute the vast
majority of the data, are treated as “missing data” and
are omitted from the analysis. This is impossible with
implicit feedback, as concentrating only on the gath-
ered feedback will leave us with the positive feedback,
greatly misrepresenting the full user profile. Hence,
it is crucial to address also the missing data, which is
where most negative feedback is expected to be found.
2. Implicit feedback is inherently noisy. While we pas-
sively track the users behavior, we can only guess their
preferences and true motives. For example, we may
view purchase behavior for an individual, but this does
not necessarily indicate a positive view of the product.
The item may have been purchased as a gift, or per-
haps the user was disappointed with the product. We
may view that a television is on a particular channel at
a particular time, but the viewer might be asleep.
3. The numerical value of explicit feedback indicates
preference, whereas the numerical value of implicit
feedback indicates confidence. Systems based on ex-
plicit feedback let the user express their level of prefer-
ence, e.g. a star rating between 1 (“totally dislike”) and
5 (“really like”). On the other hand, numerical values
of implicit feedback describe the frequency of actions,
e.g., how much time the user watched a certain show,
how frequently a user is buying a certain item, etc. A
larger value is not indicating a higher preference. For
example, the most loved show may be a movie that the
user will watch only once, while there is a series that
the user quite likes and thus is watching every week.
However, the numerical value of the feedback is defi-
nitely useful, as it tells us about the confidence that we
have in a certain observation. A one time event might
be caused by various reasons that have nothing to do
with user preferences. However, a recurring event is
more likely to reflect the user opinion.
4. Evaluation of implicit-feedback recommender requires
appropriate measures. In the traditional setting where a
user is specifying a numeric score, there are clear met-
rics such as mean squared error to measure success in
prediction. However with implicit models we have to
take into account availability of the item, competition
for the item with other items, and repeat feedback. For
example, if we gather data on television viewing, it is
unclear how to evaluate a show that has been watched
more than once, or how to compare two shows that are
on at the same time, and hence cannot both be watched
by the user.
2 Preliminaries
We reserve special indexing letters for distinguishing
users from items: for users u, v, and for items i, j . The input
data associate users and items through rui values, which we
henceforth call observations. For explicit feedback datasets,
those values would be ratings that indicate the preference
by user uof item i, where high values mean stronger pref-
erence. For implicit feedback datasets, those values would
indicate observations for user actions. For example, rui can
indicate the number of times upurchased item ior the time
uspent on webpage i. In our TV recommender case, rui
indicates how many times ufully watched show i. For ex-
ample, rui = 0.7indicates that uwatched 70% of the show,
while for a user that watched the show twice we will set
rui = 2.
Explicit ratings are typically unknown for the vast ma-
jority of user-item pairs, hence applicable algorithms work
with the relatively few known ratings while ignoring the
missing ones. However, with implicit feedback it would be
natural to assign values to all rui variables. If no action was
observed rui is set to zero, thus meaning in our examples
zero watching time, or zero purchases on record.
3 Previous work
3.1 Neighborhood models
The most common approach to CF is based on neigh-
borhood models. Its original form, which was shared by
virtually all earlier CF systems, is user-oriented; see [9]
for a good analysis. Such user-oriented methods estimate
unknown ratings based on recorded ratings of like minded
users. Later, an analogous item-oriented approach [13, 19]
became popular. In those methods, a rating is estimated us-
ing known ratings made by the same user on similar items.
Better scalability and improved accuracy make the item-
oriented approach more favorable in many cases [2, 19, 20].
In addition, item-oriented methods are more amenable to
explaining the reasoning behind predictions. This is be-
cause users are familiar with items previously preferred by
them, but usually do not know those allegedly like minded
users.
Central to most item-oriented approaches is a similarity
measure between items, where sij denotes the similarity of
iand j. Frequently, it is based on the Pearson correlation
coefficient. Our goal is to predict rui the unobserved value
by user ufor item i. Using the similarity measure, we iden-
tify the kitems rated by u, which are most similar to i. This
set of kneighbors is denoted by Sk(i;u). The predicted
value of rui is taken as a weighted average of the ratings for
neighboring items:
ˆrui =PjSk(i;u)sijruj
PjSk(i;u)sij
(1)
Some enhancements of this scheme are well practiced for
explicit feedback, such as correcting for biases caused by
varying mean ratings of different users and items. Those
modifications are less relevant to implicit feedback datasets,
where instead of having ratings which are all on the same
scale, we use frequencies in which items are consumed by
the same user. Frequencies for disparate users might have
very different scale depending on the application, and it is
less clear how to calculate similarities. A good discussion
on how to use an item-oriented approach with implicit feed-
back is given by Deshpande and Karypis [6].
All item-oriented models share a disadvantage in regards
to implicit feedback - they do not provide the flexibility to
make a distinction between user preferences and the confi-
dence we might have in those preferences.
3.2 Latent factor models
Latent factor models comprise an alternative approach
to Collaborative Filtering with the more holistic goal to un-
cover latent features that explain observed ratings; exam-
ples include pLSA [11], neural networks [16], and Latent
Dirichlet Allocation [5]. We will focus on models that are
induced by Singular Value Decomposition (SVD) of the
user-item observations matrix. Recently, SVD models have
gained popularity, thanks to their attractive accuracy and
scalability; see, e.g., [3, 7, 15, 17, 20]. A typical model as-
sociates each user uwith a user-factors vector xuRf, and
each item iwith an item-factors vector yiRf. The pre-
diction is done by taking an inner product, i.e., ˆrui =xT
uyi.
The more involved part is parameter estimation. Many of
the recent works, applied to explicit feedback datasets, sug-
gested modeling directly only the observed ratings, while
avoiding overfitting through an adequate regularized model,
such as:
min
x,yX
ru,i is known
(rui xT
uyi)2+λ(kxuk2+kyik2)(2)
Here, λis used for regularizing the model. Parameters are
often learnt by stochastic gradient descent; see, e.g., [7, 15,
20]. The results, as reported on the largest available dataset
– the Netflix dataset [4] – tend to be consistently superior
to those achieved by neighborhood models. In this work we
borrow this approach to implicit feedback datasets, which
requires modifications both in the model formulation and in
the optimization technique.
4 Our model
In this section we describe our model for implicit feed-
back. First, we need to formalize the notion of confidence
which the rui variables measure. To this end, let us intro-
duce a set of binary variables pui, which indicates the pref-
erence of user uto item i. The pui values are derived by
binarizing the rui values:
pui =1rui >0
0rui = 0
In other words, if a user uconsumed item i(rui >0),
then we have an indication that ulikes i(pui = 1). On
the other hand, if unever consumed i, we believe no pref-
erence (pui = 0). However, our beliefs are associated with
greatly varying confidence levels. First, by the nature of the
data zero values of pui are associated with low confidence,
as not taking any positive action on an item can stem from
many other reasons beyond not liking it. For example, the
user might be unaware of the existence of the item, or un-
able to consume it due to its price or limited availability. In
addition, consuming an item can also be the result of fac-
tors different from preferring it. For example, a user may
watch a TV show just because she is staying on the channel
of the previously watched show. Or a consumer may buy
an item as gift for someone else, despite not liking the item
for himself. Thus, we will have different confidence levels
also among items that are indicated to be preferred by the
user. In general, as rui grows, we have a stronger indication
that the user indeed likes the item. Consequently, we intro-
duce a set of variables, cui , which measure our confidence
in observing pui . A plausible choice for cui would be:
cui = 1 + αrui
This way, we have some minimal confidence in pui for ev-
ery user-item pair, but as we observe more evidence for pos-
itive preference, our confidence in pui = 1 increases ac-
cordingly. The rate of increase is controlled by the constant
α. In our experiments, setting α= 40 was found to produce
good results.
Our goal is to find a vector xuRffor each user u,
and a vector yiRffor each item ithat will factor user
preferences. In other words, preferences are assumed to
be the inner products: pui =xT
uyi. These vectors will be
known as the user-factors and the item-factors, respectively.
Essentially, the vectors strive to map users and items into
a common latent factor space where they can be directly
compared. This is similar to matrix factorization techniques
which are popular for explicit feedback data, with two im-
portant distinctions: (1) We need to account for the varying
confidence levels, (2) Optimization should account for all
possible u, i pairs, rather than only those corresponding to
observed data. Accordingly, factors are computed by mini-
mizing the following cost function:
min
x,yX
u,i
cui(pui xT
uyi)2+λ X
u
kxuk2+X
i
kyik2!
(3)
The λPukxuk2+Pikyik2term is necessary for regu-
larizing the model such that it will not overfit the training
data. Exact value of the parameter λis data-dependent and
determined by cross validation.
Notice that the cost function contains m·nterms, where
mis the number of users and nis the number of items.
For typical datasets m·ncan easily reach a few billion.
This huge number of terms prevents most direct optimiza-
tion techniques such as stochastic gradient descent, which
was widely used for explicit feedback datasets. Thus, we
suggest an alternative efficient optimization process, as fol-
lows.
Observe that when either the user-factors or the item-
factors are fixed, the cost function becomes quadratic so
its global minimum can be readily computed. This leads
to an alternating-least-squares optimization process, where
we alternate between re-computing user-factors and item-
factors, and each step is guaranteed to lower the value of
the cost function. Alternating least squares was used for
explicit feedback datasets [2], where unknown values were
treated as missing, leading to a sparse objective function.
The implicit feedback setup requires a different strategy to
overcome the dense cost function and to integrate the con-
fidence levels. We address these by exploiting the structure
of the variables so that this process can be implemented to
be highly scalable.
The first step is recomputing all user factors. Let us as-
sume that all item-factors are gathered within an n×fma-
trix Y. Before looping through all users, we compute the
f×fmatrix YTYin time O(f2n). For each user u, let us
define the diagonal n×nmatrix Cuwhere Cu
ii =cui, and
also the vector p(u)Rnthat contains all the preferences
by u(the pui values). By differentiation we find an analytic
expression for xuthat minimizes the cost function (3):
xu= (YTCuY+λI)1YTCup(u)(4)
A computational bottleneck here is computing YTCuY,
whose naive calculation will require time O(f2n)(for each
of the musers). A significant speedup is achieved by us-
ing the fact that YTCuY=YTY+YT(CuI)Y. Now,
YTYis independent of uand was already precomputed.
As for YT(CuI)Y, notice that CuIhas only nunon-
zero elements, where nuis the number of items for which
rui >0and typically nun. Similarly, Cup(u)contains
just nunon-zero elements. Consequently, recomputation of
xuis performed in time O(f2nu+f3). Here, we assumed
O(f3)time for the matrix inversion (YTCuY+λI)1,
even though more efficient algorithms exist, but probably
are less relevant for the typically small values of f. This
step is performed over each of the musers, so the total run-
ning time is O(f2N+f3m), where Nis the overall number
of non-zero observations, that is N=Punu. Importantly,
running time is linear in the size of the input. Typical val-
ues of flie between 20 and 200; see experimental study in
Sec. 6.
A recomputation of the user-factors is followed by a re-
computation of all item-factors in a parallel fashion. We
arrange all user-factors within an m×fmatrix X. First
we compute the f×fmatrix XTXin time O(f2m). For
each item i, we define the diagonal m×mmatrix Ciwhere
Ci
uu =cui, and also the vector p(i)Rmthat contains all
the preferences for i. Then we solve:
yi= (XTCiX+λI)1XTCip(i)(5)
Using the same technique as with the user-factors, running
time of this step would be O(f2N+f3n). We employ
a few sweeps of paired recomputation of user- and item-
factors, till they stabilize. A typical number of sweeps is 10.
The whole process scales linearly with the size of the data.
After computing the user- and item-factors, we recommend
to user uthe Kavailable items with the largest value of
ˆpui =xT
uyi, where ˆpui symbolizes the predicted preference
of user ufor item i.
Now that the basic description of our technique is com-
pleted we would like to further discuss it, as some of our
decisions can be modified. For example, one can derive pui
differently from rui , by setting a minimum threshold on rui
for the corresponding pui to be non-zero. Similarly, there
are many ways to transform rui into a confidence level cui .
One alternative method that also worked well to us is setting
cui = 1 + αlog(1 + rui).(6)
Regardless of the exact variant of the scheme, it is impor-
tant to realize its main properties, which address the unique
characteristics of implicit feedback:
1. Transferring the raw observations (rui ) into two sep-
arate magnitudes with distinct interpretations: prefer-
ences (pui) and confidence levels (cui). This better re-
flect the nature of the data and is essential to improv-
ing prediction accuracy, as shown in the experimental
study (Sec. 6).
2. An algorithm that addresses all possible (n·m) user-
item combinations in a linear running time, by exploit-
ing the algebraic structure of the variables.
5 Explaining recommendations
It is well accepted [10] that a good recommendation
should be accompanied with an explanation, which is a
short description to why a specific product was recom-
mended to the user. This helps in improving the users’
trust in the system and their ability to put recommenda-
tions in the right perspective. In addition, it is an invalu-
able means for debugging the system and tracking down
the source of unexpected behavior. Providing explana-
tions with neighborhood-based (or, “memory-based”) tech-
niques is straightforward, as recommendations are directly
inferred from past users’ behavior. However, for latent fac-
tor models explanations become trickier, as all past user
actions are abstracted via the user factors thereby block-
ing a direct relation between past user actions and the out-
put recommendations. Interestingly, our alternating least
squares model enables a novel way to compute explana-
tions. The key is replacing the user-factors by using Eq.
(4) : xu= (YTCuY+λI)1YTCup(u). Thus, the pre-
dicted preference of user ufor item i,ˆpui =yT
ixu, be-
comes: yT
i(YTCuY+λI)1YTCup(u). This expression
can be simplified by introducing some new notation. Let us
denote the f×fmatrix (YTCuY+λI )1as Wu, which
should be considered as a weighting matrix associated with
user u. Accordingly, the weighted similarity between items
iand jfrom u’s viewpoint is denoted by su
ij =yT
iWuyj.
Using this new notation the predicted preference of ufor
item iis rewritten as:
ˆpui =X
j:ruj >0
su
ij cuj (7)
This reduces our latent factor model into a linear model
that predicts preferences as a linear function of past actions
(ruj >0), weighted by item-item similarity. Each past ac-
tion receives a separate term in forming the predicted ˆpui,
and thus we can isolate its unique contribution. The actions
associated with the highest contribution are identified as the
major explanation behind the recommendation. In addition,
we can further break the contribution of each individual past
action into two separate sources: the significance of the re-
lation to user ucuj , and the similarity to target item i
su
ij .
This shares much resemblance with item-oriented neigh-
borhood models, which enables the desired ability to ex-
plain computed predictions. If we further adopt this view-
point, we can consider our model as a powerful pre-
processor for a neighborhood based method, where item
similarities are learnt through a principled optimization pro-
cess. In addition, similarities between items become depen-
dent on the specific user in question, reflecting the fact that
different users do not completely agree on which items are
similar.
Giving explanation through (7) involves solving a linear
system (YTCuY+λI)1yj, followed by a matrix vector
product, and can be done in time O(f2nu+f3),assuming
that YTYis precomputed.
6 Experimental study
Data description Our analysis is based on data from a
digital television service. We were able to collect data on
about 300,000 set top boxes. All data was collected in ac-
cordance with appropriate end user agreements and privacy
policies. The analysis was done with data that was aggre-
gated and/or fully anonymized. No personally identifiable
information was collected in connection with this research.
We collected all channel tune events for these users, in-
dicating the channel the set-top box was tuned into, and a
time stamp. There are approximately 17,000 unique pro-
grams which aired during a four week period. The training
data contains rui values, for each user uand program i,
which represent how many times user uwatched program
i(related is the number of minutes that a given show was
watched - for all of our analysis we focus on show length
based units). Notice that rui is a real value, as users may
watch parts of shows. After aggregating multiple watches
of the same program, the number of non-zero rui values is
about 32 million.
In addition, we use a similarly constructed test set, which
is based on all channel tune events during the single week
following a 4-week training period. Our system is trained
using the recent 4 weeks of data in order to generate pre-
dictions of what users will watch in the ensuing week. The
training period of 4 weeks is chosen based on an experi-
mental study which showed that a shorter period tends to
deteriorate the prediction results, while a longer period does
not add much value (since television schedules change sea-
sonally, long training periods do not necessarily have an
advantage, even though we found that our core model is
robust enough to avoid being contaminated by the season-
ality). The observations in the test set are denoted by rt
ui
(distinguished with a superscript t).
One characteristic of television watching is the tendency
to repetitively watch the same programs every week. It is
much more valuable to a user to be recommended programs
that she has not watched recently, or that she is not aware
of. Thus, in our default setting, for each user we remove
the “easy” predictions from the test set corresponding to the
shows that had been watched by that user during the training
period. To make the test set even more accurate, we toggle
to zero all entries with rt
ui <0.5, as watching less than half
of a program is not a strong indication that a user likes the
program. This leaves us with about 2 million non-zero rt
ui
values in the test set.
The tendency to watch the same programs repeatedly
also makes rui vary significantly over a large range. While
there are a lot of viewing events close to 0 (channel flip-
ping), 1, 2 or 3 (watching a film or a couple of episodes of a
series), there are also some viewing events that accumulate
to hundreds (have the DVR recording the same program for
many hours per day over a period of 4 weeks). Therefore
we employ the log scaling scheme (6) with ǫ= 108.
One other important adjustment is needed. We observe
many cases where a single channel is watched for many
hours. It is likely that the initial show that was tuned into is
of interest to the viewer, while the subsequent shows are of
decreasing interest. The television might simply have been
left on while the viewer does something else (or sleeps!).
We call this a momentum effect, and programs watched due
to momentum are less expected to reflect real preference. To
overcome this effect, we down-weight the second and sub-
sequent shows after a channel tuning event. More specifi-
cally, for the t-th show after a channel tune, we assign it a
weighting e(atb)
1+e(atb).Experimentally we found a= 2 and
b= 6 to work well and is intuitive: the third show after
the channel tune gets its rui value halved, by the fifth show
without a channel change, rui is reduced by 99%.
Evaluation methodology We evaluate a scenario where
we generate for each user an ordered list of the shows,
sorted from the one predicted to be most preferred till the
least preferred one. Then, we present a prefix of the list to
the user as the recommended shows. It is important to real-
ize that we do not have a reliable feedback regarding which
programs are unloved, as not watching a program can stem
from multiple different reasons. In addition, we are cur-
rently unable to track user reactions to our recommenda-
tions. Thus, precision based metrics are not very appropri-
ate, as they require knowing which programs are undesired
to a user. However, watching a program is an indication of
liking it, making recall-oriented measures applicable.
We denote by rankui the percentile-ranking of program
iwithin the ordered list of all programs prepared for user
u. This way, rankui = 0% would mean that program iis
predicted to be the most desirable for user u, thus preceding
all other programs in the list. On the other hand, rankui =
100% indicates that program iis predicted to be the least
preferred for user u, thus placed at the end of the list. (We
opted for using percentile-ranks rather than absolute ranks
in order to make our discussion general and independent of
the number of programs.) Our basic quality measure is the
expected percentile ranking of a watching unit in the test
period, which is:
rank =Pu,i rt
uirankui
Pu,i rt
ui
(8)
Lower values of rank are more desirable, as they indicate
ranking actually watched shows closer to the top of the rec-
ommendation lists. Notice that for random predictions, the
expected value of rankui is 50% (placing iin the middle of
the sorted list). Thus, rank >50% indicates an algorithm
no better than random.
Evaluation results We implemented our model with dif-
ferent number of factors (f), ranging from 10 to 200. In ad-
dition, we implemented two other competing models. The
first model is sorting all shows based on their popularity, so
that the top recommended shows are the most popular ones.
This naive measure is surprisingly powerful, as crowds tend
to heavily concentrate on few of the many thousands avail-
able shows. We take this as a baseline value.
The second model is neighborhood based (item-item),
along the lines described in Sec. 3.1. We explored many
variants of this scheme, and found the following two deci-
sions to yield best results: (1) Take all items as “neighbors”,
not only a small subset of most similar items. (2) Use cosine
similarity for measuring item-item similarity. Formally, for
an item ilet us arrange within riRmthe rui values as-
sociated with all musers. Then, sij =rT
irj
krikkrjk. The pre-
dicted preference of user ufor show iis: Pjsij ruj . As a
8
10
12
14
16
18
20 40 60 80 100 120 140 160 180 200
Expected percentile ranking (%)
#factors
Popularity
Neighborhood
Factor
Figure 1. Comparing factor model with popu-
larity ranking and neighborhood model.
side remark, we would like to mention that we recommend
very different settings for neighborhood based techniques
when applied to explicit feedback data.
Figure 1 shows the measured values of rank with dif-
ferent number of factors, and also the results by the popu-
larity ranking (cyan, horizontal line), and the neighborhood
model (red, horizontal line). We can see that based only on
popularity, we can achieve rank = 16.46%, which is much
lower than the rank = 50% that would be achieved by a
random predictor. However, a popularity based predictor is
clearly non-personalized and treats all users equally. The
neighborhood based method offers a significant improve-
ment (rank = 10.74%) achieved by personalizing recom-
mendations. Even better results are obtained by our fac-
tor model, which offers a more principled approach to the
problem. Results keep improving as number of factors in-
creases, till reaching rank = 8.35% for 200 factors. Thus,
we recommend working with the highest number of factors
feasible within computational limitations.
We further dig into the quality of recommendations, by
studying the cumulative distribution function of rankui.
Here, we concentrate on the model with 100 factors,
and compare results to the popularity-based and the
neighborhood-based techniques, as shown in Fig. 2. We
asked the following: what is the distribution of percentiles
for the shows that were actually watched in the test set? If
our model does well, all of the watched shows will have low
percentiles. From the figure, we see that a watched show
is in the top 1% of the recommendations from our model
about 27% of the time. These results compare favorably to
the neighborhood based approach, and are much better than
the baseline popularity-based model.
Here we would like to comment that results get much
better had we left all previously watched programs in the
test set (without removing all user-program events that
already occurred in the training period). Predicting re-
0
0.1
0.2
0.3
0.4
0.5
0.6
5%4%3%2%1%
probability
top %
Popularity
Neighborhood
Factor
w/ prev. watched
Figure 2. Cumulative distribution function of
the probability that a show watched in the
test set falls within top x% of recommended
shows.
watching a program is much easier than predicting a first
time view of a program. This is shown by the black dot-
ted line in the figure, which evaluates our algorithm when
previously watched shows are not taken out of the test set.
Although suggesting a previously watched show might not
be very exciting, it does come useful. For example, our
system informs users on which programs are running to-
day that might interest them. Here, users are not looking to
be surprised, but for being reminded not to miss a favorite
show. The high predictive accuracy of retrieving previously
watched shows comes handy for this task.
We would also like to evaluate our decision to trans-
form the raw observations (the rui values), into distinct
preference-confidence pairs (pui,cui ). Other possible mod-
els were studied as well. First, we consider a model which
works directly on the given observations. Thus, our model
(3) is replaced with a factor model that strives to minimize:
min
x,yX
u,i
(rui xT
uyi)2+λ1 X
u
kxuk2+X
i
kyik2!
(9)
Notice that this is a regularized version of the dense SVD
algorithm, which is an established approach to collabora-
tive filtering [18]. Results without regularization (λ1= 0)
were very poor and could not improve upon the popularity
based model. Better results are achieved when the model
is regularized – here, we used λ1= 500 for regularizing
the model, which proved to deliver best recommendations.
While results consistently outperform those of the popu-
larity model, they were substantially poorer even than the
neighborhood model. For example, for 50 factors we got
rank = 13.63%, while 100 factors yield rank = 13.40%.
This relatively low quality is not surprising as earlier we
argued that taking rui as raw preferences is not sensible.
Therefore, we also tried another model, which factorizes
the derived binary preference values, resulting in:
min
x,yX
u,i
(pui xT
uyi)2+λ2 X
u
kxuk2+X
i
kyik2!
(10)
The model was regularized with λ2= 150. Results are
indeed better than those of model (9), leading to rank =
10.72% with 50 factors and rank = 10.49% with 100
factors. This is slightly better than the results achieved
with the neighborhood model. However, this is still ma-
terially inferior to our full model, which results in rank =
8.93% 8.56% for 50 – 100 factors. This shows the impor-
tance of augmenting (10) with confidence levels as in (3).
We now analyze the performance of the full model (with
100 factors) on different types of shows and users. Dif-
ferent shows receive significantly varying watching time in
the training data. Some shows are popular and watched a
lot, while others are barely watched. We split the positive
observations in the test set into 15 equal bins, based on in-
creasing show popularity, We measured the performance of
our model in each bin, ranging from bin 1 (least popular
shows) to bin 15 (most popular shows). As Fig. 3 (blue line)
shows, there is a big gap in the accuracy of our model, as
it becomes much easier to predict popular programs, while
it is increasingly difficult to predict watching a non popu-
lar show. To some extent, the model prefers to stay with
the safe recommendations of familiar shows, on which it
gathered enough data and can analyze well. Interestingly,
this effect is not carried over to partitioning users accord-
ing to their watching time. Now, we split users into bins
based on their overall watching time; see Fig. 3 (red line).
Except for the first bin, which represents users with almost
no watching history, the model performance is quite similar
for all other user groups. This was somewhat unexpected,
as our experience with explicit feedback datasets was that
as we gather more information on users, prediction qual-
ity significantly increases. The possible explanation to why
the model could not do much better for heavy watchers is
that those largely represent heterogeneous accounts, where
many different people are watching the same TV.
Finally, we demonstrate the utility of our recommenda-
tion explanations. Explanations for recommendations are
common for neighbor methods since the system can al-
ways return the nearest neighbors of the recommended item.
However, there is no previous work discussing how to do
explanations for matrix decomposition methods, which in
our experience outperform the neighbor based methods. Ta-
ble 1 shows three recommended shows for one user in our
study. Following the methods in Section 5, we show the top
five watched shows which explain the recommended show
(shown in bold). These explanations make sense: the reality
show So You Think You Can Dance is explained by other re-
0
5
10
15
20
25
30
35
40
15131197531
Expected percentile ranking (%)
bin #
show popularity
user watching time
Figure 3. Analyzing the performance of the
factor model by segregating users/shows
based on different criteria.
ality shows, while Spider-Man is explained by other comic-
related shows and Life in the E.R. is explained by medical
documentaries. These common-sense explanations help the
user understand why certain shows are recommended, and
are similar to explanations returned by neighbor methods.
We also report the total percent of the recommendation ac-
counted for by the top 5. In this case, the top five shows
only explain between 35 and 40% of the recommendation,
indicating that many other watched shows give input to the
recommendations.
7 Discussion
In this work we studied collaborative filtering on datasets
with implicit feedback, which is a very common situation.
One of our main findings is that implicit user observations
should be transformed into two paired magnitudes: pref-
erences and confidence levels. In other words, for each
user-item pair, we derive from the input data an estimate
to whether the user would like or dislike the item (“pref-
erence”) and couple this estimate with a confidence level.
This preference-confidence partition has no parallel in the
widely studied explicit-feedback datasets, yet serves a key
role in analyzing implicit feedback.
We provide a latent factor algorithm that directly ad-
dresses the preference-confidence paradigm. Unlike ex-
plicit datasets, here the model should take all user-item pref-
erences as an input, including those which are not related to
any input observation (thus hinting to a zero preference).
This is crucial, as the given observations are inherently bi-
ased towards a positive preference, and thus do not reflect
well the user profile. However, taking all user-item values
as an input to the model raises serious scalability issues –
the number of all those pairs tends to significantly exceed
the input size since a typical user would provide feedback
So You Think You Can Dance Spider-Man Life In The E.R.
Hell’s Kitchen Batman: The Series Adoption Stories
Access Hollywood Superman: The Series Deliver Me
Judge Judy Pinky and The Brain Baby Diaries
Moment of Truth Power Rangers I Lost It!
Don’t Forget the Lyrics The Legend of Tarzan Bringing Home Baby
Total Rec = 36% Total Rec = 40% Total Rec = 35%
Table 1. Three recommendations with explanations for a single user in our study. Each recommended show is recom-
mended due to a unique set of already-watched shows by this user.
only on a small fraction of the available items. We address
this by exploiting the algebraic structure of the model, lead-
ing to an algorithm that scales linearly with the input size
while addressing the full scope of user-item pairs without
resorting to any sub-sampling.
An interesting feature of the algorithm is that it allows
explaining the recommendations to the end user, which is
a rarity among latent factor models. This is achieved by
showing a surprising and hopefully insightful link into the
well known item-oriented neighborhood approach.
The algorithm was implemented and tested as a part of a
large scale TV recommender system. Our design method-
ology strives to find a right balance between the unique
properties of implicit feedback datasets and computational
scalability. We are currently exploring modifications with a
potential to improve accuracy at the expense of increasing
computational complexity. As an example, in our model we
decided to treat all user-item pairs associated with a zero
preference with the same uniform confidence level. Since
the vast majority of pairs is associated with a zero prefer-
ence, this decision saved a lot of computational effort. How-
ever, a more careful analysis would split those zero values
into different confidence levels, perhaps based on availabil-
ity of the item. In our television recommender example, the
fact that a user did not watch a program might mean that
the user was not aware of the show (it is on an ’unusual’
channel or time of day), or that there is another favorite
show on concurrently, or that the user is simply not inter-
ested. Each of these correspond to different scenarios, and
each might warrant a distinctive confidence level in the “no
preference” assumption. This leads us to another possible
extension of the model – adding a dynamic time variable
addressing the tendency of a user to watch TV on certain
times. Likewise, we would like to model that certain pro-
gram genres are more popular in different times of the day.
This is part of an ongoing research, where the main chal-
lenge seems to be how to introduce an added flexibility into
the model while maintaining its good computational scala-
bility.
Finally, we note that the standard training and test setup
is designed to evaluate how well a model can predict fu-
ture user behavior. However, this is not the purpose of a
recommender system, which strives to point users to items
that they might not have otherwise purchased or consumed.
It is difficult to see how to evaluate that objective without
using in depth user study and surveying. In our example,
we believe that by evaluating our methods by removing the
“easy” cases of re-watched shows, we somehow get closer
to the ideal of trying to capture user discovery of new shows.
References
[1] G. Adomavicius and A. Tuzhilin, “Towards the Next
Generation of Recommender Systems: A Survey of
the State-of-the-Art and Possible Extensions”, IEEE
Transactions on Knowledge and Data Engineering 17
(2005), 634–749.
[2] R. Bell and Y. Koren, “Scalable Collaborative Fil-
tering with Jointly Derived Neighborhood Interpola-
tion Weights”, IEEE International Conference on Data
Mining (ICDM’07), pp. 43–52, 2007.
[3] R. Bell, Y. Koren and C. Volinsky, “Modeling Relation-
ships at Multiple Scales to Improve Accuracy of Large
Recommender Systems”, Proc. 13th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and
Data Mining, 2007.
[4] J. Bennet and S. Lanning, “The Netflix Prize”, KDD
Cup and Workshop, 2007. www.netflixprize.
com.
[5] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet Al-
location”, Journal of Machine Learning Research 3
(2003), 993–1022.
[6] M. Deshpande, G. Karypis, “Item-based top-N recom-
mendation algorithms”, ACM Trans. Inf. Syst. 22 (2004)
143-177.
[7] S. Funk, “Netflix Update: Try This At Home”,
http://sifter.org/˜simon/journal/20061211.html, 2006.
[8] D. Goldberg, D. Nichols, B. M. Oki and D. Terry, “Us-
ing Collaborative Filtering to Weave an Information
Tapestry”, Communications of the ACM 35 (1992), 61–
70.
[9] J. L. Herlocker, J. A. Konstan, A. Borchers and John
Riedl, An Algorithmic Framework for Performing
Collaborative Filtering”, Proc. 22nd ACM SIGIR Con-
ference on Information Retrieval, pp. 230–237, 1999.
[10] J. L. Herlocker, J. A. Konstan, and J. Riedl. “Explain-
ing collaborative filtering recommendations”, In Pro-
ceedings of the 2000 ACM Conference on Computer
Supported Cooperative Work, ACM Press, pp. 241-250,
2000.
[11] T. Hofmann, “Latent Semantic Models for Collabora-
tive Filtering”, ACM Transactions on Information Sys-
tems 22 (2004), 89–115.
[12] Z. Huang, D. Zeng and H. Chen, A Compari-
son of Collaborative-Filtering Recommendation Algo-
rithms for E-commerce”, IEEE Intelligent Systems 22
(2007), 68–78.
[13] G. Linden, B. Smith and J. York, “Amazon.com Rec-
ommendations: Item-to-item Collaborative Filtering”,
IEEE Internet Computing 7(2003), 76–80.
[14] D.W. Oard and J. Kim, “Implicit Feedback for Rec-
ommender Systems”, Proc. 5th DELOS Workshop on
Filtering and Collaborative Filtering, pp. 31–36, 1998.
[15] A. Paterek, “Improving Regularized Singular Value
Decomposition for Collaborative Filtering”, Proc. KDD
Cup and Workshop, 2007.
[16] R. Salakhutdinov, A. Mnih and G. Hinton, “Re-
stricted Boltzmann Machines for Collaborative Filter-
ing”, Proc. 24th Annual International Conference on
Machine Learning, pp. 791–798, 2007.
[17] R. Salakhutdinov and A. Mnih, “Probabilistic Matrix
Factorization”, Advances in Neural Information Pro-
cessing Systems 20 (NIPS’07), pp. 1257–1264, 2008.
[18] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl,
Application of Dimensionality Reduction in Recom-
mender System – A Case Study”, WEBKDD’2000.
[19] B. Sarwar, G. Karypis, J. Konstan and J. Riedl, “Item-
based Collaborative Filtering Recommendation Algo-
rithms”, Proc. 10th International Conference on the
World Wide Web, pp. 285-295, 2001.
[20] G. Takacs, I. Pilaszy, B. Nemeth and D. Tikk, “Major
Components of the Gravity Recommendation System”,
SIGKDD Explorations 9(2007), 80–84.
... Moreover, there exist approaches for specialized loss functions that avoid sampling altogether and that can work efficiently on the full item catalog [6,9]. The computational trick that they employ works for spherical loss functions over negative items. ...
... Our evaluation protocol differs slightly from [11] because we learn the user embeddings of test users together with all remaining users. The reason is that (i) unlike autoencoder models (e.g., [11,16]) that have only item embeddings, our model also requires user embeddings and (ii) unlike the iALS training method [6], our gradient descent based training does not offer a closed form equation to compute user embeddings. Our proposed learning method is general and not tied to a particular model. ...
Preprint
Full-text available
In recommendation systems, there has been a growth in the num-ber of recommendable items (# of movies, music, products). Whenthe set of recommendable items is large, training and evaluationof item recommendation models becomes computationally expen-sive. To lower this cost, it has become common to sample negativeitems. However, the recommendation quality can suffer from biasesintroduced by traditional negative sampling mechanisms.In this work, we demonstrate the benefits from correcting thebias introduced by sampling of negatives. We first provide sampledbatch version of the well-studied WARP and LambdaRank methods.Then, we present how these methods can benefit from improvedranking estimates. Finally, we evaluate the recommendation qualityas a result of correcting rank estimates and demonstrate that WARPand LambdaRank can be learned efficiently with negative samplingand our proposed correction technique.
... We build our experiment on a subset of the LFM-2b dataset [32]. We use two base recommendation models, IALS (implicit-feedback matrix factorization with alternating least squares [21]) and BPR (matrix factorization trained with pairwise rank loss [30]), to generate initial recommendation lists and reprocess these lists through various post-processing bias mitigation strategies. We ran the experiments with LensKit [6], using LensKit's IALS implementation and a PyTorch BPR implementation for LensKit. ...
Conference Paper
Full-text available
As recommender systems are prone to various biases, mitigation approaches are needed to ensure that recommendations are fair to various stakeholders. One particular concern in music recommendation is artist gender fairness. Recent work has shown that the gender imbalance in the sector translates to the output of music recommender systems, creating a feedback loop that can reinforce gender biases over time. In this work, we examine that feedback loop to study whether algorithmic strategies or user behavior are a greater contributor to ongoing improvement (or loss) in fairness as models are repeatedly retrained on new user feedback data. We simulate user interaction and retraining to investigate the effects of ranking strategies and user choice models on gender fairness metrics. We find re-ranking strategies have a greater effect than user choice models on recommendation fairness over time.
Article
Full-text available
In an era where the number of choices is overwhelming on the internet, it is crucial to filter, prioritize and deliver relevant information to a user. A recommender system addresses this issue by recommending items that users might like from many available items. Nowadays, the prevalence of providing personalized content to users through a website has increased profoundly. The majority of such websites use recommendation models to reduce a user’s searching time. Many new recommendation models are being proposed to address the changing business requirements of eCommerce organizations. Recommender systems can be broadly classified into three categories, i.e., clustering-based, matrix-factorization-based, and deep learning-based models. Many scopes and use cases are available where recommendation models play a vital role. The advent of graph representation learning and LLMs hinders recommendation models from being more effective in promptly providing relevant suggestions. This survey comprehensively discusses various deep learning-based recommendation models available for different domains. We also discuss the pros and cons of popular recommendation models. We also discuss various open issues of recommender systems and outline a few future directions. This study also provides insight to explore novel and helpful research problems related to recommendation systems.
Article
Full-text available
The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems---a personalized information filtering technology used to identify a set of items that will be of interest to a certain user. User-based collaborative filtering is the most successful technology for building recommender systems to date and is extensively used in many commercial recommender systems. Unfortunately, the computational complexity of these methods grows linearly with the number of customers, which in typical commercial applications can be several millions. To address these scalability concerns model-based recommendation techniques have been developed. These techniques analyze the user--item matrix to discover relations between the different items and use these relations to compute the list of recommendations.In this article, we present one such class of model-based recommendation algorithms that first determines the similarities between the various items and then uses them to identify the set of items to be recommended. The key steps in this class of algorithms are (i) the method used to compute the similarity between the items, and (ii) the method used to combine these similarities in order to compute the similarity between a basket of items and a candidate recommender item. Our experimental evaluation on eight real datasets shows that these item-based algorithms are up to two orders of magnitude faster than the traditional user-neighborhood based recommender systems and provide recommendations with comparable or better quality.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
Article
This paper presents an overview of the field of recommender systems and describes the current generation of recommendation methods that are usually classified into the following three main categories: content-based, collaborative, and hybrid recommendation approaches. This paper also describes various limitations of current recommendation methods and discusses possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include, among others, an improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multicriteria ratings, and a provision of more flexible and less intrusive types of recommendations.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
A key part of a recommender system is a collaborative filter-ing algorithm predicting users' preferences for items. In this paper we describe different efficient collaborative filtering techniques and a framework for combining them to obtain a good prediction. The methods described in this paper are the most im-portant parts of a solution predicting users' preferences for movies with error rate 7.04% better on the Netflix Prize dataset than the reference algorithm Netflix Cinematch. The set of predictors used includes algorithms suggested by Netflix Prize contestants: regularized singular value de-composition of data with missing values, K-means, postpro-cessing SVD with KNN. We propose extending the set of predictors with the following methods: addition of biases to the regularized SVD, postprocessing SVD with kernel ridge regression, using a separate linear model for each movie, and using methods similar to the regularized SVD, but with fewer parameters. All predictors and selected 2-way interactions between them are combined using linear regression on a holdout set.
Conference Paper
The collaborative filtering approach to recommender system s pre- dicts user preferences for products or services by learning past user- item relationships. In this work, we propose novel algorithms for predicting user ratings of items by integrating complementary mod- els that focus on patterns at different scales. At a local sca le, we use a neighborhood-based technique that infers ratings from observed ratings by similar users or of similar items. Unlike previou s local approaches, our method is based on a formal model that accounts for interactions within the neighborhood, leading to improved esti- mation quality. At a higher, regional, scale, we use SVD-like ma- trix factorization for recovering the major structural pat terns in the user-item rating matrix. Unlike previous approaches that require imputations in order to fill in the unknown matrix entries, ou r new iterative algorithm avoids imputation. Because the models involve estimation of millions, or even billions, of parameters, sh rinkage of estimated values to account for sampling variability proves crucial to prevent overfitting. Both the local and the regional appro aches, and in particular their combination through a unifying model, com- pare favorably with other approaches and deliver substantially bet- ter results than the commercial Netflix Cinematch recommend er system on a large publicly available data set.