Content uploaded by Mehdi Elahi
Author content
All content in this area was uploaded by Mehdi Elahi on Feb 03, 2016
Content may be subject to copyright.
39
Active Learning Strategies for Rating Elicitation in Collaborative
Filtering: a System-Wide Perspective
MEHDI ELAHI and FRANCESCO RICCI, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
NEIL RUBENS, University of Electro-Communications, Tokyo, Japan
The accuracy of collaborative-filtering recommender systems largely depends on three factors: the quality of the rating
prediction algorithm, and the quantity and quality of available ratings. While research in the field of recommender systems
often concentrates on improving prediction algorithms, even the best algorithms will fail if they are fed poor quality data
during training, i.e. garbage in, garbage out. Active learning aims to remedy this problem by focusing on obtaining better
quality data that more aptly reflects a user’s preferences. However, traditional evaluation of active learning strategies has two
major flaws, which have significant negative ramifications on accurately evaluating the system’s performance (prediction
error, precision, and quantity of elicited ratings). (1) Performance has been evaluated for each user independently (ignoring
system-wide improvements) (2) Active learning strategies have been evaluated in isolation from unsolicited user ratings
(natural acquisition).
In this paper we show that an elicited rating has effects across the system, so a typical user-centric evaluation which
ignores any changes of rating prediction of other users also ignores these cumulative effects, which may be more influential on
the performance of the system as a whole (system-centric). We propose a new evaluation methodology and use it to evaluate
some novel and state of the art rating elicitation strategies. We found that the system-wide effectiveness of a rating elicitation
strategy depends on the stage of the rating elicitation process, and on the evaluation measures (MAE, NDCG, and Precision).
In particular, we show that using some common user-centric strategies may actually degrade the overall performance of
a system. Finally, we show that the performance of many common active learning strategies changes significantly when
evaluated concurrently with the natural acquisition of ratings in recommender systems.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval
General Terms: Algorithms, Experimentation, Performance
Additional Key Words and Phrases: Recommender Systems, Active Learning, Rating Elicitation, Cold Start
1. INTRODUCTION
Choosing the right product to consume or purchase is nowadays a challenging problem due to
the growing number of products and variety of eCommerce services. While increasing number
of choices provides an opportunity for a consumer to acquire the products satisfying her special
personal needs, it may at the same time overwhelm by providing too many choices [Anderson
2006]. Recommender Systems (RSs) tackle this problem providing personalized suggestions for
digital content, products or services, that better match the user’s needs and constraints than the
mainstream products [Resnick and Varian 1997; Ricci et al. 2011; Jannach et al. 2010]. In this paper
we focus on the collaborative filtering recommendation approach, and on techniques that are aimed
at identifying what information about the user tastes should be elicited by the system to generate
effective recommendations.
1.1. Recommender Systems and Social Networks
A collaborative filtering (CF) recommender system uses ratings for items provided by a network
of users to recommend items, which the target user has not yet considered but will likely enjoy
[Koren and Bell 2011; Desrosiers and Karypis 2011]. A collaborative filtering system computes
its recommendations by exploiting relationships and similarities between users. These relations are
defined by observing the users’ access of the items managed by the system. For instance consider
the users actively using Amazon or Last.fm. They browse and search items to buy or to listen to.
They read users comments and get recommendations computed by the system using the access logs,
and the ratings of the users’ community.
The recommendations computed by a CF system are based on the network structure created by
the users accessing the system. For instance, classical neighbor-based CF systems evaluate user-to-
user or item-to-item similarities based on the co-rating structure of the users or items. Two users
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:2 M. Elahi et al.
are considered similar if they rate items in a correlated way. Analogously, two items are considered
similar if the network of users have rated the items in a correlated way. The recommendations for an
active user are then computed by suggesting the items that have a high average rating in the group
of users similar to the active one (user-based CF). In item-based approaches, a user is recommended
items that are similar to those she liked in the past, where similar means that users rated similarly
the two items. Even more novel matrix factorization recommendation techniques [Koren and Bell
2011], which are considered in this paper, are modeling users and items with a vector of abstract
factors that are learned by mining the rating behavior of a network of users. In these factor models,
in addition to the similarity of users with users and items with items, it is also possible to establish
the similarity of users with items, since both of them are represented uniformly with a vector of
factors.
Recommender systems are often integrated into eCommerce applications to suggest items to buy
or consume, but RSs are now also frequently used in social networks, i.e., applications primarily
designed to support social interactions between their users. Moreover, in many popular social net-
works such as Facebook, Myspace, and Google plus, several applications have been introduced for
eliciting user preferences and characteristics, and then to provide recommendations. For instance,
in Facebook, the largest social network, there are applications where users can rate friends, movies,
photos, or links. Examples of such applications are Rate My Friends (with more than 6,000 monthly
active users), Rate my Photo, My Movie Rating, and LinkR. Some of these applications collect user
preferences (ratings) only to create a user profile page. However, some of them use the data also
to make recommendations to the users. For instance using Rate My Friends, the user is requested
to rate her friends. Then the application ranks her friends based on the ratings and presents the top
scored users that she may be interested in connecting to. LinkR, another example of recommender
systems integrated into a social network, recommends a number of links and allows the user to rate
them. Moreover, Facebook itself does also collect ratings by offering a “like” button in partner sites
and exploits its usage to discover which friends, groups, apps, links, or games a particular user may
like.
It is worth noting that all of the applications mentioned above must implement a rating elicitation
strategy, i.e., identify items to present to the user in order to collect her ratings. In this paper we
propose and evaluate some strategies for accomplishing this task. Hence, social networks can benefit
from the techniques introduced here to generate better recommendations for establishing new social
relationships and hence improving the core service of social networks.
1.2. Collaborative Filtering and Rating Acquisition
The CF rating prediction accuracy does depend on the characteristics of the prediction algorithm.
Hence, in the last years several variants of CF algorithms have been proposed. [Koren and Bell 2011;
Desrosiers and Karypis 2011] provide an up to date survey of memory and model-based methods.
In addition to the rating prediction technique, the number, the distribution, and the quality of the
ratings known by the system can influence system’s performance. In general, the more informa-
tive about the user preferences the available ratings are, the higher the recommendation accuracy
is. Therefore, it is important to keep acquiring from the users new and useful ratings, in order to
maintain or improve the quality of the recommendations. This is especially true for the cold start
stage, where a new user or a new item is added to the system [Schein et al. 2002; Liu et al. 2011;
Zhou et al. 2011; Golbandi et al. 2011].
It is worth noting that RSs usually deal with huge catalogues, e.g., Netflix, the popular American
provider of on-demand Internet streaming media, manages almost one million movies. Hence, if the
recommender system wants to explicitly ask the user to rate some items, this set must be carefully
chosen. First of all, the system should ask ratings for items that the user have experienced, otherwise
no useful information can be acquired. This is not easy, especially when the user is new to the system
and there is not much knowledge that can be leveraged to predict what items the user actually
experienced in the past. But additionally, the system should exploit techniques to identify those
items that if rated by the user would generate ratings data that do improve the precision of the future
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:3
recommendations, not only for the target user but for all of the system’s users. Informative ratings
can provide additional knowledge about preferences of the users as well as fixing the errors in the
rating prediction model.
1.3. Approach and Goals of this Paper
In this work we focus on understanding the behavior of several ratings acquisition strategies, such
as “provide your ratings for these top ten movies”. The goal of a rating acquisition strategy is to
enlarge the set of available data in the optimal way for the whole system performance by eliciting
the most useful ratings from each user. In practice, a RS user interfaces can be designed so that
users browsing the existing items can rate them if they wish. But, new ratings can also be acquired
by explicitly asking users. In fact, some RSs ask the users to rate the recommended items: mixing
recommendations with users’ preference elicitation. We will show that this approach is feasible but
it must be used with care, since relying on just one single strategy, such as asking the user opinion
only for the items that the system believes the user likes, has a potentially dangerous impact on the
system effectiveness. Hence a careful selection of the elicitation strategy is in order.
In this paper we extend our previous work ([Elahi et al. 2011; Elahi et al. 2011]) where we pro-
vided an initial evaluation of active learning rating elicitation strategies in collaborative filtering.
In this paper, in addition to the “pure” strategies i.e., those implementing a single heuristic, we
also consider “partially randomized” ones. Randomized strategies, in addition to asking (simulated)
users to rate the items selected by a “pure” strategy, also ask to rate some randomly selected items.
Randomized strategies can diversify the item list presented to the user. But, more importantly, ran-
domized strategies allow to cope with the non monotonically improving behavior of the system
effectiveness that we observed during the simulation of certain “pure” strategies. In fact, we discov-
ered (as hypothesized by [Rashid et al. 2002]) that certain strategies, for instance, requesting to rate
the items with the highest predicted ratings, may generate a system-wide bias, and inadvertently
increase the system error.
RSs can be evaluated online and offline [Herlocker et al. 2004; Shani and Gunawardana 2010;
Cremonesi et al. 2010]. In the first case, one or more RSs are run and experiments on real users are
performed. This requires building or accessing a (or some) fully developed RS, with a large user
community, which is expensive and time consuming. Moreover, it is hard to test online several algo-
rithms, such as those proposed here. Therefore, similarly to many previous experimental analysis,
we performed offline experiments. We developed a program which simulates the real process of
rating elicitation in a community of users (Movielens and Netflix), the consequent rating database
growth starting from a relatively small one (cold-start), and the system adaptation (retraining) to the
new set of data. Moreover, in this paper we evaluate the proposed strategies in two scenarios: when
the simulated users are confined to rate only items that are presented to them by the active learning
strategy or when they can voluntarily add ratings on their own.
In the experiments performed here we used a state of the art Matrix Factorization rating prediction
algorithm [Koren and Bell 2011; Timely Development 2008]. Hence our results can provide useful
guidelines for managing real RSs that nowadays often rely on this technique. In factor models
both users and items are assigned to factor vectors of the same size. Those vectors are obtained
from the user ratings matrix with optimization techniques trying to approximate the original rating
matrix. Each element of the factor vector assigned to an item reflects how well the item represents a
particular latent aspect [Koren and Bell 2011]. For our experiments we employed a gradient descent
optimization technique as proposed by Simon Funk [Funk 2006].
1.4. Paper Contribution
The main contribution of our research is the introduction and the empirical evaluation of a set of rat-
ing elicitation strategies for collaborative filtering with respect to their system-wide utility. Some of
these strategies are new and some come from the literature and the common practice. An important
differentiating aspect of our study is measuring the effect of each strategy on several RSs evaluation
measures and showing that the best strategy depends on the evaluation measure. Previous works fo-
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:4 M. Elahi et al.
cussed only on the rating prediction accuracy (Mean Absolute Error), and on the number of acquired
ratings. We analyze those aspects, but in addition we consider the recommendation precision, and
the goodness of the recommendations’ ranking, measured with normalized discounted cumulative
gain (NDCG). These measures are crucial for determining the value of the recommendations [Shani
and Gunawardana 2010].
Moreover, another major contribution of our work is the analysis of the performance of the elic-
itation strategies taking into account the size of the rating database. We show that different strate-
gies can improve different aspects of the recommendation quality at different stages of the rating
database development. We show that in some stages elicitation strategies may induce a bias on the
system and ultimately result in a decrease of the recommendation accuracy.
In summary, this paper provides a realistic, comprehensive evaluation of several applicable rating
elicitation strategies, providing guidelines and conclusions that could help with their deployment in
real RSs.
1.5. Novelty of the Proposed Approach
Rating elicitation has been also tackled in a few previous works [McNee et al. 2003; Rashid et al.
2002; Rashid et al. 2008; Carenini et al. 2003; Jin and Si 2004; Harpale and Yang 2008] that will be
surveyed in Section 2. But these papers focused on a problem that is different from what we consider
here. Namely, they measured the benefit of the rating elicited from one user, e.g., in the sign up stage,
for improving the quality of the recommendations for that user. Conversely, we consider the impact
of an elicitation strategy on the overall system behavior, e.g., the prediction accuracy averaged on
all the system’s users. In other words, we try to identify strategies that can elicit from a user ratings
that will contribute to the improvement of the system performance for all of the users, and not just
for the target user.
Previously conducted evaluations have assumed rather artificial conditions, i.e., all the users and
items have some ratings since the beginning of the evaluation process and the system only asks
to the simulated user ratings that are present in the data set. In other words, previous studies did
not consider the new-item and the new-user problem. Moreover, only a few evaluations simulated
users with limited knowledge about the items (e.g. [Harpale and Yang 2008]). We generate initial
conditions for the rating data set based on the temporal evolution of the system, hence, in our
experiments, new users and new items are present in a similar manner as in real settings. Moreover,
the system does not know what items the simulated user has experienced, and may ask ratings for
items that the user will not be able to provide. This better simulates a realistic scenario where not
all rating requests can be satisfied by a user.
It is also important to note that previous analysis considered the situation where the active learn-
ing rating elicitation strategy was the only tool used to collect new ratings from the users. Hence,
elicitation strategies were evaluated in isolation from ongoing system usage, where users can freely
enter new ratings. We propose a more realistic evaluation settings, where in addition to the ratings
acquired by the elicitation strategies, ratings are also added by users on a voluntary basis. Hence,
for the validation experiments, we have also utilized a simulation process in which active learning
is combined with the natural acquisition of the users’ ratings.
The rest of the paper is structured as follows. In section 2 we review related works. In section 3
we introduce the rating elicitation strategies that we have analyzed. In section 4 we present the first
simulation procedure that we designed to more accurately evaluate the system’s recommendation
performance (MAE, NDCG, and Precision). The results of our experiments are shown in section 5
and 6. Then in Section 7 we present the analysis of the active learning strategies when active learning
is mixed with the natural acquisition of the user ratings. Finally in section 8 we summarize the results
of this research and outline directions for future work.
2. RELATED WORK
Active learning in RS aims at actively acquiring user preference data to improve the output of the RS
[Boutilier et al. 2003; Rubens et al. 2011]. Active learning for RS is a form of preference elicitation
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:5
[Bonilla et al. 2010; Pu and Chen 2008; Chen and Pu 2012; Braziunas and Boutilier 2010; Guo
and Sanner 2010; Birlutiu et al. 2012], but the current research on active learning for recommender
systems has focussed on collaborative filtering, and in particular on the new user problem. In this
settings, it is assumed that a user has not rated any items, the system is able to actively ask the user
to rate some items in order to generate recommendations for the user. In this survey we will focus
on AL in collaborative filtering.
In many previous works, which we will describe below, the evaluation of a rating elicitation
strategy is performed by simulating the interaction with a new user while the system itself is not in
a cold start stage, i.e., it has already acquired many ratings from users.
Conversely, as we mentioned in the Introduction, in our work we simulate the application of
several rating elicitation strategies in a more diverse set of scenarios; besides the typical settings in
which the new user has not rated any items, while the system already possess many ratings provided
by other users. We consider a more general scenario where the user repeatedly comes back to the
system for receiving recommendations, i.e., while the system has possibly elicited ratings from other
users. Moreover, we simulate a scenario where the system has initially a small overall knowledge
of the users’ preferences, i.e., has a small set of ratings to train the prediction model. Then, step
by step, as the users come to the system new ratings are elicited. Another important difference,
compared to the state of the art, is that we consider the impact of an elicitation strategy on the
overall system behavior. This aims to measure how the ratings elicited from one user can contribute
to the improvement of the system performance even when making recommendations for other users.
Finally, we have also investigated a more realistic evaluation scenario where active learning is
combined with natural addition of the ratings, i.e., some ratings are freely added by the users without
being requested. This scenario has not been applied previously.
2.1. Rating Elicitation at Sign Up
The first research works in AL for recommender systems where motivated by the need to implement
more effective sign up processes and used the classical neighbor-based approaches to collaborative
filtering [Desrosiers and Karypis 2011]. In [Rashid et al. 2002] the authors focus explicitly on the
sign up process, i.e., when a new user starts using a collaborative filtering recommender system
and must rate some items in order to provide to the system some initial information about her
preferences. [Rashid et al. 2002] considered six techniques for explicitly determining the items to
ask a user to rate: entropy, where items with the largest rating entropy are preferred; random request;
popularity, which is measured as the number of ratings for an item, and hence the most frequently
rated items are selected; log(popularity) ∗ entropy where items that are both popular and have
diverse ratings are selected; and finally item-item personalized, where random items are proposed
until the user rates one. Then, a recommender is used to predict what items the user is likely to
have seen based on the ratings already provided by the user. These predicted items are requested to
the user to rate. Finally, the behavior of an item-to-item collaborative filtering system [Desrosiers
and Karypis 2011] was evaluated with respect to MAE under an offline settings that simulated the
sign up process. The process was repeated multiple times and averaged for all the test users. In that
scenario the log(popularity) ∗ entropy strategy was found to be the best. For this reason we have
also evaluated log(popularity) ∗entropy in our study. But, it is worth noting that their result could
not be automatically extended to the scenario that we consider in this work, that is the evolution of
the global system performance under the application of an active learning strategy applied to all the
users. In fact, as we mentioned above, in our experiments we simulate the simultaneous acquisition
of ratings from all the users, by asking each user in turn to rate some items, and we repeat this
process several times. This simulates the long term usage of a recommender system where users
utilize system repeatedly to get new recommendations and ratings provided by a user are also used
to generate better recommendations for other users (system performance).
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:6 M. Elahi et al.
2.2. Conversational Approaches
Subsequently, researchers understood that in order to generate more effective rating elicitation
strategies the system should be conversational: it should better motivate the rating requests, fo-
cusing on the user preferences, and the user should be able to more freely enter her ratings, even
without being explicitly requested.
In [Carenini et al. 2003], a user-focussed approach is considered. They propose a set of tech-
niques to intelligently select items to rate when the user is particularly motivated to provide such
information. They present a conversational and collaborative interaction model that elicits ratings so
that the benefit of doing that is clear to the user, thus increasing the motivation to provide a rating.
Item-focused techniques that elicit ratings to improve the rating prediction for a specific item are
also proposed. Popularity, entropy and their combination are tested, as well as their item focused
modifications. The item focused techniques are different from the classical ones in that popularity
and entropy are not computed on the whole rating matrix, but only on the matrix of user’s neighbors
that have rated an item for which the prediction accuracy is being improved. Results have shown
that item focused strategies are constantly better than unfocused ones.
[McNee et al. 2003] addresses even a more general problem, aiming at understanding which,
among the following methods, is the best solution for rating elicitation in the start up phase: a)
allowing a user to enter the items and their ratings freely, b) proposing to a user a list of items
and asking her to rate them, or c) combining the two approaches. They compare three interfaces
for eliciting information from new users that implement the above mentioned approaches. They
performed an online experiment, which shows that the two pure approaches produced more accurate
user models than the mixed model with respect to MAE.
2.3. Bayesian Approaches
In another group of approaches AL is modeled as a Bayesian reasoning process. [Harpale and Yang
2008] developed such an approach extending and criticizing a previous one introduced in [Jin and
Si 2004]. In fact, in [Jin and Si 2004], as it is rather common in most AL techniques and evaluation
studies, the unrealistic assumption that a user can provide rating for any presented item is made.
Conversely, they propose a revised Bayesian item selection approach, which does not make such
assumption, and introduces an estimate of the probability that a user has consumed an item in the
past and is able to provide a rating. Their results show that the personalized Bayesian selection
outperforms Bayesian selection and the random strategy with respect to MAE. Their simulation
setting is similar to that used in [Rashid et al. 2002], hence for the same reason their results are not
directly comparable with ours. There are other important differences between their experiment and
ours: their strategies elicit only one rating per request, while we assume that the system makes many
rating requests at the same time; they compare the proposed approach only with the random strategy,
while we study the performance of several strategies; they do not consider the new user problem,
since in their simulations all the users have at least three ratings at the beginning of the experiment,
whereas in our experiments, there are users that have no ratings at all in the initial stage of the
experiment; they use a different rating prediction algorithm (Bayesian vs. Matrix Factorization). All
these differences make the two set of experiments, and the conclusions hard to compare. Moreover,
in their simulations they assume that the system has a larger number of known ratings than in our
experiments.
2.4. Decision Trees Based Methods
Many recent approaches to rating elicitation in RS identify the items to request to the user to rate
as those providing the most useful knowledge for reducing the prediction error of the recommender
system. Many of these approaches exploit decision trees to model the conditional selection of an
item to be rated, with regards to the ratings provided by the user for the items presented previously.
In [Rashid et al. 2008] the authors extend their former work [Rashid et al. 2002] using a rating
elicitation approach based on the usage of decision trees. The proposed technique is called IGCN,
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:7
and builds a tree where each node is labelled by a particular item to be asked to the user to rate.
According to the user rating for the asked item a different branch is followed, and a new node,
that is labelled with another item to rate is determined. In order to build this decision tree, they first
cluster the users in groups, by grouping users with similar profiles, and assigning each user to one of
these clusters. The tree is incrementally extended by selecting for each node the item that provides
the highest information gain for correctly classifying the user in the right cluster. Hence, the items
whose ratings are more important to correctly classify the users in the right cluster are selected
earlier in the tree. They also considered two alternative strategies. The first one is entropy0 that
differs from the more classical entropy strategy, which we mentioned above, because the missing
value is considered as a possible rating (category 0). Then, the second one is called HELF , where
items with the largest value of the harmonic mean of the entropy and popularity are selected. They
have conducted offline and online simulations, and concluded that IGCN and entropy0 perform
the best with respect to MAE.
They evaluate the improvement of the rating prediction accuracy for the particular user whose
ratings are elicited, while, as we mentioned above, we measure the overall system effectiveness of a
rating elicitation strategy. Moreover, in their experiments they use a very different rating prediction
algorithm, i.e., a standard neighbor-based approach [Desrosiers and Karypis 2011], while we use
matrix factorization [Koren and Bell 2011].
In a more recent work [Golbandi et al. 2010] three strategies for rating elicitation in collaborative
filtering are proposed. For the first method, GreedyExtend, the items that minimize the root mean
square error (RMSE) of the rating prediction (on the training set) are selected. For the V ar method
the items with the largest
2
√
popularity ∗ variance are selected, i.e., items that have many ratings
in the training set and with diverse values. Finally, for Coverage method, the items with the largest
coverage are selected. They defined coverage for an item as the total number of users who co-
rated both the selected item and any other item. They evaluated the performance of these strategies
and compared them with previously proposed ones (popularity, entropy, entropy0, HELF , and
random). In their experiments, every strategy ranks and picks the top 200 items to be presented
to new users. Then, considering the ratings of the users for these items as training set they predict
ratings in Netflix test set for every single user and compute RMSE. They show that GreedyExtend
outperforms the other strategies. In fact, this strategy is quite effective, as it obtains the same error
rate, after having acquired just 10 ratings, which the second best strategy (V ar) achieves after 26
ratings. However, despite this remarkable achievement, GreedyExtend is static, i.e., selects the
items without considering the ratings previously entered by the user. Even here the authors focus
on the new user problem. In our work we do not make such assumption, and propose and evaluate
strategies that can be used in all stages, and not only at the start up stage.
Even more recently, in [Golbandi et al. 2011] the same authors of the paper described above,
have developed an adaptive version of their approach. Here, the items selected for a user depend
on the previous ratings she has provided. They also propose a technique based on decision trees
where at each node there is a test based on a particular item (movie). The node divides the users
into three groups based on the rating of the user for that movie: lovers, who rated the movie high;
haters, who rated the movie low; and unknowns, who did not rate the movie. In order to build the
decision tree, at each node the movie, whose rating knowledge produces the largest reduction of
the RMSE is selected. The rating prediction is computed (approximated) as the weighted average
of the ratings given by the users that belong to that node. They have evaluated their approach using
the Netflix training data set (100M ratings) to construct the trees, and evaluated the performance
of the proposed strategy on the Netflix test set (2.8M ratings). The proposed strategy has shown a
significant reduction of RMSE compared with GreedyExtend, V ar, and HELF strategies. They
were able to achieve with only 6 ratings the same accuracy that is achieved by the next best strategy,
i.e., GreedyExtend, after acquiring over 20 ratings. Moreover, that accuracy is obtained by V ar
and HELF strategies after acquiring more than 30 ratings.
It should be noted that their results are again rather difficult to compare with ours. They simulate
a scenario where the system is trained and the decision tree is constructed from a large training
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:8 M. Elahi et al.
dataset. So they assume a large initial knowledge of the system. Then, they focus on completely
new users, i.e., those without a single rating in the train set. In contrast, in our work, we assume that
the system has a very limited global knowledge of the users. In our experiments this is simulated by
giving to the system only 2% of the rating dataset. Moreover, we analyze the system dynamics as
more users are repeatedly requested to enter their ratings.
2.5. Time-Dependent Evolution of a Recommender System
Finally we want to mention an interesting and related work that is not addressing the active learning
process of rating elicitation but is studying the time dependent evolution of a recommender system
as new ratings are acquired. In [Burke 2010] the authors analyze the temporal properties of a stan-
dard user-based collaborative filtering [Herlocker et al. 1999] and Influence Limiter [Resnick and
Sami 2007], a collaborative filtering algorithm developed for counteracting profile injection attacks
by considering the time at which a user has rated an item.
They evaluate the accuracy of these two prediction algorithms while the users are rating items and
the database is growing. This is radically different from the typical evaluations that we mentioned
above, where the rating dataset is decomposed into the training and testing sets without considering
the timestamp of the ratings. In [Burke 2010] it is argued that considering the time at which the
ratings were added to the system gives a better picture of the real user experience during the in-
teractions with the system in terms of recommendation accuracy. They conducted their analysis on
Movielens large dataset (1M ratings), and discovered that while using Influence Limiter, MAE is not
decreasing with the addition of more data indicating that the algorithm is not effective in terms of
accuracy improvement. For the standard user-based collaborative filtering algorithm they observed
the presence of two time segments: the start up period, until day 70 with MAE dropping gradually,
and the remaining period, where MAE was dropping much slower.
This analysis is complementary to our study. This work analyzes the performance of a recom-
mendation algorithm while the users are adding their ratings in a natural manner, i.e., without being
explicitly requested to rate items selected by an active learning strategy. We have investigated the
situation where in addition to this natural stream of ratings coming from the users, the system se-
lectively chooses additional items and present them to the users to get their ratings.
3. ELICITATION STRATEGIES
A rating dataset R is a n ×m matrix of real values (ratings) with possible null entries. The variable
r
ui
, denotes the entry of the matrix in position (u, i), and contains the rating assigned by user u to
item i. r
ui
could store a null value representing the fact that the system does not know the opinion
of the user on that item. In the Movielens and Netflix datasets the rating values are integers between
1 and 5 (inclusive).
A rating elicitation strategy S is a function S(u, N, K, U
u
) = L which returns a list of items
L = {i
1
, . . . , i
M
}, M ≤ N, whose ratings should be elicited from the user u, where N is the
maximum number of items that the strategy should return, K is the dataset of known ratings, i.e.,
the ratings (of all the users) that have been already acquired by the RS. K is also an n × m matrix
containing entries with real or null values. The not null entries represent the knowledge of the system
at a certain point of the RS evolution. Finally, U
u
is the set of items whose ratings have not yet been
elicited from u, hence potentially interesting. The elicitation strategy enforces that L ⊂ U
u
and will
not repeatedly ask a user to rate the same item; i.e. after the items in L are shown to a user they are
removed from U
u
.
Every elicitation strategy analyzes the dataset of known ratings K and scores the items in U
u
.
If the strategy can score at least N different items, then the N items with the highest score are
returned. Otherwise a smaller number of items M ≤ N is returned. It is important to note that the
user may have not experienced the items whose ratings are requested; in this case the system will
not increase the number of known ratings. In practice, following a strategy may result in collecting
a larger number of ratings, while following another one may results in fewer but more informative
ratings. These two properties (rating quantity & quality) play a fundamental role in rating elicitation.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:9
3.1. Individual Strategies
We considered two types of strategies: pure and partially randomized. The first ones implement
a unique heuristic, whereas the second type of strategies hybridize a pure one by adding some
random rating requests that are still unknown to the system. As we mentioned in the introduction
these strategies add some diversity to the system requests and, as we will show later, can cope with
an observed problem of the pure strategies: which may in some cases increase the system error.
The pure strategies that we have considered are:
— Popularity: for all the users the score for item i ∈ U
u
is equal to the number of not null ratings for
i contained in K, i.e. the number of known ratings for the item i. This strategy will rank the items
according to the popularity score and then will select the top N items. Note that this strategy is
not personalized, i.e., the same N items are proposed to be rated by any user. the rationale of this
strategy is that more popular items are more likely to be known by the user, and hence it is more
likely that a request for such a rating will increase the size of the rating database.
— log(popularity) * entropy: the score for the item i ∈ U
u
is computed by multiplying the logarithm
of the popularity of i with the entropy of the ratings for i in K. Also in this case, as for any
strategy, the top N items according to the computed score are proposed to be rated by the user.
This strategy tries to combine the effect of the popularity score, which is discussed above, with the
heuristics that favors items with more diverse ratings (larger entropy), which may provide more
useful (discriminative) information about the user’s preferences [Carenini et al. 2003; Rashid et al.
2002].
— Binary Prediction: the matrix K is transformed in a matrix B with the same number of rows and
columns, by mapping null entries in K to 0, and not null entries to 1. Hence, the matrix B models
only whether a user rated (b
ui
= 1) or not (b
ui
= 0) an item, regardless of its value [Koren 2008].
A factor model, similarly to what is done for standard rating prediction, is built using the matrix B
as training data, to compute the predictions for the entries in B that are 0 [Koren and Bell 2011].
In this case predictions are numbers between 0 and 1. The larger is the predicted value for the
entry b
ui
, the larger is the predicted probability that the user u has consumed an item i, and hence
may be able to rate it. Finally, the score for the item i ∈ U
u
corresponds to the predicted value
for b
ui
. Hence, by selecting the top N items with the highest score this strategy tries to select the
items that the user has most likely experienced, in order to maximize the likelihood that the user
can provide the requested rating. In that sense it is similar to the popularity strategy, but it tries to
make a better prediction of what items the user can rate by exploiting the knowledge of the items
the user has rated in the past. Note that the better are the predictions b
ui
for the items in U
u
the
larger is the number of ratings that this strategy can acquire.
— Highest Predicted: a rating prediction ˆr
ui
, based on the ratings in K, is computed for all the items
i ∈ U
u
and the score for i is this predicted value ˆr
ui
. Then, the top N items according to this
score are selected. The idea is that the items with the highest predicted rating, are supposed to
be the items that the user likes the most. Hence, it could also be more likely that the user have
experienced these items. Moreover, their ratings could also reveal important information on what
the user likes. We also note that this is the default strategy for RSs, i.e., enabling the user to rate
the recommendations.
— Lowest Predicted: uses an opposite heuristics compared to highest predicted: for all the items
i ∈ U
u
the prediction ˆr
ui
is computed, but then the score for i is Maxr −ˆr
ui
, where M axr is the
maximum rating value (e.g., 5). This ensure that the items with the lowest predicted ratings ˆr
ui
will get the highest score and therefore will be selected for elicitation. Lowest predicted items are
likely to reveal what the user dislikes, but are likely to elicit a few ratings, since users tend to not
rate items that they do not like; reflected by the distributions of the ratings voluntarily provided by
the users [Marlin et al. 2011].
— Highest and Lowest Predicted: for all the items i ∈ U
u
a prediction ˆr
ui
is computed. The score
for an item is |
Maxr+M inr
2
− ˆr
ui
|, where M inr is the minimum rating value (e.g., 1). This score
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:10 M. Elahi et al.
is simply the distance of the predicted rating of i from the midpoint of the rating scale. Hence, this
strategy selects the items with extreme ratings, i.e., the items that the user either hates or loves.
— Random: the score for an item i ∈ U
u
is a random integer from 1 to 5. Hence also the top N items
in U
u
according to this score are simply randomly chosen. This is a baseline strategy, used for
comparison.
— Variance: the score for the item i ∈ U
u
is equal to the variance of its ratings in the dataset K.
Hence this strategy selects the items in U
u
that have been rated in a more diverse way by the users.
This is a representative of the strategies that try to collect more useful ratings, assuming that the
opinion of the user on items with more diverse ratings are more useful for the generation of correct
recommendations.
— Voting: the score for the item i is the number of votes given by a committee of strategies including
popularity, variance , entropy , highest-lowest predicted, binary prediction, and random. Each of
these strategies produces its top 10 candidates for rating elicitation, and then the items appearing
more often in these lists are selected. This strategy depends on the selected voting strategies. We
have also included random strategy as to impose an exploratory behavior.
Finally, we would like to note that we have also evaluated other strategies: entropy, and log(pop)∗
variance. But, since their observed behaviors are very similar to some of the previously mentioned
strategies, we did not include them.
3.2. Partially Randomized Strategies
A pure strategy may not be able to return the requested number of items. For instance, there are
cases where no rating predictions can be computed by the RS for the user u. This happens for
instance when u is a new user and none of his ratings are known. Hence, in this situation the highest
predicted strategy is not able to score any of the items. In this case the randomized version of the
strategy can generates purely random items for the user to rate.
A partially randomized strategy modifies the list of items returned by a pure strategy introduc-
ing some random items. As we mentioned in the introduction, the partially randomized strategies
have been introduced to cope with some problems of the pure strategies (see section 5). More pre-
cisely, the randomized version Ran of the strategy S with randomness p ∈ [0, 1] is a function
Ran(S(u, N, K, U
u
), p) returning a new list of items L
0
computed as follow:
(1) L = S(u, N, K, U
u
) is obtained
(2) if L is an empty list, i.e., the strategy S for some reason could not generate the elicitation list,
then L
0
is computed by taking N random items from U
u
.
(3) if |L| < N , L
0
= L ∪ {i
1
, . . . , i
N−|L|
}, where i
j
is a random item in U
u
.
(4) if |L| = N , L
0
= {l
1
, . . . , l
M
, i
M+1
, . . . , i
N
}, where l
j
is a random item in L, M = dN ∗(1 −
p)e, and i
j
is a random item in U
u
.
4. EVALUATION APPROACH
In order to study the effects of the considered elicitation strategies we set up the following simulation
procedure. The goal is to simulate the influence of elicitation strategies on the evolution of a RS’s
performance. To achieve this, we partition all the available (not null) ratings in R into three different
matrices with the same number of rows and columns as R:
— K: contains the ratings that are considered to be known by the system at a certain point in time.
— X: contains the ratings that are considered to be known by the users but not by the system. These
ratings are incrementally elicited, i.e., they are transferred into K if the system asks for them
from the (simulated) users.
— T : contains a portion of the ratings that are known by the users but are withheld from X for
evaluating the elicitation strategies, i.e., to estimate the evaluation measures (defined later).
In Figure 1 (b) we illustrate graphically how the partition of the available ratings in a data set
could look like. As defined in the previous section, U
u
is the set of items whose ratings are not
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:11
rating values (proportional to size)
ratings known by the system (K)
ratings known by the user
but not the system (X)
System can elicit a rating
ratings used for evaluation (T)
items items
users
users
(a) User-centered Active Learning (b) System-centered Active Learning
Legend
Fig. 1. Comparison of the ratings data configurations used for evaluating user-centered and system-centered active learning
strategies.
known to the system and may therefore be selected by the elicitation strategies. That means that k
ui
has a null value and the system has not yet asked u for it. In Figure 1 (b) these are the items that for
a certain user have ratings that are not marked with grey boxes. In this setting, a request to rate an
item, which is identified by a strategy S, may end up with a new (not null) rating k
ui
inserted in K,
if the user has experienced the item i, i.e., if x
ui
is not null, or in a no action, if x
ui
has a null value
in the matrix X. The first case corresponds to the situation where the items is marked with a black
box for user u in Figure 1 (b). In any case, the system, will remove the item i from U
u
, as to avoid
asking to rate the same item again.
We will discuss later how the simulation is initialized, i.e., how the matrices K, X and T are built
from the full rating dataset R. In any case, these three matrices partition the full dataset R; if r
ui
has a not null value then either k
ui
or x
ui
or t
ui
is assigned that value, i.e. only one of entries is not
null.
The test of a strategy S proceeds in the following way:
(1) The not null ratings in R are partitioned into the three matrices K, X, T .
(2) MAE, Precision and NDCG are measured on T , training the rating prediction model on K.
(3) For each user u:
(a) Only the first time that this step is executed, U
u
, the unclear set of user u is initialized to all
the items i with a null value k
ui
in K.
(b) Using strategy S (pure or randomized) a set of items L = S(u, N, K, U
u
) is computed.
(c) The set L
e
, containing only the items in L that have a not null rating in X, is created.
(d) Assign to the corresponding entries in K the ratings for the items in L
e
as found in X.
(e) Remove the items in L from U
u
: U
u
= U
u
\ L and X.
(4) MAE, Precision and NDCG are measured on T , and the prediction model is re-trained on the
new set of ratings contained in K.
(5) Repeat steps 3-4 (Iteration) for I times.
It is important to note here the peculiarity of this evaluation strategy that has been mentioned al-
ready in Section 2. Traditionally, the evaluation of active learning strategies has been user-centered;
i.e. the usefulness of elicited rating was judged based on the improvement in the user’s prediction er-
ror. This is illustrated in Figure 1 (a). In this scenario the system is supposed to have a large number
of ratings from several users, and focusing on a new user (the first one in Figure 1 (a)) it first elicits
ratings from this new user that are in X, and then system predictions for this user are evaluated on
the test set T . Hence these traditional evaluations focussed on the new user problem and measured
how the ratings elicited from a new user may help the system to generate good recommendations for
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:12 M. Elahi et al.
this particular user. We note that, elicited rating from a user may improve not only the rating predic-
tions for that user, but also the predictions for the other users, which is what we are evaluating in our
experiments and it is graphically illustrated in Figure 1 (b). To illustrate this point, let us consider
an extreme example in which a new item is added to the system. The traditional user-centered AL
strategy, when trying to identify the items that a particular user u should rate, may ignore obtaining
his rating for that new item. In fact, this item has not been rated by any other user and therefore its
ratings cannot contribute to improve the rating predictions for u. However, the rating of u for the
new item would allow to bootstrap the predictions for the rest of the users in the system, and hence
from the system’s perspective the elicited rating is indeed very informative.
The MovieLens [Miller et al. 2003] and Netflix rating databases were used for our experiments.
Movielens consists of 100,000 ratings from 943 users on 1682 movies. From the full Netflix data
set, which contains 1,000,000 ratings, we extracted the first 100,000 ratings that were entered into
the system. They come from 1491 users on 2380 items, so this sample of Netflix data is 2.24 times
sparser than Movielens data.
We also performed some experiments with the larger versions of both Movielens and Netflix
datasets (1,000,000 ratings) and obtained very similar results [Elahi et al. 2011]. However, using the
full set of Netflix data required much longer times to perform our experiments since we train and
test a rating prediction model at each iteration: every time we add to K new ratings elicited from
the simulated users. After having observed a very similar performance on some initial experiments
we focussed on the smaller data sets to be able to run more experiments.
When deciding how to split the available data into the three matrices K, X and T an obvious
choice is to follow the actual time evolution of the dataset, i.e., to insert in K the first ratings
acquired by the system, then to use a second temporal segment to populate X and finally use the
remaining ratings for T . An approach that follows this idea is detailed in section 7.
But, it is not sufficient to test the performance of the proposed strategies for a particular evolution
of the rating dataset. Since we want to study the evolution of a rating data set under the application of
a new strategy we cannot test it only against the temporal distribution of the data that was generated
by a particular (unknown) previously used elicitation strategy. Hence we first followed the approach
also used in [Harpale and Yang 2008] to randomly split the rating data, but unlike [Harpale and
Yang 2008] we generated several random splits of the ratings into K, X and T . This allows us
to generate ratings configurations where there are users and items that initially have no ratings in
the known dataset K. We believe that this approach provided us with a very realistic experimental
setup, letting us to address both the new user and the new item problems [Ricci et al. 2011].
Finally, for both data sets the experiments were conducted by partitioning (randomly) the 100,000
not null ratings of R in the following manner: 2,000 ratings in K (i.e., very limited knowledge at
the beginning), 68,000 ratings in X, and 30,000 ratings in T . Moreover, |L| = 10, which means
that at each iteration the system asks from a user his opinion on at most 10 items. The number of
iterations was set as I = 170 since after that stage almost all the ratings are acquired and the system
performance is not changing anymore. Moreover, the number of factors in the SVD prediction model
was set to 16, which enabled the system to obtain a very good prediction accuracy, not very different
from configurations using hundreds of factors, as it is shown in [Koren and Bell 2011]. Note that
since the factor model is trained at each iteration and for each strategy learning the factor model
is the major computational bottleneck of the conducted experiments. For this reason we did not
use a very large number of factors. Moreover, in these experiments we wanted to compare the
system performance under the application of several strategies, hence, the key measure is the relative
performance of the system rather than its absolute value. All the experiments were performed 5
times and results presented in the following section are obtained averaging these five repetitions.
We considered three evaluation measures: mean absolute error (MAE), precision, and normalized
discounted cumulative gain (NDCG) [Herlocker et al. 2004; Shani and Gunawardana 2010; Man-
ning 2008]. For computing precision we extracted, for each user, the top 10 recommended items
(whose ratings also appear in T ) and considered as relevant the items with true ratings equal to 4 or
5.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:13
Discounted cumulative gain (DCG) is a measure originally used to evaluate effectiveness of in-
formation retrieval systems [J
¨
arvelin and Kek
¨
al
¨
ainen 2002], and is also used for evaluating collabo-
rative filtering RSs [Weimer et al. 2008] [Liu and Yang 2008]. In RSs the relevance is measured by
the rating value of the item in the predicted recommendation list. Assume that the recommendations
for u are sorted according to the predicted rating values, then DCG
u
is defined as:
DCG
u
=
N
X
i=1
r
i
u
log
2
(i + 1)
(1)
where r
i
u
is the true rating (as found in T ) for the item ranked in position i for user u, and N is
the length of the recommendation list.
Normalized discounted cumulative gain for user u is then calculated in the following way:
NDCG
u
=
DCG
u
IDCG
u
(2)
where IDCG
u
stands for the maximum possible value of DCG
u
, that could be obtained if the
recommended items are ordered by decreasing value of their true ratings. We measured also the
overall average discounted cumulative gain N DCG by averaging N D CG
u
over the full population
of users.
5. EVALUATION OF THE PURE STRATEGIES
In this section we present the results of a first set of experiments in which the pure strategies have
been evaluated. We first illustrate how the system MAE is changing as the system is acquiring new
ratings with the proposed strategies. Then we show how the NDCG and the system precision is
affected by the considered rating acquisition strategies.
5.1. Mean Absolute Error
The MAE computed on the test matrix T at the successive iterations of the application of the elic-
itation strategies (for all the users) is depicted in Figure 2. First of all, we can observe that the
considered strategies have a similar behavior on both data sets (Netflix and MovieLens) . Moreover,
there are two clearly distinct groups of strategies:
(1) Monotone error decreasing strategies: lowest-highest predicted, lowest predicted, voting, and
random.
(2) Non-monotone error decreasing strategies: binary predicted, highest predicted, popularity,
log(popularity)*entropy, and variance.
Strategies of the first group show an overall better performance (MAE) for the middle stage, but
not for the beginning and the end stage. At the very beginning, i.e., during the iterations 1-3 the best
performing strategy is binary-predicted. Then, during iterations 4-11 the voting strategy obtains
the lowest MAE on the Movielens data set. Then the random strategy becomes the best, and it is
overtaken by the lowest-highest-predicted strategy only at iteration 46. The results on the Netflix
data set differ as follows. The binary-predicted is the best strategy for a longer period, i.e., from the
beginning until iteration 7, and then voting outperforms this strategy till iteration 46, where lowest-
highest-predicted starts exhibiting the lowest error. At the iteration 80, the MAE stops changing for
all of the prediction-based strategies. This occurs because the known set K at that point already
reaches the largest possible size for those strategies, i.e., all the ratings in X, which can be elicited
by these strategies, have been transferred to K. Conversely, the MAE of the voting and random
strategies keeps decreasing, until all of the ratings in X are moved to K. It is important to note that
the prediction based strategies (e.g., highest predicted) cannot elicit ratings for which the prediction
can not be made, e.g. if a movie has no ratings in K.
The behavior of the non-monotone strategies can be divided into three stages. Firstly, they all
decrease the MAE at the beginning (approximately iterations 1-5). Secondly, they slowly increase
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:14 M. Elahi et al.
0 20 40 60 80 100 120 140 160 180
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
(a) Movielens Dataset
0 20 40 60 80 100 120 140 160 180
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
# of iterations
MAE
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
(b) Netflix Dataset
Fig. 2. System MAE evolution under the effect of the pure rating elicitation strategies.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:15
Table I. The distribution of the rating values for the ratings elicited by
the Highest Predicted strategy at two stages of its application
Percentage of elicited ratings of value r
Iterations r=1 r=2 r=3 r=4 r=5
from 1 to 5 2.06% 4.48% 16.98% 36.56% 39.90%
from 35 to 39 6.01% 13.04% 29.33% 34.06% 17.53%
it, up to a point when MAE reaches a peek (approximately iterations 6-35). Thirdly, they slowly
decrease MAE till the end of the experiment (approximately iterations 36-80). This behavior occurs
since the strategies in second group have a strong selection bias with regards to the properties of the
items which may negatively affect MAE. For instance, the highest predicted strategy at the initial
iterations (from 1 to 5) elicits primarily items with high ratings, however this behavior does not
persist as could be seen from the rating distribution for the iterations 35 to 40 (Table I). As a result,
in the beginning stages, this strategy adds to the known matrix (K) disproportionately more high
ratings than low ones, and this ultimately biases the rating prediction towards overestimating the
ratings.
Low rated movies are selected for elicitation by the highest predicted strategy in two cases: 1)
when a low rated item is predicted to have a high rating 2) when all the highest predicted ratings
have been already elicited or marked as “not available” (they are not present in X and removed
from U
u
). Looking into the data we discovered that at the iteration 36 the highest-predicted strategy
has already elicited most of the highest ratings. Hence, the next ratings that are elicited are actually
average or low ratings, which reduces the bias in K and also the prediction error. The random
and lowest-highest predicted strategies do not introduce such a bias, and this results in a constant
decrease of MAE.
5.2. Number of Acquired Ratings
In addition to measuring the quality of the elicited ratings (as described in the previous section), it is
also important to measure the number of elicited ratings. In fact, certain strategies can acquire more
ratings by better estimating what items the user has actually experienced and is therefore able to
rate. We simulate the limited knowledge of users by making available only the ratings in the matrix
X. Conversely, while a strategy may not be able to acquire many ratings but those actually acquired
can be very useful for improving recommendations.
Figure 3 shows the number of ratings in K that are known to the system, as the strategies elicit
new ratings from the simulated users. It is worth nothing, even in this case, the strong similarity of
the behavior of the elicitation strategies in both data sets. The only strategy that differs substantially
in the two data sets is random. This is clearly caused by the larger number of users and items that
are present in the Netflix data. In fact, while both data sets contain 100,000 ratings, the sparsity
of Movielens data set is much higher: containing only 2.8% of the possible ratings (1491*2380)
vs. 6.3% of the possible ratings (943*1682) contained in the Movielens data set. This larger spar-
sity makes it more difficult for a pure random strategy to select items that are known to the user.
In general this is a major limitation of any random strategy, i.e., a very slow rate of addition of
new ratings. Hence for relatively small problems (with regards to the number of items and users)
the random strategy may be applicable, but for larger problems it is rather impractical. In fact,
observing Figure 3, one can see that in the Movielens simulations after 70 iterations, in which
70*10*943=660,100 ratings’ requests were made (iterations * number-of-rating-requests * users),
the system has acquired on average only 28,000 new ratings (the system was initialized with 2,000
ratings, hence bringing the total number of ratings to 30,000). This means that only for one out
of 23 random rating requests a user is able to provide a rating. In the Netflix data set this ratio is
even worse. It is interesting to note that even the popularity strategy has a poor performance in term
of number of elicited ratings; it elicited the first 28,000 ratings at a speed equal to one rating for
each 6.7 rating requests. We also observe that according to our results, quite surprisingly, the higher
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:16 M. Elahi et al.
0 20 40 60 80 100 120 140 160 180
0
1
2
3
4
5
6
7
x 10
4
# of iterations
# of ratings in Known set
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop) * entropy
(a) Movielens
0 20 40 60 80 100 120 140 160 180
0
1
2
3
4
5
6
x 10
4
# of iterations
# fo ratings in Known set
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop) * entropy
(b) Netflix
Fig. 3. Evolution of the number of ratings elicited by the AL strategies.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:17
sparsity of the Netflix sample has produced a substantially different impact only on the random
strategy.
It is also clear that certain strategies are not able to acquire all the ratings in X. For instance
lowest-highest-predicted, lowest-predicted and highest-predicted stop acquiring new ratings once
they have collected 50,000 ratings (Movielens). This is due to the fact that these strategies, in order
to acquire from a user her ratings for items, need the recommender system to generate rating predic-
tions for those items. This happens when the user’s ratings in the test set T have no corresponding
ratings anywhere in the known dataset K, and hence matrix factorization can not derive any rating
predictions for them.
Figure 4 illustrates a related aspect, i.e., how much the acquired ratings are useful for the ef-
fectiveness of the system, i.e., how the same number of ratings, acquired by different strategies
can reduce MAE. From Figure 4 it is clear that in the first stage of the process, i.e., when a small
number of ratings are present in the known matrix K, the random and lowest-predicted strategies
collect ratings that are more effective in reducing MAE. Successively, the lowest-highest-predicted
strategy acquires more useful ratings. This is an interesting result, showing that the items with the
lowest predicted ratings and random items are providing more useful information, even though these
ratings are difficult to acquire.
5.3. Normalized Discounted Cumulative Gain
In this section we analyze the results of the experiments with regards to the NDCG metric. As
discussed in section 4, in order to compute NDCG for a particular user, first the ratings for the
items in the recommendation list are predicted. Then, the normalized discounted cumulative gain
(NDCG) is computed by dividing the DCG of the ranked list of the recommendations with the
DCG obtained by the best ranking of the same items for that user. NDCG is computed on the top
10 recommendations for every user. Moreover, recommendations lists are created only with items
that have ratings in the testing dataset. This is necessary in order to compute DCG. We note that
sometimes the testing set contains less than 10 items for some users. In this case NDCG is computed
on this smaller set.
Moreover, when computing NDCG, in some cases the rating prediction algorithm (matrix factor-
ization) cannot generate rating predictions for all 10 items that are in test set of a user. This happens
when the user’s ratings in the test set T have no corresponding ratings anywhere in the known dataset
K, and hence matrix factorization can not derive any rating predictions for them. It is important to
notice that ideal recommendation lists for a user is rather stable during the experiments that use the
same dataset. Therefore, if an algorithm is not able to generate predicted recommendation lists of
size 10, lists of the size which is available are used which results in smaller NDCG values.
Figure 5 depicts the NDCG curves for the pure strategies. Higher NDCG value corresponds to
higher rated items being present in the predicted recommendation lists. Popularity is the best strat-
egy at the beginning of the experiment. But at iteration 3, in the Movielens data set, and 9 in the
Netflix data set, the voting strategy passes the popularity strategy and then remains the best one. In
Movielens the random strategy overtakes the voting strategy at iteration 70, but this is not observed
in the Netflix data. Excluding the voting and random strategies, popularity, log(popularity)*entropy,
and variance are the best in both data sets. Lowest predicted is by far the worst, and this is quite sur-
prising considering how effective it is in reducing MAE. By further analyzing the experiment data
we discovered that the lowest predicted strategy is not effective for NDCG since it is eliciting more
ratings for the lowest ranked items which are useless for predicting the ranking of the top items.
Another striking difference from the MAE experiments, is that all the strategies improve NDCG
monotonically. It is also important to note that here the random strategy is by far the best. This is
again different from its behavior in MAE experiments.
5.4. Precision
As we have already observed with regards to MAE and NDCG, for both Netflix and Movielens
datasets very similar results were observed in the initial experiments. For this reason, in the rest of
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:18 M. Elahi et al.
0 1 2 3 4 5 6 7
x 10
4
0.7
0.75
0.8
0.85
0.9
0.95
1
# of ratings in the Known K matrix
MAE
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
(a) Movielens
0 1 2 3 4 5 6
x 10
4
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
# of ratings in the Known K matrix
MAE
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
(b) Netflix
Fig. 4. System MAE evolution vs. the number of ratings elicited.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:19
0 20 40 60 80 100 120 140 160 180
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
# of iterations
NDCG
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
(a) Movielens
0 20 40 60 80 100 120 140 160 180
0.81
0.82
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
# of iterations
NDCG
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
(b) Netflix
Fig. 5. System NDCG evolution under the application of the pure rating elicitation strategies.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:20 M. Elahi et al.
0 20 40 60 80 100 120 140 160 180
0.65
0.66
0.67
0.68
0.69
0.7
0.71
0.72
# of iterations
Precision
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
Fig. 6. System precision under the application of the pure rating elicitation strategies (Movielens).
this paper we use just the Movielens data set. Precision, as it was described in section 4, measures
the proportion of items rated 4 and 5 that are found in the recommendation list. Figure 6 depicts the
evolution of the system precision when the elicitation strategies are applied. Here, highest predicted
is the best performing strategy for the largest part of the test. Starting from iteration 50 it is as good
as the binary predicted and the lowest-highest-predicted strategies. It is also interesting to note that
all the strategies monotonically increase the precision. Moreover, the random strategy, differently
from NDCG, does not perform so well, if compared with the highest predicted strategy. This is again
related to the fact that the random strategy increases substantially the coverage by introducing new
users. But for new users the precision is significantly smaller as the system has not enough ratings
to produce good predictions.
In conclusion from these experiments one can conclude that among the evaluated strategies there
is no single best strategy, , that dominates the others for all the evaluation measures. The random
and voting strategies are the best for NDCG, whereas for MAE lo-high predicted performs quite
good, and finally for Precision lo-high predicted, highest predicted, and voting work well.
6. EVALUATION OF THE PARTIALLY RANDOMIZED STRATEGIES
Among the pure strategies only the random one is able to elicit ratings for items that have not been
evaluated by the users already present in K. Partially randomized strategies address this problem
by asking new users to rate random items (see Section 3). In this section we have used partially
randomized strategies where p = 0.2, i.e., at least 2 of the 10 items that are requested to be rated by
the simulated users are chosen at random.
Figure 7 depicts the system MAE evolution during the experimental process. We note that here
all the curves are monotone, i.e., it is sufficient to add just a small portion of randomly selected
ratings to the elicitation lists to reduce the bias of the pure, prediction-based, strategies.
It should be mentioned that we have not evaluated the partially randomized voting strategy be-
cause it already includes the random strategy as one of the voting strategies. The best performing
partially randomized strategies, with respect to MAE, are, at the beginning of the process, the par-
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:21
0 20 40 60 80 100 120 140 160 180
0.7
0.75
0.8
0.85
0.9
0.95
1
# of iterations
MAE
variance−rand
popularity−rand
lowest−pred−rand
lo−hi−pred−rand
highest−pred−rand
binary−pred−rand
log(pop)*entropy−rand
Fig. 7. System MAE evolution under the application of the partially randomized strategies (Movielens).
0 20 40 60 80 100 120 140 160 180
0.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
# of iterations
NDCG
variance−rand
popularity−rand
lowest−pred−rand
lo−hi−pred−rand
highest−pred−rand
binary−pred−rand
log(pop)*entropy−rand
Fig. 8. System NDCG evolution under the application of the partially randomized strategies (Movielens).
tially randomized binary-predicted, and subsequently the low-high-predicted (similarly to the pure
strategies case).
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:22 M. Elahi et al.
0 20 40 60 80 100 120 140 160 180
0.655
0.66
0.665
0.67
0.675
0.68
0.685
0.69
0.695
0.7
0.705
# of iterations
Precision
variance−rand
popularity−rand
lowest−pred−rand
lo−hi−pred−rand
highest−pred−rand
binary−pred−rand
log(pop)*entropy−rand
Fig. 9. System precision under the application of the randomized strategies (Movielens).
Figure 8 shows the NDCG evolution under the effect of the partially randomized strategies. Dur-
ing iterations 1-6, the partially randomized popularity strategy obtains the best NDCG. During iter-
ations 7-170, i.e., for the largest part of the test, the best strategy is the partially randomized highest
predicted. Again, as we observed for the pure strategy version, the worst is the lowest-predicted.
It is important to note that the strategies that show good performance at the beginning (partially
randomized highest and binary predicted strategies) are those aimed at finding items that a user may
know and therefore is able to rate. Hence, these strategies are very effective in the early stage when
there are many users with very few items in the known dataset K.
Figure 9 shows the precision of the partially randomized strategies. The partially randomized
highest predicted strategy shows again the best results during most of the test, as for NDCG. During
the iterations 1-6 the best strategy with respect to precision is partially randomized binary pre-
dicted strategy, but then the classical approach of requesting the user to rate the items that the
system considers the best recommendations (highest-predicted) is the winner. During iterations
111-170 partially randomized variance, popularity, log(popularity)*entropy, highest predicted and
binary predicted have very similar precision values. Similarly to NDCG, the worst strategy is the
lowest-predicted, i.e., eliciting ratings for the items that user dislikes does little to improve the rec-
ommender’s precision. Interestingly this is not the case if the goal is to improve MAE.
7. COMBINING ACTIVE LEARNING AND NATURAL ACQUISITION OF RATINGS
For these experiments, we designed a procedure to simulate the evolution of a RS’s performance
by mixing the usage of active learning strategies with the natural acquisition of ratings. We are
interested in observing the temporal evolution of the quality of the recommendations generated by
the system when, in addition to exploiting an active learning strategy for requesting the user to rate
some items, the users were able to voluntarily add ratings without being explicitly requested, just
as it happens in actual settings. To accomplish this goal, we have used the larger version of the
Movielens dataset (1,000,000 ratings) for which we considered only the ratings of users that were
active and rated movies for at least 8 weeks (2 month). Movielens consists of 377,302 ratings from
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:23
1,236 users on 3,574 movies. The ratings are timestamped with values ranging from 25/04/2000 to
28/02/2003. We measure the performance of the recommendation algorithm on a test set, as more
and more ratings are added to the known set K, while the simulated time advances from 25/04/2000
to 28/02/2003. We combined this natural acquisition of the ratings with active learning as described
below.
We split the available data into three matrices K, X and T , as we did previously, but now we
also consider the time stamp of the ratings. Hence, we initially insert in K the ratings acquired by
Movielens in the first week (3,705 ratings). Then we split randomly the remaining ratings to obtain
70% of the ratings in X (261,730) and 30% in T (111,867).
For these new experiments, we perform a simulated iteration every week. That is, each simulated
day (starting from the second week) an active learning strategy requests for rating 40 items from
each user, who already has some non-null ratings in K, i.e., user’s ratings that are known by the
system at that point in time. If these ratings are present in X, they are added to K. This procedure
is repeated for 7 days (1 week). Then, all the ratings in the Movielens dataset that according to the
timestamps were acquired in that week are also added to K. Finally, the system is trained using
the ratings in K. To achieve a realistic settings for evaluating predictive performance of RS, we
use only the items in T that users actually experienced during the following week (according to the
timestamps). This procedure is repeated for I = 48 weeks (1 year).
In order to justify the large number of rating requests that the system makes each week, it is
important to note that the simulated application of an active learning strategy, as we do in our
experiments, is able to add a lot fewer ratings than what could be elicited in a real settings. In fact,
the number of ratings that are supposed to be known by the users in the simulated process is limited
by the number of ratings that have been actually acquired in the Movielens dataset. In [Elahi et al.
2011]) it has been estimated that the number of items that are really known by the user is more
than 4 times larger than what is typically observed in the simulations. Hence, a lot of our elicitation
request would be unfulfilled, even though the user in actuality would have been able to rate the item.
Therefore, instead of asking for 10 items as typically done, we ask for 4 times as many items (40
items), to adjust for the discrepancy between the knowledge of the actual and simulated users.
In order to precisely describe the evaluation procedure, we use the following notations, where n
is the week index:
— K
n
: is the set of ratings, known by the system at the end of the week n. These are the ratings that
have been acquired up to week n. They are used to train the prediction model, compute the active
learning rating elicitation strategies for week n + 1 and test the system’s performance using the
ratings contained in the test set of the next week n + 1, T
n+1
.
— T
n+1
: is the set of ratings time stamped during the week n + 1 that are used as test set to measure
the system performance after the ratings in the previous weeks have been added to K
n
.
— AL
n
: is the set of ratings, elicited by a particular elicitation strategy, and is added to known set
(K
n
) at week n. We note that these are ratings that are present in X but not in T . This is required
for assuring that the active learning strategies are not modifying the test set and that the system
performance, under the application of the strategies, is consistently tested on the same set of
ratings.
— X
n
: is the set of ratings in X, time stamped in week n that are not in the test set T
n
. These ratings,
together with the ratings in T
n
, are all of the ratings acquired in Movielens during the week n,
and therefore are considered to have been naturally provided by the (simulated) users without
being asked by the system (natural acquisition). We note that it may happen that an elicitation
strategy has already acquired some of these ratings, i.e., the intersection of AL
n
and X
n
may be
not empty. In this case, only those not yet actively acquired are added to K
n
.
The testing of an active learning strategy S now proceeds in the following way.
— System initialization: week 1
1. The entire ratings are partitioned randomly into the two matrices X, T .
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:24 M. Elahi et al.
2. The not null ratings in X
1
and T
1
are added to K
1
: K
1
= X
1
∪ T
1
3. U
u
, the unclear set of user u is initialized to all the items i with a null value k
ui
in K
1
.
4. The rating prediction model is trained on K
1
, and MAE, Precision, and NDCG are mea-
sured on T
2
.
— For all the weeks n starting from n = 2
1. Initialize K
n
with all the ratings in K
n−1
.
2. For each user u with at least 1 rating in K
n−1
:
— Using strategy S a set of items L = S(u, N, K
n−1
, U
u
) is computed.
— The set L
e
is created, containing only the items in L that have non-null rating in X. The
ratings for the items in L
e
are added to AL
n
— Remove from U
u
the items in L: U
u
= U
u
\ L.
3. Add to K
n
the ratings time stamped in week n and those elicited by S: K
n
= AL
n
∪X
n
∪
T
n
.
4. Train the factor model on K
n
.
5. Compute MAE, Precision and NDCG on T
n+1
.
7.1. Results
Figure 10 shows the MAE time evolution for the different strategies. It should be noted that there is
a huge fluctuation of MAE, from week to week. This is caused by the fact that for every week we
train the system on the previous weeks data and we test the system performance on the next week’s
ratings in the test set. Hence, the difficulty of making good predictions may differ from week to
week. For this reason, in the figure we focus on a time range: weeks 1 to 17. In this figure the value
at week n is obtained after the system has acquired the ratings for that week, and this is the result of
evaluating the system’s performance on week n+1 (see the description of the simulation procedure
in the previous section). The natural acquisition curve shows the MAE of the system without using
items acquired by the AL strategies, i.e., the added ratings are only those that have been acquired
during that week in Movielens data set.
The results show that in the second week the performance of the all strategies is very close. But
starting from the third week popularity and log(popularity)*entropy both perform better than the
others. These two strategies share similar characteristics and outperform all the other strategies on
the whole rating elicitation process. Voting, variance and random are the next best strategies in terms
of MAE.
In order to better show the results of our experiments, in Figure 11 we plot three strategies that
can be representative of other strategies. We have chosen log(popularity)*entropy since it is one of
the state of the art strategies, highest-predicted since it performs very similar to other prediction
based strategies, and voting which is a novel strategy.
Considering the MAE obtained by the natural acquisition of ratings as a baseline we can observe
that the highest-predicted does not perform very differently from the baseline. The main reason is
that this strategy is not acquiring additional ratings besides those already collected by the natural
process, i.e., the user would rate these items on his own initiative. The other strategies, in addition to
these ratings, are capable to elicit more ratings, also those that the user will rate later on, i.e., in the
successive weeks. We observe that here, differently from the previous experiments, all the strategies
show a non-monotone behavior. But, in this case, it is due to the fact that the test set, every week,
is a subset of the ratings entered in Movielens during the following week. The predictive difficulty
of this test set can therefore change from week to week, and hence influence the performance of the
competing strategies.
In order to examine the results further, we have also plotted in Figure 12 the MAE of the strategies
normalized with respect to the MAE of the baseline, i.e., without ratings obtained by active learning
strategies: (
MAE
S trategies
MAE
Baseline
) − 1. We also plot in Figure 13 this normalized behavior only for the
three selected strategies. This figure more clearly shows the benefit of an active learning strategy in
comparison with the natural process. Moreover, in Figure 13 the number of new users entering the
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:25
0 2 4 6 8 10 12 14 16 18
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
# of weeks
MAE
natural acquisition
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
Fig. 10. System MAE evolution under the simultaneous application of active learning strategies and natural acquisition of
ratings (Movielens).
0 2 4 6 8 10 12 14 16 18
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
# of weeks
MAE
natural acquisition
highest−pred
voting
log(pop)*entropy
Fig. 11. System MAE evolution under the application of three selected active learning strategies and natural acquisition of
ratings (Movielens).
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:26 M. Elahi et al.
0 2 4 6 8 10 12 14 16 18
10
20
30
40
50
60
70
# of new users
# of weeks
normalized MAE
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
new users
natural acquisition
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
Fig. 12. System MAE evolution under the application of active learning strategies and natural acquisition of ratings (Movie-
lens). MAE values are normalized with respect to the MAE of the system that acquires new ratings only using the natural
acquisition. The number of new users entering the system every week is also shown.
system every week is also plotted, as to understand the effect of new users entering to the system
on the system performance under the application of the considered strategies. The left y-axes in this
figure shows the number of new users in the Known set K
n
and the right y-axes shows the MAE
normalized by the baseline. The gray solid line depicts the number of new users entering the system
every week.
Comparing the strategies in Figure 13, we can distinguish two types of strategies: the first type
corresponds to the highest-predicted strtategy whose normalized MAE is very close to the baseline.
The second type includes log(popularity)*entropy and voting strategies that express larger variations
of performance, and substantially differ from the baseline (excluding the week 10). The overall
performance of these strategies is better than the performance of the first type. Moreover, observing
the number of new users at each week we can see that the largest number of new users is entering
at weeks 9, 10, and 14. For these weeks the normalized MAE shows the worst performances, with
the largest value of MAE at week 10. Hence, the bad news is that in the presence of many new users
none of the strategies are effective, and better solutions need to be developed.
Despite the fact that new users are detrimental to the accuracy of the prediction, in the long
term, more users entering the system would result in a better recommender system. Thus, we have
computed the correlation coefficients between MAE curves of the strategies and the total number
of users in Known set K
n
. Table II shows these correlations as well as the corresponding p-values.
There is a clear negative correlation with the total number of users in the system. This means that
the more users are entered to the system the lower the MAE becomes.
Another important aspect to consider is the number of ratings that are elicited by the considered
strategies in addition to the natural acquisition of ratings. As discussed before, certain strategies
can acquire more ratings by better estimating what items are likely to have been experienced by the
user. Figure 14 illustrates the size of the Known set K
n
as the strategies acquire more ratings from
the simulated users. As shown in figure, although the number of ratings added naturally is by far
larger than that of any strategy (more than 314,000 ratings in week 48), still the considered strategies
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:27
0 2 4 6 8 10 12 14 16 18
10
20
30
40
50
60
70
# of new users
# of weeks
normalized MAE
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
new users
natural acquisition
highest−pred
voting
log(pop)*entropy
Fig. 13. System MAE evolution under the application of three selected active learning strategies and natural acquisition of
ratings (Movielens). MAE values are normalized with respect to the MAE of the system that acquires new ratings only using
the natural acquisition. The number of new users entering the system every week is also shown.
Table II. Correlation of MAE with # of users in the Known K
n
set.
Strategy Correlation Coefficient p-value
Natural Acquisition -0.4430 0.0016
Variance -0.5021 0.0003
Random -0.5687 0.0000
Popularity -0.5133 0.0002
Lowest predicted -0.4933 0.0004
Low-high predicted -0.5083 0.0002
Highest predicted -0.5153 0.0002
Binary prediction -0.5126 0.0002
Voting -0.5215 0.0001
Log(pop)*entropy -0.5028 0.0003
can elicit many ratings. Popularity and log(popularity)*entropy are the strategies that add the most
ratings; totaling more than 161,000 at the end of the experiment. On the other hand, voting is the
strategy that elicits overall the smallest number of ratings. This can be due to the fact that sometimes
most of the strategies vote for similar set of items. Then the selected items would mostly overlap
with naturally acquired ratings, which could result in fewer ratings being added to the known set.
However, the remarkably good performance of voting may indicate that this strategy focuses more
on informativeness of the items rather than on their ratability.
8. CONCLUSIONS AND FUTURE WORK
In this work we have addressed the problem of selecting items to present to the users for acquiring
their ratings; that is also defined as the ratings elicitation problem. We have proposed and evaluated
a set of ratings elicitation strategies. Some of them have been proposed in a previous work [Rashid
et al. 2002] (popularity, log(popularity)*entropy, random, variance), and some, which we define
as prediction-based strategies, are new: binary-prediction, highest-predicted, lowest-predicted, and
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:28 M. Elahi et al.
0 5 10 15 20 25 30 35 40 45 50
0
0.5
1
1.5
2
2.5
3
3.5
x 10
5
# of weeks
# of ratings in Known set
natural
variance
random
popularity
lowest−pred
lo−hi−pred
highest−pred
binary−pred
voting
log(pop)*entropy
Fig. 14. Size evolution of the Known set under the application of rating elicitation strategies (Movielens).
Table III. Strategies Performance Summary (performance: - good, . - bad)
Strategies Metrics
MAE NDCG Elicited # Inform. Precision
Early Stage
Late Stage
Randomized
w/ Natural
Early Stage
Late Stage
Randomized
Early Stage
Late Stage
w/ Natural
Early Stage
Late Stage
Early Stage
Late Stage
Randomized
variance
popularity
lowest-pred
lo-hi-pred
highest-pred
binary-pred
voting
log(pop)*ent
random NA NA NA
natural NA NA
highest-lowest-predicted. Moreover, we have studied the behavior of other novel strategies: partially
randomized, which adds random ratings in the elicitation lists computed by the aforementioned
strategies; voting, which requests to rate the items that are selected by the largest number of voting
strategies. We have evaluated these strategies with regards to their system-wide effectiveness by
implementing a simulation loop that models the day-by-day process of rating elicitation and rating
database growth. We have taken into account the limited knowledge of the users, which means that
the users may not be able to rate all the items that the system proposes them to rate. During the
simulation we have measured several metrics at different phases of the rating database growth. The
metrics include: MAE to measure the improvements in prediction accuracy, precision to measure
the relevance of recommendations, normalized discounted cumulative gain (NDCG) to measure the
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:29
quality of produced ranking, and coverage to measure the proportion of items over which the system
can form predictions.
The evaluation (summarized in Table III) has shown that different strategies can improve different
aspects of the recommendation quality and in different stages of the rating database development.
Moreover, we have discovered that some pure strategies may incur the risk of increasing the sys-
tem MAE if they keep adding only ratings with a certain value, e.g., the highest ones, as for the
highest-predicted strategy, an approach that is often adopted in real RSs. In addition, prediction-
based strategies are not able to address either the problem of new users, nor of new items. Popularity
and variance strategies are able to select items for new users, but can not select items that have no
ratings.
Partially randomized strategies, experience fewer problems because they elicit ratings for random
items that had no ratings at all. In this case, the lowest-highest (highest) predicted is a good alterna-
tive if MAE (precision) is the targeted effectiveness measure. These strategies are easy to implement
and as the experiments have shown can produce considerable benefits.
Moreover, our results have shown that mixing active learning strategies with natural acquisition of
ratings influences the performance of the strategies. This is an important conclusion and no previous
experiments have addressed and illustrated this issue. In this situation we show that the popularity
and log(popularity)*entropy strategies outperform the other strategies. Our proposed voting strat-
egy has shown good performance, i.e., MAE but especially NDCG, with and without the natural
acquisition.
This research identified a number of new problems that would definitely need to be studied fur-
ther. First of all, it is important to note that the results presented in this work clearly depend, as
in any experimental study, on the chosen simulation setup, which can only partially reflect the real
evolution of a recommender system. In our work we assume that a randomly chosen set of ratings,
among those that the user really gave to the system, represents the ratings known by the user, but
yet unknown by the system. However, this set does not completely reflect all the user knowledge, it
contains only the ratings acquired using the specific recommender system. For instance, Movielens
used a combined random and popularity technique for rating elicitation. In reality, many more items
are known by the user, but his ratings are not included in the data set. This is a common problem
of any off-line evaluation of a recommender system, where the performance of the recommenda-
tion algorithm is estimated on a test set that is never coincident with the recommendations set. The
recommendation set is composed of the items with the largest predicted ratings. But if such an item
is not present in the test set, an off-line evaluation will be never able to check if that prediction is
correct.
Moreover, we have already observed that the performance of some strategies (e.g., random and
voting) depends on the sparsity of the rating data. The Movielens data and the Netflix sample that
we used, still have a considerably low sparsity compared to other larger datasets. For example, if the
data sparsity was higher, there would be only a very low probability for random strategy to select an
item that a user has consumed in the past and can provide a rating for. So the partially randomized
strategies may perform worse in reality.
Furthermore, there remain many unexplored possibilities for sequentially applying several strate-
gies that use different approaches depending on the state of system [Elahi 2011]. For instance, one
may ask a user to rate popular items when the system does not know any user’s ratings yet, and use
another strategy at a latter stage.
REFERENCES
ANDERSON, C. 2006. The Long Tail. Random House Business.
BIRLUTIU, A., GROOT, P., AND HESKES, T. 2012. Efficiently learning the preferences of people. Machine Learning, 1–28.
BONILLA, E. V., GUO, S., AND SANNER, S. 2010. Gaussian process preference elicitation. In Advances in Neural Infor-
mation Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. December
6-9 2010, Vancouver, British Columbia, Canada. 262–270.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:30 M. Elahi et al.
BOUTILIER, C., ZEMEL, R. S., AND MARLIN, B. M. 2003. Active collaborative filtering. In UAI ’03, Proceedings of the
19th Conference in Uncertainty in Artificial Intelligence, Acapulco, Mexico, August 7-10 2003. 98–106.
BRAZIUNAS, D. AND BOUTILIER, C. 2010. Assessing regret-based preference elicitation with the utpref recommendation
system. In Proceedings 11th ACM Conference on Electronic Commerce (EC-2010), Cambridge, Massachusetts, USA,
June 7-11, 2010. 219–228.
BURKE, R. 2010. Evaluating the dynamic properties of recommendation algorithms. In Proceedings of the fourth ACM
conference on Recommender systems. RecSys ’10. ACM, New York, NY, USA, 225–228.
CARENINI, G., SMITH, J., AND POOLE, D. 2003. Towards more conversational and collaborative recommender systems.
In Proceedings of the 2003 International Conference on Intelligent User Interfaces, January 12-15, 2003, Miami, FL,
USA. 12–18.
CHEN, L. AND PU, P. 2012. Critiquing-based recommenders: survey and emerging trends. User Model. User-Adapt. Inter-
act. 22, 1-2, 125–150.
CREMONESI, P., KOREN, Y., AND TURRIN, R. 2010. Performance of recommender algorithms on top-n recommenda-
tion tasks. In Proceedings of the 2010 ACM Conference on Recommender Systems, RecSys 2010, Barcelona, Spain,
September 26-30, 2010. 39–46.
DESROSIERS, C. AND KARYPIS, G. 2011. A comprehensive survey of neighborhood-based recommendation methods. In
Recommender Systems Handbook, F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, Eds. Springer, 107–144.
ELAHI, M. 2011. Adaptive active learning in recommender systems. In User Modeling, Adaption and Personalization - 19th
International Conference, UMAP 2011, Girona, Spain, July 11-15, 2011. Proceedings. 414–417.
ELAHI, M., REPSYS, V., AND RICCI, F. 2011. Rating elicitation strategies for collaborative filtering. In E-Commerce and
Web Technologies - 12th International Conference, EC-Web 2011, Toulouse, France, August 30 - September 1, 2011.
Proceedings. 160–171.
ELAHI, M., RICCI, F., AND REPSYS, V. 2011. System-wide effectiveness of active learning in collaborative filtering. In
International Workshop on Social Web Mining, Co-located with IJCAI, F. Bonchi, W. Buntine, R. Gavald, and S. Gu,
Eds. Universitat de Barcelona, Spain.
FUNK, S. 2006. Netflix update: Try this at home. http://sifter.org/
˜
simon/journal/20061211.html.
GOLBANDI, N., KOREN, Y., AND LEMPEL, R. 2010. On bootstrapping recommender systems. In Proceedings of the 19th
ACM international conference on Information and knowledge management. CIKM ’10. ACM, New York, NY, USA,
1805–1808.
GOLBANDI, N., KOREN, Y., AND LEMPEL, R. 2011. Adaptive bootstrapping of recommender systems using decision trees.
In Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong,
China, February 9-12, 2011. 595–604.
GUO, S. AND SANNER, S. 2010. Multiattribute Bayesian preference elicitation with pairwise comparison queries. In Pro-
ceedings of the 7th international conference on Advances in Neural Networks - Volume Part I. Springer-Verlag, Berlin,
Heidelberg, 396–403.
HARPALE, A. S. AND YANG, Y. 2008. Personalized active learning for collaborative filtering. In SIGIR ’08: Proceedings
of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM,
New York, NY, USA, 91–98.
HERLOCKER, J. L., KONSTAN, J. A., BORCHERS, A., AND RIEDL, J. 1999. An algorithmic framework for performing
collaborative filtering. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and devel-
opment in information retrieval. SIGIR ’99. ACM, New York, NY, USA, 230–237.
HERLOCKER, J. L., KONSTAN, J. A., TERVEEN, L. G., AND RIEDL, J. T. 2004. Evaluating collaborative filtering recom-
mender systems. ACM Trans. Inf. Syst. 22, 1, 5–53.
JANNACH, D., ZANKER, M., FELFERNIG, A., AND FRIEDRICH, G. 2010. Recommender Systems: An Introduction. Cam-
bridge University Press.
J
¨
ARVELIN, K. AND KEK
¨
AL
¨
AINEN, J. 2002. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20, 4,
422–446.
JIN, R. AND SI, L. 2004. A Bayesian approach toward active learning for collaborative filtering. In UAI ’04, Proceedings of
the 20th Conference in Uncertainty in Artificial Intelligence, July 7-11 2004, Banff, Canada. 278–285.
KOREN, Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08: Proceed-
ing of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York,
NY, USA, 426–434.
KOREN, Y. AND BELL, R. 2011. Advances in collaborative filtering. In Recommender Systems Handbook, F. Ricci,
L. Rokach, B. Shapira, and P. Kantor, Eds. Springer Verlag, 145–186.
LIU, N. N., MENG, X., LIU, C., AND YANG, Q. 2011. Wisdom of the better few: cold start recommendation via represen-
tative based rating elicitation. In Proceedings of the 2011 ACM Conference on Recommender Systems, RecSys 2011,
Chicago, IL, USA, October 23-27, 2011. 37–44.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:31
LIU, N. N. AND YANG, Q. 2008. Eigenrank: a ranking-oriented approach to collaborative filtering. In SIGIR ’08: Proceed-
ings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval.
ACM, New York, NY, USA, 83–90.
MANNING, C. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge.
MARLIN, B. M., ZEMEL, R. S., ROWEIS, S. T., AND SLANEY, M. 2011. Recommender systems, missing data and statisti-
cal model estimation. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artificial Intelligence,
Barcelona, Catalonia, Spain, July 16-22, 2011. 2686–2691.
MCNEE, S. M., LAM, S. K., KONSTAN, J. A., AND RIEDL, J. 2003. Interfaces for eliciting new user preferences in
recommender systems. In Proceedings of the 2003 International Conference on User Modeling. 178–187.
MILLER, B. N., ALBERT, I., LAM, S. K., KONSTAN, J. A., AND RIEDL, J. 2003. Movielens unplugged: experiences
with an occasionally connected recommender system. In IUI ’03: Proceedings of the 8th international conference on
Intelligent user interfaces. ACM, New York, NY, USA, 263–266.
PU, P. AND CHEN, L. 2008. User-involved preference elicitation for product search and recommender systems. AI Maga-
zine 29, 4, 93–103.
RASHID, A. M., ALBERT, I., COSLEY, D., LAM, S. K., MCNEE, S. M., KONSTAN, J. A., AND RIEDL, J. 2002. Get-
ting to know you: Learning new user preferences in recommender systems. In Proceedings of the 2002 International
Conference on Intelligent User Interfaces, IUI 2002. ACM Press, 127–134.
RASHID, A. M., KARYPIS, G., AND RIEDL, J. 2008. Learning preferences of new users in recommender systems: an
information theoretic approach. SIGKDD Explor. Newsl. 10, 90–100.
RESNICK, P. AND SAMI, R. 2007. The influence limiter: provably manipulation-resistant recommender systems. In Pro-
ceedings of the 2007 ACM conference on Recommender systems. RecSys ’07. ACM, New York, NY, USA, 25–32.
RESNICK, P. AND VARIAN, H. R. 1997. Recommender systems. Commun. ACM 40, 3, 56–58.
RICCI, F., ROKACH, L., AND SHAPIRA, B. 2011. Introduction to recommender systems handbook. In Recommender Sys-
tems Handbook, F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Eds. Springer Verlag, 1–35.
RICCI, F., ROKACH, L., SHAPIRA, B., AND KANTOR, P. B., Eds. 2011. Recommender Systems Handbook. Springer.
RUBENS, N., KAPLAN, D., AND SUGIYAMA, M. 2011. Active learning in recommender systems. In Recommender Systems
Handbook, F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Eds. Springer Verlag, 735–767.
SCHEIN, A. I., POPESCUL, A., UNGAR, L. H., AND PENNOCK, D. M. 2002. Methods and metrics for cold-start rec-
ommendations. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and
development in information retrieval. ACM, New York, NY, USA, 253–260.
SHANI, G. AND GUNAWARDANA, A. 2010. Evaluating recommendation systems. In Recommender Systems Handbook,
F. Ricci, L. Rokach, and B. Shapira, Eds. Springer Verlag, 257–298.
TIMELY DEVELOPMENT, L. 2008. Netflix prize. http://www.timelydevelopment.com/demos/
NetflixPrize.aspx.
WEIMER, M., KARATZOGLOU, A., AND SMOLA, A. 2008. Adaptive collaborative filtering. In RecSys ’08: Proceedings of
the 2008 ACM conference on Recommender systems. ACM, New York, NY, USA, 275–282.
ZHOU, K., YANG, S.-H., AND ZHA, H. 2011. Functional matrix factorizations for cold-start recommendation. In Proceed-
ing of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR
2011, Beijing, China, July 25-29, 2011. 315–324.
ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.