Content uploaded by Mehdi Elahi

Author content

All content in this area was uploaded by Mehdi Elahi on Feb 03, 2016

Content may be subject to copyright.

39

Active Learning Strategies for Rating Elicitation in Collaborative

Filtering: a System-Wide Perspective

MEHDI ELAHI and FRANCESCO RICCI, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy

NEIL RUBENS, University of Electro-Communications, Tokyo, Japan

The accuracy of collaborative-ﬁltering recommender systems largely depends on three factors: the quality of the rating

prediction algorithm, and the quantity and quality of available ratings. While research in the ﬁeld of recommender systems

often concentrates on improving prediction algorithms, even the best algorithms will fail if they are fed poor quality data

during training, i.e. garbage in, garbage out. Active learning aims to remedy this problem by focusing on obtaining better

quality data that more aptly reﬂects a user’s preferences. However, traditional evaluation of active learning strategies has two

major ﬂaws, which have signiﬁcant negative ramiﬁcations on accurately evaluating the system’s performance (prediction

error, precision, and quantity of elicited ratings). (1) Performance has been evaluated for each user independently (ignoring

system-wide improvements) (2) Active learning strategies have been evaluated in isolation from unsolicited user ratings

(natural acquisition).

In this paper we show that an elicited rating has effects across the system, so a typical user-centric evaluation which

ignores any changes of rating prediction of other users also ignores these cumulative effects, which may be more inﬂuential on

the performance of the system as a whole (system-centric). We propose a new evaluation methodology and use it to evaluate

some novel and state of the art rating elicitation strategies. We found that the system-wide effectiveness of a rating elicitation

strategy depends on the stage of the rating elicitation process, and on the evaluation measures (MAE, NDCG, and Precision).

In particular, we show that using some common user-centric strategies may actually degrade the overall performance of

a system. Finally, we show that the performance of many common active learning strategies changes signiﬁcantly when

evaluated concurrently with the natural acquisition of ratings in recommender systems.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

General Terms: Algorithms, Experimentation, Performance

Additional Key Words and Phrases: Recommender Systems, Active Learning, Rating Elicitation, Cold Start

1. INTRODUCTION

Choosing the right product to consume or purchase is nowadays a challenging problem due to

the growing number of products and variety of eCommerce services. While increasing number

of choices provides an opportunity for a consumer to acquire the products satisfying her special

personal needs, it may at the same time overwhelm by providing too many choices [Anderson

2006]. Recommender Systems (RSs) tackle this problem providing personalized suggestions for

digital content, products or services, that better match the user’s needs and constraints than the

mainstream products [Resnick and Varian 1997; Ricci et al. 2011; Jannach et al. 2010]. In this paper

we focus on the collaborative ﬁltering recommendation approach, and on techniques that are aimed

at identifying what information about the user tastes should be elicited by the system to generate

effective recommendations.

1.1. Recommender Systems and Social Networks

A collaborative ﬁltering (CF) recommender system uses ratings for items provided by a network

of users to recommend items, which the target user has not yet considered but will likely enjoy

[Koren and Bell 2011; Desrosiers and Karypis 2011]. A collaborative ﬁltering system computes

its recommendations by exploiting relationships and similarities between users. These relations are

deﬁned by observing the users’ access of the items managed by the system. For instance consider

the users actively using Amazon or Last.fm. They browse and search items to buy or to listen to.

They read users comments and get recommendations computed by the system using the access logs,

and the ratings of the users’ community.

The recommendations computed by a CF system are based on the network structure created by

the users accessing the system. For instance, classical neighbor-based CF systems evaluate user-to-

user or item-to-item similarities based on the co-rating structure of the users or items. Two users

ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:2 M. Elahi et al.

are considered similar if they rate items in a correlated way. Analogously, two items are considered

similar if the network of users have rated the items in a correlated way. The recommendations for an

active user are then computed by suggesting the items that have a high average rating in the group

of users similar to the active one (user-based CF). In item-based approaches, a user is recommended

items that are similar to those she liked in the past, where similar means that users rated similarly

the two items. Even more novel matrix factorization recommendation techniques [Koren and Bell

2011], which are considered in this paper, are modeling users and items with a vector of abstract

factors that are learned by mining the rating behavior of a network of users. In these factor models,

in addition to the similarity of users with users and items with items, it is also possible to establish

the similarity of users with items, since both of them are represented uniformly with a vector of

factors.

Recommender systems are often integrated into eCommerce applications to suggest items to buy

or consume, but RSs are now also frequently used in social networks, i.e., applications primarily

designed to support social interactions between their users. Moreover, in many popular social net-

works such as Facebook, Myspace, and Google plus, several applications have been introduced for

eliciting user preferences and characteristics, and then to provide recommendations. For instance,

in Facebook, the largest social network, there are applications where users can rate friends, movies,

photos, or links. Examples of such applications are Rate My Friends (with more than 6,000 monthly

active users), Rate my Photo, My Movie Rating, and LinkR. Some of these applications collect user

preferences (ratings) only to create a user proﬁle page. However, some of them use the data also

to make recommendations to the users. For instance using Rate My Friends, the user is requested

to rate her friends. Then the application ranks her friends based on the ratings and presents the top

scored users that she may be interested in connecting to. LinkR, another example of recommender

systems integrated into a social network, recommends a number of links and allows the user to rate

them. Moreover, Facebook itself does also collect ratings by offering a “like” button in partner sites

and exploits its usage to discover which friends, groups, apps, links, or games a particular user may

like.

It is worth noting that all of the applications mentioned above must implement a rating elicitation

strategy, i.e., identify items to present to the user in order to collect her ratings. In this paper we

propose and evaluate some strategies for accomplishing this task. Hence, social networks can beneﬁt

from the techniques introduced here to generate better recommendations for establishing new social

relationships and hence improving the core service of social networks.

1.2. Collaborative Filtering and Rating Acquisition

The CF rating prediction accuracy does depend on the characteristics of the prediction algorithm.

Hence, in the last years several variants of CF algorithms have been proposed. [Koren and Bell 2011;

Desrosiers and Karypis 2011] provide an up to date survey of memory and model-based methods.

In addition to the rating prediction technique, the number, the distribution, and the quality of the

ratings known by the system can inﬂuence system’s performance. In general, the more informa-

tive about the user preferences the available ratings are, the higher the recommendation accuracy

is. Therefore, it is important to keep acquiring from the users new and useful ratings, in order to

maintain or improve the quality of the recommendations. This is especially true for the cold start

stage, where a new user or a new item is added to the system [Schein et al. 2002; Liu et al. 2011;

Zhou et al. 2011; Golbandi et al. 2011].

It is worth noting that RSs usually deal with huge catalogues, e.g., Netﬂix, the popular American

provider of on-demand Internet streaming media, manages almost one million movies. Hence, if the

recommender system wants to explicitly ask the user to rate some items, this set must be carefully

chosen. First of all, the system should ask ratings for items that the user have experienced, otherwise

no useful information can be acquired. This is not easy, especially when the user is new to the system

and there is not much knowledge that can be leveraged to predict what items the user actually

experienced in the past. But additionally, the system should exploit techniques to identify those

items that if rated by the user would generate ratings data that do improve the precision of the future

ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:3

recommendations, not only for the target user but for all of the system’s users. Informative ratings

can provide additional knowledge about preferences of the users as well as ﬁxing the errors in the

rating prediction model.

1.3. Approach and Goals of this Paper

In this work we focus on understanding the behavior of several ratings acquisition strategies, such

as “provide your ratings for these top ten movies”. The goal of a rating acquisition strategy is to

enlarge the set of available data in the optimal way for the whole system performance by eliciting

the most useful ratings from each user. In practice, a RS user interfaces can be designed so that

users browsing the existing items can rate them if they wish. But, new ratings can also be acquired

by explicitly asking users. In fact, some RSs ask the users to rate the recommended items: mixing

recommendations with users’ preference elicitation. We will show that this approach is feasible but

it must be used with care, since relying on just one single strategy, such as asking the user opinion

only for the items that the system believes the user likes, has a potentially dangerous impact on the

system effectiveness. Hence a careful selection of the elicitation strategy is in order.

In this paper we extend our previous work ([Elahi et al. 2011; Elahi et al. 2011]) where we pro-

vided an initial evaluation of active learning rating elicitation strategies in collaborative ﬁltering.

In this paper, in addition to the “pure” strategies i.e., those implementing a single heuristic, we

also consider “partially randomized” ones. Randomized strategies, in addition to asking (simulated)

users to rate the items selected by a “pure” strategy, also ask to rate some randomly selected items.

Randomized strategies can diversify the item list presented to the user. But, more importantly, ran-

domized strategies allow to cope with the non monotonically improving behavior of the system

effectiveness that we observed during the simulation of certain “pure” strategies. In fact, we discov-

ered (as hypothesized by [Rashid et al. 2002]) that certain strategies, for instance, requesting to rate

the items with the highest predicted ratings, may generate a system-wide bias, and inadvertently

increase the system error.

RSs can be evaluated online and ofﬂine [Herlocker et al. 2004; Shani and Gunawardana 2010;

Cremonesi et al. 2010]. In the ﬁrst case, one or more RSs are run and experiments on real users are

performed. This requires building or accessing a (or some) fully developed RS, with a large user

community, which is expensive and time consuming. Moreover, it is hard to test online several algo-

rithms, such as those proposed here. Therefore, similarly to many previous experimental analysis,

we performed ofﬂine experiments. We developed a program which simulates the real process of

rating elicitation in a community of users (Movielens and Netﬂix), the consequent rating database

growth starting from a relatively small one (cold-start), and the system adaptation (retraining) to the

new set of data. Moreover, in this paper we evaluate the proposed strategies in two scenarios: when

the simulated users are conﬁned to rate only items that are presented to them by the active learning

strategy or when they can voluntarily add ratings on their own.

In the experiments performed here we used a state of the art Matrix Factorization rating prediction

algorithm [Koren and Bell 2011; Timely Development 2008]. Hence our results can provide useful

guidelines for managing real RSs that nowadays often rely on this technique. In factor models

both users and items are assigned to factor vectors of the same size. Those vectors are obtained

from the user ratings matrix with optimization techniques trying to approximate the original rating

matrix. Each element of the factor vector assigned to an item reﬂects how well the item represents a

particular latent aspect [Koren and Bell 2011]. For our experiments we employed a gradient descent

optimization technique as proposed by Simon Funk [Funk 2006].

1.4. Paper Contribution

The main contribution of our research is the introduction and the empirical evaluation of a set of rat-

ing elicitation strategies for collaborative ﬁltering with respect to their system-wide utility. Some of

these strategies are new and some come from the literature and the common practice. An important

differentiating aspect of our study is measuring the effect of each strategy on several RSs evaluation

measures and showing that the best strategy depends on the evaluation measure. Previous works fo-

ACM Transactions on Intelligent Systems and Technology, Vol. 9, No. 4, Article 39, Publication date: March 2010.

39:4 M. Elahi et al.

cussed only on the rating prediction accuracy (Mean Absolute Error), and on the number of acquired

ratings. We analyze those aspects, but in addition we consider the recommendation precision, and

the goodness of the recommendations’ ranking, measured with normalized discounted cumulative

gain (NDCG). These measures are crucial for determining the value of the recommendations [Shani

and Gunawardana 2010].

Moreover, another major contribution of our work is the analysis of the performance of the elic-

itation strategies taking into account the size of the rating database. We show that different strate-

gies can improve different aspects of the recommendation quality at different stages of the rating

database development. We show that in some stages elicitation strategies may induce a bias on the

system and ultimately result in a decrease of the recommendation accuracy.

In summary, this paper provides a realistic, comprehensive evaluation of several applicable rating

elicitation strategies, providing guidelines and conclusions that could help with their deployment in

real RSs.

1.5. Novelty of the Proposed Approach

Rating elicitation has been also tackled in a few previous works [McNee et al. 2003; Rashid et al.

2002; Rashid et al. 2008; Carenini et al. 2003; Jin and Si 2004; Harpale and Yang 2008] that will be

surveyed in Section 2. But these papers focused on a problem that is different from what we consider

here. Namely, they measured the beneﬁt of the rating elicited from one user, e.g., in the sign up stage,

for improving the quality of the recommendations for that user. Conversely, we consider the impact

of an elicitation strategy on the overall system behavior, e.g., the prediction accuracy averaged on

all the system’s users. In other words, we try to identify strategies that can elicit from a user ratings

that will contribute to the improvement of the system performance for all of the users, and not just

for the target user.

Previously conducted evaluations have assumed rather artiﬁcial conditions, i.e., all the users and

items have some ratings since the beginning of the evaluation process and the system only asks

to the simulated user ratings that are present in the data set. In other words, previous studies did

not consider the new-item and the new-user problem. Moreover, only a few evaluations simulated

users with limited knowledge about the items (e.g. [Harpale and Yang 2008]). We generate initial

conditions for the rating data set based on the temporal evolution of the system, hence, in our

experiments, new users and new items are present in a similar manner as in real settings. Moreover,

the system does not know what items the simulated user has experienced, and may ask ratings for

items that the user will not be able to provide. This better simulates a realistic scenario where not

all rating requests can be satisﬁed by a user.

It is also important to note that previous analysis considered the situation where the active learn-

ing rating elicitation strategy was the only tool used to collect new ratings from the users. Hence,

elicitation strategies were evaluated in isolation from ongoing system usage, where users can freely

enter new ratings. We propose a more realistic evaluation settings, where in addition to the ratings

acquired by the elicitation strategies, ratings are also added by users on a voluntary basis. Hence,

for the validation experiments, we have also utilized a simulation process in which active learning

is combined with the natural acquisition of the users’ ratings.

The rest of the paper is structured as follows. In section 2 we review related works. In section 3

we introduce the rating elicitation strategies that we have analyzed. In section 4 we present the ﬁrst

simulation procedure that we designed to more accurately evaluate the system’s recommendation

performance (MAE, NDCG, and Precision). The results of our experiments are shown in section 5

and 6. Then in Section 7 we present the analysis of the active learning strategies when active learning

is mixed with the natural acquisition of the user ratings. Finally in section 8 we summarize the results

of this research and outline directions for future work.

2. RELATED WORK

Active learning in RS aims at actively acquiring user preference data to improve the output of the RS

[Boutilier et al. 2003; Rubens et al. 2011]. Active learning for RS is a form of preference elicitation

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:5

[Bonilla et al. 2010; Pu and Chen 2008; Chen and Pu 2012; Braziunas and Boutilier 2010; Guo

and Sanner 2010; Birlutiu et al. 2012], but the current research on active learning for recommender

systems has focussed on collaborative ﬁltering, and in particular on the new user problem. In this

settings, it is assumed that a user has not rated any items, the system is able to actively ask the user

to rate some items in order to generate recommendations for the user. In this survey we will focus

on AL in collaborative ﬁltering.

In many previous works, which we will describe below, the evaluation of a rating elicitation

strategy is performed by simulating the interaction with a new user while the system itself is not in

a cold start stage, i.e., it has already acquired many ratings from users.

Conversely, as we mentioned in the Introduction, in our work we simulate the application of

several rating elicitation strategies in a more diverse set of scenarios; besides the typical settings in

which the new user has not rated any items, while the system already possess many ratings provided

by other users. We consider a more general scenario where the user repeatedly comes back to the

system for receiving recommendations, i.e., while the system has possibly elicited ratings from other

users. Moreover, we simulate a scenario where the system has initially a small overall knowledge

of the users’ preferences, i.e., has a small set of ratings to train the prediction model. Then, step

by step, as the users come to the system new ratings are elicited. Another important difference,

compared to the state of the art, is that we consider the impact of an elicitation strategy on the

overall system behavior. This aims to measure how the ratings elicited from one user can contribute

to the improvement of the system performance even when making recommendations for other users.

Finally, we have also investigated a more realistic evaluation scenario where active learning is

combined with natural addition of the ratings, i.e., some ratings are freely added by the users without

being requested. This scenario has not been applied previously.

2.1. Rating Elicitation at Sign Up

The ﬁrst research works in AL for recommender systems where motivated by the need to implement

more effective sign up processes and used the classical neighbor-based approaches to collaborative

ﬁltering [Desrosiers and Karypis 2011]. In [Rashid et al. 2002] the authors focus explicitly on the

sign up process, i.e., when a new user starts using a collaborative ﬁltering recommender system

and must rate some items in order to provide to the system some initial information about her

preferences. [Rashid et al. 2002] considered six techniques for explicitly determining the items to

ask a user to rate: entropy, where items with the largest rating entropy are preferred; random request;

popularity, which is measured as the number of ratings for an item, and hence the most frequently

rated items are selected; log(popularity) ∗ entropy where items that are both popular and have

diverse ratings are selected; and ﬁnally item-item personalized, where random items are proposed

until the user rates one. Then, a recommender is used to predict what items the user is likely to

have seen based on the ratings already provided by the user. These predicted items are requested to

the user to rate. Finally, the behavior of an item-to-item collaborative ﬁltering system [Desrosiers

and Karypis 2011] was evaluated with respect to MAE under an ofﬂine settings that simulated the

sign up process. The process was repeated multiple times and averaged for all the test users. In that

scenario the log(popularity) ∗ entropy strategy was found to be the best. For this reason we have

also evaluated log(popularity) ∗entropy in our study. But, it is worth noting that their result could

not be automatically extended to the scenario that we consider in this work, that is the evolution of

the global system performance under the application of an active learning strategy applied to all the

users. In fact, as we mentioned above, in our experiments we simulate the simultaneous acquisition

of ratings from all the users, by asking each user in turn to rate some items, and we repeat this

process several times. This simulates the long term usage of a recommender system where users

utilize system repeatedly to get new recommendations and ratings provided by a user are also used

to generate better recommendations for other users (system performance).

39:6 M. Elahi et al.

2.2. Conversational Approaches

Subsequently, researchers understood that in order to generate more effective rating elicitation

strategies the system should be conversational: it should better motivate the rating requests, fo-

cusing on the user preferences, and the user should be able to more freely enter her ratings, even

without being explicitly requested.

In [Carenini et al. 2003], a user-focussed approach is considered. They propose a set of tech-

niques to intelligently select items to rate when the user is particularly motivated to provide such

information. They present a conversational and collaborative interaction model that elicits ratings so

that the beneﬁt of doing that is clear to the user, thus increasing the motivation to provide a rating.

Item-focused techniques that elicit ratings to improve the rating prediction for a speciﬁc item are

also proposed. Popularity, entropy and their combination are tested, as well as their item focused

modiﬁcations. The item focused techniques are different from the classical ones in that popularity

and entropy are not computed on the whole rating matrix, but only on the matrix of user’s neighbors

that have rated an item for which the prediction accuracy is being improved. Results have shown

that item focused strategies are constantly better than unfocused ones.

[McNee et al. 2003] addresses even a more general problem, aiming at understanding which,

among the following methods, is the best solution for rating elicitation in the start up phase: a)

allowing a user to enter the items and their ratings freely, b) proposing to a user a list of items

and asking her to rate them, or c) combining the two approaches. They compare three interfaces

for eliciting information from new users that implement the above mentioned approaches. They

performed an online experiment, which shows that the two pure approaches produced more accurate

user models than the mixed model with respect to MAE.

2.3. Bayesian Approaches

In another group of approaches AL is modeled as a Bayesian reasoning process. [Harpale and Yang

2008] developed such an approach extending and criticizing a previous one introduced in [Jin and

Si 2004]. In fact, in [Jin and Si 2004], as it is rather common in most AL techniques and evaluation

studies, the unrealistic assumption that a user can provide rating for any presented item is made.

Conversely, they propose a revised Bayesian item selection approach, which does not make such

assumption, and introduces an estimate of the probability that a user has consumed an item in the

past and is able to provide a rating. Their results show that the personalized Bayesian selection

outperforms Bayesian selection and the random strategy with respect to MAE. Their simulation

setting is similar to that used in [Rashid et al. 2002], hence for the same reason their results are not

directly comparable with ours. There are other important differences between their experiment and

ours: their strategies elicit only one rating per request, while we assume that the system makes many

rating requests at the same time; they compare the proposed approach only with the random strategy,

while we study the performance of several strategies; they do not consider the new user problem,

since in their simulations all the users have at least three ratings at the beginning of the experiment,

whereas in our experiments, there are users that have no ratings at all in the initial stage of the

experiment; they use a different rating prediction algorithm (Bayesian vs. Matrix Factorization). All

these differences make the two set of experiments, and the conclusions hard to compare. Moreover,

in their simulations they assume that the system has a larger number of known ratings than in our

experiments.

2.4. Decision Trees Based Methods

Many recent approaches to rating elicitation in RS identify the items to request to the user to rate

as those providing the most useful knowledge for reducing the prediction error of the recommender

system. Many of these approaches exploit decision trees to model the conditional selection of an

item to be rated, with regards to the ratings provided by the user for the items presented previously.

In [Rashid et al. 2008] the authors extend their former work [Rashid et al. 2002] using a rating

elicitation approach based on the usage of decision trees. The proposed technique is called IGCN,

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:7

and builds a tree where each node is labelled by a particular item to be asked to the user to rate.

According to the user rating for the asked item a different branch is followed, and a new node,

that is labelled with another item to rate is determined. In order to build this decision tree, they ﬁrst

cluster the users in groups, by grouping users with similar proﬁles, and assigning each user to one of

these clusters. The tree is incrementally extended by selecting for each node the item that provides

the highest information gain for correctly classifying the user in the right cluster. Hence, the items

whose ratings are more important to correctly classify the users in the right cluster are selected

earlier in the tree. They also considered two alternative strategies. The ﬁrst one is entropy0 that

differs from the more classical entropy strategy, which we mentioned above, because the missing

value is considered as a possible rating (category 0). Then, the second one is called HELF , where

items with the largest value of the harmonic mean of the entropy and popularity are selected. They

have conducted ofﬂine and online simulations, and concluded that IGCN and entropy0 perform

the best with respect to MAE.

They evaluate the improvement of the rating prediction accuracy for the particular user whose

ratings are elicited, while, as we mentioned above, we measure the overall system effectiveness of a

rating elicitation strategy. Moreover, in their experiments they use a very different rating prediction

algorithm, i.e., a standard neighbor-based approach [Desrosiers and Karypis 2011], while we use

matrix factorization [Koren and Bell 2011].

In a more recent work [Golbandi et al. 2010] three strategies for rating elicitation in collaborative

ﬁltering are proposed. For the ﬁrst method, GreedyExtend, the items that minimize the root mean

square error (RMSE) of the rating prediction (on the training set) are selected. For the V ar method

the items with the largest

2

√

popularity ∗ variance are selected, i.e., items that have many ratings

in the training set and with diverse values. Finally, for Coverage method, the items with the largest

coverage are selected. They deﬁned coverage for an item as the total number of users who co-

rated both the selected item and any other item. They evaluated the performance of these strategies

and compared them with previously proposed ones (popularity, entropy, entropy0, HELF , and

random). In their experiments, every strategy ranks and picks the top 200 items to be presented

to new users. Then, considering the ratings of the users for these items as training set they predict

ratings in Netﬂix test set for every single user and compute RMSE. They show that GreedyExtend

outperforms the other strategies. In fact, this strategy is quite effective, as it obtains the same error

rate, after having acquired just 10 ratings, which the second best strategy (V ar) achieves after 26

ratings. However, despite this remarkable achievement, GreedyExtend is static, i.e., selects the

items without considering the ratings previously entered by the user. Even here the authors focus

on the new user problem. In our work we do not make such assumption, and propose and evaluate

strategies that can be used in all stages, and not only at the start up stage.

Even more recently, in [Golbandi et al. 2011] the same authors of the paper described above,

have developed an adaptive version of their approach. Here, the items selected for a user depend

on the previous ratings she has provided. They also propose a technique based on decision trees

where at each node there is a test based on a particular item (movie). The node divides the users

into three groups based on the rating of the user for that movie: lovers, who rated the movie high;

haters, who rated the movie low; and unknowns, who did not rate the movie. In order to build the

decision tree, at each node the movie, whose rating knowledge produces the largest reduction of

the RMSE is selected. The rating prediction is computed (approximated) as the weighted average

of the ratings given by the users that belong to that node. They have evaluated their approach using

the Netﬂix training data set (100M ratings) to construct the trees, and evaluated the performance

of the proposed strategy on the Netﬂix test set (2.8M ratings). The proposed strategy has shown a

signiﬁcant reduction of RMSE compared with GreedyExtend, V ar, and HELF strategies. They

were able to achieve with only 6 ratings the same accuracy that is achieved by the next best strategy,

i.e., GreedyExtend, after acquiring over 20 ratings. Moreover, that accuracy is obtained by V ar

and HELF strategies after acquiring more than 30 ratings.

It should be noted that their results are again rather difﬁcult to compare with ours. They simulate

a scenario where the system is trained and the decision tree is constructed from a large training

39:8 M. Elahi et al.

dataset. So they assume a large initial knowledge of the system. Then, they focus on completely

new users, i.e., those without a single rating in the train set. In contrast, in our work, we assume that

the system has a very limited global knowledge of the users. In our experiments this is simulated by

giving to the system only 2% of the rating dataset. Moreover, we analyze the system dynamics as

more users are repeatedly requested to enter their ratings.

2.5. Time-Dependent Evolution of a Recommender System

Finally we want to mention an interesting and related work that is not addressing the active learning

process of rating elicitation but is studying the time dependent evolution of a recommender system

as new ratings are acquired. In [Burke 2010] the authors analyze the temporal properties of a stan-

dard user-based collaborative ﬁltering [Herlocker et al. 1999] and Inﬂuence Limiter [Resnick and

Sami 2007], a collaborative ﬁltering algorithm developed for counteracting proﬁle injection attacks

by considering the time at which a user has rated an item.

They evaluate the accuracy of these two prediction algorithms while the users are rating items and

the database is growing. This is radically different from the typical evaluations that we mentioned

above, where the rating dataset is decomposed into the training and testing sets without considering

the timestamp of the ratings. In [Burke 2010] it is argued that considering the time at which the

ratings were added to the system gives a better picture of the real user experience during the in-

teractions with the system in terms of recommendation accuracy. They conducted their analysis on

Movielens large dataset (1M ratings), and discovered that while using Inﬂuence Limiter, MAE is not

decreasing with the addition of more data indicating that the algorithm is not effective in terms of

accuracy improvement. For the standard user-based collaborative ﬁltering algorithm they observed

the presence of two time segments: the start up period, until day 70 with MAE dropping gradually,

and the remaining period, where MAE was dropping much slower.

This analysis is complementary to our study. This work analyzes the performance of a recom-

mendation algorithm while the users are adding their ratings in a natural manner, i.e., without being

explicitly requested to rate items selected by an active learning strategy. We have investigated the

situation where in addition to this natural stream of ratings coming from the users, the system se-

lectively chooses additional items and present them to the users to get their ratings.

3. ELICITATION STRATEGIES

A rating dataset R is a n ×m matrix of real values (ratings) with possible null entries. The variable

r

ui

, denotes the entry of the matrix in position (u, i), and contains the rating assigned by user u to

item i. r

ui

could store a null value representing the fact that the system does not know the opinion

of the user on that item. In the Movielens and Netﬂix datasets the rating values are integers between

1 and 5 (inclusive).

A rating elicitation strategy S is a function S(u, N, K, U

u

) = L which returns a list of items

L = {i

1

, . . . , i

M

}, M ≤ N, whose ratings should be elicited from the user u, where N is the

maximum number of items that the strategy should return, K is the dataset of known ratings, i.e.,

the ratings (of all the users) that have been already acquired by the RS. K is also an n × m matrix

containing entries with real or null values. The not null entries represent the knowledge of the system

at a certain point of the RS evolution. Finally, U

u

is the set of items whose ratings have not yet been

elicited from u, hence potentially interesting. The elicitation strategy enforces that L ⊂ U

u

and will

not repeatedly ask a user to rate the same item; i.e. after the items in L are shown to a user they are

removed from U

u

.

Every elicitation strategy analyzes the dataset of known ratings K and scores the items in U

u

.

If the strategy can score at least N different items, then the N items with the highest score are

returned. Otherwise a smaller number of items M ≤ N is returned. It is important to note that the

user may have not experienced the items whose ratings are requested; in this case the system will

not increase the number of known ratings. In practice, following a strategy may result in collecting

a larger number of ratings, while following another one may results in fewer but more informative

ratings. These two properties (rating quantity & quality) play a fundamental role in rating elicitation.

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:9

3.1. Individual Strategies

We considered two types of strategies: pure and partially randomized. The ﬁrst ones implement

a unique heuristic, whereas the second type of strategies hybridize a pure one by adding some

random rating requests that are still unknown to the system. As we mentioned in the introduction

these strategies add some diversity to the system requests and, as we will show later, can cope with

an observed problem of the pure strategies: which may in some cases increase the system error.

The pure strategies that we have considered are:

— Popularity: for all the users the score for item i ∈ U

u

is equal to the number of not null ratings for

i contained in K, i.e. the number of known ratings for the item i. This strategy will rank the items

according to the popularity score and then will select the top N items. Note that this strategy is

not personalized, i.e., the same N items are proposed to be rated by any user. the rationale of this

strategy is that more popular items are more likely to be known by the user, and hence it is more

likely that a request for such a rating will increase the size of the rating database.

— log(popularity) * entropy: the score for the item i ∈ U

u

is computed by multiplying the logarithm

of the popularity of i with the entropy of the ratings for i in K. Also in this case, as for any

strategy, the top N items according to the computed score are proposed to be rated by the user.

This strategy tries to combine the effect of the popularity score, which is discussed above, with the

heuristics that favors items with more diverse ratings (larger entropy), which may provide more

useful (discriminative) information about the user’s preferences [Carenini et al. 2003; Rashid et al.

2002].

— Binary Prediction: the matrix K is transformed in a matrix B with the same number of rows and

columns, by mapping null entries in K to 0, and not null entries to 1. Hence, the matrix B models

only whether a user rated (b

ui

= 1) or not (b

ui

= 0) an item, regardless of its value [Koren 2008].

A factor model, similarly to what is done for standard rating prediction, is built using the matrix B

as training data, to compute the predictions for the entries in B that are 0 [Koren and Bell 2011].

In this case predictions are numbers between 0 and 1. The larger is the predicted value for the

entry b

ui

, the larger is the predicted probability that the user u has consumed an item i, and hence

may be able to rate it. Finally, the score for the item i ∈ U

u

corresponds to the predicted value

for b

ui

. Hence, by selecting the top N items with the highest score this strategy tries to select the

items that the user has most likely experienced, in order to maximize the likelihood that the user

can provide the requested rating. In that sense it is similar to the popularity strategy, but it tries to

make a better prediction of what items the user can rate by exploiting the knowledge of the items

the user has rated in the past. Note that the better are the predictions b

ui

for the items in U

u

the

larger is the number of ratings that this strategy can acquire.

— Highest Predicted: a rating prediction ˆr

ui

, based on the ratings in K, is computed for all the items

i ∈ U

u

and the score for i is this predicted value ˆr

ui

. Then, the top N items according to this

score are selected. The idea is that the items with the highest predicted rating, are supposed to

be the items that the user likes the most. Hence, it could also be more likely that the user have

experienced these items. Moreover, their ratings could also reveal important information on what

the user likes. We also note that this is the default strategy for RSs, i.e., enabling the user to rate

the recommendations.

— Lowest Predicted: uses an opposite heuristics compared to highest predicted: for all the items

i ∈ U

u

the prediction ˆr

ui

is computed, but then the score for i is Maxr −ˆr

ui

, where M axr is the

maximum rating value (e.g., 5). This ensure that the items with the lowest predicted ratings ˆr

ui

will get the highest score and therefore will be selected for elicitation. Lowest predicted items are

likely to reveal what the user dislikes, but are likely to elicit a few ratings, since users tend to not

rate items that they do not like; reﬂected by the distributions of the ratings voluntarily provided by

the users [Marlin et al. 2011].

— Highest and Lowest Predicted: for all the items i ∈ U

u

a prediction ˆr

ui

is computed. The score

for an item is |

Maxr+M inr

2

− ˆr

ui

|, where M inr is the minimum rating value (e.g., 1). This score

39:10 M. Elahi et al.

is simply the distance of the predicted rating of i from the midpoint of the rating scale. Hence, this

strategy selects the items with extreme ratings, i.e., the items that the user either hates or loves.

— Random: the score for an item i ∈ U

u

is a random integer from 1 to 5. Hence also the top N items

in U

u

according to this score are simply randomly chosen. This is a baseline strategy, used for

comparison.

— Variance: the score for the item i ∈ U

u

is equal to the variance of its ratings in the dataset K.

Hence this strategy selects the items in U

u

that have been rated in a more diverse way by the users.

This is a representative of the strategies that try to collect more useful ratings, assuming that the

opinion of the user on items with more diverse ratings are more useful for the generation of correct

recommendations.

— Voting: the score for the item i is the number of votes given by a committee of strategies including

popularity, variance , entropy , highest-lowest predicted, binary prediction, and random. Each of

these strategies produces its top 10 candidates for rating elicitation, and then the items appearing

more often in these lists are selected. This strategy depends on the selected voting strategies. We

have also included random strategy as to impose an exploratory behavior.

Finally, we would like to note that we have also evaluated other strategies: entropy, and log(pop)∗

variance. But, since their observed behaviors are very similar to some of the previously mentioned

strategies, we did not include them.

3.2. Partially Randomized Strategies

A pure strategy may not be able to return the requested number of items. For instance, there are

cases where no rating predictions can be computed by the RS for the user u. This happens for

instance when u is a new user and none of his ratings are known. Hence, in this situation the highest

predicted strategy is not able to score any of the items. In this case the randomized version of the

strategy can generates purely random items for the user to rate.

A partially randomized strategy modiﬁes the list of items returned by a pure strategy introduc-

ing some random items. As we mentioned in the introduction, the partially randomized strategies

have been introduced to cope with some problems of the pure strategies (see section 5). More pre-

cisely, the randomized version Ran of the strategy S with randomness p ∈ [0, 1] is a function

Ran(S(u, N, K, U

u

), p) returning a new list of items L

0

computed as follow:

(1) L = S(u, N, K, U

u

) is obtained

(2) if L is an empty list, i.e., the strategy S for some reason could not generate the elicitation list,

then L

0

is computed by taking N random items from U

u

.

(3) if |L| < N , L

0

= L ∪ {i

1

, . . . , i

N−|L|

}, where i

j

is a random item in U

u

.

(4) if |L| = N , L

0

= {l

1

, . . . , l

M

, i

M+1

, . . . , i

N

}, where l

j

is a random item in L, M = dN ∗(1 −

p)e, and i

j

is a random item in U

u

.

4. EVALUATION APPROACH

In order to study the effects of the considered elicitation strategies we set up the following simulation

procedure. The goal is to simulate the inﬂuence of elicitation strategies on the evolution of a RS’s

performance. To achieve this, we partition all the available (not null) ratings in R into three different

matrices with the same number of rows and columns as R:

— K: contains the ratings that are considered to be known by the system at a certain point in time.

— X: contains the ratings that are considered to be known by the users but not by the system. These

ratings are incrementally elicited, i.e., they are transferred into K if the system asks for them

from the (simulated) users.

— T : contains a portion of the ratings that are known by the users but are withheld from X for

evaluating the elicitation strategies, i.e., to estimate the evaluation measures (deﬁned later).

In Figure 1 (b) we illustrate graphically how the partition of the available ratings in a data set

could look like. As deﬁned in the previous section, U

u

is the set of items whose ratings are not

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:11

rating values (proportional to size)

ratings known by the system (K)

ratings known by the user

but not the system (X)

System can elicit a rating

ratings used for evaluation (T)

items items

users

users

(a) User-centered Active Learning (b) System-centered Active Learning

Legend

Fig. 1. Comparison of the ratings data conﬁgurations used for evaluating user-centered and system-centered active learning

strategies.

known to the system and may therefore be selected by the elicitation strategies. That means that k

ui

has a null value and the system has not yet asked u for it. In Figure 1 (b) these are the items that for

a certain user have ratings that are not marked with grey boxes. In this setting, a request to rate an

item, which is identiﬁed by a strategy S, may end up with a new (not null) rating k

ui

inserted in K,

if the user has experienced the item i, i.e., if x

ui

is not null, or in a no action, if x

ui

has a null value

in the matrix X. The ﬁrst case corresponds to the situation where the items is marked with a black

box for user u in Figure 1 (b). In any case, the system, will remove the item i from U

u

, as to avoid

asking to rate the same item again.

We will discuss later how the simulation is initialized, i.e., how the matrices K, X and T are built

from the full rating dataset R. In any case, these three matrices partition the full dataset R; if r

ui

has a not null value then either k

ui

or x

ui

or t

ui

is assigned that value, i.e. only one of entries is not

null.

The test of a strategy S proceeds in the following way:

(1) The not null ratings in R are partitioned into the three matrices K, X, T .

(2) MAE, Precision and NDCG are measured on T , training the rating prediction model on K.

(3) For each user u:

(a) Only the ﬁrst time that this step is executed, U

u

, the unclear set of user u is initialized to all

the items i with a null value k

ui

in K.

(b) Using strategy S (pure or randomized) a set of items L = S(u, N, K, U

u

) is computed.

(c) The set L

e

, containing only the items in L that have a not null rating in X, is created.

(d) Assign to the corresponding entries in K the ratings for the items in L

e

as found in X.

(e) Remove the items in L from U

u

: U

u

= U

u

\ L and X.

(4) MAE, Precision and NDCG are measured on T , and the prediction model is re-trained on the

new set of ratings contained in K.

(5) Repeat steps 3-4 (Iteration) for I times.

It is important to note here the peculiarity of this evaluation strategy that has been mentioned al-

ready in Section 2. Traditionally, the evaluation of active learning strategies has been user-centered;

i.e. the usefulness of elicited rating was judged based on the improvement in the user’s prediction er-

ror. This is illustrated in Figure 1 (a). In this scenario the system is supposed to have a large number

of ratings from several users, and focusing on a new user (the ﬁrst one in Figure 1 (a)) it ﬁrst elicits

ratings from this new user that are in X, and then system predictions for this user are evaluated on

the test set T . Hence these traditional evaluations focussed on the new user problem and measured

how the ratings elicited from a new user may help the system to generate good recommendations for

39:12 M. Elahi et al.

this particular user. We note that, elicited rating from a user may improve not only the rating predic-

tions for that user, but also the predictions for the other users, which is what we are evaluating in our

experiments and it is graphically illustrated in Figure 1 (b). To illustrate this point, let us consider

an extreme example in which a new item is added to the system. The traditional user-centered AL

strategy, when trying to identify the items that a particular user u should rate, may ignore obtaining

his rating for that new item. In fact, this item has not been rated by any other user and therefore its

ratings cannot contribute to improve the rating predictions for u. However, the rating of u for the

new item would allow to bootstrap the predictions for the rest of the users in the system, and hence

from the system’s perspective the elicited rating is indeed very informative.

The MovieLens [Miller et al. 2003] and Netﬂix rating databases were used for our experiments.

Movielens consists of 100,000 ratings from 943 users on 1682 movies. From the full Netﬂix data

set, which contains 1,000,000 ratings, we extracted the ﬁrst 100,000 ratings that were entered into

the system. They come from 1491 users on 2380 items, so this sample of Netﬂix data is 2.24 times

sparser than Movielens data.

We also performed some experiments with the larger versions of both Movielens and Netﬂix

datasets (1,000,000 ratings) and obtained very similar results [Elahi et al. 2011]. However, using the

full set of Netﬂix data required much longer times to perform our experiments since we train and

test a rating prediction model at each iteration: every time we add to K new ratings elicited from

the simulated users. After having observed a very similar performance on some initial experiments

we focussed on the smaller data sets to be able to run more experiments.

When deciding how to split the available data into the three matrices K, X and T an obvious

choice is to follow the actual time evolution of the dataset, i.e., to insert in K the ﬁrst ratings

acquired by the system, then to use a second temporal segment to populate X and ﬁnally use the

remaining ratings for T . An approach that follows this idea is detailed in section 7.

But, it is not sufﬁcient to test the performance of the proposed strategies for a particular evolution

of the rating dataset. Since we want to study the evolution of a rating data set under the application of

a new strategy we cannot test it only against the temporal distribution of the data that was generated

by a particular (unknown) previously used elicitation strategy. Hence we ﬁrst followed the approach

also used in [Harpale and Yang 2008] to randomly split the rating data, but unlike [Harpale and

Yang 2008] we generated several random splits of the ratings into K, X and T . This allows us

to generate ratings conﬁgurations where there are users and items that initially have no ratings in

the known dataset K. We believe that this approach provided us with a very realistic experimental

setup, letting us to address both the new user and the new item problems [Ricci et al. 2011].

Finally, for both data sets the experiments were conducted by partitioning (randomly) the 100,000

not null ratings of R in the following manner: 2,000 ratings in K (i.e., very limited knowledge at

the beginning), 68,000 ratings in X, and 30,000 ratings in T . Moreover, |L| = 10, which means

that at each iteration the system asks from a user his opinion on at most 10 items. The number of

iterations was set as I = 170 since after that stage almost all the ratings are acquired and the system

performance is not changing anymore. Moreover, the number of factors in the SVD prediction model

was set to 16, which enabled the system to obtain a very good prediction accuracy, not very different

from conﬁgurations using hundreds of factors, as it is shown in [Koren and Bell 2011]. Note that

since the factor model is trained at each iteration and for each strategy learning the factor model

is the major computational bottleneck of the conducted experiments. For this reason we did not

use a very large number of factors. Moreover, in these experiments we wanted to compare the

system performance under the application of several strategies, hence, the key measure is the relative

performance of the system rather than its absolute value. All the experiments were performed 5

times and results presented in the following section are obtained averaging these ﬁve repetitions.

We considered three evaluation measures: mean absolute error (MAE), precision, and normalized

discounted cumulative gain (NDCG) [Herlocker et al. 2004; Shani and Gunawardana 2010; Man-

ning 2008]. For computing precision we extracted, for each user, the top 10 recommended items

(whose ratings also appear in T ) and considered as relevant the items with true ratings equal to 4 or

5.

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:13

Discounted cumulative gain (DCG) is a measure originally used to evaluate effectiveness of in-

formation retrieval systems [J

¨

arvelin and Kek

¨

al

¨

ainen 2002], and is also used for evaluating collabo-

rative ﬁltering RSs [Weimer et al. 2008] [Liu and Yang 2008]. In RSs the relevance is measured by

the rating value of the item in the predicted recommendation list. Assume that the recommendations

for u are sorted according to the predicted rating values, then DCG

u

is deﬁned as:

DCG

u

=

N

X

i=1

r

i

u

log

2

(i + 1)

(1)

where r

i

u

is the true rating (as found in T ) for the item ranked in position i for user u, and N is

the length of the recommendation list.

Normalized discounted cumulative gain for user u is then calculated in the following way:

NDCG

u

=

DCG

u

IDCG

u

(2)

where IDCG

u

stands for the maximum possible value of DCG

u

, that could be obtained if the

recommended items are ordered by decreasing value of their true ratings. We measured also the

overall average discounted cumulative gain N DCG by averaging N D CG

u

over the full population

of users.

5. EVALUATION OF THE PURE STRATEGIES

In this section we present the results of a ﬁrst set of experiments in which the pure strategies have

been evaluated. We ﬁrst illustrate how the system MAE is changing as the system is acquiring new

ratings with the proposed strategies. Then we show how the NDCG and the system precision is

affected by the considered rating acquisition strategies.

5.1. Mean Absolute Error

The MAE computed on the test matrix T at the successive iterations of the application of the elic-

itation strategies (for all the users) is depicted in Figure 2. First of all, we can observe that the

considered strategies have a similar behavior on both data sets (Netﬂix and MovieLens) . Moreover,

there are two clearly distinct groups of strategies:

(1) Monotone error decreasing strategies: lowest-highest predicted, lowest predicted, voting, and

random.

(2) Non-monotone error decreasing strategies: binary predicted, highest predicted, popularity,

log(popularity)*entropy, and variance.

Strategies of the ﬁrst group show an overall better performance (MAE) for the middle stage, but

not for the beginning and the end stage. At the very beginning, i.e., during the iterations 1-3 the best

performing strategy is binary-predicted. Then, during iterations 4-11 the voting strategy obtains

the lowest MAE on the Movielens data set. Then the random strategy becomes the best, and it is

overtaken by the lowest-highest-predicted strategy only at iteration 46. The results on the Netﬂix

data set differ as follows. The binary-predicted is the best strategy for a longer period, i.e., from the

beginning until iteration 7, and then voting outperforms this strategy till iteration 46, where lowest-

highest-predicted starts exhibiting the lowest error. At the iteration 80, the MAE stops changing for

all of the prediction-based strategies. This occurs because the known set K at that point already

reaches the largest possible size for those strategies, i.e., all the ratings in X, which can be elicited

by these strategies, have been transferred to K. Conversely, the MAE of the voting and random

strategies keeps decreasing, until all of the ratings in X are moved to K. It is important to note that

the prediction based strategies (e.g., highest predicted) cannot elicit ratings for which the prediction

can not be made, e.g. if a movie has no ratings in K.

The behavior of the non-monotone strategies can be divided into three stages. Firstly, they all

decrease the MAE at the beginning (approximately iterations 1-5). Secondly, they slowly increase

39:14 M. Elahi et al.

0 20 40 60 80 100 120 140 160 180

0.7

0.75

0.8

0.85

0.9

0.95

1

# of iterations

MAE

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

(a) Movielens Dataset

0 20 40 60 80 100 120 140 160 180

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

# of iterations

MAE

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

(b) Netﬂix Dataset

Fig. 2. System MAE evolution under the effect of the pure rating elicitation strategies.

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:15

Table I. The distribution of the rating values for the ratings elicited by

the Highest Predicted strategy at two stages of its application

Percentage of elicited ratings of value r

Iterations r=1 r=2 r=3 r=4 r=5

from 1 to 5 2.06% 4.48% 16.98% 36.56% 39.90%

from 35 to 39 6.01% 13.04% 29.33% 34.06% 17.53%

it, up to a point when MAE reaches a peek (approximately iterations 6-35). Thirdly, they slowly

decrease MAE till the end of the experiment (approximately iterations 36-80). This behavior occurs

since the strategies in second group have a strong selection bias with regards to the properties of the

items which may negatively affect MAE. For instance, the highest predicted strategy at the initial

iterations (from 1 to 5) elicits primarily items with high ratings, however this behavior does not

persist as could be seen from the rating distribution for the iterations 35 to 40 (Table I). As a result,

in the beginning stages, this strategy adds to the known matrix (K) disproportionately more high

ratings than low ones, and this ultimately biases the rating prediction towards overestimating the

ratings.

Low rated movies are selected for elicitation by the highest predicted strategy in two cases: 1)

when a low rated item is predicted to have a high rating 2) when all the highest predicted ratings

have been already elicited or marked as “not available” (they are not present in X and removed

from U

u

). Looking into the data we discovered that at the iteration 36 the highest-predicted strategy

has already elicited most of the highest ratings. Hence, the next ratings that are elicited are actually

average or low ratings, which reduces the bias in K and also the prediction error. The random

and lowest-highest predicted strategies do not introduce such a bias, and this results in a constant

decrease of MAE.

5.2. Number of Acquired Ratings

In addition to measuring the quality of the elicited ratings (as described in the previous section), it is

also important to measure the number of elicited ratings. In fact, certain strategies can acquire more

ratings by better estimating what items the user has actually experienced and is therefore able to

rate. We simulate the limited knowledge of users by making available only the ratings in the matrix

X. Conversely, while a strategy may not be able to acquire many ratings but those actually acquired

can be very useful for improving recommendations.

Figure 3 shows the number of ratings in K that are known to the system, as the strategies elicit

new ratings from the simulated users. It is worth nothing, even in this case, the strong similarity of

the behavior of the elicitation strategies in both data sets. The only strategy that differs substantially

in the two data sets is random. This is clearly caused by the larger number of users and items that

are present in the Netﬂix data. In fact, while both data sets contain 100,000 ratings, the sparsity

of Movielens data set is much higher: containing only 2.8% of the possible ratings (1491*2380)

vs. 6.3% of the possible ratings (943*1682) contained in the Movielens data set. This larger spar-

sity makes it more difﬁcult for a pure random strategy to select items that are known to the user.

In general this is a major limitation of any random strategy, i.e., a very slow rate of addition of

new ratings. Hence for relatively small problems (with regards to the number of items and users)

the random strategy may be applicable, but for larger problems it is rather impractical. In fact,

observing Figure 3, one can see that in the Movielens simulations after 70 iterations, in which

70*10*943=660,100 ratings’ requests were made (iterations * number-of-rating-requests * users),

the system has acquired on average only 28,000 new ratings (the system was initialized with 2,000

ratings, hence bringing the total number of ratings to 30,000). This means that only for one out

of 23 random rating requests a user is able to provide a rating. In the Netﬂix data set this ratio is

even worse. It is interesting to note that even the popularity strategy has a poor performance in term

of number of elicited ratings; it elicited the ﬁrst 28,000 ratings at a speed equal to one rating for

each 6.7 rating requests. We also observe that according to our results, quite surprisingly, the higher

39:16 M. Elahi et al.

0 20 40 60 80 100 120 140 160 180

0

1

2

3

4

5

6

7

x 10

4

# of iterations

# of ratings in Known set

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop) * entropy

(a) Movielens

0 20 40 60 80 100 120 140 160 180

0

1

2

3

4

5

6

x 10

4

# of iterations

# fo ratings in Known set

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop) * entropy

(b) Netﬂix

Fig. 3. Evolution of the number of ratings elicited by the AL strategies.

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:17

sparsity of the Netﬂix sample has produced a substantially different impact only on the random

strategy.

It is also clear that certain strategies are not able to acquire all the ratings in X. For instance

lowest-highest-predicted, lowest-predicted and highest-predicted stop acquiring new ratings once

they have collected 50,000 ratings (Movielens). This is due to the fact that these strategies, in order

to acquire from a user her ratings for items, need the recommender system to generate rating predic-

tions for those items. This happens when the user’s ratings in the test set T have no corresponding

ratings anywhere in the known dataset K, and hence matrix factorization can not derive any rating

predictions for them.

Figure 4 illustrates a related aspect, i.e., how much the acquired ratings are useful for the ef-

fectiveness of the system, i.e., how the same number of ratings, acquired by different strategies

can reduce MAE. From Figure 4 it is clear that in the ﬁrst stage of the process, i.e., when a small

number of ratings are present in the known matrix K, the random and lowest-predicted strategies

collect ratings that are more effective in reducing MAE. Successively, the lowest-highest-predicted

strategy acquires more useful ratings. This is an interesting result, showing that the items with the

lowest predicted ratings and random items are providing more useful information, even though these

ratings are difﬁcult to acquire.

5.3. Normalized Discounted Cumulative Gain

In this section we analyze the results of the experiments with regards to the NDCG metric. As

discussed in section 4, in order to compute NDCG for a particular user, ﬁrst the ratings for the

items in the recommendation list are predicted. Then, the normalized discounted cumulative gain

(NDCG) is computed by dividing the DCG of the ranked list of the recommendations with the

DCG obtained by the best ranking of the same items for that user. NDCG is computed on the top

10 recommendations for every user. Moreover, recommendations lists are created only with items

that have ratings in the testing dataset. This is necessary in order to compute DCG. We note that

sometimes the testing set contains less than 10 items for some users. In this case NDCG is computed

on this smaller set.

Moreover, when computing NDCG, in some cases the rating prediction algorithm (matrix factor-

ization) cannot generate rating predictions for all 10 items that are in test set of a user. This happens

when the user’s ratings in the test set T have no corresponding ratings anywhere in the known dataset

K, and hence matrix factorization can not derive any rating predictions for them. It is important to

notice that ideal recommendation lists for a user is rather stable during the experiments that use the

same dataset. Therefore, if an algorithm is not able to generate predicted recommendation lists of

size 10, lists of the size which is available are used which results in smaller NDCG values.

Figure 5 depicts the NDCG curves for the pure strategies. Higher NDCG value corresponds to

higher rated items being present in the predicted recommendation lists. Popularity is the best strat-

egy at the beginning of the experiment. But at iteration 3, in the Movielens data set, and 9 in the

Netﬂix data set, the voting strategy passes the popularity strategy and then remains the best one. In

Movielens the random strategy overtakes the voting strategy at iteration 70, but this is not observed

in the Netﬂix data. Excluding the voting and random strategies, popularity, log(popularity)*entropy,

and variance are the best in both data sets. Lowest predicted is by far the worst, and this is quite sur-

prising considering how effective it is in reducing MAE. By further analyzing the experiment data

we discovered that the lowest predicted strategy is not effective for NDCG since it is eliciting more

ratings for the lowest ranked items which are useless for predicting the ranking of the top items.

Another striking difference from the MAE experiments, is that all the strategies improve NDCG

monotonically. It is also important to note that here the random strategy is by far the best. This is

again different from its behavior in MAE experiments.

5.4. Precision

As we have already observed with regards to MAE and NDCG, for both Netﬂix and Movielens

datasets very similar results were observed in the initial experiments. For this reason, in the rest of

39:18 M. Elahi et al.

0 1 2 3 4 5 6 7

x 10

4

0.7

0.75

0.8

0.85

0.9

0.95

1

# of ratings in the Known K matrix

MAE

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

(a) Movielens

0 1 2 3 4 5 6

x 10

4

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

# of ratings in the Known K matrix

MAE

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

(b) Netﬂix

Fig. 4. System MAE evolution vs. the number of ratings elicited.

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:19

0 20 40 60 80 100 120 140 160 180

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

# of iterations

NDCG

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

(a) Movielens

0 20 40 60 80 100 120 140 160 180

0.81

0.82

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

# of iterations

NDCG

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

(b) Netﬂix

Fig. 5. System NDCG evolution under the application of the pure rating elicitation strategies.

39:20 M. Elahi et al.

0 20 40 60 80 100 120 140 160 180

0.65

0.66

0.67

0.68

0.69

0.7

0.71

0.72

# of iterations

Precision

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

Fig. 6. System precision under the application of the pure rating elicitation strategies (Movielens).

this paper we use just the Movielens data set. Precision, as it was described in section 4, measures

the proportion of items rated 4 and 5 that are found in the recommendation list. Figure 6 depicts the

evolution of the system precision when the elicitation strategies are applied. Here, highest predicted

is the best performing strategy for the largest part of the test. Starting from iteration 50 it is as good

as the binary predicted and the lowest-highest-predicted strategies. It is also interesting to note that

all the strategies monotonically increase the precision. Moreover, the random strategy, differently

from NDCG, does not perform so well, if compared with the highest predicted strategy. This is again

related to the fact that the random strategy increases substantially the coverage by introducing new

users. But for new users the precision is signiﬁcantly smaller as the system has not enough ratings

to produce good predictions.

In conclusion from these experiments one can conclude that among the evaluated strategies there

is no single best strategy, , that dominates the others for all the evaluation measures. The random

and voting strategies are the best for NDCG, whereas for MAE lo-high predicted performs quite

good, and ﬁnally for Precision lo-high predicted, highest predicted, and voting work well.

6. EVALUATION OF THE PARTIALLY RANDOMIZED STRATEGIES

Among the pure strategies only the random one is able to elicit ratings for items that have not been

evaluated by the users already present in K. Partially randomized strategies address this problem

by asking new users to rate random items (see Section 3). In this section we have used partially

randomized strategies where p = 0.2, i.e., at least 2 of the 10 items that are requested to be rated by

the simulated users are chosen at random.

Figure 7 depicts the system MAE evolution during the experimental process. We note that here

all the curves are monotone, i.e., it is sufﬁcient to add just a small portion of randomly selected

ratings to the elicitation lists to reduce the bias of the pure, prediction-based, strategies.

It should be mentioned that we have not evaluated the partially randomized voting strategy be-

cause it already includes the random strategy as one of the voting strategies. The best performing

partially randomized strategies, with respect to MAE, are, at the beginning of the process, the par-

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:21

0 20 40 60 80 100 120 140 160 180

0.7

0.75

0.8

0.85

0.9

0.95

1

# of iterations

MAE

variance−rand

popularity−rand

lowest−pred−rand

lo−hi−pred−rand

highest−pred−rand

binary−pred−rand

log(pop)*entropy−rand

Fig. 7. System MAE evolution under the application of the partially randomized strategies (Movielens).

0 20 40 60 80 100 120 140 160 180

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

# of iterations

NDCG

variance−rand

popularity−rand

lowest−pred−rand

lo−hi−pred−rand

highest−pred−rand

binary−pred−rand

log(pop)*entropy−rand

Fig. 8. System NDCG evolution under the application of the partially randomized strategies (Movielens).

tially randomized binary-predicted, and subsequently the low-high-predicted (similarly to the pure

strategies case).

39:22 M. Elahi et al.

0 20 40 60 80 100 120 140 160 180

0.655

0.66

0.665

0.67

0.675

0.68

0.685

0.69

0.695

0.7

0.705

# of iterations

Precision

variance−rand

popularity−rand

lowest−pred−rand

lo−hi−pred−rand

highest−pred−rand

binary−pred−rand

log(pop)*entropy−rand

Fig. 9. System precision under the application of the randomized strategies (Movielens).

Figure 8 shows the NDCG evolution under the effect of the partially randomized strategies. Dur-

ing iterations 1-6, the partially randomized popularity strategy obtains the best NDCG. During iter-

ations 7-170, i.e., for the largest part of the test, the best strategy is the partially randomized highest

predicted. Again, as we observed for the pure strategy version, the worst is the lowest-predicted.

It is important to note that the strategies that show good performance at the beginning (partially

randomized highest and binary predicted strategies) are those aimed at ﬁnding items that a user may

know and therefore is able to rate. Hence, these strategies are very effective in the early stage when

there are many users with very few items in the known dataset K.

Figure 9 shows the precision of the partially randomized strategies. The partially randomized

highest predicted strategy shows again the best results during most of the test, as for NDCG. During

the iterations 1-6 the best strategy with respect to precision is partially randomized binary pre-

dicted strategy, but then the classical approach of requesting the user to rate the items that the

system considers the best recommendations (highest-predicted) is the winner. During iterations

111-170 partially randomized variance, popularity, log(popularity)*entropy, highest predicted and

binary predicted have very similar precision values. Similarly to NDCG, the worst strategy is the

lowest-predicted, i.e., eliciting ratings for the items that user dislikes does little to improve the rec-

ommender’s precision. Interestingly this is not the case if the goal is to improve MAE.

7. COMBINING ACTIVE LEARNING AND NATURAL ACQUISITION OF RATINGS

For these experiments, we designed a procedure to simulate the evolution of a RS’s performance

by mixing the usage of active learning strategies with the natural acquisition of ratings. We are

interested in observing the temporal evolution of the quality of the recommendations generated by

the system when, in addition to exploiting an active learning strategy for requesting the user to rate

some items, the users were able to voluntarily add ratings without being explicitly requested, just

as it happens in actual settings. To accomplish this goal, we have used the larger version of the

Movielens dataset (1,000,000 ratings) for which we considered only the ratings of users that were

active and rated movies for at least 8 weeks (2 month). Movielens consists of 377,302 ratings from

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:23

1,236 users on 3,574 movies. The ratings are timestamped with values ranging from 25/04/2000 to

28/02/2003. We measure the performance of the recommendation algorithm on a test set, as more

and more ratings are added to the known set K, while the simulated time advances from 25/04/2000

to 28/02/2003. We combined this natural acquisition of the ratings with active learning as described

below.

We split the available data into three matrices K, X and T , as we did previously, but now we

also consider the time stamp of the ratings. Hence, we initially insert in K the ratings acquired by

Movielens in the ﬁrst week (3,705 ratings). Then we split randomly the remaining ratings to obtain

70% of the ratings in X (261,730) and 30% in T (111,867).

For these new experiments, we perform a simulated iteration every week. That is, each simulated

day (starting from the second week) an active learning strategy requests for rating 40 items from

each user, who already has some non-null ratings in K, i.e., user’s ratings that are known by the

system at that point in time. If these ratings are present in X, they are added to K. This procedure

is repeated for 7 days (1 week). Then, all the ratings in the Movielens dataset that according to the

timestamps were acquired in that week are also added to K. Finally, the system is trained using

the ratings in K. To achieve a realistic settings for evaluating predictive performance of RS, we

use only the items in T that users actually experienced during the following week (according to the

timestamps). This procedure is repeated for I = 48 weeks (1 year).

In order to justify the large number of rating requests that the system makes each week, it is

important to note that the simulated application of an active learning strategy, as we do in our

experiments, is able to add a lot fewer ratings than what could be elicited in a real settings. In fact,

the number of ratings that are supposed to be known by the users in the simulated process is limited

by the number of ratings that have been actually acquired in the Movielens dataset. In [Elahi et al.

2011]) it has been estimated that the number of items that are really known by the user is more

than 4 times larger than what is typically observed in the simulations. Hence, a lot of our elicitation

request would be unfulﬁlled, even though the user in actuality would have been able to rate the item.

Therefore, instead of asking for 10 items as typically done, we ask for 4 times as many items (40

items), to adjust for the discrepancy between the knowledge of the actual and simulated users.

In order to precisely describe the evaluation procedure, we use the following notations, where n

is the week index:

— K

n

: is the set of ratings, known by the system at the end of the week n. These are the ratings that

have been acquired up to week n. They are used to train the prediction model, compute the active

learning rating elicitation strategies for week n + 1 and test the system’s performance using the

ratings contained in the test set of the next week n + 1, T

n+1

.

— T

n+1

: is the set of ratings time stamped during the week n + 1 that are used as test set to measure

the system performance after the ratings in the previous weeks have been added to K

n

.

— AL

n

: is the set of ratings, elicited by a particular elicitation strategy, and is added to known set

(K

n

) at week n. We note that these are ratings that are present in X but not in T . This is required

for assuring that the active learning strategies are not modifying the test set and that the system

performance, under the application of the strategies, is consistently tested on the same set of

ratings.

— X

n

: is the set of ratings in X, time stamped in week n that are not in the test set T

n

. These ratings,

together with the ratings in T

n

, are all of the ratings acquired in Movielens during the week n,

and therefore are considered to have been naturally provided by the (simulated) users without

being asked by the system (natural acquisition). We note that it may happen that an elicitation

strategy has already acquired some of these ratings, i.e., the intersection of AL

n

and X

n

may be

not empty. In this case, only those not yet actively acquired are added to K

n

.

The testing of an active learning strategy S now proceeds in the following way.

— System initialization: week 1

1. The entire ratings are partitioned randomly into the two matrices X, T .

39:24 M. Elahi et al.

2. The not null ratings in X

1

and T

1

are added to K

1

: K

1

= X

1

∪ T

1

3. U

u

, the unclear set of user u is initialized to all the items i with a null value k

ui

in K

1

.

4. The rating prediction model is trained on K

1

, and MAE, Precision, and NDCG are mea-

sured on T

2

.

— For all the weeks n starting from n = 2

1. Initialize K

n

with all the ratings in K

n−1

.

2. For each user u with at least 1 rating in K

n−1

:

— Using strategy S a set of items L = S(u, N, K

n−1

, U

u

) is computed.

— The set L

e

is created, containing only the items in L that have non-null rating in X. The

ratings for the items in L

e

are added to AL

n

— Remove from U

u

the items in L: U

u

= U

u

\ L.

3. Add to K

n

the ratings time stamped in week n and those elicited by S: K

n

= AL

n

∪X

n

∪

T

n

.

4. Train the factor model on K

n

.

5. Compute MAE, Precision and NDCG on T

n+1

.

7.1. Results

Figure 10 shows the MAE time evolution for the different strategies. It should be noted that there is

a huge ﬂuctuation of MAE, from week to week. This is caused by the fact that for every week we

train the system on the previous weeks data and we test the system performance on the next week’s

ratings in the test set. Hence, the difﬁculty of making good predictions may differ from week to

week. For this reason, in the ﬁgure we focus on a time range: weeks 1 to 17. In this ﬁgure the value

at week n is obtained after the system has acquired the ratings for that week, and this is the result of

evaluating the system’s performance on week n+1 (see the description of the simulation procedure

in the previous section). The natural acquisition curve shows the MAE of the system without using

items acquired by the AL strategies, i.e., the added ratings are only those that have been acquired

during that week in Movielens data set.

The results show that in the second week the performance of the all strategies is very close. But

starting from the third week popularity and log(popularity)*entropy both perform better than the

others. These two strategies share similar characteristics and outperform all the other strategies on

the whole rating elicitation process. Voting, variance and random are the next best strategies in terms

of MAE.

In order to better show the results of our experiments, in Figure 11 we plot three strategies that

can be representative of other strategies. We have chosen log(popularity)*entropy since it is one of

the state of the art strategies, highest-predicted since it performs very similar to other prediction

based strategies, and voting which is a novel strategy.

Considering the MAE obtained by the natural acquisition of ratings as a baseline we can observe

that the highest-predicted does not perform very differently from the baseline. The main reason is

that this strategy is not acquiring additional ratings besides those already collected by the natural

process, i.e., the user would rate these items on his own initiative. The other strategies, in addition to

these ratings, are capable to elicit more ratings, also those that the user will rate later on, i.e., in the

successive weeks. We observe that here, differently from the previous experiments, all the strategies

show a non-monotone behavior. But, in this case, it is due to the fact that the test set, every week,

is a subset of the ratings entered in Movielens during the following week. The predictive difﬁculty

of this test set can therefore change from week to week, and hence inﬂuence the performance of the

competing strategies.

In order to examine the results further, we have also plotted in Figure 12 the MAE of the strategies

normalized with respect to the MAE of the baseline, i.e., without ratings obtained by active learning

strategies: (

MAE

S trategies

MAE

Baseline

) − 1. We also plot in Figure 13 this normalized behavior only for the

three selected strategies. This ﬁgure more clearly shows the beneﬁt of an active learning strategy in

comparison with the natural process. Moreover, in Figure 13 the number of new users entering the

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:25

0 2 4 6 8 10 12 14 16 18

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

# of weeks

MAE

natural acquisition

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

Fig. 10. System MAE evolution under the simultaneous application of active learning strategies and natural acquisition of

ratings (Movielens).

0 2 4 6 8 10 12 14 16 18

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

# of weeks

MAE

natural acquisition

highest−pred

voting

log(pop)*entropy

Fig. 11. System MAE evolution under the application of three selected active learning strategies and natural acquisition of

ratings (Movielens).

39:26 M. Elahi et al.

0 2 4 6 8 10 12 14 16 18

10

20

30

40

50

60

70

# of new users

# of weeks

normalized MAE

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

new users

natural acquisition

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

Fig. 12. System MAE evolution under the application of active learning strategies and natural acquisition of ratings (Movie-

lens). MAE values are normalized with respect to the MAE of the system that acquires new ratings only using the natural

acquisition. The number of new users entering the system every week is also shown.

system every week is also plotted, as to understand the effect of new users entering to the system

on the system performance under the application of the considered strategies. The left y-axes in this

ﬁgure shows the number of new users in the Known set K

n

and the right y-axes shows the MAE

normalized by the baseline. The gray solid line depicts the number of new users entering the system

every week.

Comparing the strategies in Figure 13, we can distinguish two types of strategies: the ﬁrst type

corresponds to the highest-predicted strtategy whose normalized MAE is very close to the baseline.

The second type includes log(popularity)*entropy and voting strategies that express larger variations

of performance, and substantially differ from the baseline (excluding the week 10). The overall

performance of these strategies is better than the performance of the ﬁrst type. Moreover, observing

the number of new users at each week we can see that the largest number of new users is entering

at weeks 9, 10, and 14. For these weeks the normalized MAE shows the worst performances, with

the largest value of MAE at week 10. Hence, the bad news is that in the presence of many new users

none of the strategies are effective, and better solutions need to be developed.

Despite the fact that new users are detrimental to the accuracy of the prediction, in the long

term, more users entering the system would result in a better recommender system. Thus, we have

computed the correlation coefﬁcients between MAE curves of the strategies and the total number

of users in Known set K

n

. Table II shows these correlations as well as the corresponding p-values.

There is a clear negative correlation with the total number of users in the system. This means that

the more users are entered to the system the lower the MAE becomes.

Another important aspect to consider is the number of ratings that are elicited by the considered

strategies in addition to the natural acquisition of ratings. As discussed before, certain strategies

can acquire more ratings by better estimating what items are likely to have been experienced by the

user. Figure 14 illustrates the size of the Known set K

n

as the strategies acquire more ratings from

the simulated users. As shown in ﬁgure, although the number of ratings added naturally is by far

larger than that of any strategy (more than 314,000 ratings in week 48), still the considered strategies

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:27

0 2 4 6 8 10 12 14 16 18

10

20

30

40

50

60

70

# of new users

# of weeks

normalized MAE

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

0

new users

natural acquisition

highest−pred

voting

log(pop)*entropy

Fig. 13. System MAE evolution under the application of three selected active learning strategies and natural acquisition of

ratings (Movielens). MAE values are normalized with respect to the MAE of the system that acquires new ratings only using

the natural acquisition. The number of new users entering the system every week is also shown.

Table II. Correlation of MAE with # of users in the Known K

n

set.

Strategy Correlation Coefﬁcient p-value

Natural Acquisition -0.4430 0.0016

Variance -0.5021 0.0003

Random -0.5687 0.0000

Popularity -0.5133 0.0002

Lowest predicted -0.4933 0.0004

Low-high predicted -0.5083 0.0002

Highest predicted -0.5153 0.0002

Binary prediction -0.5126 0.0002

Voting -0.5215 0.0001

Log(pop)*entropy -0.5028 0.0003

can elicit many ratings. Popularity and log(popularity)*entropy are the strategies that add the most

ratings; totaling more than 161,000 at the end of the experiment. On the other hand, voting is the

strategy that elicits overall the smallest number of ratings. This can be due to the fact that sometimes

most of the strategies vote for similar set of items. Then the selected items would mostly overlap

with naturally acquired ratings, which could result in fewer ratings being added to the known set.

However, the remarkably good performance of voting may indicate that this strategy focuses more

on informativeness of the items rather than on their ratability.

8. CONCLUSIONS AND FUTURE WORK

In this work we have addressed the problem of selecting items to present to the users for acquiring

their ratings; that is also deﬁned as the ratings elicitation problem. We have proposed and evaluated

a set of ratings elicitation strategies. Some of them have been proposed in a previous work [Rashid

et al. 2002] (popularity, log(popularity)*entropy, random, variance), and some, which we deﬁne

as prediction-based strategies, are new: binary-prediction, highest-predicted, lowest-predicted, and

39:28 M. Elahi et al.

0 5 10 15 20 25 30 35 40 45 50

0

0.5

1

1.5

2

2.5

3

3.5

x 10

5

# of weeks

# of ratings in Known set

natural

variance

random

popularity

lowest−pred

lo−hi−pred

highest−pred

binary−pred

voting

log(pop)*entropy

Fig. 14. Size evolution of the Known set under the application of rating elicitation strategies (Movielens).

Table III. Strategies Performance Summary (performance: - good, . - bad)

Strategies Metrics

MAE NDCG Elicited # Inform. Precision

Early Stage

Late Stage

Randomized

w/ Natural

Early Stage

Late Stage

Randomized

Early Stage

Late Stage

w/ Natural

Early Stage

Late Stage

Early Stage

Late Stage

Randomized

variance

popularity

lowest-pred

lo-hi-pred

highest-pred

binary-pred

voting

log(pop)*ent

random NA NA NA

natural NA NA

highest-lowest-predicted. Moreover, we have studied the behavior of other novel strategies: partially

randomized, which adds random ratings in the elicitation lists computed by the aforementioned

strategies; voting, which requests to rate the items that are selected by the largest number of voting

strategies. We have evaluated these strategies with regards to their system-wide effectiveness by

implementing a simulation loop that models the day-by-day process of rating elicitation and rating

database growth. We have taken into account the limited knowledge of the users, which means that

the users may not be able to rate all the items that the system proposes them to rate. During the

simulation we have measured several metrics at different phases of the rating database growth. The

metrics include: MAE to measure the improvements in prediction accuracy, precision to measure

the relevance of recommendations, normalized discounted cumulative gain (NDCG) to measure the

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:29

quality of produced ranking, and coverage to measure the proportion of items over which the system

can form predictions.

The evaluation (summarized in Table III) has shown that different strategies can improve different

aspects of the recommendation quality and in different stages of the rating database development.

Moreover, we have discovered that some pure strategies may incur the risk of increasing the sys-

tem MAE if they keep adding only ratings with a certain value, e.g., the highest ones, as for the

highest-predicted strategy, an approach that is often adopted in real RSs. In addition, prediction-

based strategies are not able to address either the problem of new users, nor of new items. Popularity

and variance strategies are able to select items for new users, but can not select items that have no

ratings.

Partially randomized strategies, experience fewer problems because they elicit ratings for random

items that had no ratings at all. In this case, the lowest-highest (highest) predicted is a good alterna-

tive if MAE (precision) is the targeted effectiveness measure. These strategies are easy to implement

and as the experiments have shown can produce considerable beneﬁts.

Moreover, our results have shown that mixing active learning strategies with natural acquisition of

ratings inﬂuences the performance of the strategies. This is an important conclusion and no previous

experiments have addressed and illustrated this issue. In this situation we show that the popularity

and log(popularity)*entropy strategies outperform the other strategies. Our proposed voting strat-

egy has shown good performance, i.e., MAE but especially NDCG, with and without the natural

acquisition.

This research identiﬁed a number of new problems that would deﬁnitely need to be studied fur-

ther. First of all, it is important to note that the results presented in this work clearly depend, as

in any experimental study, on the chosen simulation setup, which can only partially reﬂect the real

evolution of a recommender system. In our work we assume that a randomly chosen set of ratings,

among those that the user really gave to the system, represents the ratings known by the user, but

yet unknown by the system. However, this set does not completely reﬂect all the user knowledge, it

contains only the ratings acquired using the speciﬁc recommender system. For instance, Movielens

used a combined random and popularity technique for rating elicitation. In reality, many more items

are known by the user, but his ratings are not included in the data set. This is a common problem

of any off-line evaluation of a recommender system, where the performance of the recommenda-

tion algorithm is estimated on a test set that is never coincident with the recommendations set. The

recommendation set is composed of the items with the largest predicted ratings. But if such an item

is not present in the test set, an off-line evaluation will be never able to check if that prediction is

correct.

Moreover, we have already observed that the performance of some strategies (e.g., random and

voting) depends on the sparsity of the rating data. The Movielens data and the Netﬂix sample that

we used, still have a considerably low sparsity compared to other larger datasets. For example, if the

data sparsity was higher, there would be only a very low probability for random strategy to select an

item that a user has consumed in the past and can provide a rating for. So the partially randomized

strategies may perform worse in reality.

Furthermore, there remain many unexplored possibilities for sequentially applying several strate-

gies that use different approaches depending on the state of system [Elahi 2011]. For instance, one

may ask a user to rate popular items when the system does not know any user’s ratings yet, and use

another strategy at a latter stage.

REFERENCES

ANDERSON, C. 2006. The Long Tail. Random House Business.

BIRLUTIU, A., GROOT, P., AND HESKES, T. 2012. Efﬁciently learning the preferences of people. Machine Learning, 1–28.

BONILLA, E. V., GUO, S., AND SANNER, S. 2010. Gaussian process preference elicitation. In Advances in Neural Infor-

mation Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. December

6-9 2010, Vancouver, British Columbia, Canada. 262–270.

39:30 M. Elahi et al.

BOUTILIER, C., ZEMEL, R. S., AND MARLIN, B. M. 2003. Active collaborative ﬁltering. In UAI ’03, Proceedings of the

19th Conference in Uncertainty in Artiﬁcial Intelligence, Acapulco, Mexico, August 7-10 2003. 98–106.

BRAZIUNAS, D. AND BOUTILIER, C. 2010. Assessing regret-based preference elicitation with the utpref recommendation

system. In Proceedings 11th ACM Conference on Electronic Commerce (EC-2010), Cambridge, Massachusetts, USA,

June 7-11, 2010. 219–228.

BURKE, R. 2010. Evaluating the dynamic properties of recommendation algorithms. In Proceedings of the fourth ACM

conference on Recommender systems. RecSys ’10. ACM, New York, NY, USA, 225–228.

CARENINI, G., SMITH, J., AND POOLE, D. 2003. Towards more conversational and collaborative recommender systems.

In Proceedings of the 2003 International Conference on Intelligent User Interfaces, January 12-15, 2003, Miami, FL,

USA. 12–18.

CHEN, L. AND PU, P. 2012. Critiquing-based recommenders: survey and emerging trends. User Model. User-Adapt. Inter-

act. 22, 1-2, 125–150.

CREMONESI, P., KOREN, Y., AND TURRIN, R. 2010. Performance of recommender algorithms on top-n recommenda-

tion tasks. In Proceedings of the 2010 ACM Conference on Recommender Systems, RecSys 2010, Barcelona, Spain,

September 26-30, 2010. 39–46.

DESROSIERS, C. AND KARYPIS, G. 2011. A comprehensive survey of neighborhood-based recommendation methods. In

Recommender Systems Handbook, F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, Eds. Springer, 107–144.

ELAHI, M. 2011. Adaptive active learning in recommender systems. In User Modeling, Adaption and Personalization - 19th

International Conference, UMAP 2011, Girona, Spain, July 11-15, 2011. Proceedings. 414–417.

ELAHI, M., REPSYS, V., AND RICCI, F. 2011. Rating elicitation strategies for collaborative ﬁltering. In E-Commerce and

Web Technologies - 12th International Conference, EC-Web 2011, Toulouse, France, August 30 - September 1, 2011.

Proceedings. 160–171.

ELAHI, M., RICCI, F., AND REPSYS, V. 2011. System-wide effectiveness of active learning in collaborative ﬁltering. In

International Workshop on Social Web Mining, Co-located with IJCAI, F. Bonchi, W. Buntine, R. Gavald, and S. Gu,

Eds. Universitat de Barcelona, Spain.

FUNK, S. 2006. Netﬂix update: Try this at home. http://sifter.org/

˜

simon/journal/20061211.html.

GOLBANDI, N., KOREN, Y., AND LEMPEL, R. 2010. On bootstrapping recommender systems. In Proceedings of the 19th

ACM international conference on Information and knowledge management. CIKM ’10. ACM, New York, NY, USA,

1805–1808.

GOLBANDI, N., KOREN, Y., AND LEMPEL, R. 2011. Adaptive bootstrapping of recommender systems using decision trees.

In Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong,

China, February 9-12, 2011. 595–604.

GUO, S. AND SANNER, S. 2010. Multiattribute Bayesian preference elicitation with pairwise comparison queries. In Pro-

ceedings of the 7th international conference on Advances in Neural Networks - Volume Part I. Springer-Verlag, Berlin,

Heidelberg, 396–403.

HARPALE, A. S. AND YANG, Y. 2008. Personalized active learning for collaborative ﬁltering. In SIGIR ’08: Proceedings

of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM,

New York, NY, USA, 91–98.

HERLOCKER, J. L., KONSTAN, J. A., BORCHERS, A., AND RIEDL, J. 1999. An algorithmic framework for performing

collaborative ﬁltering. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and devel-

opment in information retrieval. SIGIR ’99. ACM, New York, NY, USA, 230–237.

HERLOCKER, J. L., KONSTAN, J. A., TERVEEN, L. G., AND RIEDL, J. T. 2004. Evaluating collaborative ﬁltering recom-

mender systems. ACM Trans. Inf. Syst. 22, 1, 5–53.

JANNACH, D., ZANKER, M., FELFERNIG, A., AND FRIEDRICH, G. 2010. Recommender Systems: An Introduction. Cam-

bridge University Press.

J

¨

ARVELIN, K. AND KEK

¨

AL

¨

AINEN, J. 2002. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20, 4,

422–446.

JIN, R. AND SI, L. 2004. A Bayesian approach toward active learning for collaborative ﬁltering. In UAI ’04, Proceedings of

the 20th Conference in Uncertainty in Artiﬁcial Intelligence, July 7-11 2004, Banff, Canada. 278–285.

KOREN, Y. 2008. Factorization meets the neighborhood: a multifaceted collaborative ﬁltering model. In KDD ’08: Proceed-

ing of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York,

NY, USA, 426–434.

KOREN, Y. AND BELL, R. 2011. Advances in collaborative ﬁltering. In Recommender Systems Handbook, F. Ricci,

L. Rokach, B. Shapira, and P. Kantor, Eds. Springer Verlag, 145–186.

LIU, N. N., MENG, X., LIU, C., AND YANG, Q. 2011. Wisdom of the better few: cold start recommendation via represen-

tative based rating elicitation. In Proceedings of the 2011 ACM Conference on Recommender Systems, RecSys 2011,

Chicago, IL, USA, October 23-27, 2011. 37–44.

Active Learning Strategies for Rating Elicitation in Collaborative Filtering 39:31

LIU, N. N. AND YANG, Q. 2008. Eigenrank: a ranking-oriented approach to collaborative ﬁltering. In SIGIR ’08: Proceed-

ings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval.

ACM, New York, NY, USA, 83–90.

MANNING, C. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge.

MARLIN, B. M., ZEMEL, R. S., ROWEIS, S. T., AND SLANEY, M. 2011. Recommender systems, missing data and statisti-

cal model estimation. In IJCAI 2011, Proceedings of the 22nd International Joint Conference on Artiﬁcial Intelligence,

Barcelona, Catalonia, Spain, July 16-22, 2011. 2686–2691.

MCNEE, S. M., LAM, S. K., KONSTAN, J. A., AND RIEDL, J. 2003. Interfaces for eliciting new user preferences in

recommender systems. In Proceedings of the 2003 International Conference on User Modeling. 178–187.

MILLER, B. N., ALBERT, I., LAM, S. K., KONSTAN, J. A., AND RIEDL, J. 2003. Movielens unplugged: experiences

with an occasionally connected recommender system. In IUI ’03: Proceedings of the 8th international conference on

Intelligent user interfaces. ACM, New York, NY, USA, 263–266.

PU, P. AND CHEN, L. 2008. User-involved preference elicitation for product search and recommender systems. AI Maga-

zine 29, 4, 93–103.

RASHID, A. M., ALBERT, I., COSLEY, D., LAM, S. K., MCNEE, S. M., KONSTAN, J. A., AND RIEDL, J. 2002. Get-

ting to know you: Learning new user preferences in recommender systems. In Proceedings of the 2002 International

Conference on Intelligent User Interfaces, IUI 2002. ACM Press, 127–134.

RASHID, A. M., KARYPIS, G., AND RIEDL, J. 2008. Learning preferences of new users in recommender systems: an

information theoretic approach. SIGKDD Explor. Newsl. 10, 90–100.

RESNICK, P. AND SAMI, R. 2007. The inﬂuence limiter: provably manipulation-resistant recommender systems. In Pro-

ceedings of the 2007 ACM conference on Recommender systems. RecSys ’07. ACM, New York, NY, USA, 25–32.

RESNICK, P. AND VARIAN, H. R. 1997. Recommender systems. Commun. ACM 40, 3, 56–58.

RICCI, F., ROKACH, L., AND SHAPIRA, B. 2011. Introduction to recommender systems handbook. In Recommender Sys-

tems Handbook, F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Eds. Springer Verlag, 1–35.

RICCI, F., ROKACH, L., SHAPIRA, B., AND KANTOR, P. B., Eds. 2011. Recommender Systems Handbook. Springer.

RUBENS, N., KAPLAN, D., AND SUGIYAMA, M. 2011. Active learning in recommender systems. In Recommender Systems

Handbook, F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Eds. Springer Verlag, 735–767.

SCHEIN, A. I., POPESCUL, A., UNGAR, L. H., AND PENNOCK, D. M. 2002. Methods and metrics for cold-start rec-

ommendations. In SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and

development in information retrieval. ACM, New York, NY, USA, 253–260.

SHANI, G. AND GUNAWARDANA, A. 2010. Evaluating recommendation systems. In Recommender Systems Handbook,

F. Ricci, L. Rokach, and B. Shapira, Eds. Springer Verlag, 257–298.

TIMELY DEVELOPMENT, L. 2008. Netﬂix prize. http://www.timelydevelopment.com/demos/

NetflixPrize.aspx.

WEIMER, M., KARATZOGLOU, A., AND SMOLA, A. 2008. Adaptive collaborative ﬁltering. In RecSys ’08: Proceedings of

the 2008 ACM conference on Recommender systems. ACM, New York, NY, USA, 275–282.

ZHOU, K., YANG, S.-H., AND ZHA, H. 2011. Functional matrix factorizations for cold-start recommendation. In Proceed-

ing of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR

2011, Beijing, China, July 25-29, 2011. 315–324.