Conference PaperPDF Available

Improving Serendipity and Accuracy in Cross-Domain Recommender Systems

Authors:

Abstract

Cross-domain recommender systems use information from source domains to improve recommendations in a target domain, where the term domain refers to a set of items that share attributes and/or user ratings. Most works on this topic focus on accuracy but disregard other properties of recommender systems. In this paper, we attempt to improve serendipity and accuracy in the target domain with datasets from source domains. Due to the lack of publicly available datasets, we collect datasets from two domains related to music, involving user ratings and item attributes. We then conduct experiments using collaborative filtering and content-based filtering approaches for the purpose of validation. According to our results, the source domain can improve serendipity in the target domain for both approaches. The source domain decreases accuracy for content-based filtering and increases accuracy for collaborative filtering. The improvement of accuracy decreases with the growth of non-overlapping items in different domains.
Improving Serendipity and Accuracy
in Cross-Domain Recommender Systems ?
Denis Kotkov, Shuaiqiang Wang, and Jari Veijalainen
University of Jyvaskyla, Dept. of Computer Science and Information Systems,
P.O.Box 35, FI-40014 University of Jyvaskyla, Jyvaskyla, Finland
deigkotk@student.jyu.fi, {shuaiqiang.wang, jari.veijalainen}@jyu.fi,
Abstract. Cross-domain recommender systems use information from source do-
mains to improve recommendations in a target domain, where the term domain
refers to a set of items that share attributes and/or user ratings. Most works on
this topic focus on accuracy but disregard other properties of recommender sys-
tems. In this paper, we attempt to improve serendipity and accuracy in the target
domain with datasets from source domains. Due to the lack of publicly avail-
able datasets, we collect datasets from two domains related to music, involving
user ratings and item attributes. We then conduct experiments using collaborative
filtering and content-based filtering approaches for the purpose of validation. Ac-
cording to our results, the source domain can improve serendipity in the target
domain for both approaches. The source domain decreases accuracy for content-
based filtering and increases accuracy for collaborative filtering. The improve-
ment of accuracy decreases with the growth of non-overlapping items in different
domains.
Keywords: Recommender Systems, Serendipity, Cross-Domain Recommenda-
tions, Collaborative Filtering, Content-Based Filtering, Data Collection
1 Introduction
Recommender systems use past user behavior to suggest items interesting to users [17].
An item is “a piece of information that refers to a tangible or digital object, such as
a good, a service or a process that a recommender system suggests to the user in an
interaction through the Web, email or text message” [12]. Recommender systems use
algorithms to generate recommendations.
Traditional recommendation algorithms mainly aim to improve accuracy, which
indicates how good an algorithm is at suggesting items a user usually consumes. In
this paper, they are referred to as accuracy-oriented algorithms. Generally speaking,
accuracy-oriented algorithms often suggest popular items, as these items are widely
consumed by individuals. To improve accuracy, recommendation algorithms also tend
to suggest items similar to a user profile (a set of items rated by the user [12]), as
these items match previous user tastes. As a result, a user is recommended (1) items
?This is the final draft of the paper published in the journal. The final publication is available at
https://link.springer.com/chapter/10.1007/978-3-319-66468-2 6
2
that are popular and therefore familiar to the user [6] and (2) items that the user can
easily find him/herself, which is referred to as the overspecialization problem [21]. In
particular, as two main categories of recommendation algorithms, collaborative filtering
algorithms often suggest popular items due to the popularity bias in most datasets, while
content-based filtering algorithms often suffer from the overspecialization problem due
to insufficient information regarding attributes of items.
Typically, the main reason why a user joins a recommender system is to find novel
and interesting items the user would not find him/herself [21]. To improve user satis-
faction, a recommender system should suggest serendipitous items [12]. In this paper,
we follow the definitions of [10, 2, 12], which indicate that serendipitous items must be
relevant, novel and unexpected to a user.
The mentioned problems can be tackled by cross-domain recommender systems,
which could predict serendipitous items by enriching the training data from the target
domain with additional datasets from other domains. Here the term domain refers to
“a set of items that share certain characteristics that are exploited by a particular rec-
ommender system” [9]. These characteristics are item attributes and user ratings. Rec-
ommender systems that take advantage of multiple domains are called cross-domain
recommender systems [9, 4, 13].
In this paper, we explore the cross-domain recommendation task [4, 13], that re-
quires one target domain and at least one source domain. The former refers to the do-
main from which suggested items are picked from, and similarly the latter refers to the
domain that contains auxiliary information.
In this work, we seek to address the following research question: Can the source
domain improve serendipity in the target domain? Due to the lack of publicly avail-
able datasets for cross-domain recommender systems [3, 11, 13], we collected data from
Vkontakte1(VK) – Russian online social network (OSN) and Last.fm2(FM) – music
recommender service. We then matched VK and FM audio recordings and developed
the cross-domain recommender system that suggests VK recordings to VK users based
on data from both domains. Each audio recording is represented by its metadata ex-
cluding the actual audio file. VK recordings thus represent the target domain, while
the source domain consists of FM recordings. VK and FM recordings share titles and
artists, but have different user ratings and other attributes.
We regard items that share certain attributes and belong to different domains as over-
lapping, while those that do not as non-overlapping. In our case, VK and FM recordings
that have the same titles and artists are overlapping items.
To address the research question and illustrate the potential of additional data, we
chose simple but popular recommendation algorithms to conduct experiments for vali-
dation: collaborative filtering based on user ratings and content-based filtering based on
the descriptions of the items.
Our results indicate that the source domain can improve serendipity in the target
domain for both collaborative filtering and content-based filtering algorithms:
1http://vk.com/
2http://last.fm/
3
The traditional collaborative filtering algorithms tend to suggest popular items, as
most datasets contain rich information regarding these items in terms of user rat-
ings. Combing datasets of different domains decreases the popularity bias.
Content-based filtering algorithms often suffer from the overspecialization problem
due to poor data regarding item attributes. Enriching item attributes alleviates the
problem and increases serendipity.
According to our results, the source domain has a negative impact on accuracy for
content-based filtering, and a positive impact on accuracy of collaborative filtering. Fur-
thermore, with the growth of non-overlapping items in different domains, the improve-
ment of accuracy for collaborative filtering decreases.
This paper has the following contributions:
We initially investigate the cross-domain recommendation problem in terms of
serendipity.
We collect a novel dataset to conduct the experiments for addressing the research
question.
The rest of the paper is organized as follows. Section 2 overviews related works.
Section 3 describes the datasets used to conduct experiments. Section 4 is dedicated to
recommendation approaches, while section 5 describes conducted experiments. Finally,
section 6 draws final concussions.
2 Related Works
In this section, we survey state-of-the-art efforts regarding serendipity and cross-domain
recommendations.
2.1 Serendipity in Recommender Systems
According to the dictionary3, serendipity is “the faculty of making fortunate discoveries
by accident”. The term was coined by Horace Walpole, who referenced the fairy tale,
“The Three Princes of Serendip”, to describe his unexpected discovery [16].
Currently, there is no agreement on definition of serendipity in recommender sys-
tems. Researchers employ different definitions in their studies. In this paper, we employ
the most common definition, which indicates that serendipitous items are relevant, novel
and unexpected [10, 2, 12].
Given the importance of serendipity, researchers have proposed different serendipity-
oriented recommendation algorithms. For example, Lu et al. presented a serendipitous
personalized ranking algorithm [15]. The algorithm is based on matrix factorization
with the objective function that incorporates relevance and popularity of items. Another
matrix factorization based algorithm is proposed by Zheng, Chan and Ip [24]. The au-
thors proposed the unexpectedness-augmented utility model, which takes into account
relevance, popularity and similarity of items to a user profile. In contrast, Zhang et al.
3http://www.thefreedictionary.com/serendipity
4
provided the recommendation algorithm Full Auralist [23]. It consists of three algo-
rithms, each being responsible for relevance, diversity and unexpectedness. To the best
of our knowledge, studies that focus on improving serendipity using source domains
are of restricted availability.
2.2 Cross-Domain Recommendations
Cross-domain recommender systems use multiple domains to generate recommenda-
tions, which can be categorized based on domain levels [4,5]:
Attribute level. Items have the same type and attributes. Two items are assigned to
different domains if they have different values of a particular attribute. A pop song
and jazz song might belong to different domains.
Type level. Items have similar types and share some common attributes. Two items
are assigned to different domains if they have different subsets of attributes. A pho-
tograph and animated picture might belong to different domains. Even though both
items have common attributes, such as a title, publisher and tags, other attributes
might be different (duration attribute for animated pictures).
Item level. Items have different types and all or almost all attributes. Two items are
assigned to different domains if they have different types. A song and book might
belong to different domains, as almost all attributes of the items are different.
System level. Two items are assigned to different domains if they belong to differ-
ent systems. For example, movies from IMDb4and MovieLens5might belong to
different domains.
Depending on whether overlapping occurs in the set of users or items [7], there are
four situations that enable cross-domain recommendations: a) no overlap between items
and users, b) user sets of different domains overlap, c) item sets overlap, and d) item
and user sets overlap.
Most efforts on cross-domain recommendations focus on the situation when users
or both users and items overlap [13]. For example, Sang demonstrated the feasibility of
utilizing the source domain. The study was conducted on a dataset collected from Twit-
ter6and YouTube7. The author established relationships between items from different
domains using topics [19]. Similarly to Sang, Shapira, Rokach and Freilikhman also
linked items from different domains, where 95 participants rated movies and allowed
the researches to collect data from their Facebook pages [20]. The results suggested
that source domains improve the recommendation performance [20]. Another study
with positive results was conducted by Abel et al. The dataset contained information
related to the same users from 7 different OSNs [1]. Sahebi and Brusilovsky demon-
strated the usefulness of recommendations based on source domains to overcome cold
start problem [18].
4http://www.imdb.com/
5https://movielens.org/
6https://twitter.com/
7https://www.youtube.com/
5
Most works on cross-domain recommendations focus on accuracy. To the best of our
knowledge, the efforts on the impact of source domains on the target domain in terms
of serendipity involving a real cross-domain dataset are very limited. In this paper, we
investigate whether source domains can improve serendipity in the target domain when
only items overlap on system level.
3 Datasets
Due to the lack of publicly available datasets for cross-domain recommender systems
with overlapping items [3, 11] we collected data from VK and FM. The construction of
the dataset included three phases (Figure 1): 1) VK recordings collection, 2) duplicates
matching, and 3) FM recordings collection.
Fig. 1: Data collection chart.
3.1 VK Recordings Collection
The VK interface provides the functionality to add favored recordings to users’ pages.
By generating random user IDs we collected disclosed VK users’ favored audio record-
ings using VK API. Each audio recording is represented by its metadata excluding the
actual audio file. Our VK dataset consists of 97,737 (76,177 unique) audio recordings
added by 864 users.
Each VK user is allowed to share any audio or video recording. The interface of
the OSN provides the functionality to add favored recordings to the users page. VK
users are allowed not only to add favored audio recordings to their pages, but also to
rename them. The dataset thus contains a noticeable number of duplicates with different
names. To assess this number we randomly selected 100 VK recordings and manually
split them into three categories:
Correct names - the name of the recording is correctly written without any gram-
matical mistakes or redundant symbols.
Misspelled names - the name is guessable, even if the name of the recording is
replaced with the combination of artist and recording name or lyrics.
Meaningless names – the name does not contain any information about the record-
ing. For example, “unknown” artist and “the song” recording.
Out of 100 randomly selected recordings we detected 14 misspelled and 2 meaningless
names. The example can be seen from table 1.
6
Table 1: Examples of recordings.
Artist name Recording name
Correct names
Beyonce Halo
Madonna Frozen
Misspelled
Alice DJ Alice DJ - Better of Alone.mp3
Reamonn Oh, tonight you kill me with your smile
lLady Gaga Christmas Tree
Meaningless
Unknown classic
Unknown party
3.2 Duplicates Matching
To match misspelled recordings, we developed a duplicate matching algorithm that de-
tects duplicates based on recordings’ names, mp3 links and durations. The algorithm
compares recordings’ names based on the Levenshtein distance and the number of com-
mon words excluding stop words.
We then removed some popular meaningless recordings such as “Unknown”, “1” or
“01”, because they represent different recordings and do not indicate user preferences.
Furthermore, some users assign wrong popular artists’ names to the recordings. To re-
strict the growth of these kinds of mistakes, the matching algorithm considers artists of
the duplicate recordings to be different. By using the presented matching approach, the
number of unique recordings decreased from 76,177 to 68,699.
3.3 FM Recordings Collection
To utilize the source domain we collected FM recordings that correspond to 48,917
selected VK recordings that were added by at least two users or users that have testing
data. Each FM recording contains descriptions such as FM tags added by FM users.
FM tags indicate additional information such as genre, language or mood. Overall, we
collected 10,962 overlapping FM recordings and 20,214 (2,783 unique) FM tags.
It is also possible to obtain FM users who like a certain recording (top fans). For
each FM recording, we collected FM users who like at least one more FM recording
from our dataset according to the distribution of VK users among those recordings. In
fact, some unpopular FM recordings are missing top fans. We thus collected 17,062 FM
users, where 7,083 of them like at least two recordings from our database. FM users
liked 4,609 FM recordings among those collected.
3.4 The Statistics of the Datasets
In this work, we constructed three datasets. Each of them includes the collected FM
data and different parts of the VK data (percentage indicates the fraction of overlapping
items):
7
100% - the dataset contains only overlapping recordings picked by VK and FM
users;
50% - the dataset contains equal number of overlapping and non-overlapping
recordings;
7% - the dataset contains all collected VK and FM recordings. The fraction of
overlapping recordings is 6.7%.
The 7% dataset contains all the collected and processed data. We presented results
for 50% and 100% datasets to demonstrate how serendipity and accuracy change when
a dataset contains different fraction of overlapping items.
Table 2: The Statistics of the Datasets.
100% 50% 7%
VK FM VK FM VK FM
Users 665 7,083 795 7,083 864 7,083
Ratings 14,526 40,782 33,680 40,782 96,737 40,782
Items 4,609 4,609 9,218 4,609 68,699 4,609
Artists 1,986 1,986 4,595 1,986 31,861 1,986
Tags - 20,167 - 20,167 - 20,167
The statistics of the datasets are presented in table 2. The number of VK users
varies in different datasets, due to the lack of ratings after removing non-overlapping
VK recordings.
0
0,2
0,4
0,6
0,8
1
1,2
Popularity
Recording ID
FM
VK
(a) 100% dataset
Popularity
Recording ID
FM
VK
(b) 50% dataset
1
1,2
0
0,2
0,4
0,6
0,8
1
1,2
Popularity
Recording ID
FM
VK
(c) 7% dataset
Fig. 2: Popularity distributions of VK and FM datasets
According to figure 2, each recording has different popularity among VK and FM
users. The FM dataset contains reach information in terms of user ratings regarding
recordings unpopular in the VK dataset. In the figure, popularity is based on the number
of users who picked a particular item:
Po pularit yi=Freq(i)
Freqmax
,(1)
where Freq(i)is the number of users who picked recording i, while Freqmax corre-
sponds to the maximum number of users picked the same recording in a dataset.
8
4 Recommendation Approaches
In this section, we implemented simple but popular collaborative filtering and content-
based filtering algorithms to demonstrate the impact of the data from source domains.
4.1 Item-Based Collaborative Filtering
We chose item-based collaborative filtering as the first experimental algorithm. It is a
representative recommendation algorithm that has been widely used in industry due to
its scalability [8]. In item-based collaborative filtering, each audio recording (item) is
represented as a vector in a multidimensional feature space, where each feature is a
user’s choice (rating). VK recording is represented as follows: ivk = (uvk
1,i,uvk
2,i, ..., uvk
n,i),
and each element uvk
1,i∈ {0,1}for k=1, ..., ||U||, where Uis a set of users, while uvk
k,i
equals to 1 if VK user kpicks VK recording ivk and 0 otherwise. To integrate the source
domain (FM) with our target domain (VK), we included FM users as follows: ivk f m =
(uvk
1,i,uvk
2,i, ..., uvk
n,i,uf m
1,i,uf m
2,i, ..., uf m
n,i).
To generate recommendations, item-based collaborative filtering first detects
recordings that are most similar to recordings picked by the target user. The algorithm
then ranks recordings based on the obtained similarities.
To measure similarity, we used conditional probability, which is a common similar-
ity measure for situations in which users only indicate items they like without specifying
how much they like these items (unary data) [8]. Conditional probability is calculated
as follows:
p(i,j) = Freq(ij)
Freq(i)·Freq(j)α,(2)
where Freq(i)is the number of users that picked item i, while Freq(ij)is the number
of users that picked both items iand j. The parameter αis a damping factor to decrease
the similarity for popular items. In our experiments α=1.
Item vectors based on FM users contain remarkably more dimensions than vectors
based on VK users. To alleviate the problem, we compared recordings using the follow-
ing rule:
sim(i,j) =
p(ivk,jvk ),ivk ∧ ∃ jvk
(@if m @jf m )
p(if m,jf m ),if m ∧ ∃ jfm
(@ivk @jvk)
p(ivk f m,jvk f m ),ivk ∧ ∃ jvk
if m ∧ ∃ jf m
.(3)
We compared items in each pair using domains that contain user ratings for both
items. To rank items in the suggested list, we used sum of similarities of recordings [8]:
score(u,i) = jIusim(i,j),(4)
where Iuis the set of items picked by user u(user profile).
9
4.2 Content-Based Filtering
We chose content-based filtering algorithm, as this algorithm uses item attributes in-
stead of user ratings to generate recommendations. In our case, these attributes are VK
- FM artists and FM tags. Each FM artist corresponds to a particular VK artist.
To represent items, we used a common weighting scheme, term frequency-inverse
document frequency (TF-IDF). TF-IDF weight consists of two parts:
t f id fattr,i=t fattr,i·id fattr ,(5)
where t fatt r,icorresponds to the frequency of attribute attr for item i(term frequency),
while id fatt r corresponds to the inverse frequency of attribute attr (inverse document
frequency). The term frequency is based on the number of times an attribute appears
among attributes of an item with respect to the number of item attributes:
t fatt r,i=natt r,i
ni
,(6)
where niis the number of attributes of item i, while nattr,iis the number of times attribute
attr appears among attributes of item i. In our case, nattr,i=1 for each item, while ni
varies depending on the item. The term frequency increases with the decrease of the
number of item attributes. The inverse document frequency is based on the number of
items with an attribute in the dataset:
id fatt r =ln ||I||
||Iattr ||,(7)
where Iis a set of all the items, while Iattr is a set of items that have attribute attr. The
inverse document frequency is high for rare attributes and low for popular ones. TF-IDF
weighting scheme assigns high weights to rare attributes that appear in items with low
number of attributes.
An audio recording is represented as follows: ia= (a1,i,a2,i, ..., ad,i), where ak,i
corresponds to the TF-IDF weight of artist ak[14]. The user is represented as follows:
ua= (a1,u,a2,u, ..., ad,u), where ak,ucorresponds to the number of recordings picked by
user uperformed by artist ak.
To integrate FM data, we considered FM tags as follows: iat = (a1,i,a2,i, ..., ad,i,t1,i,
t2,i, ..., tq,i), where tk,icorresponds to the TF-IDF weight of tag tk[14]. The user vector
then is denoted as follows: uat = (a1,u,a2,u,..., ad,u,t1,u,t2,u, ..., tq,u), where tk,uis the
number of recordings picked by user uhaving tag tk.
The recommender system compares audio recordings’ vectors and a user vector
using cosine similarity [8]:
cos(u,i) = u·i
||u||||i||,(8)
where uand iare user and item vectors. To suggest recordings, content-based filtering
ranks recordings according to cos(u,i). In our experiments, we used cos(ua,ia)for VK
data and cos(uat ,iat)for VK and FM data.
10
5 Experiments
In this section, we detail experiments conducted to demonstrate whether the source do-
main improves serendipity and accuracy in the target domain when only items overlap.
5.1 Evaluation Metrics
To assess the performance of algorithms we used two metrics: (1) Precision@Kto
measure accuracy and (2) a traditional serendipity metric Ser@K.
Precision@Kis a commonly used metric to assess quality of recommended lists
with binary relevance. In our datasets, recordings added by a user to his/her page are
relevant, while the rest of the recordings are irrelevant to the user. Precision@Kreflects
the fraction of relevant recordings retrieved by a recommender system in the first K
results. The metric is calculated as follows:
Precision@K=1
||U||
uU
||RSu(K)RELu||
K,(9)
where Uis a set of users, while RSu(K)is a set of top-K suggestions for user u. Record-
ings from the test set (ground truth) for user uare represented by RELu.
The traditional serendipity metric is based on (1) a primitive recommender system,
which suggests items known and expected by a user, and (2) a set of items similar to a
user profile. Evaluated recommendation algorithms are penalized for suggesting items
that are irrelevant, generated by a primitive recommender system or included in the set
of items similar to a user profile. Similarly to [2], we used a slight modification of the
serendipity metric:
Ser@K=1
||U||
uU
||(RSu(K)\PM\Eu)RELu||
K,(10)
where PM is a set of suggestions generated by the primitive recommender system, while
Euis a set of recordings similar to recordings picked by user u. We selected the 10 most
popular recordings for PM following one of the most common strategies [24, 15]. Set of
items similar to a user profile Eurepresents all the recordings that have common artists
with recordings user upicked. User ucan easily find recordings from set Euby artist
name, we therefore regard these recordings as obvious.
5.2 Results
Following the datasets sampling strategy in [8], we split each of our datasets into train-
ing and test datasets and applied 3-fold cross-validation. We selected 40% of the users
who picked the most VK recordings, and chose 30% of their ratings as the testing
dataset. We then regarded the rest of the ratings as the training dataset.
To compare the results of various baselines, we used offline evaluation. The recom-
mender system suggested 30 popular VK recordings to each testing VK user excluding
11
recordings that the user has already added in the training set. In each approach the rec-
ommendation list consists of the same items. We chose popular items for evaluation, as
the users are likely to be familiar with those items.
In this study, we demonstrate serendipity and accuracy improvements resulting from
the source domain with three simple but popular algorithms: (1) POP, (2) Collaborative
Filtering (CF), and (3) Content-Based Filtering (CBF). It is important to note that POP
is a non-personalized recommendation algorithm, which orders items in the suggested
list according to their popularity in the VK dataset. For the CF and the CBF algorithms,
we obtained two performance results based on (1) data collected from VK and (2) data
collected from both VK and FM.
POP - ordering items according to their popularity using the VK dataset.
CF(VK) - item-based collaborative filtering using the VK dataset.
CF(VKFM) - item-based collaborative filtering using VK and FM datasets.
CBF(VK) - content-based filtering using the VK dataset.
CBF(VKFM) - content-based filtering using VK and FM datasets.
Figure 3 demonstrates the experimental results based on three datasets presented in
section 3. From the figure we can observe that:
1. The source domain can improve serendipity in the target domain. On all datasets,
CBF based on VK and FM data outperforms CBF based on only VK data in terms
of serendipity. For collaborative filtering the situation is very similar, except the
decrease of serendipity for recommendation lists of length 10 and 15 on the 7%
dataset. For the 50% dataset, the CF algorithm achieves 0.0156, 0.0147 and 0.0142
in terms of Ser@5, Ser@10 and Ser@15 based on VK data, while these numbers
are 0.0190, 0.0164 and 0.0146 based on VK and FM data, making the improvement
of 22.2%, 11.7% and 2.7%, respectively.
2. For collaborative filtering, the source domain can improve accuracy in the tar-
get domain when only items overlap. For the 100% dataset, the CF algorithm
achieves 0.0208, 0.0196 and 0.0189 in terms of Precision@5, Precision@10 and
Precision@15 based on VK data, while these numbers are 0.0271, 0.0260 and
0.0252 based on VK and FM data, making the improvement of 30.6%, 32.4% and
33.7%, respectively.
3. The improvement of accuracy declines with the growth of non-overlapping items
for collaborative filtering. The improvement of CF in terms of Precision@5 de-
creases as follows: 30.6%, 6.1% and 6.0% using 100%, 50% and 7% datasets, re-
spectively.
4. The source domain decreases accuracy of content-based filtering. For the 100%
dataset, CBF based on VK and FM data decreases Precision@5, Precision@10
and Precision@15 by 31.9%, 24.0% and 11.2%, respectively.
5. Despite being accurate, popularity baseline has a very low serendipity. POP out-
performs other algorithms in terms of accuracy on the 100% dataset. Meanwhile,
the algorithm fails to suggest any serendipitous items in top-5 recommendations on
each dataset.
12
0,015
0,02
0,025
0,03
0,035
0,04
0,045
5
10
15
Precision@K
K
POP
CF(VK)
CF(VKFM)
CBF(VK)
CBF(VKFM)
(a) 100% dataset
0
0,005
0,01
0,015
0,02
5
10
15
Ser@K
K
POP
CF(VK)
CF(VKFM)
CBF(VK)
CBF(VKFM)
(b) 100% dataset
0,02
0,022
0,024
0,026
0,028
0,03
0,032
5
10
15
Precision@K
K
POP
CF(VK)
CF(VKFM)
CBF(VK)
CBF(VKFM)
(c) 50% dataset
0
0,005
0,01
0,015
0,02
5
10
15
Ser@K
K
POP
CF(VK)
CF(VKFM)
CBF(VK)
CBF(VKFM)
(d) 50% dataset
0,02
0,02
0,02
0,03
0,03
0,03
0,03
5
10
15
Precision@K
K
POP
CF(VK)
CF(VKFM)
CBF(VK)
CBF(VKFM)
(e) 7% dataset
0
0,005
0,01
0,015
0,02
5
10
15
Ser@K
K
POP
CF(VK)
CF(VKFM)
CBF(VK)
CBF(VKFM)
(f) 7% dataset
Fig. 3: Precision@Kand Ser@Kfor experiments conducted using datasets with differ-
ent fractions of non-overlapping items.
13
According to observations 1 and 2, CF(VKFM) outperforms CF(VK) in terms of
both serendipity and accuracy. The improvement of accuracy illustrates the global cor-
relation of user preferences in different domains [22, 9]. Although, the data belongs to
different domains, user ratings from the source domain indicate similarities between
items that improve the recommendation performance in the target domain. The im-
provement of serendipity is caused by the growth of accuracy and by different popular-
ity distributions in VK and FM datasets.
Observation 3 supports the claim [9], that the improvement cased by the source
domain rises with the growth of the overlap between target and source domains. The
decrease of accuracy for the CF algorithm with the FM data is caused by the different
lengths of item vectors in source and target domains, where vectors of FM items contain
significantly more dimensions than vectors of VK items.
Observations 1 and 4 indicate that the FM data positively contributes to serendipity
and negatively affects accuracy of the content-based filtering algorithm. As users tend
to add recording of the same artist, CBF(VK) significantly outperforms CBF(VKFM).
However, most recordings suggested by CBF(VK) are obvious to a user, as the user can
find these recordings him/herself. As a result, the serendipity of CBF(VK) is very low.
FM tags help recommend similar recordings of artists novel to the user. Recordings that
share the same FM tags do not necessarily share the same artists, which results in the
decrease of accuracy and increase of serendipity.
Observation 5 indicates that POP has very low serendipity, despite being accurate.
Popular recommendations are likely to be accurate, as users tend to add familiar record-
ings. However, popular recordings are widely recognized by users and therefore re-
garded as obvious.
6 Conclusion
In this paper, we first initially investigated the cross-domain recommendation problem
in terms of serendipity. We collected data from VK and FM and built three datasets that
contain different fractions of non-overlapping items from source and target domains.
We then conducted extensive experiments with collaborative filtering and content-based
filtering algorithms to demonstrate the impact of source domains on performance gains
of the target domain.
According to our results, the source domain can improve serendipity in the target
domain when only items overlap on system level for both collaborative filtering and
content-based filtering algorithms. The integration of the source domain resulted in the
decrease of accuracy for content-based filtering and the increase of accuracy for col-
laborative filtering. Similarly to [9] our results indicated that the more items overlap in
source and target domains with respect to the whole dataset the higher the improvement
of accuracy for collaborative filtering.
14
7 Acknowledgement
The research at the University of Jyv¨
askyl¨
a was performed in the MineSocMed project,
partially supported by the Academy of Finland, grant #268078. The communication of
this research was supported by Daria Wadsworth.
References
1. Abel, F., Herder, E., Houben, G.J., Henze, N., Krause, D.: Cross-system user modeling and
personalization on the social web. User Modeling and User-Adapted Interaction 23, 169–209
(2013)
2. Adamopoulos, P., Tuzhilin, A.: On unexpectedness in recommender systems: Or how to
better expect the unexpected. ACM Transactions on Intelligent Systems and Technology 5,
1–32 (2014)
3. Berkovsky, S., Kuflik, T., Ricci, F.: Mediation of user models for enhanced personalization
in recommender systems. User Modeling and User-Adapted Interaction 18, 245–286 (2008)
4. Cantador, I., Cremonesi, P.: Tutorial on cross-domain recommender systems. In: Proceedings
of the 8th ACM Conference on Recommender Systems. pp. 401–402. New York, NY, USA
(2014)
5. Cantador, I., Fern´
andez-Tob´
ıas, I., Berkovsky, S., Cremonesi, P.: Cross-domain recom-
mender systems, pp. 919–959. Springer, Boston, MA (2015)
6. Celma Herrada, `
O.: Music recommendation and discovery in the long tail. Ph.D. thesis,
Universitat Pompeu Fabra (2009)
7. Cremonesi, P., Tripodi, A., Turrin, R.: Cross-domain recommender systems. In: 11th IEEE
International Conference on Data Mining Workshops. pp. 496–503 (2011)
8. Ekstrand, M.D., Riedl, J.T., Konstan, J.A.: Collaborative filtering recommender systems.
Foundations and Trends in Human-Computer Interaction 4, 81–173 (2011)
9. Fern´
andez-Tob´
ıas, I., Cantador, I., Kaminskas, M., Ricci, F.: Cross-domain recommender
systems: A survey of the state of the art. In: Proceedings of the 2nd Spanish Conference on
Information Retrieval. pp. 187–198 (2012)
10. Iaquinta, L., Semeraro, G., de Gemmis, M., Lops, P., Molino, P.: Can a recommender system
induce serendipitous encounters? IN-TECH (2010)
11. Kille, B., Hopfgartner, F., Brodt, T., Heintz, T.: The plista dataset. In: Proceedings of the 2013
International News Recommender Systems Workshop and Challenge. pp. 16–23. ACM, New
York, NY, USA (2013)
12. Kotkov, D., Veijalainen, J., Wang, S.: Challenges of serendipity in recommender systems. In:
Proceedings of the 12th International conference on web information systems and technolo-
gies. SCITEPRESS (2016)
13. Kotkov, D., Wang, S., Veijalainen, J.: Cross-domain recommendations with overlapping
items. In: Proceedings of the 12th International Conference on Web Information Systems
and Technologies. vol. 2, pp. 131–138. SCITEPRESS (2016)
14. Lops, P., de Gemmis, M., Semeraro, G.: Recommender Systems Handbook, chap. Content-
based Recommender Systems: State of the Art and Trends, pp. 73–105. Springer US, Boston,
MA (2011)
15. Lu, Q., Chen, T., Zhang, W., Yang, D., Yu, Y.: Serendipitous personalized ranking for top-n
recommendation. In: Proceedings of the The IEEE/WIC/ACM International Joint Confer-
ences on Web Intelligence and Intelligent Agent Technology. pp. 258–265. IEEE Computer
Society, Washington, DC, USA (2012)
15
16. Remer, T.G.: Serendipity and the three princes: From the Peregrinaggio of 1557, p. 20. Uni-
versity of Oklahoma Press (1965)
17. Ricci, F., Rokach, L., Shapira, B.: Recommender Systems Handbook, chap. Introduction to
Recommender Systems Handbook, pp. 1–35. Springer US (2011)
18. Sahebi, S., Brusilovsky, P.: Cross-domain collaborative recommendation in a cold-start con-
text: The impact of user profile size on the quality of recommendation. In: User Modeling,
Adaptation, and Personalization, pp. 289–295. Lecture Notes in Computer Science, Springer
Berlin Heidelberg (2013)
19. Sang, J.: Cross-network social multimedia computing. In: User-centric Social Multimedia
Computing, pp. 81–99. Springer Theses, Springer Berlin Heidelberg (2014)
20. Shapira, B., Rokach, L., Freilikhman, S.: Facebook single and cross domain data for recom-
mendation systems. User Modeling and User-Adapted Interaction 23, 211–247 (2013)
21. Tacchini, E.: Serendipitous mentorship in music recommender systems. Ph.D. thesis, Uni-
versit`
a degli Studi di Milano (2012)
22. Winoto, P., Tang, T.: If you like the devil wears prada the book, will you also enjoy the
devil wears prada the movie? a study of cross-domain recommendations. New Generation
Computing 26, 209–225 (2008)
23. Zhang, Y.C., S´
eaghdha, D.O., Quercia, D., Jambor, T.: Auralist: Introducing serendipity into
music recommendation. In: Proceedings of the 5th ACM International Conference on Web
Search and Data Mining. pp. 13–22. ACM, New York, NY, USA (2012)
24. Zheng, Q., Chan, C.K., Ip, H.H.: An unexpectedness-augmented utility model for making
serendipitous recommendation. In: Advances in Data Mining: Applications and Theoretical
Aspects, vol. 9165, pp. 216–230. Springer International Publishing (2015)
... Kotkov and Wang analyzed the effect of multiple data sources on accuracy and serendipity in the target domain [48] . Using collaborative filtering and contentbased (CB) methods, a combination of three datasets was employed for simulation. ...
... In (2), a lower value of U nserendipity u indicates a higher level of serendipity. In addition, i shows a new Improving Existing Accuracy-Oriented RS to Support Serendipity [10,18,24,48,51,60,79,88] Or Developing a New Complete Approach Focused on Serendipity [3,19,46,72,86,89,90] Pre ...
... Components [ 25 , 49 , 51 , 61 , 90 ] Based on Users [31 , 46 , 51, 75 , 77 ] Direct Evaluating (Serendipity or [22,23,25,30,41,48,53,61] Unserendipity) Feedback recommendation, while H u refers to a set of items existing in user records. The set of users is shown by U , and R u,N indicates N top recommendations. ...
Article
A recommender system is employed to accurately recommend items, which are expected to attract the user’s attention. The over-emphasis on the accuracy of the recommendations can cause information over-specialization and make recommendations boring and even predictable. Novelty and diversity are two partly useful solutions to these problems. However, novel and diverse recommendations cannot merely ensure that users are attracted since such recommendations may not be relevant to the user’s interests. Hence, it is necessary to consider other criteria, such as unexpectedness and relevance. Serendipity is a criterion for making appealing and useful recommendations. The usefulness of serendipitous recommendations is the main superiority of this criterion over novelty and diversity. The bulk of studies of recommender systems have focused on serendipity in recent years. Thus, a systematic literature review is conducted in this paper on previous studies of serendipity-oriented recommender systems. Accordingly, this paper focuses on the contextual convergence of serendipity definitions, datasets, serendipitous recommendation methods, and their evaluation techniques. Finally, the trends and existing potentials of the serendipity-oriented recommender systems are discussed for future studies. The results of the systematic literature review present that the quality and the quantity of articles in the serendipity-oriented recommender systems are progressing.
... Zheng's algorithm can be enhanced by multiplying an adjustable parameter in unexpectedness value. Kotkov, Wang, and Veijalainen (2016a) analyzed the effect of using multiple data sources on accuracy and serendipity in the target domain They integrated three datasets for simulation through CF and CB techniques. According to their research results, serendipity increases in both CF and CB techniques if the overlap of datasets is at the system level. ...
... Lu et al., 2012) Matrix factorization Movie _ Hybrid (X. Li et al., 2019) Model-based Movie Jacquard (Ziegler et al., 2014) Similarity-based Music Cosine similarity (Menk et al., 2017) Theory-based(Curiosity based) Social network _ (Kotkov, Wang, & Veijalainen, 2016a) Music Conditional probability (Maccatrozzo et al., 2017) Model-based Movie Cosine similarity (Park et al., 2019) Clustring Social Network Tie strength serendipitous (1) or not (0). It can be directly extracted from the dataset. ...
Article
Most of the available recommender systems focus on the accuracy of recommendations. As a result, their recommendations are often popular and very close to user preferences, which make them repetitious and predictable, hence adversely affecting user satisfaction. Recent studies on recommender systems, however, aim for factors beyond accuracy as accuracy alone cannot ensure the satisfaction of all users. One of the most important criteria beyond the accuracy is serendipity, which includes relevant, unexpected, and novel recommendations that cannot be easily discovered by users themselves. In this paper, a Convolutional Neural Network (CNN) is integrated with the Particle Swarm Optimization (PSO) algorithm to generate serendipitous recommendations. The proposed method is based on the focus shift points, consisting of unexpectedness and relevance parameters. In this approach, these points are considered as the factors showing whether recommendations are serendipitous. The CNN is employed to predict the focus shift points for each user. Then, the PSO is utilized to search for recommendations close to the predicted focus shift points and generate the list of candidate recommendations. After that, the Serendipitous Personalized Ranking (SPR) method is employed to re-rank the candidate recommendations and generate the final list. According to the evaluation results, the proposed approach outperforms other state-of-the-art methods in SRDP, Hit Ratio, and NDCG factors.
... The system may then take this new order of items into account for the next recommendation and once again presents a list of results. Another simple way to tackle the serendipity problem is to use multiple data bases with the same collaborative filtering method [25]. The idea here is to count on the diversity of multiple data bases to observe the emergence of unexpected recommendations. ...
Article
Full-text available
Nowadays, recommender systems are at use in various domains of everyday life such as social media networks, video on demand platforms or tourism. They help users sorting a vast amount of items and then get a more satisfying experience. However, these recommender systems tend to have a bias in the items recommended, a situation known as the overspecialization or diversity problem. In the tourism domain, this means new points of interest are less likely to be recommended than already established and well known places and that tourists tend to have the same trip over the same places, making it less personal. This paper presents and discusses first thoughts on how to overcome the overspecialization problem in the tourism domain by using the notion of ”semantic trajectory” of tourists in a touristic area.
... Thereupon, tracking of user behaviour across multiple websites may yield comprehensive profile information for improving recommendations on individual sites (cf. [30]). For instance, a user reading about a historic topic on an encyclopedia site could be provided with recommendations for related board games on a gaming site, further giving options to obtain matching games in libraries and shops near his/her current location. ...
Article
Full-text available
In the past decades recommender systems have become a powerful tool to improve personalization on the Web. Yet, many popular websites lack such functionality, its implementation usually requires certain technical skills, and, above all, its introduction is beyond the scope and control of end-users. To alleviate these problems, this paper presents a novel tool to empower end-users without programming skills, without any involvement of website providers, to embed personalized recommendations of items into arbitrary websites on client-side. For this we have developed a generic meta-model to capture recommender system configuration parameters in general as well as in a web augmentation context. Thereupon, we have implemented a wizard in the form of an easy-to-use browser plug-in, allowing the generation of so-called user scripts, which are executed in the browser to engage collaborative filtering functionality from a provided external rest service. We discuss functionality and limitations of the approach, and in a study with end-users we assess the usability and show its suitability for combining recommender systems with web augmentation techniques, aiming to empower end-users to implement controllable recommender applications for a more personalized browsing experience.
Article
Recommender System (RS) is an information filtering approach that helps the overburdened user with information in his decision making process and suggests items which might be interesting to him. While presenting recommendation to the user, accuracy of the presented list is always a concern for the researchers. However, in recent years, the focus has now shifted to include the unexpectedness and novel items in the list along with accuracy of the recommended items. To increase the user acceptance, it is important to provide potentially interesting items which are not so obvious and different from the items that the end user has rated. In this work, we have proposed a model that generates serendipitous item recommendation and also takes care of accuracy as well as the sparsity issues. Literature suggests that there are various components that help to achieve the objective of serendipitous recommendations. In this paper, fuzzy inference based approach is used for the serendipity computation because the definitions of the components overlap. Moreover, to improve the accuracy and sparsity issues in the recommendation process, cross domain and trust based approaches are incorporated. A prototype of the system is developed for the tourism domain and the performance is measured using mean absolute error (MAE), root mean square error (RMSE), unexpectedness, precision, recall and F-measure.
Chapter
Full-text available
The proliferation of e-commerce sites and online social media has allowed users to provide preference feedback and maintain profiles in multiple systems, reflecting a variety of their tastes and interests. Leveraging all the user preferences available in several systems or domains may be beneficial for generating more encompassing user models and better recommendations, e.g., through mitigating the cold-start and sparsity problems in a target domain, or enabling personalized cross-selling recommendations for items from multiple domains. Cross-domain recommender systems, thus, aim to generate or enhance recommendations in a target domain by exploiting knowledge from source domains. In this chapter, we formalize the cross-domain recommendation problem, unify the perspectives from which it has been addressed, analytically categorize, describe and compare prior work, and identify open issues for future research.
Article
Full-text available
Cross-domain recommender systems aim to generate or enhance personalized recommendations in a target domain by exploiting knowledge (mainly user preferences) form other source domains. This may beneficial for generating better recommendations, e.g. mitigating the cold-start and sparsity problems in a target domain, and enabling personalized cross-selling for items from multiple domains. In this tutorial, we formalize the cross-domain recommendation problem, categorize and survey state of the art cross-domain recommender systems, discuss related evaluation issues, and outline future research directions on the topic.
Article
Full-text available
Cross-domain recommendation is an emerging research topic. In the last few years an increasing amount of work has been published in various areas related to the Recommender System field, namely User Modeling, Information Retrieval, Knowledge Management, and Machine Learning. The problem has thus been addressed from distinct perspectives. Hence there are even conflicting definitions of the cross-domain recommendation task, and there is no rigorous comparison of existing approaches. In this paper we provide a formal statement of the problem, and present a review of the state of the art. We also establish a general taxonomy that let us to better characterize, categorize and compare the revised work. Finally, we conclude this review with a survey of interesting research topics on cross-domain recommendation.
Chapter
Social multimedia contributes significantly to the arrival of the Big Data era. The distribution of social multimedia content and users’ social multimedia activities among various social media networks motivate us to investigate social multimedia computing under the cross-network circumstances. We interpret cross-network as the “variety” of social multimedia: the heterogeneous data in various social media networks. In this chapter, basic tasks of user-centric social multimedia computing are extended under the cross-network circumstances, by exploiting the overlapped users among social media networks.
Conference Paper
Many recommendation systems traditionally focus on improving accuracy, while other aspects of recommendation quality are often overlooked, such as serendipity. Intuitively, a serendipitous recommendation is one that provides a pleasant surprise, which means that a suggestion must be unexpected to the user, and yet it must be useful. Based on this principle, we propose a novel serendipity-oriented recommendation mechanism. To model unexpectedness, we combine the concepts of item rareness and dis-similarity: the less popular is an item and the further is its distance from a user’s profile, the more unexpected it is assumed to be. To model usefulness, we adopt PureSVD latent factor model, whose effectiveness in capturing user interests has been demonstrated. The effectiveness of our mechanism has been experimentally evaluated based on popular benchmark datasets and the results are encouraging: our approach produced superior results in terms of serendipity, and also leads in terms of accuracy and diversity.
Article
Although the broad social and business success of recommender systems has been achieved across several domains, there is still a long way to go in terms of user satisfaction. One of the key dimensions for improvement is the concept of unexpectedness. In this paper, we propose a model to improve user satisfaction by generating unexpected recommendations based on the utility theory of economics. In particular, we propose a new concept of unexpectedness as recommending to users those items that depart from what they expect from the system. We define and formalize the concept of unexpectedness and discuss how it differs from the related notions of novelty, serendipity and diversity. We also measure the quality of recommendations using specific metrics under certain utility functions. Finally, we provide unexpected recommendations of high quality and conduct several experiments on a "real-world" dataset to compare our recommendation results with some other standard baseline methods. Our proposed approach outperforms these baseline methods in terms of unexpectedness while avoiding accuracy loss.
Conference Paper
Serendipitous recommendation has benefitted both e-retailers and users. It tends to suggest items which are both unexpected and useful to users. These items are not only profitable to the retailers but also surprisingly suitable to consumers' tastes. However, due to the imbalance in observed data for popular and tail items, existing collaborative filtering methods fail to give satisfactory serendipitous recommendations. To solve this problem, we propose a simple and effective method, called serendipitous personalized ranking. The experimental results demonstrate that our method significantly improves both accuracy and serendipity for top-N recommendation compared to traditional personalized ranking methods in various settings.