Available via license: CC BY 4.0

Content may be subject to copyright.

Similarity Measures for Recommender Systems:

Drawbacks and Neighbors Formation

Mohammad Al-Shamri ( mohamad.alshamri@gmail.com )

King Khalid University

Research Article

Keywords: Web-based services, Collaborative recommender system, Similarity measures, User prole,

Web Personalization

Posted Date: September 23rd, 2022

DOI: https://doi.org/10.21203/rs.3.rs-2091938/v1

License: This work is licensed under a Creative Commons Attribution 4.0 International License.

Read Full License

Similarity Measures for Recommender Systems:

Drawbacks and Neighbors Formation

Mohammad Yahya H. Al-Shamri1,2

1Computer Engineering Department, College of Computer Science, King Khalid University, Abha, Saudi Arabia

2Electrical Engineering Department, Faculty of Engineering, Ibb University, Ibb, Yemen.

E-mail: mohamad.alshamri@ gmail.com.

Abstract

Similarity measures are crucial for electing neighbors for the users of recommender systems.

However, the massive amount of the processed data in such applications may hide the inner

nature of the utilized similarity measures. This paper devotes lots of studies to three standard

similarity measures based on many synthetic and real datasets. The aim is to uncover the hidden

nature of such measures and conclude their suitability for the recommender systems under

different scenarios. Moreover, we propose a novel similarity measure called the normalized

sum of multiplications (NSM) and two different variants of it. For experimentation, we examine

all measures at three levels; a toy example, synthetic datasets, and real-world datasets. The

results show that sometimes Pearson correlation coefficient and cosine similarity exclude

similar neighbors in favor of less valuable ones. The former measures the correlation direction,

while the latter measures the angle. However, both direction and angle are not similarity but an

indication to it and can have the same values for two far vectors. On the other hand, the

proposed similarity measure constantly reveals the exact similarity and tracks the closest

neighbors. The results prove its robustness and its very good predictive accuracy compared to

the traditional ones.

Key Words — Web-based services, Collaborative recommender system, Similarity measures,

User profile, Web Personalization.

1. Introduction

The web has become an inevitable way of handling many services. One such im-

portant service is the recommendation of online items and services, which is growing

very fast due to its ability to shrink the search space and avoid overwhelming users with

2

a massive amount of irrelevant information [1]. In fact, the recommendation system

builds a bidirectional relationship between users and service providers. The success of

such a relationship will benefit both sides and push researchers toward developing ef-

fective strategies. The suggested items may likely interest the active user, and hence

this gratified user may become a loyal user of the system. For the success of recom-

mender systems, powerful and efficient matching tools to compare users and guide

them toward close neighbors are needed [2,3]. Since the matching process is essential

for the success of such systems, many efforts are made toward proposing many types of

similarity measures. Some measures are inherited from the information retrieval do-

main, while others are solely presented for recommender systems [4].

Typically, similarity functions generate a single value from a multitude of scores,

representing the degree of resemblance between two users/items. This value is signifi-

cant as it determines the degree of influence of this user/item on the predicted ratings.

Hence, it helps active users to find relevant and exciting information in the form of

suggestions, and therefore, they have a direct effect on the correctness of the generated

list. Actually, getting a representative similarity value will enhance the system's person-

alization level and achieve the primary goal of such systems [5].

This paper explores the most common similarity measures for collaborative filtering,

namely, Pearson correlation coefficient (PCC), cosine similarity (COS), and mean

square difference-based similarity (MSD), and studies their performance in the recom-

mendation field. Due to the massive amount of information available for the recom-

mendation, many similarity measures may present good performance. However, once

deeply analyzing their behavior, one may find conflicting cases and sometimes unac-

ceptable performance. These problems may bias the system’s performance and limit its

diversity toward different situations and moods. However, it is not easy to identify such

issues by examining the similarity measures on a massive real-world dataset as it is

usually biased toward users' common sense. For example, almost all people like the Ti-

tanic film and score it highly, suggesting it to anyone will have a big chance to be a cor-

rect prediction. Once the user profile contains a reasonable ratio of such items, the simi-

larity measure's role deteriorates and becomes minor, any random value may present

good results. We need to dig into the user vector and assume many synthetic cases to

uncover this matter. A general similarity measure should consider all possible cases of

users. Therefore, we test the examined similarity measures on many synthetic data hav-

ing various characteristics. First, we assume uniform distributed samples; then, we

augmented the same dataset with one exact sample and then five identical samples of

each active user. This arrangement allows us to investigate the hidden performance of

each measure in different scenarios. Finally, we assume a single-mood synthetic dataset

which is very important for those users preferring to rate only one type of item, possi-

bly what they like, and give them the same value. Hence, we will study the perfor-

mance of the basic similarity measures on synthetic and real datasets for individual us-

ers and the system as a whole. Based on the findings of this study, we go one step fur-

ther and propose a novel similarity measure that alleviates the identified problems and

achieves good performance for recommender systems. The proposed measure alleviates

the drawbacks of the standard similarity measures and handles all users fairly. The con-

tributions of this work are:

1. Studying and analyzing many similarity measures to highlight the weaknesses for

each one if found and identify the reasons for such performance.

2. Creating many synthetic datasets to determine the suitability of many similarity

measures for the recommendation applications.

3. Proposing a novel similarity measure that positively affects the recommendation

performance.

The remaining of this paper discusses related work in Section 2. The most common

similarity measures for recommender systems are discussed in Section 3. This section

will discuss the range of similarity scores and their relationship to the number of stars

in the system. We will go through a toy example to have an initial view of the perfor-

mance of each similarity function for different scenarios. In Section 4, we will discuss

the proposed similarity measure and its variants and spotlight their effect on the system

performance. For more verifications, we devoted two sections to experiments—section

5 experiments similarity measures on synthetic data to observe their performance on

4

data with known characteristics. The experimental setup, experiments, and evaluation

procedure are discussed there. On the other hand, Section 6 is devoted to experimenting

with similarity measures on two real datasets. We conclude the paper in section 7 with

some direction for future work.

2. Related Work

Matching users or items is the cornerstone of the success of the recommendation

process. Without an efficient matching process, the system may follow unrelated users

or items and hence provide misleading information as an output. Bag et al. [6] classi-

fied similarity measures into two types: traditional and heuristic. Based on this classifi-

cation, all the examined similarity measures in this paper are traditional. Bagchi [2]

used Apache Mahout to investigate the quality of many similarity measures used to

match collaborative filtering users. He concluded that the system’s performance is high-

ly dependent on the metric used for similarity measurement. They found that Euclide-

an-based similarity outperformed the others in error, recall, and precision.

Actually, the literature takes three major directions in dealing with the similarity

measures. One direction tries to identify the limitations and drawbacks of the traditional

similarity measures [7-14], as summarized in Table 1. This table lists the problem code,

name, a brief description of each one, and their references. The second direction goes

for merging more than one similarity measure to compensate for some drawbacks. The

third direction finds it easier to propose new similarity measures.

Some drawbacks of the existing similarity measures, especially PCC and COS, are

highlighted by Patra et al. [10]. These drawbacks are related to common ratings and

their number, local information, and the utilization of the user ratings. They concluded

that these measures could not generate trustworthy neighbors if the rating matrix is

sparse. Jain et al. [15] noticed that PCC showed low similarity scores for similar pat-

terns and high similarity scores for different patterns. In addition, they found that MSD

degraded accuracy due to ignoring the number of shared histories between users, and

Jaccard MSD cannot see the actual preference difference between users. Many refer-

ences that studied the limitations of traditional similarity measures are listed in [11,13].

They argued that PCC and COS focused on the direction and ignored the length of the

user vectors. Guo et al. [11] highlighted four problems for traditional similarity

measures, while Tan and He [13] added three more. This work identified three more

and added them to the list. The identified problems are about COS, MSD, and NSM.

COS gives full marks for two opposite ratings if they appear constantly in the profile

while MSD considers the difference without looking for its position in the rating scale.

Morozov and Zhong [8] listed many drawbacks of traditional similarity measures.

They inferred that COS is not trustworthy because it gives a total similarity value for

one common rating. Moreover, they illustrated that PCC is undefined for any uniform

vector, and apart from directions, MSD emphasizes distance with no rating adjustment.

They concluded that similarity functions that rely on Covariance would give 0 value

whenever there are only two common ratings, regardless of their ratings. Hence, they

suggested combining PCC and MSD to overcome some drawbacks. He and Luo [7] ar-

gued that linear correlation between users caused many weaknesses for traditional simi-

larity metrics. To solve that, they proposed a new similarity measure based on mutual

information to exploit the nonlinear dependency between items. Similarly, Sheugh and

Alizadeh [9] investigated PCC as a similarity measure, highlighted its limitations, and

proposed a new measure to mitigate that.

Table 1: Identified drawbacks/problems for traditional similarity measures.

6

Some authors proposed and explored different approaches for calculating the similar-

ity degree between users. Karypis [16] discussed variants of similarity measures that

use conditional probability to compute the similarity between items or users. A simple

asymmetrical similarity measure based on the cardinalities of both the common set and

the sets of the users is proposed by Millan et al. [17]. Bobadilla et al. [18] used hidden

attributes of the recommendation process to propose a singularity measure that im-

proves predictions. Still, this measure is limited to the number of singularities from the

shared history between users. Guo et al. [11] proposed a Bayesian similarity that con-

sidered the user vectors' direction and length. However, Tan and He [13] exported the

principle of physical resonance into the similarity process to overcome the low number

of shared histories between users. They proposed a parameterized similarity measure

that considers the distance between users, common ratings, and uncommon ratings. Mu

et al. [19] used Hellinger distance to mitigate the effect of the sparsity problem on the

similarity computation. This measure is used for finding a global similarity measure,

and hence a weighted sum of local and global similarities is used for predictions.

Nonlinear similarity models with three semantic heuristics: proximity, impact, and

popularity (PIP), are proposed by Ahn [20]. Later, Liu et al. [21] argued that the PIP

model might degrade the ratings frequently in some situations and therefore proposed

another model that uses a different set of heuristics called proximity, significance, and

singularity. These models show adaptive behavior, but they are over-dependent on the

co-rated items. Al-Shamri and Bharadwaj [22] used the fuzzy concordance/discordance

principle as a similarity measure for memory-based recommenders. It again relies on

the common set of ratings between two users. Bag et al. [6] argued that considering on-

ly common ratings is inappropriate for generating trustworthy neighbors. So, they pro-

posed variants of the Jaccard index, which considers all user ratings irrespective of

whether they are common. Ayub et al. [14] proposed a similarity measure based on the

concept of the Jaccard index by introducing another argument to consider the average

ratings of users. Schwarz et al. [23] inversed the Euclidean distance to be a similarity

measure for finding the coincidence between users. Pirasteh et al. [24] proposed an

asymmetric similarity function based on the common ratings and then normalized it by

the cardinality of the active user. Gazdar and Hidri [3] introduced a new similarity func-

tion based on the common ratings and the number of ratings given by the two users.

They used nonlinear systems, linear differential equations, and integrals to transform

user preferences into a similarity value.

Finally, some authors explore the performance under other parameters like sparsity

or cardinality of the common set. For example, Jain et al. [4] reviewed multifarious

similarity measures and analyzed their performance for different sparsity percentages.

They found that PCC and COS may give false indications about similarity while the

best results are obtained with those relying on Minkowski distance. Hassanieh et al.

[25] studied the error performance of many similarities. They induced that the system

performance is different for different sparsity levels. Moreover, Stephen et al. [26] ad-

dressed sparse recommendation data and its effect on similarity calculations. They ar-

gued that finding two vectors with common ratings is even sparser. Therefore, they

studied the hierarchical categorization of items to have a clear overview of the user in-

terests. Al-Shamri [12] studied the effect of the common set cardinality on different

similarity measures. They learned this effect using empirical examples and inferred

some weaknesses of PCC and COS.

3. Similarity Measures for Recommendation Process

Similarity measures guide the system to identify a set of neighbors for the active user

on hand. PCC, the most common similarity measure for the collaborative recommenda-

tion, measures how linearly two users are related to each other [4,6,10].

( )

( )

( )

( )

( )

,,

2

2

,,

,k xy

k xy k xy

xk x yk y

sS

xy

xk x yk y

sS sS

rm rm

PCC

rm rm

∈

∈∈

−× −

=

−−

∑

∑∑

uu

(1)

where

, is the rating of the user, , for item , is the mean rating of , and

, is the common ratings between and . On the other hand, COS finds the dot

product between the vectors of two users [4,10]. It is used by many applications, in-

cluding YouTube and Amazon.

8

( )

,,

22

,,

,

k xy

k xy k xy

xk yk

sS

xy

xk yk

sS sS

rr

COS

rr

∈

∈∈

×

=

∑

∑∑

uu

(2)

In fact, COS finds the alignment between two vectors, not the rating agreement [8].

Finally, mean square distance measures the distance between two vectors [6,9,10,27].

This measure puts more emphasis on significant differences rather than minor differ-

ences.

( )

( )

2

,,

2

max

,

k xy

xk yk

sS

xy

xy

rr

dis RS

∈

−

=×

∑

uu

(3)

We divide the squared differences by the maximum rating scale value, , to keep

a normalized value. The distance can be transformed into a similarity value by subtract-

ing it from one. Accordingly, the MSD similarity measure will be [6,10]:

( ) ( )

,1 ,

xy xy

MSD dis= −uu uu

(4)

The following subsections discuss the range values of the similarity score and then

present a toy example to have an initial test for the performance of different similarity

measures on different profiles.

3.1 Range Values of the Similarity Score

The range of the generated similarity score is significant for understanding the be-

havior of different similarity measures. Usually, similarity measures generate values

that range from 0 to 1 for unipolar measures and -1 to 1 for bipolar ones. However, rec-

ommender systems rarely use 0 as a rating scale value, and hence many direct similarity

measures will start from some value greater than 0 as with MSD. Logically, the maxi-

mum range value can be obtained when comparing exact cases. In contrast, the mini-

mum can be obtained between the two extreme cases, i.e., the maximum and the mini-

mum of the rating scale.

To study this range for different measures, let us assume a single mood user,

, ,...,

Ri i i i

U RR R=

where

i

R

is the rating scale value. Table 2 lists the similarity results

between

maxR

U

(Here = 5) and all other single mood users for three different rat-

ing scales. As it is evident from the results, PCC cannot help for such cases while COS

generates the same value for all. On the other hand, MSD somehow gives correct val-

ues for the maximum and the minimum. The minimum value changes with the number

of stars on the rating scale, and it approaches zero more once the stars are increased.

3.2 Toy Example

This example generates a dataset of 15 users rating 15 items using a 5-star rating

scale, as shown in Table 3. This small dataset has users with different rating histories.

For clarity purposes, we assumed pure positive ratings are 4 and 5 while pure negative

ratings are 2 and 1. The rating 3 may be considered neutral, but it is usually added to

the broad category of positive ratings. The toy example includes many opposite profiles

to compare the performance of the similarity measures on opposite users. The inclusion

of such users grants us to conceive the similarity measure behavior with users having

contrasting moods.

In the toy example, user 1, 1, gives uniformly distributed ratings while user 2, 2,

behaves oppositely. These two users represent the ideal case, where users utilize the

rating scale evenly. For generating the opposite profile, we calculate the counter-value

Table

2: Range values of different similarity measures for 5-, 7-, and 10-star rating scales.

10

of each rating value as:

,,

() 1

xk xk

cr N r= +−

(5)

Note that is the maximum star on the rating scale. User 3, 3, has positive ratings

(3,4,5) only, representing a dominant case of recommender systems where users navi-

gate and then rate what they liked in advance. The opposite user for this case is User 4,

4. At the extreme points, user 5, 5, has pure negative ratings (1,2) while user 6, 6, is

the opposite with pure positive ratings (4,5). User 7, 7, represents a colinear vector

with 5 by a factor of 2. A user with the only one-rated item is given by user 8, 8. User

9, 9, is designed to study the cross-value problem. Five more users are added to show

single mood ratings for each rating scale value.

Results and Discussions

The similarity values between different users of the toy Example are listed in Table

3. Since all examined similarity measures are symmetric, we allocate the upper triangle

of table 3 for three similarity measures and the lower triangle for the other three. We pu t

2 for the undefined (incomputable) similarity value. The following points illustrate the

principal observations and conclusions:

1. The bottom right-corner red subtable illustrates the similarity values for inter-

and intra- single mood users, where:

Table 3: Dataset of Example 1: 15 users are rating 15 items using a 5-star rating scale.

• PCC cannot match these users because it calculates the rating deviation from

the user mean, which is always zero (PB4). Hence PCC is not suitable for these

cases.

• COS always gives full marks, even for opposite moods. This behavior is due to

the nature of COS, which measures the cosine of the angle between the two

vectors. If they have a single positive or negative mood, they will be parallel

with zero angles.

• MSD gives reasonable and gradual values based on how much the difference is

because it finds the rating difference, not angle or direction.

2. The results of the opposite cases (1,

2), (3,

4), and (5,

6), reveal that

PCC quickly identified them while COS and MSD gave them high value even

Table 4: Similarity values for users of the Toy Example for different similarity measures.

12

they are opposite. COS assigned high values because the two vectors are almost

parallel to the -axis, so minimal angle will be there.

3. The common ratings between (2,

6) are (5,4,3,5) and (4,2,4,4), indicating

positive relationship. Similarly, the common between (1,

5) are (1,2,3,1) and

(2,1,2,2) indicate negative relationship. However, PCC gave both -0.493, treat-

ing them as opposite users, a misleading value. This value assures the cross-

value problem (PB9) of PCC.

4. The common ratings between (1,

3) are (4,4,5,5) and (5,5,4,4), which indicate

a strong positive relationship. However, PCC degraded the similarity to 0.222,

which again assures the sensitivity of PCC to the cross values. On the other

hand, COS and MSD gave almost full marks.

5. MSD cannot grasp user preferences if the difference is the same (PB11). For ex-

ample, the MSD values for users (12,

15), and (11,

14) are the same 0.64.

Similarly, the values for users (12,

14), (11,

13), and (13,

15) are the same

0.84. Some pairs belong to the same category (positive or negative) while others

belong to different categories, and hence treating them the same is not logical.

6. PB3 is evident for 8, where it got an undefined value from PCC and a total

value from COS.

7. PCC and COS values for all users with the collinear users 5 and 7 are equal,

proving the problem PB5.

8. MSD assigned a moderate value for the two extreme users 11 and 15, which is

better than PCC and COS. However, it should be zero.

9. The common history between 2 and 5 is (2,1,2,2) and (5,4,3,5), showing op-

posite mood users. However, COS gave them a very high mark of 0.961 due to

the free value problem (PB10). A similar situation occurs between (6,

7),

(11,

15), and all opposite users.

10. The single co-rated item problem (PB2) is very clear from the results of (1,

9)

and (2,

9), where PCC gave +1 or -1, and COS always gave 1.

11. MSD results of (6,

9) and (6,

14) proved the unequal-length problem (PB6)

where dense history user encounters unfair evaluation in favor of shallow histo-

ry user.

12. Two users, 1 and 2, have uniformly distributed ratings. However, they got dif-

ferent MSD values with 10 because of the common snap of ratings (PB8).

From the results of this example, we can conclude the following main points:

• PCC shows different behavior due to its dependency on the ratings’ means.

• COS calculates the cosine of the angle between the vectors of the two users.

Therefore, it cannot always differentiate between the user categories.

• MSD treats the same differences equally. This is inappropriate for the recom-

mendation process.

The listed pitfalls and the obtained results from the toy example encouraged us to go

further and propose a new similarity measure in the following section.

4. Normalized Sum of Multiplications Similarity Measures

PCC and COS indicate mainly the direction of the correlation between users, not

their exact similarity. Both measures rely on different combinations of the sum of mul-

tiplications. PCC multiplies the rating deviation from the mean of each user, while COS

multiplies ratings directly. However, both of them normalize the result by the norms of

their combinations. These combinations for example make COS an indicator of the an-

gle between the users, which may be the same for two far users. To resolve that, we

propose a normalized sum of multiplications that modifies the COS similarity to solve

this issue.

Definition 1: The normalized sum of multiplications (NSM) between two vectors (us-

ers’ profiles), ux and uy, having

,xy

S

common items is defined by:

( ) ( )

,,

2

,,

,

max( , )

k xy

k xy

xk yk

sS

xy

xk yk

sS

rr

NSM

rr

∈

∈

×

=

∑

∑

uu

(6)

14

Here the denominator is the sum of the square of the maximum between individual

ratings, which preserves the exact similarity between different users. The range values

of NSM are also listed in Table 2. It starts from some value greater than 0 and gives

correct values for the maximum. NSM differs from MSD in developing the minimum.

The minimum value of both MSD and NSM changes with the number of stars on the

rating scale, and it approaches zero more once the stars are increased. However, NSM

always generates less minimum value than MSD, making it appropriate for capturing

opposite cases better than MSD.

4.1 Normalized Sum of Multiplications Similarity Measure for the Users of the Toy

Example

The results of NSM for the users of the toy Example are also shown in Table 3. We

will trace the same points in the previous analysis using NSM similarity for this exam-

ple.

1. The NSM similarity values of the bottom right-corner red table of Table 3 are

gradual for different users, starting from 0.5 for close users in the negative cate-

gory to 0.8 for close users in the positive category.

2. The similarity values between opposite users are less than MSD. The similarity

value between the two extreme users, 11 and 15, is 0.2, the minimum possible

value for this rating scale using the NSM similarity measure (Table 2).

3. The similarity value between (2,

6) and (1,

5) is 0.571, representing the de-

gree of positive/negative relationship between these users.

4. The similarity value between 1 and 3 having a strong positive relationship is

0.821, which indicates the degree of positiveness between them.

5. Unlike MSD, NSM grasps user preferences if the difference is the same, avoid-

ing PB11. For example, the NSM values for (12,

15) and (11,

14) are 0.4

and 0.25. Similarly, the NSM values for (12,

14), (11,

13), and (13,

15)

are 0.5, 0.33 and 0.6. This performance keeps the category difference between

users.

6. NSM and MSD gave the same full similarity for 8 with 3. However, they be-

have differently with 8 and 4 as they have opposite ratings. The value given

by NSM is more accurate than that of MSD.

7. NSM and MSD values for users with the collinear users 5 and 7 are not

equal. Again the value given by NSM is more realistic than that of MSD.

8. NSM assigned a meager value for the two extreme users 11 and 15, which is

better than MSD.

9. NSM does not suffer from the free-value problem (PB10). It gave logical values

for different cases. The similarities between opposite users are far smaller than

COS or MSD.

10. NSM and MSD do not suffer from PB2.

11. The unequal-length problem (PB6) is common between many similarity

measures due to the common history restrictions for calculating the similarity

score.

12. The common snap of ratings between users lets NSM suffers from PB8.

From the results of this example on NSM, we can conclude that:

• All opposite cases have similarity values less than 0.5.

• The minimum value depends on the rating scale.

• NSM can differentiate easily between the user categories by treating the same

differences based on their classes.

4.2 Upper-scale and Lower-scale Normalized Sum of Multiplications Similarity

Measures

NSM emphasizes the similarity between close positive users and downgrades the

similarity between the close negative users due to the multiplication and maximum op-

erators between top values or bottom values of the rating scale. For example, 4×5=20

and the maximum is 5 while 1×2=2 and the maximum is 2. Both values have a one-unit

distance; however, their NSM values are different, 0.8 and 0.5. Hence it will encourage

the recommendation of liked items by promoting the similarity of positive users and

degrading the similarity between negative users. This behavior may be accepted by

16

some applications having a huge amount of data but it may be not by other applications.

To resolve this problem, we have to consider the counter-values of the ratings when

doing the multiplication process. In fact, the previous toy example represents counter-

values where 5 is the counter-value of 1 and 4 is the counter-value of 2 on the 5-star

rating scale. This discriminatory behavior encourages us to propose two other versions

of NSM. The first one shifts the results toward the top, while the second one shifts the

results downward.

Definition 2: The counter-value normalized sum of multiplications between two vec-

tors (users’ profiles), ux and uy, having

,xy

S

common items is defined by:

( )

( )

( )

( )

,,

c

xy x y

NSM NSM c c=uu u u

(7)

Definition 3: The upper-scale normalized sum of multiplications (UNSM) between two

vectors (users’ profiles), ux and uy, having

,xy

S

common items is defined by:

( ) ( ) ( )

( )

, max , , ,

c

xy xy xy

UNSM NSM NSM=uu uu uu

(8)

Definition 4: The lower-scale normalized sum of multiplications (LNSM) between two

vectors (users’ profiles), ux and uy, having

,xy

S

common items is defined by:

( ) ( ) ( )

( )

, min , , ,

c

xy xy xy

LNSM NSM NSM=uu uu uu

(9)

4.3 UNSM and LNSM for the Users of the Toy Example

Table 3 lists the results of UNSM and LNSM for the users of the toy Example, which

reveal the following remarks:

1. All NSM variants declare the same score if one user is the counter of the other,

like all opposite cases.

2. If the countering operation flips both values, the score may decrease or increase

much like that of (8,

15) and (9,

15).

3. If the countering operation flips only one value, then the score may change

slightly like in the case of (12,

13), (8,

9), and (4,

15).

4. The maximum difference between UNSM and LNSM values depends on the

utilized rating scale, which is 0.3 for the utilized rating scale.

5. Examined Similarity Measures with Synthetic Datasets

Usually, different similarity measures are examined using real-world datasets like

MovieLens and Jester. However, these datasets are biased due to the inherited nature of

the recommendation systems of suggesting mainly liked items. Hence these items have

a high chance of being rated. In the following, we will discuss the generated synthetic

dataset and then analyze the results for various similarity measures.

5.1 Generated Synthetic Dataset

To observe the performance of different similarity measures on a dataset with some

predefined statistical characteristics, we randomly generated four different synthetic

datasets with four splits for cross-validation, as discussed below. The numbers of rat-

ings for all examined datasets are listed in Table 5. We omitted Dataset_2 and Da-

taset_3 as they have different combinations of Dataset_1.

Dataset_1: (Uniformly Distributed Dataset)

This dataset consists of four splits of 125 users rating randomly 1200 movies on a 9-

star rating scale. The number of ratings for each user and the ratings themselves is uni-

formly distributed. Hence, each rating scale value will have an equal chance of appear-

ing in the user-item matrix. The aim is to verify the performance of similarity measures

based on a proper uniform distribution dataset, not a biased one. Moreover, we use a

large and symmetric rating scale to give a clear picture of the effect of the uniformly

distributed ratings on the system performance. For each experiment, one split will be

assumed as the set of active users, and the remaining splits will be treated as the train-

ing set. Hence, the test user will be 125, while the training users will become 375.

18

Dataset_2:

Dataset_1 has uniformly distributed random ratings. Here we added the same set of

testing users to the training users for each split, and hence the training users become

500 users while the test users remain the same, 125 users. The aim is to explore the ef-

fect of having one similar user in the training dataset for each active user. For each ex-

periment, one split will be assumed as the set of active users, while all splits will be

treated as the training set, including the one used for testing.

Dataset_3:

To extend the notion of similar users in the training dataset, we duplicated each test

user of Dataset_1 five times and added them to the training dataset. Hence, we have

1000 training users for each fold of this experiment. For each experiment, one split will

be assumed as the set of active users, while the training set consists of all splits dupli-

cated five times, including the one used for testing. Hence the test users will be 125 us-

ers, while the training users will become 2500 users.

Dataset_4: (Single Mood Dataset)

To study the effect of single mood users on the system’s performance, we randomly

generated four splits of 125 users rating randomly 1200 movies with random single

mood ratings for each user. For experimentation, one split will be assumed as the set of

active users, and the remaining splits will be treated as the training set. Hence, the test

user will be 125, while the training users will become 375.

5.2 Experimental Setup

Four folds are formed by reserving a split for testing the system and merging the re-

maining splits to form the fold training dataset. Hence, the training users will be 375,

Table 5:

Number of ratings for each dataset.

and the test users will be 125 for Dataset_1 and Dataset_4. The ratings of each test user

are divided randomly into training ratings (80%) and test ratings (20%). The perfor-

mance of various similarity measures is studied for different neighborhood sets to see

the performance under many conditions. In fact, we set the neighborhood set size (NSS)

to be 1,2,3,4,5,10,20,30,40,50. NSS gradually increases by one step, five, and finally,

tens to cover very small, small, and normal NSS.

5.3 Evaluation Metrics

The experiments are evaluated using the percentage of correct prediction (PCP),

mean absolute error (MAE), and user coverage (UC) [12,28]. UC is a fundamental met-

ric to inspect how many active users the system can successfully help since we deal

with neighbors and their effect on the system performance. The results are averaged

over all folds.

5.4 Analysis of the Results

This subsection discusses the results of the four conducted experiments.

Experiment_1:

The results of this experiment for all NSS are given in Fig. 1. The results show com-

parable low performance for all similarity measures as they deal with the ideal case

with uniformly distributed ratings. PCP is shown in part(a); the results are around 11%

and 12% for all similarity measures for high values of NSS with a clear advantage for

LNSM. In fact, NSM outperforms MSD in almost all cases, and LNSM transcends

both. UNSM is the worst among NSM variants. PCP starts very low for all approaches

with low NSS and begins to increase by growing up NSS where it reaches 12% as the

maximum value with NSS=50. The results grow as more neighbors contribute to the

prediction process. This performance is expected as we have highly random equiproba-

ble ratings, which are rare in real datasets.

20

In terms of MAE, all measures reveal comparable results with a slight advantage for

LNSM. Note that MAE values are high and show some improvement with increased

NSS, where 26% is the maximum at NSS=50. Somehow stable performance occurs at

NSS = 30, 40, and 50. Fig. 1(c) shows the user coverage for different similarity

measures. UC is increased by moving up the NSS value. LNSM surpasses other simi-

larity measures by a small value for low values of NSS up to NSS=10. However, UC is

very small for low values of NSS, which is only around 10% for all. Thus only 10% of

the active users can be benefitted from the system with this value of NSS.

(a) PCP

(b) MAE

(c) User Coverage

Fig. 1. Evaluation metrics for Dataset_1

(a) PCP

(b) MAE

(c) User Coverage

Fig. 2. Evaluation metrics for Dataset_2

Experiment_2:

The results’ pattern of this experiment (Fig. 2) is almost opposite to that of the previ-

ous experiment because we now have one exact user in the training dataset for each test

user. Due to this doping, the lowest PCP value is now around 16% which exceeds the

highest value of the previous experiment, which emphasizes the power of similarity

measures for identifying similar users once they exist in the training dataset. The sys-

tem used this information efficiently for more enhancement.

PCP starts very high even with only one neighbor, as shown in Fig. 2(a). This ad-

vantage is reduced by increasing the number of neighbors, reaching the minimum with

NSS = 50 as more less-valuable neighbors are contributing now to the prediction pro-

cess. The results indicate that many neighbors are not always desirable, especially for

datasets having similar moods. Maybe it somehow benefits those having odd attitudes.

The results are comparable for all similarities for the lower part of NSS up to NSS =5.

Then PCC shows the best performance with a small margin compared to others. MSD

performance is inferior to NSM and UNSM in terms of all evaluation metrics. UNSM is

the best among NSM variants, and LNSM is the worst for all metrics.

In terms of MAE, MSD and COS show the lowest performance, especially for high

NSS. However, MAE is improved very much compared to that of experiment 1. It is

zero with NSS=1 for all similarity measures and does not exceed 20% with NSS=50.

UC results are depicted in Fig. 2(c). All similarities reveal high UC with a minimum

value of 91%, ensuring the importance of finding close neighbors in the training da-

taset. UC values decrease after NSS = 10, indicating that many neighbors are not al-

ways desirable.

Experiment_3:

The results (Fig. 3) of this experiment are enhanced further compared to experiment

2. Here the training dataset includes five similar users to each active user. Therefore,

the results of NSS = 1,2,3,4 and 5 are identical for all similarity measures. Note that

this dataset does not include conflicting cases like single mood users. The results are

very high in PCC and start decaying after NSS = 5, where PCC keeps its advantage

over the others while MSD keeps the lowest performance in this range. All reach the

22

minimum with NSS = 50. The minimum value of PCP is now 44%. All NSM variants

outperform MSD and COS for the same range. This conclusion becomes apparent with

MAE, and this assures the findings of PC P. One critical point here is that MAE is very

low and does not exceed 10% for all, with a clear advantage for PCC with NSS = 10 to

50. Again having similar users in the dataset is very important for the system's success,

and their number affects the performance highly.

Here UC shows stable performance for the lower part of NSS with 99.5%. It

achieves 100% for PCC and NSM variants with NSS = 10 and begins slowly to decay

(a) PCP

(b) MAE

(c) User Coverage

Fig. 3. Evaluation metrics for Dataset_3

(a) PCP

(b) MAE

(c) User Coverage

Fig. 4. Evaluation metrics for Dataset_4

after that. After NSS = 5, UC changes because the set of neighbors is changing by in-

creasing NSS. Note that the performance monotonically decreases after NSS = 5; se-

lecting the best NSS value is crucial for the system's success and depends on the simi-

larity measure and the dataset on hand.

Experiment_4:

This dataset is unique, representing users with single mood ratings. This dataset al-

lows us to assess the performance for such cases, which are always hidden in the real-

world datasets as they have various moods. In this case, the mean of training users and

testing users are the same due to the single mood nature of the dataset, and hence PCC

will be undefined. Other similarity measures can handle this situation and start to show

the maximum possible value with NSS = 40 (Fig. 4). Increasing NSS beyond this value

does not affect all evaluation metrics.

Apparently, the PCP of this experiment shows that COS is the best for such a dataset

though it cannot differentiate between single mood users exactly as discussed earlier.

This advantage vanishes as NSS goes up to where all perform almost the same with

NSS = 40. With this value, PCP reaches 100%, and hence increasing it further will not

add any value to the system. It will consume resources without any actual benefit. Con-

sistent findings appear for MAE. However, the error values are high for the lower part

of NSS and reached 50% for some similarities. This value monotonically decreased to

zero for all similarities except PCC, where it stands fixed at about 56% because no

neighbor is there.

UC is above 90% for all similarities except PCC, where it is zero. Except for PCC,

UC is full for NSS greater than 3 as they can select close neighbors for this case. COS

is better than others for NSS =1 and 2.

6. Examined Similarity Measures with Real Datasets

This section examines similarity measures with two real datasets, Movielens [29] and

Jester [30] datasets. For this purpose, we randomly selected four folds of 125 users reflect-

ing the dataset distribution as discussed in Al-Shamri [12]. The rating scale of Jester is dig-

itized to 10 stars, and three mean groups are defined to create the dataset categories in anal-

24

ogy to that of Movielens. The ratings of each test user for items are divided randomly into

80/20 percent between training and test ratings. The training dataset includes the remaining

folds for each split, and the results are averaged over all splits to ensure cross-validation of

our results.

6.1 MovieLens Dataset

The results of this experiment are shown in Fig. 5. PCP is very low for the lower part

of NSS and increased monotonically for the top part, showing 37% for LNSM. PCC,

NSM, and UNSM alternate the results’ superiority between lower and top parts of NSS.

However, they are less than that of LNSM for all cases. Moreover, MSD outperforms

NSM and UNSM by a small margin for the top part of NSS, whereas COS shows the

lowest. Note that the results do not reach a stable level, and hence they may increase

once NSS is increased beyond 50. This indicates that NSS should be selected carefully

based on the dataset and its characteristics.

MAE dictates a similar behavior to PCP, where it performs very low at the beginning

with a 66% error. The randomness of this dataset is very high, which is improved by

increasing NSS as more neighbors are elaborated to the process. The improvement per-

centage between LNSM values at NSS = 1 and NSS = 50 is around 275%. UC results

assure the random nature of the dataset, where it is only 40% at NSS = 1. In this case,

the selected neighbors cannot help 60% of the active users to predict any item correctly.

This value is enhanced very much with NSS=10, where it is now 95%.

6.2 Jester Dataset

This dataset is denser than Movielens, where each user has rated at least 35 jokes out of

101. PCP (Fig. 6(a)) starts high with NSS = 1 compared to the Movielens dataset, with a

very high sparsity level. However, the big success at the start does not continue, and it

achieves 20% only as of the possible maximum at NSS = 20. The results decay slightly af-

ter that. LNSM outperforms other for NSS = 1..5,10,20,40. PCC performs the worst for

NSS=1..10 and as good as NSM for the remaining values except NSS = 30, where it stands

the best.

Here, the system can assist users from the beginning as there is a low number of

jokes. Movielens results are shallow at the beginning (2%) and gradually increased. In

contrast, Jester results are high initially and are stable early and then decrease slightly

from their maximum value, which is the opposite of Movielens. Of course, 10% of PCP

is not numerically high but compared to Movielens; it is high.

MSD is the best for the lower part of NSS in terms of MAE. These values start high

with NSS = 1 and go down when NSS goes up. It begins to be stable with 16% at NSS

= 30. Increasing NSS is not suitable for such a dataset. UC is high for all (between 75%

(a) PCP

(b) MAE

(c) User Coverage

Fig. 5. Evaluation metrics for Movielens dataset

(a) PCP

(b) MAE

(c) User Coverage

Fig. 6. Evaluation metrics for Jester Dataset.

26

and 90%). LNSM shows the best values for five NSSs, while PCC shows the best re-

sults for four high NSS.

7. Conclusions

A similarity measure for a collaborative recommender system should reflect the ac-

tual similarity between users. Usually, RSs include many users in the prediction process

which hides the fine details prevailed by the similarity measures. However, the most

used similarity measures may fail, as illustrated by the results of the toy example and

the synthetic data. Hence we propose NSM and its variants as efficient solutions for

such systems. NSM similarity inherits the power of COS and avoids its weakness of

indicating the direction only. NSM directly manipulates each vector value and grasps

the exact differences between them. However, strict differentiation between users is

sometimes not required for recommender systems. Therefore, we propose two variants

of NSM that represent lenient and strict behavior based on the position of the rating

value on the scale.

We examine the proposed similarity variants on synthetic and real datasets with dif-

ferent characteristics. The results of MSD and NSM are robust for many cases, with an

advantage for NSM as it does not suffer from PB10. Moreover, the results show that

similarity measures show comparable results for the high number of neighbors as the

dataset becomes denser.

Basic similarities for CRSs rely on the common set only. However, this is not enough

as more parameters should be considered. The future work will examine the effect of

including such parameters besides studying the performance of the proposed similarity

measures with different neighbors.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available

from the corresponding author on reasonable request.

Acknowledgment

The author extends his appreciation to the Deanship of Scientific Research at King

Khalid University for funding this work through the Research Group Project under

grant number (RGP. 2/165/43).

References

1. Kluver D, Ekstrand MD, Konstan, JA (2018) Rating-Based Collaborative Filtering: Algorithms

and Evaluation. Social Information Access: Systems and Technologies, 344-390.

2. Bagchi S (2015) Performance and Quality Assessment of Similarity Measures in Collaborative

Filtering Using Mahout. Procedia Computer Science 50:229-234.

https://doi.org/10.1016/j.procs.2015.04.055.

3. Gazdar A, Hidri, L (2020) A new similarity measure for collaborative filtering based recom-

mender systems. Knowledge-Based Systems 188, 105058,

https://doi.org/10.1016/j.knosys.2019.105058.

4. Jain G, Mahara T, Tripathi KN (2020a) A Survey of Similarity Measures for Collaborative Fil-

tering-Based Recommender System. M. Pant et al. (eds.), Soft Computing: Theories and Appli-

cations, Advances in Intelligent Systems and Computing 1053, Springer.

https://doi.org/10.1007/978-981-15-0751-9_32.

5. Wischenbart M, Firmenich S, Rossi G et al (2021) Engaging end-user driven recommender sys-

tems: personalization through web augmentation. Multimed Tools Appl 80:6785–6809.

https://doi.org/10.1007/s11042-020-09803-8.

6. Bag S, Kumar SK, Tiwari MK (2019a) An efficient recommendation generation using relevant

Jaccard similarity. Information Science 483. https://doi.org/10.1016/j.ins.2019.01.023.

7. He X, Luo Y (2010) Mutual Information Based Similarity Measure for Collaborative Filtering.

Proc. Progress in Informatics and Computing (PIC), 2010 IEEE International Conference, pp.

1117-1121.

8. Morozov S, Zhong X (2013) The evaluation of similarity metrics in collaborative filtering rec-

ommenders. Proc. of the 2013 Hawaii University International Conferences Education & Tech-

nology Math & Engineering Technology, Honolulu, HI, USA.

9. Sheugh L, Alizadeh SH (2015) A note on Pearson correlation coefficient as a metric of similari-

ty in recommender system. Proc. 2015 AI & Robotics (IRANOPEN), pp. 1-6, doi:

10.1109/RIOS.2015.7270736.

10. Patra BKr, Launonen R, Ollikainen V, Nandi S (2015) A new similarity measure using

Bhattacharyya coefficient for collaborative filtering in sparse data. Knowledge-Based Systems

82:163-177. https://doi.org/10.1016/j.knosys.2015.03.001.

11. Guo G, Zhang J, Yorke-Smith N (2016) A novel evidence-based Bayesian similarity measure

for recommender systems. ACM Trans. Web 10(2):1–30. doi: 10.1145/2856037

28

12. Al-Shamri MYH (2016) Effect of Collaborative Recommender System Parameters: Common Set Cardi-

nality and the Similarity Measure. Advances in Artificial Intelligence 2016:10 pages.

13. Tan Z, He L (2017) An Efficient Similarity Measure for User-Based Collaborative Filtering

Recommender Systems Inspired by the Physical Resonance Principle. IEEE Access 5:27211-

27228. doi: 10.1109/ACCESS.2017.2778424.

14. Ayub M, Ghazanfar MA, Maqsood M, Saleem A (2018) A Jaccard base similarity measure to

improve performance of CF based recommender systems. Proc. 2018 International Conference

on Information Networking (ICOIN), pp. 1-6, doi: 10.1109/ICOIN.2018.8343073.

15. Jain A, Nagar S, Singh PK, Dhar J (2020b) EMUCF: Enhanced multistage user-based collabora-

tive filtering through nonlinear similarity for recommendation systems. Exp. Sys. with App.

161:113724. doi.org/10.1016/j.eswa.2020.113724.

16. Karypis G (2001) Evaluation of item-based top-N recommendation algorithms. Proc. the Inter-

national Conference on Information and Knowledge Management (CIKM ’01), pp. 247–254,

Atlanta, Ga, USA.

17. Millan M, Trujillo M, Ortiz E (2007) A Collaborative Recommender System Based on Asym-

metric User Similarity. In: Yin H, Tino P, Corchado E, Byrne W, Yao X (eds) Intelligent Data

Engineering and Automated Learning - IDEAL 2007. Lecture Notes in Computer Science, vol

4881. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77226-2_67.

18. Bobadilla J, Ortega F, Hernando A (2012) A collaborative filtering similarity measure based on

singularities. Inf. Process. Manage 48(2):204–217. doi: 10.1016/j.ipm.2011.03.007.

19. Mu Y, Xiao N, Tang R, Luo L, Yin X (2019) An Efficient Similarity Measure for Collaborative

Filtering. Procedia Computer Science, 147:416-421.

https://doi.org/10.1016/j.procs.2019.01.258.

20. Ahn HJ (2008) A new similarity measure for collaborative filtering to alleviate the new user

cold-starting problem. Inf. Sci. 178(1):37-51.

21. Liu H, Hu Z, Mian A, Tian H, Zhu X (2014) A new user similarity model to improve the accu-

racy of collaborative filtering. Knowledge-Based Systems 56:156–166.

22. Al-Shamri MYH, Bharadwaj KK (2007) A hybrid preference-based recommender system based

on fuzzy concordance/discordance principle. Proc. 3rd Indian International Conference on Artifi-

cial Intelligence (IICAI’07), pp. 301–314, Pune, India.

23. Schwarz M, Lobur M, Stekh Y (2017) Analysis of the effectiveness of similarity measures for

recommender systems. Proc. 14th International Conference: The Experience of Designing and

Application of CAD Systems in Microelectronics (CADSM), pp. 275-277. doi:

10.1109/CADSM.2017.7916133.

24. Pirasteh P, Jung JJ, Hwang D (2015) An Asymmetric Weighting Schema for Collaborative Fil-

tering. In: Camacho D, Kim SW, Trawiński B (eds) New Trends in Computational Collective

Intelligence. Studies in Computational Intelligence, vol 572. Springer.

https://doi.org/10.1007/978-3-319-10774-5_7.

25. Hassanieh LA, Jaoudeh CA, Abdo JB, Demerjian J (2018) Similarity measures for collaborative

filtering recommender systems. Proc. IEEE Middle East and North Africa Communications

Conference (MENACOMM), pp. 1-5. doi: 10.1109/MENACOMM.2018.8371003.

26. Stephen SC, Xie H, Rai S (2017) Measures of Similarity in Memory-Based Collaborative Filter-

ing Recommender System: A Comparison. Proc. 4th Multidisciplinary International Social

Networks Conference (MISNC '17), Association for Computing Machinery, New York, NY,

USA, pp. 1–8. DOI: https://doi.org/10.1145/3092090.3092105.

27. Ning X, Desrosiers C, Karypis G (2015) A Comprehensive Survey of Neighborhood-Based

Recommendation Methods. In: Ricci F, Rokach L, Shapira B (eds) Recommender Systems

Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7637-6_2.

28. Bag S, Ghadge A, Tiwari MK. (2019b) An integrated recommender system for improved accuracy and

aggregate diversity. Computers & Industrial Engineering 130:187-197.

https://doi.org/10.1016/j.cie.2019.02.028.

29. https://grouplens.org/datasets/movielens/

30. http://www.ieor.berkeley.edu/~goldberg/jester-data/