Content uploaded by Felix Beierle

Author content

All content in this area was uploaded by Felix Beierle on May 19, 2018

Content may be subject to copyright.

Do You Like What I Like? Similarity Estimation in

Proximity-based Mobile Social Networks

Felix Beierle

Service-centric Networking

Telekom Innovation Laboratories

Technische Universit¨

at Berlin

Berlin, Germany

beierle@tu-berlin.de

Abstract—While existing social networking services tend to

connect people who know each other, people show a desire to also

connect to yet unknown people in physical proximity. Existing

research shows that people tend to connect to similar people.

Utilizing technology in order to stimulate human interaction

between strangers, we consider the scenario of two strangers

meeting. On the example of similarity in musical taste, we develop

a solution for the problem of similarity estimation in proximity-

based mobile social networks. We show that a single exchange

of a probabilistic data structure between two devices can closely

estimate the similarity of two users – without the need to contact

a third-party server. We introduce metrics for fast and space-

efﬁcient approximation of the Dice coefﬁcient of two multisets

– based on the comparison of two Counting Bloom Filters or

two Count-Min Sketches. Our analysis shows that utilizing a

single hash function minimizes the error when comparing these

probabilistic data structures. The size that should be chosen for

the data structure depends on the expected average number of

unique input elements. Using real user data, we show that a

Counting Bloom Filter with a single hash function and a length

of 128 is sufﬁcient to accurately estimate the similarity between

two multisets representing the musical tastes of two users. Our

approach is generalizable for any other similarity estimation of

frequencies represented as multisets.

Index Terms—Mobile Social Networking, Device-to-Device

Communication, Similarity, Dice Coefﬁcient, Counting Bloom

Filter, Recommender Systems

I. INTRODUCTION

In January of 2018, the prime minister of the UK appointed

one of her ministers to focus on issues related to loneliness1.

This acknowledges the basic human need to form connections

with other people. From a technological point of view, Online

Social Networks (OSNs) are typically used to reﬂect real-

world social connections and establish new ones. Besides

established global OSNs, there is a rising market for services

that explicitly focus on the connections between people in

proximity, i.e., neighborhood networks like Nextdoor2and its

competitors. We note the apparent desire and need for people

to form connections with those in physical proximity. How

can technology assist this need?

First, we observe that people use their smartphone to access

social networking services. OSNs have shifted to Mobile

1https://www.theguardian.com/society/2018/jan/16/

may-appoints- minister-tackle-loneliness-issues-raised-jo-cox

2http://www.nextdoor.com/

Social Networks (MSNs). Facebook, for instance, lists 1.15

billion mobile daily active users on average for 20163. Among

the most frequent concerns with established OSNs or MSNs

is the potential misuse of data and loss of user privacy. This

concern is heightened by the highly sensitive user and sensor

data that modern smartphones provide (cf. [1], [2]). One of the

key reasons for these concerns is the centralized architecture of

the systems – the social networking service provider has all the

data and can use it beyond the level to which users intended to

share it with the service provider [3]. By utilizing short range

wireless interfaces, the need for centralized servers in social

networking scenarios can be reduced. Utilizing Bluetooth,

WiFi Direct, or NFC, smartphones can communicate directly

with each other.

Consider the stimulation of connections between people

who do not know each other yet, utilizing a proximity-based

MSN that takes into account the described privacy concerns

of centralized social networking architectures. Previous work

points to what approach such a system can make in order to

foster user interaction: psychological studies point out that any

social network is structured by homophily, which means that

people who are similar to each other tend to connect with each

other [4]. How can we determine the similarity of two users

with mobile devices without contacting a central server?

When answering that question, we have to consider the

applicability of the solution in a quickly changing mobile

device-to-device context, as well as implications imposed by

short-range technologies like NFC. This implies the use of

only small amounts of data and a limited number of necessary

data exchanges. A lot of the information that is relevant for the

social proﬁle of a user is already available on the smartphone

itself [5]. As the needed data is present and connectivity

between devices is also given, we focus on the question of

how to process and compare the data under the constraints of

bandwidth and computing limitations of mobile devices.

Based on the use case of two strangers meeting, we develop

a method of similarity estimation for proximity-based MSNs.

Two users can quickly approximate their similarity when

meeting, without exchanging clear text data nor contacting

3https://investor.fb.com/investor-news/press-release- details/2017/

Facebook-Reports-Fourth-Quarter-and- Full-Year-2016-Results/default.aspx

any central server. While our approach is applicable to any

other data that can be represented as a multiset, in this paper,

we focus on one of the most typical features of social proﬁles,

the musical taste of the user. Not only is listening to music

one of the most typical usages for smartphones [6]; musical

taste is, after gender, the most commonly disclosed proﬁle

feature in Facebook [7]. Musical taste is a common feature

that users tend to identify with, that thus can serve as an

appropriate feature for similarity estimation or can be used

in the recommendation of new contacts.

In this paper, we present our approach that allows the

estimation of the similarity of two users’ musical tastes based

on probabilistic data structures. While approaches for set sim-

ilarity estimation exist for the Bloom Filter (BF), we develop

an approach for Counting Bloom Filters (CBFs) and Count-

Min Sketches (CMSs), suitable for multisets. We discuss our

approach based on experiments done with synthetic data and

real user music listening history data. We conclude with a

concrete approach that is applicable for multiset similarity

estimations in device-to-device scenarios.

Hence, the main contributions of this paper are:

•An approach to similarity estimations in proximity-based

MSNs.

•The introduction of new comparison metrics for CBFs

and for CMSs.

•An evaluation of the introduced metrics based on both

synthetic and real data sets – showing support for our

approach to space-efﬁcient similarity estimations.

II. RE LATE D WOR K

Before the advent of smartphones, a similar idea of social

networking was described in [8]. Here, users exchange identi-

ﬁers of existing OSNs with each other via Bluetooth and can

look up each other’s information on the OSN. Having the data

already available on the smartphone gives us the possibility to

directly compare data instead of relying on existing OSNs.

Some other papers present similar ideas, utilizing Bluetooth

and central servers [9] or manually entered interests to ﬁnd

user similarities [10]. E-Smalltalker also follows the idea to

share data via Bluetooth and describes a so-called Iterative

Bloom Filter for ﬁnding the intersection of two sets of topics

of interest [11]. More recent work deals with proximity-based

mobile social networking: with E-Shadow, the user can see

proﬁles of other users in proximity and has to evaluate manu-

ally if he/she is interested in another user without support for

automatic similarity estimation [12]. In SANE, the devices of

users with similar interests are used for forwarding messages

[13], while there is no system to stimulate interactions between

users or offer recommendations for new contacts.

Papers that focus more on the algorithmic side of this topic

often deal with the research areas of private set intersection

or secure multi-party computation. Here, the application sce-

narios usually require a much higher level of privacy than

estimating similarity in the proximity-based social networking

scenario. Especially the factor proximity reduces potential

attacker vectors. Often, multiple data exchanges are necessary

for handshakes, key-exchanges, etc. [14]. Furthermore, some

approaches rely on third parties to perform homomorphic

encryption [15] or need other peers to perform computations

[16]. While these third parties usually do not learn anything

about the two users, they are not needed in our approach, with

which it is possible to estimate the similarity of two users with

one single device-to-device data exchange.

III. BACKGRO UN D

Multisets. Amultiset is a generalization of a set, allowing

for multiple instances of each of its elements. A series of

events for which the frequency is important – but the order

is not – can be described as a multiset. Take for example

the visited locations of a user. Every location or area can be

described by a unique string. Each visit adds one element of

the corresponding string to the multiset. Users with similar

movement patterns will generate similar multisets. In this

paper, we focus on the musical taste of users. Without the

need to have the user explicitly enter this information, we can

just collect data about the songs a user listened to [17]. Storing

a unique string representation for each song for each time it

is played yields a multiset that represents the musical taste

of the user. In order to reduce the amount of data that needs

to be exchanged, we want to avoid sending clear text music

playlists between clients.

Bloom Filter (BF). Probabilistic data structures are able

to represent large amounts of data space-efﬁciently. Querying

the data yields results with a certain probability. The trade-

off is between used memory and precision. A BF yields

probabilistic set membership [18]. It consists of a bit vector

with nbits, initialized with zeros, and kpairwise independent

hash functions, each of which yields one position in the bit

vector when hashing one element. When adding an element to

the set, all hash functions are applied, yielding kpositions in

the bit array, which then are switched to 1. We visualize a BF

in Figure 1. When querying for set membership, the BF can

1 2 3 4 5 6 7

h1(x)

0

h2(x)

0 1 0 1 0 0

Fig. 1. Visualization of a Bloom ﬁlter with a length of n= 7 and two hash

functions (k= 2).

answer whether the element is deﬁnitely not in the set – when

the query element hashes to at least one 0in the bit vector

– or is likely in the set – when all hash positions in the bit

vector are 1. In the latter case the queried element is either in

the set or there is a collision with another element.

Counting Bloom Filter (CBF). An extension to the BF to

adapt it to work with multisets is to make each ﬁeld in the bit

vector not binary but a counter [19]. An example is shown in

Figure 2. The resulting CBF can yield a close estimation of the

cardinality of the queried element in a multiset. Analogously

to the false positives due to collisions in the original BF, the

estimated cardinality from the CBF represents an upper bound

of an element’s cardinality in the original multiset. In order

to achieve a closer approximation of the cardinality of an

element, using multiple hash functions can be useful. A single

hash function can have collisions and there can be collisions

between hash functions.

1 2 3 4 5 6 7

h1(x)

3

h2(x)

3 2 0 2 0 0

Fig. 2. Visualization of a Counting Bloom ﬁlter with a length of n= 7 and

two hash functions (k= 2).

In this paper, we use collision instead of hash collision,

because the collisions that occur do not have to be hash

collisions: as each element has to be mapped to the length of

the (C)BF and not the entire namespace of the hash function,

there may be collisions even if there is no hash collision. For

example, two different hashes from hash function h1could

be mapped to the same position of the (C)BF, resulting in a

collision but strictly speaking without having a hash collision.

Utilizing multiple hash functions can help reduce the impact of

collisions: when querying a CBF for the cardinality of an item

e, it is hashed with each hash function and the lowest counter is

returned. Through the described collisions, the yielded number

is equal or higher than the true amount.

Count-Min Sketch (CMS). Another probabilistic data

structure that works with multisets is the Count-Min Sketch

[20]. It is often used to provide information about the fre-

quency of events in streams of data. A CMS consists of w

columns and drows (cf. Figure 3). Each ﬁeld is initialized

with 0. Each row is associated with one hash function. Adding

an element increments the counter at the positions indicated

by the hash functions.

1 2 3 4 5 6 7

h1()

0

h2() 4 0 0 1 0 3

1 0 3 0 4 0 0

Fig. 3. Visualization of a Count-Min Sketch with a width of w= 7 and

depth of d= 2.

It is worth stating the relationship between CBF and CMS:

adding the rows of a CMS yields a CBF. A CBF with length n

and one single (k= 1) hash function his equal to a CMS with

width n=wand depth 1, given that the same hash function

his used in the CMS:

CBF n,k =CMS w,d |n=w∧k=d= 1 (1)

Comparison of two BFs. There are some papers dealing

with the comparison of two sets utilizing BFs [21] [22] [23]

[24]. In [21], the authors compare two BFs with a bit-wise

AND . A large number of 1s in the result indicates similarity.

The authors of [22] calculate string similarity utilizing BFs by

creating n-grams, adding those into BFs and using the Dice

coefﬁcient (see Equation 3). The idea is that identical n-grams

will hash to identical positions in the bit array as long as

the same length and same hash functions are used. In [23],

the authors utilize BFs to estimate path similarity for paths

in computer networks. They deﬁne a Bloom Distance, which

is the logical AND of both BFs, followed by counting the

number of 1s in the results and dividing by the length of a

BF. The authors of [24] use cosine similarity on two BFs to

determine similarity. In [11], the authors iteratively compare

two BFs. They start with a small BF with a high false positive

rate and increase its size in a second round of comparisons if

the similarity value of the ﬁrst round was above a pre-deﬁned

threshold.

IV. METRICS FOR COMPARING CBFS AND CMSS

Because in our scenario, we are dealing with multisets,

utilizing a BF and thus leaving out the cardinality of each

element in the multiset would not reﬂect the musical taste of

the user anymore. It is the cardinality of the element (”play

count”) that indicates the number of times a speciﬁc song was

played back.

In order to compare multisets, one can apply similar metrics

like suggested for the BF, i.e., cosine similarity and Dice

coefﬁcient. Both cosine similarity (in case of positive values)

and Dice coefﬁcient yield a value between 0and 1, where 0

indicates no similarity and 1indicates sameness. The cosine

similarity for two vectors v1and v2is deﬁned as:

cosSim(v1, v2) = v1·v2

||v1|| · ||v2|| (2)

The numerator indicates the dot product of v1and v2. The

denominator indicates the multiplication of the lengths of the

two vectors: ||x|| =px2

1+... +x2

n. Given a multiset X, we

can interpret the cardinalities of the elements in the multiset

as elements of a vector. Thus, for two multisets X, Y , we

will just write cosSim(X, Y ), assuming an appropriate vector

representation of the cardinalities of the elements in Xand Y.

The Dice coefﬁcient of two sets Aand Bis given by:

Dice(A, B) = 2· |A∩B|

|A|+|B|(3)

In order to obtain the Dice coefﬁcient for two multisets Xand

Y, we can also use Equation 3 by employing the cardinality

and the intersection of multisets. The cardinality gives the

sum of all occurrences of all elements in the multiset. The

intersection of two multisets Xand Ycan be determined by

the minimum function applied for each element i: If i∈nX

(denoting that Xhas exactly ninstances of i) and i∈mY,

then the following holds for each element:

i∈min(n,m)(X∩Y)(4)

To the best of our knowledge, there is no research specif-

ically about the comparison of two CBFs or two CMSs. A

prerequisite for the pairwise comparison is that the two data

structures to be compared are of the same length and use the

same hash functions; i.e., the same elements will always hash

to the same positions in the data structure. Based on these

prerequisites, we will now transform the idea of both cosine

similarity and Dice coefﬁcient to CBFs as well as to CMSs,

yielding metrices for these data structures.

A. Cosine Similarities for CBFs and CMSs

As the data structure of a CBF is a vector, we can im-

mediately use Equation 2 for deﬁning the cosine similarity

cosSim(P, Q)of two CBFs P, Q.

For the cosine similarity of CMSs, we view each CMS as

a collection of dvectors (one vector each row), cf. Figure 3,

and propose the following.4

Deﬁnition 1 (CMS-cosSim).Let Rand Sbe two CMSs

with the same dimensions d×wand utilizing the same hash

functions. The CMS cosine similarity of Rand Sis given by:

CMS -cosSim(R, S ) = 1

d·

d

X

i=1

cosSim(#»

ri,#»

si)(5)

where #»

ri= (ri1, . . . , riw)and #»

si= (si1, . . . , siw).

Thus, the CMS cosine similarity of two CMSs averages the

cosine similarity of all corresponding rows.

B. Dice Coefﬁcents for CBFs and CMSs

The bitwise operations suggested for the comparisons of two

BFs do not work with CBF and CMS as we have counters –

instead of binary values – at each position, cf. Figure 2 and

3. Therefore, we transfer the idea of the Dice coefﬁcient for

multisets to CBFs as follows:

Deﬁnition 2 (CBF-Dice coefﬁcient).Let Pand Qbe two

CBFs with length nand utilizing the same hash functions.

The CBF dice coefﬁcient of Pand Qis given by:

CBF -Dice(P, Q) = 2·Pn

i=1 min(pi, qi)

Pn

i=1 pi+qi

(6)

Note that in order to approximate the numerator for the mul-

tiset Dice coefﬁcient, the CBF dice coefﬁcient in Equation 6

applies the minimum function for each position, and for the

denominator, the cross sum of both CBFs is used.

Extending the Dice coefﬁcient to CMSs reuses the dice

coefﬁcient for CBFs.

Deﬁnition 3 (CMS-Dice coefﬁcient).Let Rand Sbe two

CMSs with the same dimensions d×wand utilizing the same

hash functions. The CMS dice coefﬁcient of Rand Sis given

by:

CMS -Dice(R, S) = 1

d·

d

X

i=1

2·Pw

j=1 min(rij , sij )

Pw

j=1 rij +sij

(7)

Hence, the Dice coefﬁcient of two CMSs utilizes the average

of the CBF Dice coefﬁcient of each pair of corresponding

rows.

4If P(resp. R) is a CBF (resp. CMS), its positions will be denoted by pi

(resp. rij ).

C. Comparing Mulisets via CBFs and CMSs

Given two multisets Xand Y, we can now estimate their

similarity via CBFs or via CMSs. Let Xcbf and Ycbf be CBFs

(of the same length and using the same hash functions) and

let Xcms and Ycms be CMSs (of the same dimensions and

using the same hash functions) for Xand Y. In the following

evaluation, we show how we can use

CBF -Dice(Xcbf , Y cbf )(8)

and

CMS -Dice(Xcms , Y cms )(9)

to approximate

Dice(X, Y ).(10)

Likewise, we investigate the approximation of

cosSim(X, Y )(11)

by

cosSim(Xcbf , Y cbf )(12)

and

CMS -cosSim(Xcms , Y cms ).(13)

Using any of the approximations given in (8), (9), (12), or

(13) will typically yield a fast and space efﬁcient comparison

of the multisets Xand Y. Our evaluation on both synthetic

and on real data shows that good approximations may already

be achieved when using small CBFs or CMSs.

V. EX PE RI ME NTAL RE SU LTS

For the evaluation, we use both a synthetic data set and real

user music listening histories. The synthetic data set

SD =Ar, A0, ..., A1000

consists of 1,002 multisets of strings. For the strings contained

in each of those multisets, we use random ASCII strings

that are 10 characters long. As the strings are entered into

the hash functions of CBF and CMS, we could have picked

any other random string to achieve the same effects. Each

multiset has 66.9unique entries on average. Given a multiset

of random strings Ar, the other multisets Aiare chosen

such that comparing Arto the other multisets yields Dice

coefﬁcients of 0.000 to 1.000 in increments of 0.001, i.e.,

Dice(Ar, Ai) = i∗0.001

For the real data RD, we used the taste proﬁle subset5of

the million song data set [25]. In order to have appropriate

data for comparisons, we chose a subset of active users who

each listened to at least 50 distinct songs. The subset contains

1,865 distinct users, 14,867 distinct song titles, and 122,389

recorded plays. In order to be able to visualize the results, we

chose a subset of these users that yields a range of different

similarity values. To enter data into the data structures, we

built a unique string for each song. RD consists of roughly

4,000 multiset comparisons. On average, each multiset has

63.8unique entries.

5https://labrosa.ee.columbia.edu/millionsong/tasteproﬁle

A. Synthetic Data SD / Comparing CMSs

For evaluating CMS-Dice on SD, we start with a CMS

encoding of SD with w= 400 columns and d= 10 rows. For

each Ai∈SD, let Acms

ibe the corresponding CMS for Ai.

Figure 4 illustrates the comparison of the Dice coefﬁcient of

the multisets as ground truth – Dice(Ar, Ai)– with the CMS-

Dice of the CMS representation – CMS -Dice(Acms

r, Acms

i).

The Dice coefﬁcient is plotted in red. The x-axis indicates the

multiset pair combination that is compared – sorted by Dice

coefﬁcient. The y-axis gives the similarity score. The blue dots

represent the CMS-Dice similarity score.

0 100 200 300 400 500 600 700 800 900 1000

Comparison Pair

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Similarity Score

Ground Truth Similarity – Dice(Ar,Ai)

Similarity Estimation – CMS-Dice(Acms

r,Acms

i)

Fig. 4. Similarity scores for ground truth Dice(Ar, Ai)in red and estimation

CMS-Dice(Acms

r, Acms

i)in blue using a CMS with w= 400 columns and

d= 10 rows, using synthetic data set SD.

In Figure 5, we give the same plot for cosine similarities.

We compare the cosine similarity of the multisets as ground

truth – cosDis(Ar, Ai)– with the cosine similarity of the

CMS representation – CMS -cosSim(Acms

r, Acms

i). The ﬁrst

observation we make is that Dice and cosine measurement

yield almost identical results, which can be seen by the

red lines in Figure 4 and 5 having almost identical slopes.

Therefore, for the rest of this paper, we focus on the Dice

coefﬁcient. Another observation we make is that the similarity

estimation by CMS-Dice is always slightly higher than or

equal to the Dice coefﬁcient ground truth – so the similarity

between two multisets is always correctly estimated or over-

estimated due to collisions, never underestimated.

Typically, when using a CMS, the user wants to perform

queries and get accurate results. The values for wand dare

chosen accordingly. In our case, we just want to perform a

similarity estimation without querying for speciﬁc entries. We

investigate what inﬂuence the number of columns wand the

number of rows dhave on the estimation of similarity. In

order to do so, we plot the same comparisons of the synthetic

data for different values of wand d. For each combination,

we calculate the root mean square error (RMSE) of the

similarity estimation by CMS-Dice from the Dice coefﬁcient

of the ground truth. The RMSE quantiﬁes to what extent the

similarity estimation differs from the ground truth similarity

score. The lower the RMSE, the better the approximation

0 100 200 300 400 500 600 700 800 900 1000

Comparison Pair

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Similarity Score

Ground Truth Similarity – cosSim(Ar,Ai)

Similarity Estimation – CMS-cosSim(Acms

r,Acms

i)

Fig. 5. Similarity scores for ground truth cosDis(Ar, Ai)in red and

estimation CMS-cosSim(Acms

r, Acms

i)in blue using a CMS with w= 400

columns and d= 10 rows, using synthetic data set SD.

of the Dice coefﬁcient. Based on 100,100 comparisons, we

calculate the RMSE for different combinations of w(x-axis)

and d(y-axis) values. The result is given in Figure 6. With

an increasing number of columns, the RMSE decreases. The

number of rows does not signiﬁcantly inﬂuence the RMSE:

increasing the number of rows does not reduce the RMSE of

the similarity estimation.

B. Synthetic Data SD / Comparing CBFs

Visualizing the RMSEs for similarity estimation by CBF-

Dice with different length nand number of hash functions k,

we get Figure 7. Increasing the length of the CBF reduces

the average error of CBF-Dice while increasing the number

of hash functions increases the error.

C. Real Data RD

Using the real data set RD, our ﬁndings using synthetic

data SD are conﬁrmed. Figure 8 shows the RMSEs for CMS-

Number of Columns w

100200300 400500 600700 800 9001000

Number of Rows d

12345678910

RMSE

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 6. RMSEs of similarity estimation CMS-Dice for CMSs of different

sizes, using synthetic data set SD and Dice coefﬁcient as ground truth.

Length n

100200300 400500 600700 800 9001000

Number of Hash Functions k

12345678910

RMSE

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 7. RMSEs of similarity estimation CBF -Dice for CBFs of different

sizes, using synthetic data set SD and Dice coefﬁcient as ground truth.

Number of Columns w

100200300 400500 600700 800 9001000

Number of Rows d

12345678910

RMSE

0.05

0.10

0.15

0.20

0.25

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 8. RMSEs of similarity estimation CMS-Dice for CMSs of different

sizes, using real data set RD and Dice coefﬁcient as ground truth.

Dice, and Figure 9 for CBF-Dice. We achieve the highest

accuracy and simultaneously the lowest memory size by using

a CMS with one row, which is a CBF with one hash function

(cf. Equation 1). In Figure 10, we visualize the estimation

error with CBF-Dice (which is equal to CMS-Dice in this

case) utilizing this data structure with a length of 400. We

can see the error of each similarity estimation and can see

that we never underestimate the similarity. Looking at the

ﬁrst 3,000 comparison pairs, the values for the ground truth

similarity scores are very low. The values for the similarity

estimation by CBF-Dice range roughly from 0to 0.2. For the

remaining comparisons, we observe that the higher the ground

truth similarity score is, the lower is the error range by the

CBF-Dice estimation.

Length n

100200300 400500 600700 800 9001000

Number of Hash Functions k

12345678910

RMSE

0.1

0.2

0.3

0.4

0.5

0.6

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 9. RMSEs of similarity estimation CBF -Dice for CBFs of different

sizes, using real data set RD and Dice coefﬁcient as ground truth.

0 500 1000 1500 2000 2500 3000 3500 4000

Comparison Pair

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Similarity Score

Ground Truth Similarity – Dice

Similarity Estimation – CBF-Dice

Fig. 10. Similarity scores for ground truth Dice in red and estimation

CBF -Dice in blue using a CBF/CMS with length 400 and 1hash function,

using real data set RD.

VI. DISCUSSION

In order to discuss the experimental results, we start by

analyzing the regular BF. Consider a regular BF with one hash

function, utilized for comparing sets. If two sets are equal,

the BFs should be equal and even the smallest length yields

the correct estimation. Regarding memory size (length of the

BF), comparing two disjunct sets is the worst case scenario:

estimating the similarity of two disjunct sets by performing

AND on the two BFs should yield 0for every position. There

is one factor that introduces a deviation from 0: the number

of unique inputs for a given length of the BF. Because of the

limited length of the BF, several elements are hashed to the

same position in the bit vector, even if there is no hash collision

(see description in Section III). By increasing the length of the

BF, the probability for such collisions is reduced. The more

unique elements are entered into the BF, the more positions

are set to 1. Thus, the more unique elements are in at least

one of the sets, the longer the bit vector should be if a small

error in estimating the similarity is desired.

When comparing two disjunct sets with cardinalities gand

h, the BFs have to have at least a length of g+hin order

to theoretically be able to correctly estimate a similarity of

0. Now imagine increasing the number of hash functions. It

increases the error: more bits are set to 1, which creates a

higher similarity estimation. In the following, we show that

these conclusions are also true for the CMS and CBF.

As described in Section III, when the goal is to query a

CBF or CMS for cardinality of an element of a multiset,

the user proﬁts from utilizing multiple hash functions. In our

scenario of similarity estimation, we do not need to query for

speciﬁc items and do not proﬁt from multiple hash functions

in the same way. As described above for the BF, the opposite

is true for the CBF: both Figure 7 and 9 indicate the trend

that the more hash functions we use, the worse the error

is. This is because of the collisions: multiple elements are

mapped to the same positions in the bit vector. The more hash

functions we use, the more collisions there are. This can be

compensated by increasing the length of the CBF, which would

just unnecessarily increase the needed memory size.

Regarding the CMS, we increase the number of hash

functions by increasing the number of rows (see Figures 6

and 8). However, we do not see an increase in error when

using more hash functions. This is because each hash function

corresponds to one row. The probability of collisions is the

same in each row and the average of the row-wise similarities

calculated by CMS-cosSim and CMS-Dice contains this error.

Considering memory size and computation time, we should

use just one single row.

The best and worst case for similarity estimation are the

same as described for the BF: the higher the real similarity,

the lower is the RMSE. The same elements deﬁnitely hash

to the same positions. Only those elements not present in the

other multiset introduce an error in the similarity estimation

through collisions. Thus, the more dissimilar the multisets are,

the larger the potential error. This can best be seen in Figure

10. Note how the spread of blue dots (similarity estimations by

CBF-Dice) spans a larger part of the y-axis (similarity score)

for lower ground truth similarity scores (red dots). This means

that for lower similarities, there are higher errors. All errors

are produced by collisions and lead to an overestimation of

similarity.

Compared to the BF comparison, for the CBF and CMS, the

cardinality of each element is the additional factor to consider.

For the regular BF, each collision has the same effect on the

error. Using a CBF or CMS, the inﬂuence a collision has

on the error of the similarity estimation is based on how

the data of the multiset is distributed. If two users listen to

two different songs very frequently and those two songs are

mapped to the same position in the data structure, then the

error in the similarity estimation can be signiﬁcant. The error

is less signiﬁcant if the collision occurs for two different songs

the two users listened to less frequently.

We conclude that a general approach for estimating the sim-

ilarity of two multisets in proximity-based mobile applications

can be:

•use one-hash CBF / one-row CMS as a data structure

•estimate the average number of unique input elements

•deﬁne an appropriate threshold for the given scenario

We showed that the one-hash CBF gives the best estimation

while having the smallest memory size. Our discussion showed

that the average number of unique input elements is the

relevant factor for how well the estimation is. Based on this

number, we can pick the length of the CBF. We showed that

a length of twice the average unique inputs is necessary to

theoretically still be able to estimate with full accuracy in the

worst case scenario of disjunct multisets. Lastly, after perform-

ing the similarity estimation, one should have a threshold to

be able to tell if the result should be regarded as signiﬁcant.

Taking the music listening histories in our scenario, we

have an average unique input of 63.8elements and regard

similarities above 0.6to be relevant. Picking a size of twice the

average unique input gives a length of the data structure of 128.

We plot the ground truth (red) and the CBF-Dice similarity

estimation (blue) in Figure 11. The green area marks similarity

scores above 0.6. We observe some false positives: the blue

dots in the green area that correspond to red dots below the

green area. The left-most blue dot in the green area indicates

the largest error. Here, we estimate a signiﬁcant similarity of

≥0.6while the ground truth value is about 0.5. If we need

more accurate results, we choose a larger length, for example

like in Figure 10. A positive side effect of the larger errors

for low values is that it somewhat provides privacy through

lack-of-accuracy: estimating a similarity of 0.4corresponds to

a ground truth value of between 0to about 0.4– we cannot

make an accurate assumption about the actual similarity.

0 500 1000 1500 2000 2500 3000 3500 4000

Comparison Pair

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Similarity Score

Ground Truth Similarity – Dice

Similarity Estimation – CBF-Dice

Fig. 11. Similarity scores for ground truth Dice (red) and estimation

CBF -Dice (blue) using one-hash CBF (length 128), using real data set RD.

The green area indicates similarity scores above the desired threshold of 0.6.

VII. CONCLUSION

In this paper, we approached the problem of multiset

similarity estimation in the scenario of proximity-based MSNs.

We developed the comparison metrics CBF-Dice and CMS-

Dice for the similarity estimation of two CBFs and two

CMSs. Applying these metrics, we can approximate the Dice

coefﬁcient when comparing two multisets. We evaluated our

approach with both synthetic data and real music listening

history data.

Our results show that the more hash functions we utilize

in a data structure, the higher is the error in the estimation.

The larger the data structure is, the smaller is the error. We

achieve the lowest error when utilizing a one-hash CBF / one-

row CMS. Here, we minimize the number of collisions by

using only one hash function. The collisions are the source

of the error in the similarity estimation. In general, the higher

the real similarity, the better is the estimation.

The data structure one-hash CBF is appropriate for the given

scenario of similarity estimation between two users, requiring

only a single data exchange between two smartphones. We

described the general approach for assessing the appropriate

size of the data structure by estimating the average number of

unique input elements as well as deﬁning a threshold for the

similarity score. Using the real user music listening histories

with a mean of about 64 unique entries, we showed that

a one-hash CBF with length 128 (twice the average unique

inputs) sufﬁces to accurately estimate the similarity between

two multisets.

While we presented the scenario of two strangers meeting

and quickly determining the similarity of their musical tastes,

our approach can be applied in a variety of other scenarios. In

general, any two systems that log any events can be compared

with our approach. Utilizing CBF-Dice, one can perform fast

and space-efﬁcient similarity estimations of the two systems

in terms of the frequencies of the logged events.

For future work in the proximity-based MSN scenario,

potential attack scenarios like malicious users should be ad-

dressed. Furthermore, an implementation for mobile devices

can help evaluate the performance of our proposed approach.

For the scenario of stimulating interaction between strangers,

additional features besides musical taste should be considered,

e.g., visited locations.

ACKNOWLEDGMENT

This work has received funding from project DYNAMIC6

(grant No 01IS12056), which is funded as part of the Software

Campus initiative by the German Federal Ministry of Educa-

tion and Research (BMBF). We are grateful for the support

provided by Niklas Lensing, Bianca L¨

uders, Peter Ruppel,

Boris Lorbeer, Sandro Rodriguez Garzon, Martin Westerkamp,

Kai Grunert, Tanja Deutsch, Bernd Louis, and Axel K¨

upper.

REFERENCES

[1] F. Beierle, V. T. Tran, M. Allemand, P. Neff, W. Schlee, T. Probst,

R. Pryss, and J. Zimmermann, “Context Data Categories and Privacy

Model for Mobile Data Collection Apps,” Procedia Computer Science,

2018 (to appear).

[2] ——, “TYDR - Track Your Daily Routine. Android App for Track-

ing Smartphone Sensor and Usage Data,” in MOBILESoft ’18: 5th

IEEE/ACM International Conference on Mobile Software Engineering

and Systems. ACM, 2018 (to appear).

6http://www.dynamic-project.de

[3] M. Falch, A. Henten, R. Tadayoni, and I. Windekilde, “Business models

in social networking,” in CMI Int. Conf. on Social Networking and

Communities, 2009.

[4] F. Beierle, K. Grunert, S. G¨

ond¨

or, and V. Schl ¨

uter, “Towards

Psychometrics-based Friend Recommendations in Social Networking

Services,” in 2017 IEEE 6th International Conference on AI & Mobile

Services (AIMS 2017). IEEE, 2017, pp. 105–108.

[5] F. Beierle, S. G¨

ond¨

or, and A. K¨

upper, “Towards a Three-tiered Social

Graph in Decentralized Online Social Networks,” in Proc. 7th Interna-

tional Workshop on Hot Topics in Planet-Scale mObile Computing and

Online Social neTworking (HotPOST). ACM, Jun. 2015, pp. 1–6.

[6] A. Smith, “U.S. Smartphone Use in 2015,” http://www.pewinternet.org/

2015/04/01/us-smartphone- use-in- 2015/, Accessed 2018-02-15.

[7] R. Farahbakhsh, X. Han, A. Cuevas, and N. Crespi, “Analysis of publicly

disclosed information in Facebook proﬁles,” in Proc. 2013 IEEE/ACM

International Conference on Advances in Social Networks Analysis and

Mining (ASONAM). ACM, Aug. 2013, pp. 699–705.

[8] A. Beach, M. Gartrell, S. Akkala, J. Elston, J. Kelley, K. Nishimoto,

B. Ray, S. Razgulin, K. Sundaresan, B. Surendar, M. Terada, and R. Han,

“WhozThat? Evolving an Ecosystem for Context-Aware Mobile Social

Networks,” IEEE Network, vol. 22, no. 4, pp. 50–55, 2008.

[9] N. Eagle and A. Pentland, “Social Serendipity: Mobilizing social soft-

ware,” Pervasive Computing, IEEE, vol. 4, no. 2, pp. 28–34, 2005.

[10] A.-K. Pietil ¨

ainen, E. Oliver, J. LeBrun, G. Varghese, and C. Diot,

“MobiClique: Middleware for Mobile Social Networking,” in Proc. 2nd

ACM Workshop on Online Social Networks (WOSN). ACM, 2009, pp.

49–54.

[11] Z. Yang, B. Zhang, J. Dai, A. Champion, D. Xuan, and D. Li, “E-

SmallTalker: A Distributed Mobile System for Social Networking in

Physical Proximity,” in 2010 IEEE 30th International Conference on

Distributed Computing Systems (ICDCS), Jun. 2010, pp. 468–477.

[12] J. Teng, B. Zhang, X. Li, X. Bai, and D. Xuan, “E-Shadow: Lubri-

cating Social Interaction Using Mobile Phones,” IEEE Transactions on

Computers, vol. 63, no. 6, pp. 1422–1433, Jun. 2014.

[13] A. Mei, G. Morabito, P. Santi, and J. Stefa, “Social-Aware Stateless

Routing in Pocket Switched Networks,” IEEE Transactions on Parallel

and Distributed Systems, vol. 26, no. 1, pp. 252–261, Jan. 2015.

[14] C. Dong, L. Chen, and Z. Wen, “When Private Set Intersection Meets

Big Data: An Efﬁcient and Scalable Protocol,” in Proc. 2013 ACM

SIGSAC Conference on Computer & Communications Security (CCS).

ACM, 2013, pp. 789–800.

[15] F. Kerschbaum, “Outsourced Private Set Intersection Using Homo-

morphic Encryption,” in Proc. 7th ACM Symposium on Information,

Computer and Comm. Security (ASIACCS). ACM, 2012, pp. 85–86.

[16] J. Tillmanns, “Privately Computing Set-Union and Set-Intersection Car-

dinality via Bloom Filters,” in 20th Australasian Conf. on Inf. Security

and Privacy (ACISP), vol. 9144. Springer, 2015, pp. 413–430.

[17] F. Beierle, K. Grunert, S. G¨

ond¨

or, and A. K¨

upper, “Privacy-aware

Social Music Playlist Generation,” in Proc. 2016 IEEE International

Conference on Communications (ICC). IEEE, May 2016, pp. 5650–

5656.

[18] B. H. Bloom, “Space/Time Trade-offs in Hash Coding with Allowable

Errors,” Commun. ACM, vol. 13, no. 7, pp. 422–426, Jul. 1970.

[19] L. Fan, P. Cao, J. Almeida, and A. Z. Broder, “Summary Cache: A Scal-

able Wide-Area Web Cache Sharing Protocol,” IEEE/ACM Transactions

on Networking, vol. 8, no. 3, pp. 281–293, Jun. 2000.

[20] G. Cormode and S. Muthukrishnan, “An Improved Data Stream Sum-

mary: The Count-Min Sketch and its Applications,” in LATIN 2004:

Theoretical Informatics. Springer, 2004, pp. 29–38.

[21] N. Jain, M. Dahlin, and R. Tewari, “Using Bloom Filters to Reﬁne Web

Search Results.” in WebDB, 2005, pp. 25–30.

[22] R. Schnell, T. Bachteler, and J. Reiher, “Privacy-preserving record

linkage using Bloom ﬁlters,” BMC Medical Informatics and Decision

Making, vol. 9, no. 1, p. 41, Aug. 2009.

[23] B. Donnet, B. Gueye, and M. A. Kaafar, “Path similarity evaluation

using Bloom ﬁlters,” Computer Networks, vol. 56, no. 2, pp. 858–869,

2012.

[24] M. Alaggan, S. Gambs, and A.-M. Kermarrec, “BLIP: Non-interactive

Differentially-Private Similarity Computation on Bloom ﬁlters,” in Sta-

bilization, Safety, and Security of Distributed Systems, ser. LNCS, A. W.

Richa and C. Scheideler, Eds. Springer, 2012, no. 7596, pp. 202–216.

[25] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million

song dataset,” in Proc. 12th International Society for Music Information

Retrieval Conference (ISMIR), 2011.