Conference PaperPDF Available

Abstract and Figures

Suppose that a Facebook user, whose age is hidden or missing, likes Britney Spears. Can you guess his/her age? Knowing that most Britney fans are teenagers, it is fairly easy for humans to answer this question. Interests (or "likes") of users is one of the highly-available on-line in-formation. In this paper, we show how these seemingly harmless interests (e.g., music interests) can leak privacy-sensitive information about users. In particular, we infer their undisclosed (private) attributes using the public at-tributes of other users sharing similar interests. In order to compare user-defined interest names, we extract their semantics using an ontologized version of Wikipedia and measure their similarity by applying a statistical learning method. Besides self-declared interests in music, our tech-nique does not rely on any further information about users such as friend relationships or group belongings. Our ex-periments, based on more than 104K public profiles col-lected from Facebook and more than 2000 private profiles provided by volunteers, show that our inference technique efficiently predicts attributes that are very often hidden by users. To the best of our knowledge, this is the first time that user interests are used for profiling, and more generally, semantics-driven inference of private data is addressed.
Content may be subject to copyright.
Yo u Are W h a t Yo u L i ke ! I nform a t ion Leakage Through Users’ Interests
Abdelberi Chaabane, Gergely Acs, Mohamed Ali Kaafar
INRIA France
{chaabane, gergely.acs, kaafar}@inrialpes.fr
Abstract
Suppose that a Facebook user, whose age is hidden or
missing, likes Britney Spears. Can you guess his/her age?
Knowing that most Britney fans are teenagers, it is fairly
easy for humans to answer this question. Interests (or
“likes”) of users is one of the highly-available on-line in-
formation. In this paper, we show how these seemingly
harmless interests (e.g., music interests) can leak privacy-
sensitive information about users. In particular, we infer
their undisclosed (private) attributes using the public at-
tributes of other users sharing similar interests. In order
to compare user- defined interest names, we extract their
semantics using an ontologized version of Wikipedia and
measure their similarity by applying a statistical learning
method. Besides self-declared interests in music, our tech-
nique does not rely on any further information about users
such as friend relationships or group belongings. Our ex-
periments, based on more than 104K public profiles col-
lected from Facebook and more than 2000 private profiles
provided by volunteers, show that our inference technique
efficiently predicts attributes that are very often hidden by
users. To the best of our knowledge, this is the first time that
user interests are used for profiling, and more generally,
semantics-driven infer ence of private data is addressed.
1. Introduction
Among the vast amount of personal information, user in-
terests or likes (using the terminology of Facebook) is one
of the highly-available public information on On-line So-
cial Networks (OSNs). Our measurements show that 57%
of about half millionFacebookuserprolesthatwecol-
lected publicly reveal at least one interest amongst d ifferent
categories. This wealth of information shows that the ma-
jority of users co nsider this information harmless to their
privacy as they do not see any correlation b etween their
interests and their private data. Nonetheless, interests, if
augmented with semantic knowledge, may leak informa-
tion about its owner and thus lead to privacy b reach. For
example, consider an unknown Facebook user who has an
interest “Eenie Meenie”. In addition, there are many female
teenager users who have interests such as “My World 2.0”
and “Justin Bieber”. It is easy to predict that the unknown
user is probably also a female teenager: “Eenie Meenie” is
asongof“JustinBieber”onhisalbum“MyWorld2.0,
and most Facebook users who have these interests are fe-
male teenagers. This example illustrates the two main com-
ponents of our approach: (1) deriving semantic correlation
between words (e.g., “My World 2.0”, “Eenie Meenie”, and
“Justin Bieber”) in order to link users sharing similar in-
terests, and (2) deriving statistics about these users (e.g.,
Justin Bieber fans) by analyzing their public Facebook pro-
files. To the best of our knowledge, the possibility of this
information leakage an d the automation of such inference
have never been considered so far. We believe that this lack
of exploitation is due to several challenges to extract useful
information from interestnamesanddescriptions.
First, many interests are ambiguous.Infact,theyare
short sentences (or ev en one word) that deal with a concept.
Without a semantic denition of this concept, the interest
is equivocal. For example, if a user includes “My World
2.0” in her Art/Entertainment interests, one can imply that
this user is likely to be interested in pop as a genre of mu-
sic. Without a knowledge of what “My World 2.0” is, the
information about such an interest is hidden, and hence un-
exploited.
Second, drawing semantic link between different inter-
ests is difficult. For example, if a user includes in her public
profile “My World 2.0” and another user chooses the inter-
est “I love Justin Bieber”, then clearly, these two users are
among the Justin Bieber funs. However, at a large scale,
automating interest linkage may not be possible without se-
mantic knowledge.
Finally, interests are user-generated,andassuch,very
heterogeneous items as opposed to marketers’ classified
items (e.g., in Amazon, Imd b, etc.). This is due to the fact
that OSNs do not have any control on how the descriptions
and titles of interests are constructed. As a result, inter-
est d escriptions as provided by users are often incorrect,
misleading or altogether missing. It is therefore very diffi-
cult to extract useful information from interests and classify
them from the user-generated descriptions. Particularly, in-
terests that are harvested from user profiles are different in
nature, ranging from official homepage links or ad-hoc cre-
ated groups to user instantaneous input. In addition, interest
descriptions, as shown on public profiles, either have coarse
granularity (i.e., high level descriptions of classes of inter-
ests such as “Music”, “Movies”, “Books”, etc.), or they are
too fine-grained to be exploited (e.g., referring to the nam e
of a singer/music band, or to the title of a recent movie,
etc.). Finding a source of knowledge encompassing this
huge variety of concepts is challenging.
Therefore, linking users sharing semantically re lated in-
terests is the pivot of our approach. The main goal of
our work is to show how seemingly harmless information
such as interests, if augmented with semantic knowledge,
can leak private information. As a demonstration, we will
show that solely based on what users reveal as their mu-
sic interests, we can successfully infer hidden information
with more than 70% of correct guesses for some attributes
in Facebook. Furthermore, as opposed to previous works
[18, 27, 22], our technique does not need further informa-
tion,suchasfriendrelationshipsorgroupbelongings.
Technical Roadmap
Our objective is to find out interest similarities between
users, even though these similarities might not be clearly
observed from their interests. We extract semantic links be-
tween their seemingly unrelated interest names using the
Latent Dirichlet Allocation (LDA) gener ative model [6].
The idea behind LDA is to learn the underlying (semantic)
relationship between differentinterests,andclassifythem
into unobserved” groups (called Interest Topics). The out-
put of LDA is the probabilities that an interest name b elongs
to each of these topics.
To identify latent (semantic) relations between different
interests, LDA n eeds a broadersemanticdescriptionofeach
interest than simply their short names. For instance, LDA
cannot reveal semantic relations between interests “Eenie
Meenie” and “My World 2.0” using only these names un-
less they are augmented with some text de scribing their
semantics. Informally, we create a document about Ee-
nie Meenie” and another about “My World 2.0” that con-
tain their semantic description and then let LDA identify
the common topics of these documents. These documents
are called Interest Descriptions. I n order to draw seman-
tic knowledge from the vast corpus of users’ interests, we
leverage on the ontologized version of Wikipedia. An in-
terest description, according toourWikipediausage,isthe
parent categories of the most likely article that describes the
interest. These represent broader topics organizing this in-
terest. For instance, there is a single Wikipedia article about
“Eenie Meenie” which belongs to category “Justin Bieber
songs” (among others). In addition, there is another arti-
cle about “My World 2.0” that belongs to category “Justin
Bieber albums”. Therefore, thedescriptionsofinterests
“Eenie Meenie” and “My World 2.0” will contain “Justin
Bieber songs” and “Justin Bieber albums”, respectively, and
LDA can create a topic r epresenting Justin Bieber which
connects the two interests. An interesting feature of this
method is the ability to enrich the user’s interests from, say
asingleitem,toacollectionofrelatedcategories,andhence
draw a broader picture of the semantics behind the interest
of the user. We used two sets of 104K public Facebook
profiles and 2000 private profiles to derive the topics of all
the collected inter ests.
Knowing each user’s interests and the prob abilities that
these interests belong to the identified topics, we compute
the likelihood of users are interested in these topics. Our in-
tuition is that users who are interested roughly in the same
topics with “similar” likelihood (called interest neighbors)
have also similar personal profile data. Hence, to infer a
specific user’s hidden attribute in his profile, we identify his
interest neighbors who publicly reveal this attribute in their
profile. Then, we guess the hidden value from the neigh-
bors’ (public) attribute values.
We postulate an d verify that inter est-b ased sim ilarities
between users, and in particular their music preferences,
is a good predictor of hidden information. As long as
users are revealing their music interests, we show that sensi-
tive attributes such as Gender, Age, Relationship status and
Country-level locations can be inferred with high accuracy.
Organization We descr ibe our atta cker model in Section
2. Section 3 presents related work and show the main dif-
ferences between our approach and previous works. Our
algorithm is detailed in Section 4 and both of our datasets
are described in Section 5. Section VI is devoted to present
our inference results. We discuss some limitations of our
approach and present future works in Section 7 and nally
we conclud e.
2. Attacker Model
Before defining our attacker model, we describe user
profiles as implemented by Facebook.
Facebook implements a user profile as a collection of
personal data called attributes, which describe the user.
These attributes can be b inary such as Gender or multi-
values such as Age. The availability of these attributes
obeys to a set of privacy-settings rules. Depending on these
privacy settings, which are set by the user, information can
be revealed exclusively to the social links established on
OSN (e.g., friends in Facebook
1
)orpartially(e.g.,tofriends
of friends) or publicly (i.e., to everyone). In this paper, we
demonstrate the information leakage through users’ inter-
ests by inferring the private attributes of a user. We consider
two binary attributes (Gender: male/female and Relation-
ship status: married/single) and two multi-valued attributes
(Country-level location and Age).
As opposed to previous works [18, 27], we consider an
attacker that only has access to self-declared, publicly avail-
able music interests. Hence, the attacker can be anyone who
can collect the Facebook public profile of a targeted user. In
fact, earlier attacks considered a dataset crawled from a spe-
cific community such as a university. Hence, the crawler
being part of this community had access to attributes that
are only visible to friends which impacts data availability.
Indeed, as we will show in Section 5, data availability is
different whether we deal with public data (data disclosed
to anyone) or private data (data disclosed to friends only).
Thus our attacker is more general compared to [18, 27],
since it relies only on public information.
This characteristic allows us to draw a broader attacker.
For example, our technique can be used for the purpose of
profiling to d eliver targeted ads. Advertisers could automat-
ically build user profiles with high accuracy and minimum
effort, with or without the consent of the users. Spammers
could gather information across the web to send extremely
targeted spam (e.g., by including specific information re-
lated to the location or age of the targeted user).
3. Related Work
Most papers have considered two main privacy problems
in OSNs: inferring private attributes and de-anonymizing
users. Most of these works used the information of friend-
ships or group belongings in order to achieve these goals.
By contrast, our approach only relies on users’ interests.
In particular, instead of using link based classification al-
gorithms [11] and/or mixing multiple user attributes to im-
prove inference accuracy, we provide a new approach based
on semantic knowledge in order to demonstrate informa-
tion leakage through user interests. Moreover, all previous
works relied on private datasets (e.g., dataset of a private
community such as a university), and h ence assumed a dif-
ferent attacker model than ours (see Section 2 for details).
We also leverage knowledge from th e area of Personalizing
Retrieval in order to link users sharing similar interests.
Private Attribute Inference Zheleva and Getoor [27]
were the first to study the impact of friends’ attributes on
the privacy of a user. They tried to infer p rivate user at-
tributes based on the groups the users belong to. For that
1
Facebook has recently added a feature to split friends into sublists in
order to make some attributes accessible to a chosen subset of friends.
purpose, they compared the inference accuracy of differ-
ent link-based classification algorithms. Although their
approach provides good results for some OSNs such as
Flicker, they admit that it is not suitable to Facebook espe-
cially with multi-valued attributes such as political v iews.
Moreover, they made the assumption that at least 50% of
ausersfriendsrevealtheprivateattribute. However,our
experiments show that this is not realistic in our attacker
model, since users tend to (massively) hide their attributes
from public access (see Section 5). For instance, only 18%
of users on Facebook disclose their relationship status and
less than 2% disclose their birth date.
In [13], authors built a Bayes network from links ex-
tracted from a social network. Although they crawled a real
OSN (LiveJournal) they used hypothetical attributes to an-
alyze their learning algorithm. A further step was taken by
[18] who proposed a modified Naive Bayes classifier that
infers political affiliation (i.e., a binary value: liberal or
conservative) based on user attributes, user links or both.
Besides a different attacker model, we do not use the com-
bination of multiple attributes to infer the missing one (i.e.,
we only use music interests).
Rather than relying on self declared or existing graphs,
Mislove et al. [22] built “virtual” communities based
on a metric called Normalized conductance.However,
community-based inference isdatadependentbecausethe
detected community may not correlate with the attribute
to be inferred. Indeed, [25] provided an in depth study of
community detection algorithm s for soc ial networks. After
comparing the results of 100 different social graphs (pro-
vided by Facebook), they concluded that a common at-
tribute of a community is good predictor only in certain
social graphs (e.g., according to [25], the communities in
the MIT male network are dominated by residence, but it is
not the case for female networks).
De-anonymizing Users In [5], the authors considered an
anonymized network composed of nodes (users) and edges
(social links) where the attacker aims to identify a “tar-
geted” user. Another problem was considered by [26],
where a targeted user visiting a hostile website was de-
anonymized using his group belongings (stolen from his
web-history). Th e main id ea behind both a ttacks is that
the group membership of a user is in general sufficient to
identify him/her. De-anonymization o f users, considered by
these works, is an orthogonal privacy risk to attribute infer-
ence.
Personalizing Retrieval Our work shares techniq ues
with the ar ea of per sonalizing retrieval where the goa l is
to build personalized services to users. This can be der ived
from the user “taste” or by interpreting his social interac-
tions. This is an active research domain and a broad range
of problems were resolved and used in e-commerce, recom-
mendation, collaborative filtering and similar. This knowl-
edge extraction entails the analysis of a large text corpora
from which one can derive a statistical model that explains
latent interactions between th e docu ments. Latent seman-
tic analysis techniques provide an efficient way to extract
underlying topics and cluster documents [14, 9]. Latent
Dirichlet Allocation (LDA) [6] has been extended b y Zhang
et al. [7] to identify communities in the Orkut social net-
work. The model was successfully used to recommend new
groups to users. In addition, Zheleva et al. [28] used an
adapted LDA model to derive music taste from listening ac-
tivities of users in order to identify songs related to a spe-
cific taste and the listeners who share the same taste.
Similarly to these works, we also use LDA to capture the
interest topics of users but instead of recommending con-
tent, our goal is to link users sharing semantically-related
interests to demonstrate information leakage.
4. From Interest Names to Attribute Inference
4.1. Overview
While a human can easily capture the semantics behind
different interest names (titles or short descriptions), this
task cannot be easily automated. In this section, we present
how we can extract meaningful knowledge from users’ in-
terests and then classify them for the purpose of attribute
inference.
Our technique consists of four main steps as illustrated
by Figure 1:
1. Creating Interest Descriptions: Interest descriptions
are the user-specified interest names augmented with
semantically related words which are mined from the
Wikipedia ontology.
2. Extracting semantic correlation between interest de-
scriptions using Latent Dir ichlet Allocation (LDA).
The output represents a set of topics containing seman-
tically related concepts.
3. Computing Interest Feature Vectors (IFV). Based on
the discovered topics, LDA also computes the proba-
bility that an interest I belongs to To p ic
i
for all I and i
(Step 3a). T hen, we derive the IFV of each user (Step
3b) which quantifies the interest of a user in each topic.
4. Computing the neighbors of each user in the feature
space (i.e., whose IFVs are similar in the feature space)
to discover similar users, and exploiting this neighbor-
hood to infer hidden attributes.
4.2. Step 1: Augmenting Interests
Interest names (shortly interests) extracted from user
profiles can be single words, phrases, and also complex
sentences. These text fragments are usually insufficient to
characterize the interest topics of the user. Indeed, most
statistical learning methods, such as LDA, need a deeper
description of a given document (i.e., interest) in order to
identify the semantic correlation inside a text corpora (i.e.,
set of interests). Moreover, the diversity and heterogene-
ity of these interests make their descriptio n a difficult task.
For instance, two different interests such as AC/DC” and “I
love Angus Young” refer to the same band. However, these
strings on their own provide insufficient information to re-
veal this semantic correlation. To augment interest names
with further content that helps LDA to identify their com-
mon topics, we use an ontology, which provides structured
knowledge about any unstructured fragment of text (i.e., in-
terest names).
4.2.1 Wikipedia as an Ontology
Although ther e a re several available ontolo gies [10, 3],
we use the o ntologized version of Wikipedia, the most u p-
to-date and largest reference of human knowledge in the
world. Wikipedia represents a huge, constantly evolving
collection of manually defined concepts and semantic re-
lations, which are sufficient to cover most interest names.
Moreover, Wikipedia is multilingual which allows the aug-
mentation of non-english interest names. We used the
Wikipedia Miner Toolkit [21] to create the ontologized ver-
sion of Wikipedia from a dump made on January, 2011 with
asizeof27Gb.
Wikipedia includes articles and categories.Eachar-
ticle describes a single concept or topic, and almost all
Wikipedia’s articles are organized within one or more cate-
gories, which can be mined for broader (more general) se-
mantic meaning. AC/DC, for example, belongs to the cate-
gories Australian hard rock musical groups, Hard rock mu-
sical groups, Blues-rock groups etc. All of Wikipedia’s cat-
egories descend from a single root called Fundamental.The
distance between a particular category and this root mea-
sures the category’s generality or specificity. For instance,
AC/DC is in depth 5, while its parent categories are in depth
4whichmeanstheyaremoregeneralandclosertotheroot.
All articles conta in various hyper links pointing to further
(semantically related) articles. For example, the article
about Angus Young contains links to articles AC/DC, mu-
sician, duckwalk, etc. The anchor texts used within these
links have particular importance as they can help with dis-
ambiguation and eventually identifying the most related ar-
ticle to a given search term: e.g., if majority o f the “duck-
walk links (i.e., their anchor texts contain string “duck-
I1 : Michael Jackson
I2 : Lady Gaga
I3 : Lil Wayne
I4 : Bob Marley
I5 : Sai Sai Kham Leng
I6 : Fadl Shaker
I7 : Charlene Choi
Interests
I1 : Michael Jackson, american pop singers; american choreographers;
american dance musicians; the jackson 5 members; mtv video music
awards winners; . . .
I2 : Lady Gaga, american pop singers; american singer-songwriters; bisex-
ual musicians; musicians from new york; english-language singers; . . .
I4 : Bob Marley, jamaican songwriters; jamaican male singers; rastafarian
music; english-language singers; . ..
I5 : Sai Sai Kham Leng, burmese musicians; burmese singers; burmese
singer-songwriters; . . .
I6 : Fadl Shaker, lebanese singers; arabic-language singers; arab musicians;
...
Augmented interests (Int erest descriptions)
Top i c 1 Top i c 2 Top i c 3 ...
american blues singers arabic-language singers freestyle rappers
american soul singers lebanese singers hip hop singers
american male singers rotana artists g-funk
african american singers arab musicians crips
english-language singers israeli jews hip hop djs
american pop singers egyptian singers hip hop musicians
musicians from indiana algerian musicians electro-hop musicians
musicians from philadelphia rai musicians african american songwriters
Extracted interest topics
Interest To pi c 1 To p i c 2 Top i c 3 To p i c 3 ...
I1 0.8 0 0.4 0.5
I2 0.8 0 0.3 0
I4 0 0 0.1 0.7
I6 0 0.9 0 0
...
Top ic bel o ng i ng s of In te rest s
User Interests
User1 I1
User2 I1, I2
User3 I6
User4 I1, I2, I6
...
Users
User To p i c 1 Top i c 2 To p i c 3 To p i c 3 ...
User1 0.8 0 0.4 0.5
User2 0.96 0 0.58 0.5
User3 0 0.9 0 0
User3 0.96 0.9 0.58 0.5
...
Interest Feature Vectors (IFV)
IFV computation (Step 3b)
Wikipedi a
(Step 1)
LDA
(Step 2)
(Step 3a)
LDA
Figure 1: Computing interest feature vectors. First, we extract interest names and augment them using Wikipedia (Step 1).
Then, we compute correlation between augmented interests and generate topics (Step 2) u sing LDA. Finally, we compute the
IFV of each user (Step 3).
walk) is pointing to Chuck Berry and only a few of them
to the bird Duck, then with high probability the search
term “duckwalk” refers to Chuck Berry (a dancing style
performed by Chuck Berry). Indeed, the toolkit uses this
approach to search for the most related article to a search
string; first, the anchor texts of the links made to an article
are u sed to index all articles. Then, the article which has the
most links containing the search term as the anchor text is
defined to be the most related article.
4.2.2 Interest Description
The description of an interest is the collection of the par-
ent categories of its most related Wikipedia article (more
precisely, the collection of the name of these categories). To
create such descriptions, we first searched for the Wikipedia
article that is most related to a given interest name using the
toolkit. The search vocabulary is extensive (5 million or
more terms and phrases), and encodes both synonymy and
polysemy. The search returns an article or set of articles that
could refer to the given intere st. If a list is returned, we se-
lect the article that is most likely related to the interest name
as described above. Afterwards, we gather all the parent
categories of the most related article which constitute the
description of the interest. For example, in Figure 1, User3
has interest “Fadl Shaker”. Searching for “Fadl Shaker” in
Wikipedia, we obtain a single article which has parent cat-
egories Arab musicians”, Arabic-language singers and
“Lebanese male singers”. These strings altogether (with
“Fadl Shaker”) give the description of this interest.
4.3. Step 2: Extracting Semantic Correlation
To identify semantic correlations between interest de-
scriptions, we use Latent Dirichlet Allocation (LDA) [6].
LDA captures statistical properties of text documents in a
discrete dataset and represents each document in terms of
the underlying topics. More specifically, having a text cor-
pora consisting of N documents (i.e., N interest descrip-
tions), each document is modeledasamixtureoflatenttop-
ics (interest topics). A topic represents a cluster of words
that tend to co-occur with a high probability within the
topic. For example, in Figure 1, American soul singers”
and American blues singers”oftenco-occurandthusbe-
long to the same topic (Topic
1
). However, we do not ex-
pect to find Arab musicians” in the same context, and thus,
it belongs to another topic (Topic
2
). Note that the topics are
created by LDA and they are not named. Through character-
izing the statistical relations among words and documents,
LDA can estimate the probability that a given document is
about a given topic where the number of all topics is de-
noted by k and is a parameter of the LDA model.
More precisely, LDA models our collection of interest
descriptions as follows. The topics of an interest description
are described by a discrete (i.e., categorical) random vari-
able M(φ) with parameter φ which is in turn drawn from a
Dirichlet distribution D(α) for each description, where both
φ and α are parameter vectors with a size of k.Inaddition,
each topic z out of the k has a discrete distribution M(β
z
)
on the whole vocabulary. The generative process for each
interest description has the following step s:
1. Sample φ from D(α).
2. For each word w
i
of the description:
(a) Sample a topic z
i
from M(φ).
(b) Sample a word w
i
from M(β
z
i
).
Note that α and B =
z
{β
z
} are corpus-level parame-
ters, while φ is a document-level parameter (i.e., it is sam-
pled once for each interest description). Given the parame-
ters α and B,thejointprobabilitydistributionofaninterest
topic mixture φ,asetofwordsW ,andasetofk topics Z
for a description is
p(φ, Z, W |α, B)=p(φ|α)
!
i
p(z
i
|φ)p(w
i
|β
z
i
) (1)
The observable variable is W (i.e., the set of words in the
interest descriptions) while α, B,andφ are latent variables.
Equation (1) describes a parametric empirical Bayes model,
where we can estimate the parameters using Bayes infer-
ence. In this work, we used collapsed Gibbs sampling [19]
to recover the posterior marginal distribution of φ for each
interest description. Recall that φ is a vector, i.e., φ
i
is the
probability that the interest description belongs to Topic
i
.
4.4. Step 3: Interest Feature Vector (IFV) Extrac-
tion
The probability that a user is interested in Topic
i
is the
probability that his interest descriptions belong to Topic
i
.
Let V denote a user’s interest feature vector, I is th e set
of his interest descriptions, and φ
I
i
is the probability that
interest description I belongs to Topic
i
.Then,forall1
i k,
V
i
=1
!
II
(1 φ
I
i
)
is the probability that the user is interested in Topic
i
.
For instance, in Figure 1, User4 has interests “Lady
Gaga”, “Michael Jackson”, and “Fadl Shaker”. The pro b -
ability that User4 belongs to Topic
1
,whichrepresents
American singers, is th e probability that at least one of these
interests belongs to Topic
1
.Thisequals1 ((1 0.8)(1
0.8)) = 0.96.
4.5. Step 4: Inference
4.5.1 Neighbors Computation
Observe that an IFV uniquely defines the interest of an
individual in a k-dimensional feature space. Defining an
appropriate distance measure in this space, we can quantify
the similarity between the interests of any two users. This
allows the identification of users who sh are similar interests,
and likely have correlated profile data that can be used to
infer their hidden profile data.
We use a chi- squar ed distance me tric. In particular, the
correlation distance d
V,W
between two IFV vectors V and
W is
d
V,W
=
k
"
i=1
(V
i
W
i
)
2
(V
i
+ W
i
)
In [23], authors showed that the chi-squared distance
gives better results when dealing with vectors of probabil-
ities than others. Indeed, we conducted several tests with
different other distance metrics: Euclidean, Manhattan and
Kullback-Leibler, and results show that the chi-squared dis-
tance outperforms all of them.
Using the above m etric, we can compute the $ nearest
neighbors of a user u (i.e., the users who are the closest to
u in the interest feature space).Anaiveapproachistocom-
pute all M
2
/2 pairwise distances, where M is the number
of all users, and then to find the $ closest ones for each user.
However, it becomes impractical for large values of M and
k.Amoreefcientapproachusingk-d tree is taken. The
main motivation behind k-d trees is that the tree can be con-
structed efficiently (with complexity O(M log
2
M)), then
saved and used afterwards to compute the closest neighbor
of any user with a worst case computation of O(k·M
11/k
).
4.5.2 Inference
We can infer a user s hidden p role a ttribute x from that
of its $ nearest neighbors: first, we select the $ nearest
neighbors out of all whose attribute x is defined and pub-
lic. Then, we do majority voting for the hidden value (i.e.,
we select the attribute value which th e m ost users out of the
$ nearest neighbor have). If more than one attribute value
has the maximal number of votes, we randomly choose one.
For instance, suppose that we want to infer User4’s
country-level location in Figure 1, and User4 has 5 near-
est neighbors (who publish their locations) because all of
them are interested in Topic
2
with high pr obability (e.g.,
they like “Fadl Shaker”). If 3 out of these 5 are from Egypt
and the others are from Lebanon then our guess for User4’s
location is Egypt.
Although there are multiple techniques besides majority
voting to derive the hidden attribute value, we will show
in Section 6.2 that, surprisingly, even this simple technique
results in remarkable inference accuracy.
5. Dataset Description
For the purpose of our study, we collected two profile
datasets from Facebook. The first is composed of Facebook
profiles that we crawled and which we accessed as “every-
one” (see Section 5.1). The second is a set of more than
4000 private profiles that we collected from volunteers us-
ing a Facebook application (see Section 5.2). Next, we de-
scribe our methodology used to collect these datasets. We
also present the technical challenges that we encountered
while crawling Facebook. Finally, we describe the char ac-
teristics of our datasets.
5.1. Crawling Public Facebook Profiles
Crawling a social network is challenging due to several
reasons. One main concern is to avoid sampling b iases.
Apreviouswork[15]hasshownthatthebestapproach
to avoid sampling bias is a so called True Un iform Sam-
pling (UNI) of user identifiers (ID). UNI consists in generat-
ing a random 32-bits ID and then crawling the correspond-
ing user profile in Facebook. This technique has a major
drawback in practice: most of the generated IDs are likely
to be unassigned, and thus not associated with any profile
(only 16% of the 32-bits space is used). Hence, the crawler
would quickly become very resource-consuming because
alargenumberofrequestswouldbeunsuccessful. Inour
case, inspired by the conclusions in [15], and avoiding sam-
pling bias that might be introduced by different social graph
crawls (e.g. Breadth-First Search), we follow a simple, y et
efficient two-steps crawling methodology as an alternative
to UNI.
First, we randomly crawled a large fraction of the Face-
book Public directory
3
.Asaresult,atotalof100Million
(and 120 thousands) URLsofsearchableFacebookproles
were collected (without profile data). This technique allows
to avoid the random generation of user identifiers while
uniformly (independently from the social graph properties)
collecting existing user identifiers.
Second, from this list of candidate URLs of profiles, we
crawled a set of randomly selected 494 392 profiles out of
the 100 millions. The crawled dataset is called RawProfiles.
Finally, the entire RawProfiles dataset was sanitized to
fit our validation purposes. Two restrictions were consid-
ered: (1) non Latin-wr itten profiles were filtered out from
the dataset and (2) only profiles with at least one music
interest with its corresponding Wikipedia description were
kept. Therefore, we obtained a set of 104 401 profiles. This
data set, called PubProfiles,isthenusedasaninputofour
inference algorithm (see details in Section 4.4).
Technica l chall e n g es
As noted above, we crawled pro files to collect public in-
formation that are available to everyone. However, Face-
book, as most OSNs operators do, protects this data from
exhaustive crawling by implementing a plethora of anti-
crawler techniques. For instance, it implements a request
rate limit that, if exceeded,generatesaCAPTCHAtobe
solved. To bypass this restriction and to be cautious not to
DoS the system, we set a very slow request frequency (1 per
3
available at:http://www.facebook.com/directory/
minute). In addition, we distributed our crawler on 6 dif-
ferent machines that were geographically spread. In addi-
tion, it is worth noting that one of the trickiest countermea-
sures that Facebook implements to prevent easy crawling
is the rendering of the web page. In particular, rather than
sending a simple HTML page to the client browser, Face-
book embeds HTML inside JavaScript, thus, the received
page is not a valid HTML page but a JavaScript code that
has to be interpreted. Unfortunately, most publicly avail-
able crawling libraries do not interp ret JavaScript. Thus,
we developed our own lightweight web browser, based on
the Qt Port of WebKit [1], which is capab le of interpreting
JavaScript. This allows our crawler to be served with easy-
to-parse HTML page.
5.2. A Facebook Application to Collect Private At-
tributes
We d evelope d a Fac ebook a pplica tion to gather private
attributes from users. The application was distributed to
many of our colleagues and friends on Facebook, and was
surprisingly used by more users than expected. Users vol-
unteered to install the application, and hence, their private
information was collected by our tool. We collected pri-
vate attributes from 4012 profiles out of which 2458 proles
have at least one music interest. These anonymized private
profiles, collected from April 6 to April 20 in 2011, repre-
sent our private dataset (called VolunteerProles).
The usage of this dataset is motivated by our need to
understand how data availability varies between public and
private datasets, and to verifywhetheritimpactstheresults
of our algorithm.
5.3. Ethical and Legal Considerations
In order to comply with legal and ethical aspects in
crawling online social networks data, we were cautious not
to inadvertently DoS the Facebook infrastructure (as men-
tioned in Section 5.1). Also cautionary measures were
taken to prevent our crawler from requesting off-limit in-
formation. In other words, our crawler is compliant with
the Robots Exclusion Protocol [2]. Even though we ac-
cessed publicly available information, we anonymized the
collected data by removing user names and all information
which wer e irrelevant to our study.
The Facebook application needed more sanitization to
ensure users’ anonymity. The reader might refer to the ‘Dis-
closure and Privacy Policy’ of the application
4
for more in-
formation.
4
available at http://apps.facebook.com/social_
privacy/
Figure 2: Left: Complementary Cumulative Distribution Function of Music Interests. Right: Cumulative Distribution Func-
tion (CDF) of Country-level Locations (retrieved from the CurrentCity attribute)
5.4. Dataset Description
In the following, we provide statistics that describe
the d atasets used in this study. First, Table 1 summa-
rizes the statistics about the availability of attributes in
the three datasets (i.e., in RawProfiles, PubProfiles and
VolunteerProles).
Attributes Raw(%) Pub(%) Volunteer(%)
Gender 79 84 96
Interests 57 100 62
Current City 23 29 48
Looking For 22 34 -
Home Town 22 31 48
Relationship 17 24 43
Interested In 16 26 -
Birth date 6 11 72
Religion 1 2 0
Table 1: The availability of attributes in our datasets.
We observe that Gend er is the most co mmon attribute
that users publicly reveal. However, three attributes that
we want to inf e r are largely kept private. The age is the
information that u sers conceal the most (89% are undis-
closed in PubProfiles). Comparing the availability of the
attributes in PubProfiles and VolunteerProles is enlight-
ening. We can clearly note that users tend to hide their
attribute values from public access even though these at-
tributes are frequently provided (in their private profiles).
For instance, the birth date is provided in more than
72% in VolunteerProles, whereas it is rarely available in
PubProfiles (only 1.62% of users provide their full birth
date). The current city is publicly revealed in almost 30% of
the cases, whereas half of all volunteers provided this data
in their private profile. Recall that the attributes we are in-
terested in are either binary (Gende r, Relation ship) or multi-
valued (Age, Country-level location). Finally , note that, as
it is shown in Table 1, the public availability of attributes in
PubProfiles and in RawProfiles are roughly similar.
Also note that the availability of interests slightly
changes from RawProfiles (57%) to VolunteerProles
(62%), yet still relatively abundant. This behavior might
have at least two explanations: (1) by default, Facebook sets
Interest to be a public attribu te, (2) users are more willing
to reveal their interests compared to other attributes. Figure
2(left)depictsthecomplementaryCDFofmusicinterests
publicly revealed by users in the three datasets. Note that
more than 30% of RawProfiles profiles reveal at least one
music interest. Private profiles show a higher ratio which is
more than 75%.
Figure 2 (right) plots the cumulative distribution of the
country-level locations of users in our datasets. The three
curves show that a single country is over-represented, and
that a large fraction o f users’ locations is represented only
by a few countries. Independently from the dataset, 40% of
users come from a single country, and the top 10 countries
represent more than 78% of users. The gentler slope of the
curves above 10 countries indicates that other countries are
more widely spread across the remaining profiles. Notably,
the number of countries appearing in VolunteerProles
shows that the distribution does not cover all countries in
the world. In particular, our volunteers only come from
less than 35 different countries. Nevertheless, we believe
that VolunteerProles still fits for purpose because the over-
representation shape of location distributions is kept, and
illustrated by the Facebook statistics [4] in general ( more
than 50% of users come from only 9 countries). Motivated
by this over-representation in our datasets, we validate our
inference technique in Section 6.2 on users that come from
the top 10 countries (following the Facebook statistics).
Attribute
Overall marginal distribution (OMD) Inference accuracy on VolunteerProles
PubProfiles Facebook statistics PubProfiles OMD Facebook statistics OMD
Gender 62% (Female) 51% (Male) 39.3% 60.7%
Relationship 55% (Single) Unknown 36.7% 50%
4
Age 50% (18-25) 26.1% (26-34) 33.9% 57.9%
Country 52% (U.S) 23% (U.S) 2.3% 2.3%
Table 2: Baseline inference using different marginal distributions. Inference of VolunteerProles based on Facebook OMD
is better than PubProfiles OMD.
6. Experimentation Results and Validation
In the following, we validate ou r interest-based inference
technique using both VolunteerProles and PubProfiles.
We evaluated the corre ctness of our alg orithm in terms of
inference accuracy, i.e. the fraction of successful inferences
and the total number of trials. An inference is successful if
the inferred attribute equals to the real value. In particular,
for both PubProfiles and VolunteerProles datasets and for
each attribute to be inferred, we select users that provide
the attribute and then we compute the inference accuracy:
we hide each user’s attribute, compute the nearest neigh-
bors of the user, do a majority voting as d escribed in Sec-
tion 4.5.1, and then verify whether the inference yields the
real attribute.
Before discussing our validation results, in the following,
we introduce a maximum likelihoo d-based inference tech-
nique that we consider as a baseline technique with which
we comp are our method .
6.1. Baseline Inference Technique
Without having access to any friendship and/or commu-
nity graph, an adversary can rely on the marginal distribu-
tions of the attribute values. In pa rticular, the probability of
value val of a hidden attribute x in any user’s profile u can
be estimated as the fraction of users who have this attribute
value in dataset U:
P (u.x = val|U )=
|{v | v.x = val v U }|
|U|
Then, a simple approach to infer an attribute is to guess its
most likely value for all users (i.e., the value x for which
P (u.x = val|U) is maximal).
To compute P (u.x = val|U),anadversarycancrawla
set of users and then derive the Overall Marginal Distribu-
tion (OMD) of an attr ibute x from the crawled dataset (more
preciselly, U is the subset of all crawled users who pub-
lished that attribute). However, this OMD is derived from
public attributes (i.e., U contains only publicly revealed at-
tributes), and hence, may deviate from the real OMD which
includes both publicly revealed and undisclosed attributes.
4
Using random guessing instead of maximum likelihood decision
To illustrate the difference, consider Table 2 that com-
pares the real OMD o f the four attributes to be inferred, as
provided by Facebook statistics (composed of both p rivate
and public attr ibutes [4]), with the OMD derived from our
public dataset PubProfiles.Thetwodistributionssuggest
different predominant values which highly impacts the in-
ference accuracy when the guess is based o n the most likely
attribute value. For instance, PubProfiles conveys that the
majority of Facebook users are female which contradicts
Facebook statistics (with a significant difference of 11%).
Similarly, the age of most users according to PubProfiles is
between 18 and 25-years old, while the predominant cate-
gory of ages, according to Facebook, is 26-34.
In fact, all public datasets (e.g., PubProfiles)arebiased
towards the availability of attributes (not to be conf used
with the bias in sam pling discussed in Section 5.1). Re-
call that, as shown in Table 1, some attributes (in particular
Age, Relationship status and Country) are publicly avail-
able for only a small fraction of users (see the PubProfiles
column). Put simply, the difference between the two OMDs
is mainly due to the mixture of private and public attributes
in Facebook statistics and the absence of private attributes
in PubProfiles.Whetherrevealingattributesisdrivenby
some sociological reasons or others is beyond the scope of
this paper.
To illustrate how the bias towards attribute availability
impacts inference accuracy, we conduct two experiments.
First, we infer the attributes in VolunteerProles using the
OMD derived from PubProfiles.Inthesecondexperiment,
we infer the same attributes u sing the OMD comp uted from
Facebook statistics. As shown in Table 2, the second ap-
proach always performs better. The results show that using
the Facebook statistics we obtain an inference accuracy gain
of 21% for the gender and 25% for the age. Since Face-
book does not provide statistics about the relationship sta-
tus of their users, we used random guessing instead (i.e., we
randomly chose between singleandmarriedforeachuser).
Surprisingly, even random guessing outperforms the max-
imum likelihood-based approach using PubProfiles OMD.
Therefore, we conclude that the maximum likelihood-based
inference performs better when we use the OMD derived
from Facebook statistics. Accordingly, in our performance
evaluation, we also used this in our baseline inference tech-
nique.
Finally, note that previous works [27, 18] computed the
inference accuracy using private data (i.e., their dataset is
acrawlofacommunity,andthus,theycouldaccessall
attributes that can only be seen by community members).
Hence, these results are obtained with different attacker
model, and the assumption that 50% of all attributes are ac-
cessible, as suggested in [27], is unrealistic in our model.
6.2. Experiments
In order to validate our interest-based inference tech-
nique, we follow two approaches. First, for each attribute,
we randomly sample users from PubProfiles such that the
sampled dataset has the same OMD as the real Facebook
dataset [4]. Then, we measure the inference accuracy on
this sampled dataset. Second, we test our technique on
the VolunteerProles dataset where both private and pub-
lic attributes are known. Since we know the attribute values
in the collected profiles, we can check if the inference is
successful or not. In particular , we infer four attributes in
both approaches: Gender, Relationship status, Age, and the
Country of current location. We run experiments to infer an
attribute a in PubProfiles as follows:
1. From all users that provide a in PubProfiles,weran-
domly sample a set of users (denoted by S)following
the OMD of Facebook. The size of S for each attribute
is tailored by (i) Facebook OMD and (ii) the number
of available samples in PubProfiles.Table3showsthe
size of S.
2. For this sampled set, we compute the inference accu-
racy as it has b een described in Section 6.2.
3. We repeat Steps 2 and 3 fifteen times and compute the
average of all inference accuracy values (Monte Carlo
experiment).
For VolunteerProles we proceed as for PubProfiles,but
since the attributes are a mix of public and private attributes,
there is no need to do sampling, and we skip Step 1.
Attribute Size of S
Gender 1000
Relationship 400
Country 1000
Age 105
Table 3: Size of S
Parameter Estimation Recall from Section 4.5.1 that our
algorithm is based on majority voting. Hence, estimating
the number of neighbors that provides the best inference
Attribute Baseline Random guess IFV Inference
Gender 51% 50% 69%
Relationship 50% 50% 71%
Country 41% 10% 60%
Age 26% 16.6% 49%
Table 4: Inference Accuracy of PubProfiles
accuracy for each attribute is essential. Figure 3 depicts the
inference accuracy in function of the number of neighbors.
This figure clearly shows that each attribute has a specific
number of neighbors that results in the best inference accu-
racy. Note that, as discussed at the beginning of this section,
we rely on repeated random sampling to compute the re-
sults, and hence, the computed parameters are independent
from the input data. Age inference requires two neighbors.
This can be explained by the limited number of users that
disclose their age which causes the IFV space to be very
sparse: the more neighbors we consider the more likely it is
that these neighbors are far and have different attribute val-
ues. For other attributes, the optimal number of neighbors
is between 3 and 5. We tested different IFV sizes (i.e., k
the number of topics). Notably, best results were achieved
with k =100.Inthesequel,wewillusetheseestimated
numbers of neighbors as well as k =100which yield the
best inference accuracy.
Figure 3: Correlation between Number of Neighbors and
Inference accuracy
Table 4 provides a summary of the results for
PubProfiles.Theinformationleakage can be estimated to
20% in compariso n with the baseline inference. Surpris-
ingly, the amount of the information is independent from
the inferred attribute since the gain is about 20% for all of
them. These results show that music interest is a good pre-
dictor of all attributes.
!
!
!
!
!
!
!
!
!
!
Attribute
Inferred
Male Female
Male 53% 47%
Female 14% 86%
Table 5: Confusion Matrix of Gender
Gender Inference Table 4 shows that the gender can be
inferred with a high accuracy even if only one music interest
is known in the PubProfiles.Ouralgorithmperforms18%
better than the baseline. Recall that the baseline guesses
male for all users (Table 2). To compare the inference ac-
curacy for both males and females, we computed the con-
fusion matrix in Table 5. Surprisingly, memale inference is
highly accurate (86%) with a low false negative rate (14%).
However, it is not the case for male inference. This behav-
ior can be explained by the number of female profiles in
our dataset. In fact, females represent 61.41% of all col-
lected profiles (with publicly revealed gender attribute) and
they were subscribed to 421685 music interests. However,
males share only 273714 music interests which represents
35% less than woman. Hence, our technique is more capa-
ble o f p redicting females since the amount of their disclosed
(music) interest information is larger compared to males.
This also confirms that the amount of disclosed interest in-
formation is correlated with inference accuracy.
!
!
!
!
!
!
!
!
!
!
Attribute
Inferred
Single Married
Single 78% 22%
Married 36% 64%
Table 6: Confusion Matrix of Relationship
Relationship Inference Inferring the relationship status
(married/single) is challenging since less than 17% of
crawled users disclose this attribute sh owing that it is highly
sensitive. Recall that, as there is no publicly available statis-
tics about the distribution of this attribute, we do random
guessing as the baseline (having an accuracy of 50%). Our
algorithm performs well with 71% of good inference for
all users in PubProfiles.Aspreviously,weinvestigatehow
music interests are a good predictor for both single and mar-
ried users by computing the confusion matrix (Table 6). We
notice that single users are more distinguishable, based on
their IFV, than married ones. The explanation is that single
users share more interests than married ones. In particular,
asingleuserhasanaverageof9musicinterestswhereasa
married user has only 5.79. Likewise in case of gender, this
confirms that the amount of disclosed interest information
is correlated with inference accuracy.
Country of Location Inference As described in Section
5, we are interested in inferring the users’ location in the
top 10 countries in Facebook. Our approach can easily
be extended to all countries, however, as shown by Fig-
ure 2, more than 80% of users in PubProfiles belong to
10 countries and these countries represent more than 55%
of all Facebook users according to [4]. As the number of
users belonging to the top 10 countries is very limited in
VolunteerProles,wedonotevaluateourschemeonthat
dataset. Table 4 shows that our algorithm has an accuracy
of 60% on PubProfiles with 19% increase compared to the
baseline (recall that, following Table 2, the baseline gives
U.S. a s a guess fo r all users). Figur e 4 draws the confusion
matrix
6
and gives more insight about the inference accuracy.
In fact, countries with a specific (regional) music have b etter
accuracy than others. Particularly, U.S. has more than 94%
of correct inference, Philippine 80%, India 62%, Indonesia
58% and Greece 42%. This highlights the essence of our
algorithm where semantically correlated music interests are
grouped together and hence allow us to extract users inter-
ested in the same topics (e.g., Philippine music). Without
asemanticknowledgethatspeciestheoriginofasinger
or band this is not possible. As for Gender and relation-
ship, the number of collected profiles can also explain the
incapacity of the system to correctly infer certain countries
such as Italy, Mexico or France. In particular, as shown in
Table 7, the number of users belonging to these countries
is very small. Hence, their interests m ay be insufficient to
compute a representative IFVwhichyieldspooraccuracy.
"
"
"
"
"
"
"
"
Att
Inferred
13-17 18-24 25-34 35+
13-17 58.33% 30% 11.6% 0%
18-24 17% 67% 3.4% 1.3%
25-34 15.38% 46.15% 38.4% 0%
35+ 0% 100% 0% 0%
Table 8: Confusion Matrix of Age Inference
Age Inference Finally, we are interested in inferring the
age of users. To do that, we created ve age categories
7
that are depicted in Table 8. Recall that the baseline tech-
nique always predicts the category of 2 6 and 34 years for all
users. Table 4 shows that our algorithm performs 23% bet-
ter than the baseline attack. Note that our technique gives
good results despite that only 3% of all users provide their
age (3133 users in total) in PubProfiles.Weinvestigatehow
music interests are correlated with the age bin by computing
6
We removed Brazil since all its entries (2) were wrongly inferred. This
is caused by the small number of Brazilians in our dataset.
7
We created six categories but since in PubProfiles we have only few
users in the last 3 bins, we merge them together. For VolunteerProles
we have six bins.
Figure 4: Confusion Matrix of Country Inference
Country %ofusers
US 71.9%
PH 7.80%
IN 6.21%
ID 5.08%
GB 3.62%
GR 2.32%
FR 2.12%
MX 0.41%
IT 0.40%
BR 0.01%
Table 7: Top 10 countries distribution in
PubProfiles
the confusion matrix in Table 8. We find that, as expected,
most errors come from falsely putting users into their neigh-
boring bins. For instance, our method puts 30% of 13-18
years old users into the bin of 18-24 years. However, note
that fewer bins ( such as teenager, adult and senior) would
yield b etter accuracy, and it should be sufficient to many
applications (e.g., for targeted ads). Observe that we have
an error of 100% for the last bin. This is due to the small
number of users (3 in PubProfiles)whobelongtothisbin
(we cannot extract useful information and build a represen-
tative IFV for such a small number of users).
6.2.1 VolunteerProles Inference
Attribute Baseline Random guess IFV Inference
Gender 51% 50% 72.5%
Relationship 50% 50% 70.5%
Age 26% 16.6% 42%
Table 9: Inference Accuracy for VolunteerProles
As a second step to validate o u r IFV techniqu e, we p e r-
form inference on VolunteerProles.Table9showsthat
our algorithm also performs well on this dataset. Notice
that Age inference is slightly worse than in PubProfiles.
Recall from Section 6.2 that we had only a few users
in the last three bins in PubProfiles,andhence,we
merged these bins. However, in VolunteerProles,we
have enough users and we can have 6 different age cat-
egories. This explain s the small difference in inference
accuracy between VolunteerProles and PubProfiles.Re-
garding other attributes, the accuracy is slightly worse for
Relationship (-0.5%) and better for Gender (+3.5%). This
small variation in inference accuracy between PubProfiles
and VolunteerProles demonstrates that our technique has
also good results with users having private attributes: in
PubProfiles,wecouldcomputetheinference accuracy only
on users who published their attribute values, while in
VolunteerProles,wecouldalsotestourmethodonusers
hiding their attributes.
7. Discussion
Topic mo d e l i n g We used LDA fo r semantic ex traction.
Another alternative is to use Latent Seman tic Analysis
(LSA) [17]. As opposed to LDA, LSA is not a generative
model. It consists in extracting a spatial representation for
words from a multi-document corpus by applying singular
value decomposition. However, as pointed out in [12],
spatial representations are inadequate for capturing the
structure of semantic association; LSA assumes symmetric
similarity between words which is not the case for a vast
majority of associations. One classical example given
in [12] involves China and North Korea: Griffiths et al.
noticed that, generally speaking, people have always the
intuition that North Korea is more similar to China than
China to North Korea. This problem is resolved in LDA
where P (occurence of word1 |occurence of word2 ) &=
P (occurence of word2 |occurence of word1 ).Inaddition,
[12] showed that LDA outperforms LSA in terms of
drawing semantic correlations between words.
Collaborative Filtering Our algorithm is based o n dis-
covering latent correlations between user interests in order
to cluster users. An alternative approach could be to employ
model-based collaborative filtering (MBCF) that avoids us-
ing semantic-knowledge. In MBCF, each user is repre-
sented by his interest vector. The size of this vector equals
the number of all defined interest names, and its coordi-
nates are defined as follows: a coordinate is 1 if the user has
the corresponding interest name, otherwise it is 0. Since
interest names are user-generated, the universe of all such
names, and hence the vector size can be huge. This nega-
tively impacts the performance.
In particular, collaborative filtering suffers from a “cold-
start” effect [24], which means that the system cannot draw
correct inferences for users who have insufficient informa-
tion (i.e., small number of interests). Recall from Section
5that70%ofusersinPubProfiles have less than 5 inter-
ests and it is more than 85% in RawProfiles.Hence,the
sparseness of users’ interest vectors is very high (the aver-
age density
8
is 0.000025). Moreover, [8] has studied the
effect of cold-start in recommendation systems (for both
item-based and collaborative-based) on real datasets gath-
ered from two IP-TV providers. Their results show that
awell-knownCFalgorithms,calledSVD[20],performs
poorly when the density is low (about 0.0005) with a re-
call between 5% and 10%. Additionally, the number of new
users and interests is ever growing (on average, 20 millions
new users joined Facebook each month in the first half of
2011 [4]). This tremendous number of n ew users and in-
terests keeps the system in a constant cold-start state. In
addition, users, in MBCF typically evaluate items using a
multi-valued metric (e.g., an item is ranked between 1 and
5) but it must be at least binary (e.g. like/dislike), whereas
in our case, only “likes” (interests) are provided. In fact, the
lack of an interest I in a user profile does not mean that the
user is not interested in I,buthemaysimplynothavedis-
covered I yet. In these scenarios, when users only declare
their interests but not their disinterest, MBCF techniques
(e.g. SVD [20]) are less accurate than nearest neighbor-like
approaches that we employed [16].
OSN independence One interesting feature of our tech-
nique is that it is OSN independent. In particular, it does
not rely on any social graph, and the input data (i.e. interest
names) can be collected from any other source of informa-
tion (e.g., deezer, lastfm, or any other potential sources).
No need for frequent model updates (stability) One
may argue that our LDA model needs frequent updates since
user interests are ever-growing. Nevertheless, recall from
Section 4.2 that our technique uses the parent topics of the
user interests (according to Wikipedia) to augment the se-
mantics knowledge of each interest. There are substantially
fewer higher-level parent categories than leaf categories in
the Wikipedia hierarchy, and they change less frequently.
Thus, there is no need to update the LDA model, unless the
considered interest introduces a new parent category in the
running model. Hence, our approach is more stable than
MBCF; once the IFV vector is extracted and similarity is
computed, we can readily make inference without having to
retrain the system.
8
The density of this vector is the number of coordinates equal one di-
vided by the vector size.
Target e d advertisin g and spam Using our technique,
advertisers co uld automatically build online profiles with
high accuracy and minimum effort without the consent of
users. Spammers could gather information across the web
to send targeted spam. For example, by matching a users
Facebook profile and his email address, the spammer could
send him a message containing ads that are tailored to his
inferred geo-localization, age, or marital status.
Addressing possible limitations First, we only tested our
approach on profiles that provide music interests. Even
with this limitation, our results show the effectiveness of
our technique in inferring undisclosed attributes. In addi-
tion, we only based our method on user interests and did not
combine it with any other available attributes (e.g. gender or
relationship) to improve inference accuracy. We must em-
phasize that our main goal was to show information leakage
through user interests rather than developing a h ighly accu-
rate inference algorithm. Considering other interests (e.g.
movies, books, etc.) and/or combining with different avail-
able attributes can be a potential extension of our scheme
which is left for fu ture work. Second, we demonstrated our
approach using an English-version of Wikipedia. However,
our approach is not restricted to English, since Wikipedia is
also available in o ther languages. Finally, we encountered
few examples that denotes a non-interest (or “dislike”). In
particular, we observed interests that semantically express a
dislike for a group, or an ideology. For instance, an inter-
est can be created with th e title “I hate Michael Jackson”.
Our semantics-driven classification will falsely identify the
users having this interest as “Michael Jackson” fans. How-
ever, as a minority of users are expected to use such a strat-
egy to show their non-interest, this has a small impact on
our approach. Additionally, one might integrate Applied
Linguistics Techniques to handle such peculiar cases and
filter out dislikes.
8Conclusion
This paper presents a semantics-driven inference tech-
nique to predict private user attributes. Using only Music
Interests that are often disclosed by users, we extracted un-
observable Interest topics by analyzing the corpus of Inter-
ests, which are semantically augmented using Wikipedia,
and derived a probabilistic model to compute the belong-
ing of users to each of these topics. We estimated simi-
larities between users, an d showed how our model can be
used to predict hidden information. Therefore, on-line ser-
vices and in particular OSNs should raise the bar of privacy
protections by setting a restrictive by-default behavior, and
explicitly hide most user information.
References
[1] Qt port of webkit: an open source web browser engine.
http://trac.webkit.org/wiki/QtWebKit.
[2] Robots Exclusion Protocol. RobotsExclusionProtocol,
1996.
[3] OpenCyc. http://www.opencyc.org/,2006.
[4] Facebook Statistics. http://gold.insidenetwork.
com/facebook/facebook-stats/,2011.
[5] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art
thou r3579x?: anonymized social networks, hidden patterns,
and structural steganography. In Proceedings of the 16th
international conference on World Wide Web,WWW07,
pages 181–190, New York, NY, USA, 2007. ACM.
[6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet
Allocation. Journal of Machine Learning Research,3:993
1022, 2003.
[7] W.-Y. Chen, J.-C. Chu, J. Luan, H. Bai, Y. Wang, and E. Y.
Chang. Collaborative filtering for orkut communities: dis-
cov ery of user latent behavior. In Proceedings of the 18th in-
ternational conference on World wide web,WWW’09,pages
681–690, New York, NY, USA, 2009. ACM.
[8] P. Cremonesi and R. Turrin. Analysis of cold-start recom-
mendations in IPTV systems. In RecSys ’09: Proceedings of
the third ACM conference on Recommender systems,pages
233–236, New York, NY, USA, 2009. ACM.
[9] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer,
and R. Harshman. Indexing by latent semantic analysis.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMA-
TION SCIENCE,41(6):391407,1990.
[10] C. Fellbaum, editor. WordNe t A n E l ect roni c L exi cal
Database.TheMITPress,Cambridge,MA;London,May
1998.
[11] L. Getoor and C. P. Diehl. Link mining: a survey. SIGKDD
Explor. Newsl.,7,December2005.
[12] T. L. Griffiths, J. B . Tenenbaum, and M. Steyvers. Topics
in semantic representation. Psychological Review,114:2007,
2007.
[13] J. He, W. W. Chu, and Z. (victor Liu. Inferring privacy infor-
mation from social networks. In IEEE International Confer-
ence on Intelligence and Security Informatics,2006.
[14] T. Hofmann. Probabilistic Latent Semantic Analysis. In Pro-
ceedings of Uncertainty in Artificial Intelligence, UAI,1999.
[15] M. Kurant, M. Gjoka, C. T. Butts, and A. Markopoulou.
Walking on a Graph with a Magnifying Glass. In Proceedings
of ACM SIGMETRICS ’11,SanJose,CA,June2011.
[16] S. Lai, L. Xiang, R. Diao, Y. Liu, H. Gu, L. Xu, H. Li,
D. Wang, K. Liu, J. Zhao, and C. Pan. Hybrid recommen-
dation models for binary user preference prediction problem.
In KDD Cup,2011.
[17] T. K. Landauer and S. T. Dumais. Solution to plato’s prob-
lem: The latent semantic analysis theory of acquisition, in-
duction and representation of knowledge. Psychological Re-
view,1997.
[18] J. Lindamood and M. Kantarcioglu. Inferring Private Infor-
mation Using Social Network Data. Technical report, Uni-
versity of Texas at Dallas, 2008.
[19] Z. L i u, Y. Zhang, E. Y. Chang, and M. S un. Plda+: Parallel
latent dirichlet allocation with data placement and pipeline
processing. ACM Transactions on Intelligent Systems and
Technology, special i ssue on Large Scale Machine Learn-
ing,2011. Softwareavailableathttp://code.google.
com/p/plda.
[20] D. B. Michael. Learning collaborative information filters,
1998.
[21] D. Milne. An open-source toolkit for mining wikipedia.
In Proc. New Zealand Computer Science Research Student
Conf,2009.
[22] A. Mislove, B. Viswanath, K. P. Gummadi, and P. Druschel.
You a r e who y ou k n ow : I nferring user p r o l es i n onli n e s o-
cial networks.
[23] J. Puzicha, T. Hofmann, and J. Buhmann. Non-parametric
similarity measures for unsupervised texture segmentation
and image retrieval. In Computer Vision and Pattern Recog-
nition, 1997. Proceedings., 1997 IEEE Computer Society
Conference on,pages267–272,jun1997.
[24] A. I. Schein, A. P opescul, L. H., R. Popescul, L. H. Ungar,
and D. M. Pennock. Methods and metrics for cold-start rec-
ommendations. In In Proceedings of the 25th Annual Interna-
tional ACM SIGIR Conference on Research and Development
in Information Retrieval,pages253260.ACMPress,2002.
[25] A. L. Traud, P. J. Mucha, and M. A. Porter. Social Structure
of Facebook Networks. 2011.
[26] G. Wondracek, T. Holz, E . Kirda, and C. Kruegel. A prac-
tical attack to de-anonymize social network users. In 31st
IEEE Symposium on Security and Privacy, Oakland, Califor-
nia, USA,2010.
[27] E. Zheleva and L. Getoor. To join or not to join: the illusion
of privac y in social networks with mixed public and private
user profiles. In In WWW,2009.
[28] E. Zheleva, J. Guiver, E. M. Rodrigues, and N. Milic-
Frayling. Statistical models of music-listening sessions in
social media. In In WWW,2010.
... The information about users and their online behaviour is collected through the ad library API calls [4]. This includes information inference based on monitoring ads displayed during browsing sessions [5,6]. The Advertising and Analytics (A&A) companies like Google Analytics and Flurry using this framework are working to increase their revenue by providing ad libraries that the apps developers use to serve ads. ...
... for discussion over DMP and DSP. 5 https://junction.cj.com/article/button-weighs-in-what-does-applesidfa-opt-in-overhaul-mean-for-affiliate (Accessed: Nov, 2022). Regulation (GDPR)' [15], 'The Privacy Act in Australia' [16]. ...
... advertising system, e.g. private profiling, (4) the research article may be related to performance measurements, advertising measurements or traffic analysis, etc., (5) we also consider supporting articles to elaborate a particular concept or theory, and (6) we include conference papers, journals, books, early access articles, magazines, and survey articles only. ...
Article
Full-text available
Targeted advertising has transformed the marketing landscape for a wide variety of businesses, by creating new opportunities for advertisers to reach prospective customers by delivering personalised ads, using an infrastructure of a number of intermediary entities and technologies. The advertising and analytics companies collect, aggregate, process, and trade a vast amount of users’ personal data, which has prompted serious privacy concerns among both individuals and organisations. This article presents a comprehensive survey of the privacy risks and proposed solutions for targeted advertising in a mobile environment. We outline details of the information flow between the advertising platform and ad/analytics networks, the profiling process, the measurement analysis of targeted advertising based on user’s interests and profiling context, and the ads delivery process, for both in-app and in-browser targeted ads; we also include an overview of data sharing and tracking technologies. We discuss challenges in preserving the mobile user’s privacy that include threats related to private information extraction and exchange among various advertising entities, privacy threats from third-party tracking, re-identification of private information and associated privacy risks. Subsequently, we present various techniques for preserving user privacy and a comprehensive analysis of the proposals based on such techniques; we compare the proposals based on the underlying architectures, privacy mechanisms, and deployment scenarios. Finally, we discuss the potential research challenges and open research issues.
... The study in [68] designed a membership inference attack against a recommendation system to infer the training data in a content filtering model. Abdelberi et al. used a statistical learning model to find a connection between users' interests and the demographic information that users are not willing to share [69]. Previous studies also investigated the risk of crosssystem information exposure [70], [71]. ...
Preprint
Full-text available
Online personalized recommendation services are generally hosted in the cloud where users query the cloud-based model to receive recommended input such as merchandise of interest or news feed. State-of-the-art recommendation models rely on sparse and dense features to represent users' profile information and the items they interact with. Although sparse features account for 99% of the total model size, there was not enough attention paid to the potential information leakage through sparse features. These sparse features are employed to track users' behavior, e.g., their click history, object interactions, etc., potentially carrying each user's private information. Sparse features are represented as learned embedding vectors that are stored in large tables, and personalized recommendation is performed by using a specific user's sparse feature to index through the tables. Even with recently-proposed methods that hides the computation happening in the cloud, an attacker in the cloud may be able to still track the access patterns to the embedding tables. This paper explores the private information that may be learned by tracking a recommendation model's sparse feature access patterns. We first characterize the types of attacks that can be carried out on sparse features in recommendation models in an untrusted cloud, followed by a demonstration of how each of these attacks leads to extracting users' private information or tracking users by their behavior over time.
... In reality, a large fraction of users build passwords using their own PII (e.g., 36.95%∼51.43% [53]), while a user's PII can often be easily learned from social networks [20] and unending data breaches [4], [27], [47]. For instance, in April 2021, the personal data of 533 million Facebook users was made freely available [34], such as name, birthday, location, phone # and email; in June 2021, personal data of 700 million LinkedIn users was sold online for $5,000 [44], including name, email, location, phone #, gender, etc. ...
Conference Paper
Full-text available
Honeywords are decoy passwords associated with each user account to timely detect password leakage. The key issue lies in how to generate honeywords that are hard to be differentiated from real passwords. This security mechanism was first introduced by Juels and Rivest at CCS’13, and has been covered by hundreds of media and adopted in dozens of research domains. Existing research deals with honeywords primarily in an ad hoc manner, and it is challenging to develop a secure honeyword-generation method and well evaluate (attack) it. In this work, we tackle this problem in a principled approach. We first propose four theoretic models for characterizing the attacker's best distinguishing strategies, with each model based on a different combination of information available to the attacker(e.g., public datasets, the victim’s personal information and registration order). These theories guide us to design effective experiments with real-world password datasets to evaluate the goodness (flatness) of a given honeyword-generation method. Armed with the four best attacking theories, we develop the corresponding honeyword-generation method for each type of attackers, by using various representative probabilistic password guessing models. Through a series of exploratory investigations, we show the use of these password models is not straightforward, but requires creative and significant efforts. Both empirical experiments and user-study results demonstrate that our methods significantly outperform prior art. Besides, we manage to resolve several previously unexplored challenges that arise in the practical deployment of a honeyword method. We believe this work pushes the honeyword research towards statistical rigor.
... However, studies have shown that it is still possible to accurately infer the protected attributes merely by using public information. In terms of the user privacy, works have shown that the user's private information can be inferred by their behavior history [26,41,114], rating data [292], user interests [47], social relations [81,124], or even the writing style of the textual feature [234]. This is also referred to as the attribute inference attack, and the attacker could be either the RS or external entities that have obtained public information about the user and the RS. ...
Preprint
Recommender systems (RS), serving at the forefront of Human-centered AI, are widely deployed in almost every corner of the web and facilitate the human decision-making process. However, despite their enormous capabilities and potential, RS may also lead to undesired counter-effects on users, items, producers, platforms, or even the society at large, such as compromised user trust due to non-transparency, unfair treatment of different consumers, or producers, privacy concerns due to extensive use of user's private data for personalization, just to name a few. All of these create an urgent need for Trustworthy Recommender Systems (TRS) so as to mitigate or avoid such adverse impacts and risks. In this survey, we will introduce techniques related to trustworthy and responsible recommendation, including but not limited to explainable recommendation, fairness in recommendation, privacy-aware recommendation, robustness in recommendation, user controllable recommendation, as well as the relationship between these different perspectives in terms of trustworthy and responsible recommendation. Through this survey, we hope to deliver readers with a comprehensive view of the research area and raise attention to the community about the importance, existing research achievements, and future research directions on trustworthy recommendation.
Preprint
Full-text available
Machine learning (ML) models have been deployed for high-stakes applications, e.g., healthcare and criminal justice. Prior work has shown that ML models are vulnerable to attribute inference attacks where an adversary, with some background knowledge, trains an ML attack model to infer sensitive attributes by exploiting distinguishable model predictions. However, some prior attribute inference attacks have strong assumptions about adversary's background knowledge (e.g., marginal distribution of sensitive attribute) and pose no more privacy risk than statistical inference. Moreover, none of the prior attacks account for class imbalance of sensitive attribute in datasets coming from real-world applications (e.g., Race and Sex). In this paper, we propose an practical and effective attribute inference attack that accounts for this imbalance using an adaptive threshold over the attack model's predictions. We exhaustively evaluate our proposed attack on multiple datasets and show that the adaptive threshold over the model's predictions drastically improves the attack accuracy over prior work. Finally, current literature lacks an effective defence against attribute inference attacks. We investigate the impact of fairness constraints (i.e., designed to mitigate unfairness in model predictions) during model training on our attribute inference attack. We show that constraint based fairness algorithms which enforces equalized odds acts as an effective defense against attribute inference attacks without impacting the model utility. Hence, the objective of algorithmic fairness and sensitive attribute privacy are aligned.
Book
Full-text available
● تقدم الانترنت تطوير كبير للكثير من المجالات ومنها التسويق ، ويمثل التحري التسويقي جزءاً هاما من نظام المعلومات التسويقي ، ولقد قمنا في هذه الدراسة باقتراح و تصميم نظام للتحري التسويقي معتمد على الانترنت كمصدر للبيانات بالاستفادة من أنظمة ذكاء الأعمال وتقنيات التخزين في مستودعات البيانات ، للوصول الى منهجية للنظام المقترح ، وتطبيق عملي للنظام على سوق البورصة السورية الذي يعتبر ذو تكاملية ووثوقية للبيانات ، حيث تم تصميم وبناء عملية (ETL) متكاملة لبناء مستودع لبيانات السوق وتصميم Cube مقترح لحالتنا الدراسية يحقق الجمع بين البيانات (الماكروية والميكروية للسوق) ،تم دراسة وتطبيق خوارزميات التنقيب في البيانات كخوارزميات التصنيف والسلاسل الزمنية ، وكانت النتائج جيدة في النماذج التنبؤية التي وجدت علاقات تصنيفية قوية لمتغيرات السوق الداخلية، مع عدم ايجاد علاقة قوية بين المتغيرات المختلفة لبيئات البحث ، واظهار نتائج جيدة للتنبؤ بسعر السهم بواسطة خوارزمية السلاسل الزمنية
Article
Heterogeneous graphs consist of multiple types of nodes and edges, and contain comprehensive information and rich semantics, which can properly model real-world complex systems. However, the attribute values of nodes are often incomplete with many missing attributes, as the cost of collecting node attributes is prohibitively expensive or even impossible (e.g., sensitive personal information). While a handful of graph neural network (GNN) models are developed for attribute completion in heterogeneous networks, most of them either ignore the use of similarity between nodes in feature space, or overlook the different importance of different-order neighbor nodes for attribute completion, resulting in poor performance. In this paper, we propose a general Attribute Completion framework for HEterogeneous Networks (AC-HEN), which is composed of feature aggregation, structure aggregation, and multi-view embedding fusion modules. Specifically, AC-HEN leverages feature aggregation and structure aggregation to obtain multi-view embeddings considering neighbor aggregation in both feature space and network structural space, which distinguishes different contributions of different neighbor nodes by conducting weighted aggregation. Then AC-HEN uses the multi-view embeddings to complete the missing attributes via an embedding fusion module in a weak supervised learning paradigm. Extensive experiments on three real-world heterogeneous network datasets demonstrate the superiority of AC-HEN against state-of-the-art baselines in both attribute completion and node classification. The source code is available at: https://github.com/Code-husky/AC-HEN.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
Online social networks are now a popular way for users to connect, express themselves, and share content. Users in to- day's online social networks often post a profile, consisting of attributes like geographic location, interests, and schools attended. Such profile information is used on the sites as a basis for grouping users, for sharing content, and for sug- gesting users who may benefit from interaction. However, in practice, not all users provide these attributes. In this paper, we ask the question: given attributes for some fraction of the users in an online social network, can we infer the attributes of the remaining users? In other words, can the attributes of users, in combination with the social network graph, be used to predict the attributes of another user in the network? To answer this question, we gather fine-grained data from two social networks and try to infer user profile attributes. We find that users with com- mon attributes are more likely to be friends and often form dense communities, and we propose a method of inferring user attributes that is inspired by previous approaches to detecting communities in social networks. Our results show that certain user attributes can be inferred with high accu- racy when given information on as little as 20% of the users.
Conference Paper
Full-text available
We have developed a method for recommending items that combines content and collaborative data under a single probabilistic framework. We benchmark our algorithm against a naïve Bayes classifier on the cold-start problem, where we wish to recommend items that no one in the community has yet rated. We systematically explore three testing methodologies using a publicly available data set, and explain how these methods apply to specific real-world applications. We advocate heuristic recommenders when benchmarking to give competent baseline performance. We introduce a new performance metric, the CROC curve, and demonstrate empirically that the various components of our testing strategy combine to obtain deeper understanding of the performance characteristics of recommender systems. Though the emphasis of our testing is on cold-start recommending, our methods for recommending and evaluation are general.
Article
This paper presents detail information of our solutions to the task 2 of KDD Cup 2011. The task 2 is called binary user preference prediction problem in the paper because it aims at separating items users vote highly from items users will not vote, and the solutions of this task can be easily applied on binary user behavior data. In the contest, we firstly im-plemented many different models, including neighborhood-based models, latent factor models, content-based models, etc. Then, linear combination is used to combine different models together. Finally, we used robust post process to further fine tune the special user-item pairs. The final error rate is 2.4808% which placed number 2 in the Leaderboard.
Article
We study the social structure of Facebook “friendship” networks at one hundred American colleges and universities at a single point in time, and we examine the roles of user attributes–gender, class year, major, high school, and residence–at these institutions. We investigate the influence of common attributes at the dyad level in terms of assortativity coefficients and regression models. We then examine larger-scale groupings by detecting communities algorithmically and comparing them to network partitions based on user characteristics. We thereby examine the relative importance of different characteristics at different institutions, finding for example that common high school is more important to the social organization of large institutions and that the importance of common major varies significantly between institutions. Our calculations illustrate how microscopic and macroscopic perspectives give complementary insights on the social organization at universities and suggest future studies to investigate such phenomena further.