Conference PaperPDF Available

Identifying Users Across Social Tagging Systems.

Authors:
Identifying Users Across Social Tagging Systems
Tereza Iofciu1, Peter Fankhauser1, Fabian Abel2, Kerstin Bischoff1
1L3S Research Center, Leibniz University Hannover
{iofciu,fankhauser,bischoff}@L3S.de
2Web Information Systems, TU Delft
f.abel@tudelft.nl
Abstract
How much do tagging activities tell about a user? Is it pos-
sible to identify people in Delicious based on the tags, which
they use in Flickr? In this paper we study those questions and
investigate whether users can be identified across social tag-
ging systems. We combine two kinds of information: their
user ids and their tags. We introduce and compare a variety
of approaches to measure the distance between user profiles
for identification. With the best performing combination we
achieve, depending on the actual settings, accuracies of be-
tween 60% and 80%, which demonstrates that the traces of
Web 2.0 users can reveal quite much about their identity.
Introduction
Today, people have online accounts on diverse Web portals
where they leave plenty of multifaceted profile data. Users
share their pictures, videos, or bookmarks at platforms such
as Flickr, YouTube, and Delicious and annotate these re-
sources using tags to facilitate retrieval of the resources, to
express their opinion regarding some resource or merely to
present themselves (cf. Marlow et al. 2006).
Aggregating profiles from different systems reveals more
information about users and is beneficial for personalization
and cross-domain recommendations – particularly for solv-
ing cold-start problems where systems suffer from sparse
user profiles (Abel et al. 2010). However, for privacy rea-
sons, people may not want their different online accounts to
be connectable. Indeed, the interlinkage of profile informa-
tion may be risky. Recently, PleaseRobMe1set an intimidat-
ing example and attracted public attention as they exploited
foursquare2to detect the current location of Twitter users
and identify – given the linkage to the address of these users
– houses and apartments that were easy to burgle as the in-
habitants were traveling at the time.
(Un)fortunately, automatically connecting the different
Social Web identities of the users is difficult because they
might (possibly on purpose) use varying usernames or have
unequal profiles (e.g. fields such as homepage, birthday,
etc.) on the different systems. Yet, the feasibility of ex-
Copyright c
2011, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1http://pleaserobme.com
2http://foursquare.com
ploiting individual tagging practices to identify a user and
link her Social Web accounts has not been studied in detail.
In this paper, we close this gap and study the following
research question: is it possible to identify users across sys-
tems based on their (tag-based) profiles? We analyze profiles
of users from three collaborative tagging systems: Flickr,
Delicious and StumbleUpon. While the latter two systems
are for organizing public Web resources, Flickr is mainly for
sharing personal pictures with friends and people rarely tag
other people’s photos.
Research Challenge User profiles can be constructed
based on implicit and explicit user feedback. With explicit
feedback, we refer to the data the user herself provides to the
system directly, e.g. during the registration process. Usually
such explicit data is structured as attribute-value pairs. In
our approaches we experiment with the usernames as ex-
plicit profile information. With implicit feedback, we refer
to the users’ tagging activities within the folksonomy sys-
tems, i.e. the set of tag assignments performed by the user.
User Identification Challenge. Given ua, the tag-based pro-
file and/or username of user Xin system A, and UB, the set
of profiles from system B, the challenge of the user identifi-
cation strategies is to rank the profiles from system Bso that
ubUB, the profile of Xin system B, appears at the very
top of the ranking.
Contributions Our main contributions can be summa-
rized as follows:
We propose different strategies that allow for the identifi-
cation of users across systems.
For tag-based profile mapping, we introduce a symmet-
ric variant of BM25 using site specific statistics and com-
pare it against measures like TF, TFIDF and conventional
BM25. The results show that it is important to account for
the specifics of a site.
We evaluate the different matching approaches in experi-
ments with public profiles from three different social tag-
ging networks, Flickr, Delicious and StumbleUpon. We
show how by combining implicit and explicit profiles we
reach an accuracy of over 60%.
Furthermore, we show how by aggregating the users’ pro-
files from different sources, we can identify users with an
accuracy of almost 80%.
Related Work
The issue of identifying users via their interaction over the
web has been recently addressed in various application sce-
narios, such as personalization. Recent research mainly an-
alyzed whether explicit profile information is sufficient to
identify users across system boundaries. For example, Car-
magnola and Cena ( 2009) introduce an approach that bases
heuristics on profile attributes such as username, name, lo-
cation or email address of a user. Vosecky, Hong, and Shen
(2009) examine explicit user profile information from two
similar social networking services to find which fields in
the profiles are best suitable for user cross-system identifica-
tion. Zafarani and Liu (2009) connect user accounts across
12 communities exploiting explicit profile information.
Szomszor, Cantador, and Alani (2008) focus on implicit
tagging information. However, they do not aim to identify
users across tagging platforms, but their goal is to align tag-
based profiles users have in Flickr and Delicious and assume
that linkage between different accounts of the same user is
given. Nevertheless, the authors propose an approach for
correlating tag clouds, which are filtered (to eliminate mis-
spellings) and semantically enriched (e.g., via WordNet syn-
onyms). We have implemented also their approach, but for
our task it led to a significant decrease in performance.
More generally, the problem of identifying users can be
regarded as an instance of duplicate detection – also known
as record linkage or entity resolution – a long standing prob-
lem in computer science. For a comprehensive overview on
the topic see Elmagarmid, Ipeirotis, and Verykios 2007. In
a sense the experiments described in this paper return to the
very root application domain of duplicate detection – iden-
tifying individuals – though under quite different circum-
stances. By tagging and other forms of interactions, Web 2.0
users provide a rich but fairly noisy trace, which as we will
show can be readily exploited for identifying them.
Matching Users across Sites
Matching Users based on their Tags
For identifying users across social systems based on their
tagging behavior, we experiment with standard techniques
like TF, TFIDF and BM25 and compare it against a new
symmetric variant of BM25 using site specific statistics.
Baselines One of the most straightforward approaches to
match tag user profiles exploits tag frequencies. We evaluate
this approach as one baseline (TF). However, this approach
does not take into account the specificity of tags. Tags used
by many users such as “web” contribute much less evidence
for a match than more specific tags such as “NYC”. To take
into account tag specificity, frequency is typically combined
with the inverse document frequency of a tag. Since together
with the vector space model it is a standard method in Infor-
mation Retrieval, we evaluate this approach as another base-
line (TFIDF).
Each user profile uis modeled as a vector, where each
dimension contains the TF or TFIDF value of tag tT. The
matching score of two profiles u1and u2is then determined
by the cosine distance between their weighted vectors.
BM25 A well known weakness of the TFIDF weighting
scheme is that term frequency is not a very good indica-
tor for the relevance of a term. If a document contains a
term 20 times, it is not 20 times as relevant as occurring
just once. Similarly, if a user assigns a tag 20 times, this
is not 20 times as relevant as assigning a tag just once.
Okapi BM25 (Sp¨
arck Jones, Walker, and Robertson 2000)
addresses this weakness by tempering term frequency such
that it quickly saturates with a maximum value.
Site specific IDF and BM25 Tagging behavior is in-
fluenced highly by the site’s domain and design choices.
People tag differently music items, images or Web re-
sources (Bischoff et al. 2008). For example, in our ex-
perimental dataset “tools” is used for more than half of
the resources in Delicious, but only few times in Flickr,
conversely, “arts” is used very often in Flickr, and rarely
in Delicious. Hence, “tools” is a very discriminative tag
when matching against Flickr, while “arts” discriminates
well against profiles in Delicious. As a consequence the doc-
ument frequency of particular tags may differ substantially
among the sites. For the same reason of dependency on sys-
tem design choices like tagging rights, object type and own-
ership, etc. (Marlow et al. 2006), also profile lengths may be
very different. Thus, we suggest to use BM25 together with
a site specific IDF and site specific average profile lengths:
w(u, t, s) = T F (u, t, s)pID F (t, s)
T F (u, t, s) = c(t, u)(k1+ 1)
c(t, u) + k1(1 b+b|u|
avgU (s))
ID F (t, s) = max(0, log N(s)n(t, s) + c
n(t, s) + (1 c))(1)
Thereby, TF takes into account the site specific profile
length avgU (s)and I DF takes into account the site spe-
cific document frequency of a tag n(t, s). As shown below,
this approach leads to significantly improved matching.
Matching Users based on Explicit Usernames
Often, the username is the only explicit and publicly avail-
able user attribute common to various tagging systems. A
straightforward approach for identifying users across sys-
tems is to analyze their usernames (Zafarani and Liu 2009).
We apply the following string similarity metrics: ex-
act match, Jaccard similarity at character level, Leven-
shtein similarity (minimum number of editing operations re-
quired to transform one string into another string), Smith-
Waterman similarity (the costs of aligning two strings by
comparing segments of all possible lengths between two
strings) and Longest Common Substring (LCS), a variation
to Levenshtein distance allowing only addition and deletion.
Matching Users based on Combined Profiles
Now we present approaches to merge different sources of
user information, first by combining implicit and explicit
profile information and, second, by aggregating profiles
from two systems to map against a third system.
Combining Username and Tags In order to combine the
different types of profiles, tag- and username-based, we use
a mixture model:
w(u1, u2) = λwt(u1, u2) + (1 λ)wu(u1, u2)(2)
where wt(u1, u2)is the normalized score obtained based
on the tags the user assigned, as presented above; and
wu(u1, u2)is the string similarity of the usernames of the
two users. As the BM25 scores are not normalized between
0 and 1, we scale them to the same range as the scores on
username similarity by dividing them with the maximum
score of all compared user tag profiles between two systems.
Thereby the choice of λindeed reflects the relative impor-
tance of the two scores used for matching.
Aggregated Profiles Our evaluation will show that match-
ing accuracy via tags depends heavily on the number of tags
given by a user. Hence, we create an aggregated tag-based
user profile by considering all the tags a user has assigned in
two systems, on which the user is already identified (e.g. via
explicit links on her profile page). Therefore, we accumulate
the tag frequencies of the corresponding profiles and then
apply the same comparison approaches for matching users
between the aggregated profile and a third system as pre-
sented above. For creating an aggregated username-based
profile from two systems, we consider the two usernames
as matching candidates. When calculating the distance be-
tween the aggregated username profile and a third profile we
select the highest matching username pair.
Evaluation
Method and Metrics
The user identification algorithms have to find for each user
profile the corresponding profile that refers to the same user
in another system, i.e. each algorithm is tested in different
settings which are given by the different service constella-
tions. For example, (i) given the Flickr profile of user u, the
algorithm has to rank Delicious profiles so that u’s Delicious
profile appears at the very top of the ranking.
Dataset To investigate the questions above, we crawled
public profiles of 421,188 distinct users via the Social Graph
API3, which makes information about connections between
different user accounts of a user available. However, only
a few users linked the profiles they have at social tagging
platforms. Among these users, 1467 people had a Flickr and
Delicious profile (FD dataset) and only 321 users had a tag-
based profile at all three systems, i.e. Flickr and Delicious
and StumbleUpon (FDS dataset).
A remarkable feature of the dataset is that only a few tags
occur in more than one service: less than 20% of the distinct
tags were used in more than one system. For each user and
each pair of services we compute the overlap as the number
of distinct tags that occur in both profiles. The Delicious and
StumbleUpon profiles have the biggest overlap. However,
the overlap is rather small: for more than 50% of the users
the overlap of their Delicious and StumbleUpon profiles is
less than 20% and there exist only 6 users for whom the
overlap is slightly larger than 50%. It is interesting that the
overlap is so small, as in Delicious and StumbleUpon the
same type of resources are tagged, probably the tools are
used for separate tasks. Flickr and StumbleUpon profiles
3http://code.google.com/apis/socialgraph/
offer the least overlap as for more than 40% the overlap is
0%. The small overlaps of individual profiles indicate that
user identification based on tagging is not trivial.
Metrics To measure the quality of the user profile rank-
ings we use MRR and S@k.MRR (Mean Reciprocal Rank)
indicates at which rank the correct profile occurs on average.
The Success at rank k (S@k) stands for the mean probability
that the correct profile occurs within the top k of the ranked
results. In case of tied scores between the correct user pro-
file pair and some other pairs, we penalized both metrics by
dividing them by the number of tied scores.
Results
Matching Users based on Tags Table 1 compares the
various approaches for identifying users. Regarding tag-
based profiles (profile type: tag), BM25 clearly outperforms
TFIDF, and BM25 with site specific IDF also clearly outper-
forms BM25 with global IDF. This suggests that accounting
for site and domain specific characteristics in tag weighting
is promising. All methods yield substantially better results
than the baseline approach using TF with cosine similarity,
BM25 with site specific IDF improves its S@1 by even 2.5
times. By operating on a larger set of users (cf. FD column)
the chance of a mismatch increases for all metrics. However,
the relative ordering is consistent. BM25 with site specific
IDF (k1= 3.75,b= 1, and c= 1) outperforms all other
approaches and looking at MRR it is less influenced by the
higher number of users. Evidently, there exists a strong cor-
relation between profile size and matching accuracy (0.93).
Matching Users based on Username Regarding user-
names (Table 1, username), Levenshtein and the Longest
Common Subsequence (LCS) based distance perform fairly
similar and outperform both Jaccard and Smith-Waterman
distances as well as the ExactMatch baseline. Success rates
(S@k) increase only slightly with increasing k, whereas they
increase fairly substantially depending on kfor matching
based on users’ tags. This is to be expected. User names
tend to be much more unique than the tags assigned by users.
For the best metric (LCS), string similarity works well for
approximately 55% of the users but fails for the other 45%.
Matching Users based on Tags and Username When
combining the best performing measures for the two types
of profiles (Table 1, combined), i.e. BM25 with site spe-
cific IDF (tag-based) and LCS (username-based), we gain
major improvements of 35% compared to the approaches
that exploit just the tag-based profiles and of 8.9% com-
pared to the username-based approaches. Furthermore, we
analyzed how the user identification strategies perform for
the different service settings: all approaches work best when
comparing profiles from StumbleUpon and Delicious while
they are less successful when Flickr is involved. Regard-
ing matching based on tag-based profiles, this result is to
be expected considering that the type of resources differ be-
tween these systems.However, a remarkable observation is
that many users also tend to use similar usernames on Stum-
bleUpon and Delicious as the success of the username-based
Table 1: Results based on user tags, username and mixture for Flickr, Delicious and StumbleUpon (FDS dataset); for FDS
aggregated profiles; and for Flickr and Delicious (FD dataset). All improvements are significant (p<0.05, 2-tailed t-test)
FDS FDS-aggregation FD
Profile type Strategy MRR S@1 S@3 S@10 MRR S@1 S@3 S@10 MRR S@1 S@3 S@10
tags
TF 0.181 0.126 0.180 0.278 - - - - 0.108 0.070 0.110 0.178
TFIDF 0.267 0.207 0.277 0.380 0.335 0.259 0.356 0.470 0.184 0.124 0.197 0.302
BM25 0.301 0.242 0.317 0.405 0.391 0.326 0.414 0.505 0.259 0.204 0.274 0.370
BM25 specific IDF 0.345 0.291 0.360 0.443 0.453 0.393 0.474 0.560 0.343 0.250 0.330 0.428
username
ExactMatch 0.387 0.372 0.375 0.375 0.555 0.542 0.547 0.547 0.555 0.542 0.547 0.547
Jaccard 0.535 0.501 0.536 0.577 0.684 0.654 0.686 0.717 0.684 0.654 0.686 0.717
SmithWaterman 0.462 0.357 0.475 0.607 0.437 0.217 0.476 0.747 0.437 0.217 0.476 0.747
Levenshtein 0.574 0.552 0.572 0.591 0.721 0.701 0.722 0.735 0.571 0.459 0.496 0.524
LCS 0.582 0.552 0.586 0.600 0.727 0.701 0.731 0.746 0.564 0.452 0.502 0.535
combined Mixture 0.677 0.641 0.697 0.728 0.816 0.792 0.832 0.855 0.632 0.543 0.590 0.624
approach (LCS) is higher than 70%, whereas it is less suc-
cessful (S@1 <50%) for Flickr profiles.
Using aggregated profiles Table 1 (FDS-aggregation)
also compares the various approaches for matching aggre-
gated profiles, i.e. the union of two individual profiles for
a given user. We compare each aggregated profile with
the remaining profile (e.g. Flickr-Delicious to Stumble-
Upon, Delicious-StumbleUpon to Flickr, etc.) and vice-
versa. Regarding aggregated tag profiles, BM25 with site
specific IDF leads again to a significant improvement over
IDF and BM25 with global IDF. In summary, knowing more
(tags) about a user improves user identification performance
clearly. For example, S@1 improves from 0.291 to 0.393 for
BM25 with site specific IDF.
Correspondingly, aggregated username-based profiles al-
low for improving the performance. For example, S@1 in-
creases from 0.552 to 0.701 for the Levenshtein and LCS
measures (Table 1, username, FDS-aggregation). Finally,
for the mixture approach that combines the best tag- and
username-based user identification strategies, we also see
that having more user information increases the precision of
the user identification challenge significantly. For the aggre-
gated profiles, the mixture of the tag-based BM25 approach
using site specific IDF and LCS for measuring similarity of
usernames leads to the best performance of 0.816 and 0.792
regarding MRR and S@1 respectively. Moreover, it is in-
teresting to see that for all settings where Flickr profiles are
unified with Delicious or StumbleUpon profiles, the com-
bination of tag- and username-based strategies achieves a
success (S@1) of nearly 90%.
Conclusions and Future Work
In this paper we investigate whether users can be identified
across Web platforms by analyzing their tagging practices.
Therefore, we examine user profiles from three different so-
cial tagging services: Flickr, Delicious and StumbleUpon.
We exploit implicit feedback (tagging behavior) as well as
lightweight explicit profile information (usernames) to con-
struct user profiles for identifying the users. In summary,
we conclude that (1) it is possible to identify users across
systems based on their tagging behavior even though the
tagging behavior varies considerably between the analyzed
systems, (2) for the user identification based on tag profiles
our new approach of BM25 in combination with site spe-
cific IDF outperformed the other approaches significantly
and (3) knowing more about the user (profile aggregation)
and combining tag- and username-based approaches further
improves the performance significantly to an accuracy of al-
most 80% and nearly 90% for specific settings.
While our user identification strategies can support cross-
system personalization, they raise privacy concerns. For fu-
ture work, we plan to study such privacy aspects in more
detail. We will also investigate whether the consideration of
network structure (such as friend links) in combination with
tag-based profile features impacts user identification.
Acknowledgments The work was partially funded by the
NTH (Nieders¨
achsische Technische Hochschule) School for
IT Ecosystems as well as the Crokodil project funded by the
German Federal Ministry of Education and Research and the
European Social Fund of the European Union (ESF).
References
Abel, F.; Henze, N.; Herder, E.; and Krause, D. 2010. Interweaving
public user profiles on the web. In Proc. UMAP, 16–27.
Bischoff, K.; Firan, C. S.; Nejdl, W.; and Paiu, R. 2008. Can all
tags be used for search? In Proc. CIKM 2008, 193–202.
Carmagnola, F., and Cena, F. 2009. User identification for cross-
system personalisation. Information Sciences: an International
Journal 179(1-2):16–32.
Elmagarmid, A. K.; Ipeirotis, P. G.; and Verykios, V. S. 2007.
Duplicate record detection: A survey. Knowledge and Data Engi-
neering, IEEE Transactions on 19(1):1–16.
Marlow, C.; Naaman, M.; Boyd, D.; and Davis, M. 2006. HT06,
tagging paper, taxonomy, flickr, academic article, to read. In Proc.
Hypertext 2006, 31–40. ACM.
Sp¨
arck Jones, K.; Walker, S.; and Robertson, S. E. 2000. A proba-
bilistic model of information retrieval: development and compara-
tive experiments. parts 1 and 2. Information Processing and Man-
agement 36:779–840.
Szomszor, M.; Cantador, I.; and Alani, H. 2008. Correlating user
profiles from multiple folksonomies. In Proc. Hypertext, 33–42.
Vosecky, J.; Hong, D.; and Shen, V. Y. 2009. User identification
across multiple social networks. In Int. Conference on Networked
Digital Technologies (NDT ’09), 360 –365.
Zafarani, R., and Liu, H. 2009. Connecting corresponding identi-
ties across communities. In Proc. ICWSM, 354–357.
... To identify crossover users, we searched in the Flare Systems database to determine whether some of the usernames found in the public forum also discussed in cybercrime forums over a similar timeframe. This cross-correlation method is based on the idea that users are likely to choose the same username in different forums, a phenomenon that was observed in previous studies [55][56][57][58][59][60]. ...
... More precisely, several studies show that individuals can efficiently be identified across online platforms through a simple username matching method [55][56][57][58][59][60]. On the other hand, there is a considerable strain of research aimed at developing more sophisticated methods to link individuals across online platforms based, for example, on the features of a user profile [61,62] or his/her/their writing style (a technique known as stylometry) [63,64]. ...
... This allowed us to find a lower bound estimate on the number of crossover users. The estimate is a lower bound because the method inevitably yields false positives (those we flagged as crossover users when they are not), but given users' tendency to choose the same usernames [55][56][57][58][59][60], it likely yields more false negatives (those we missed with the method) than such false positives. We also strengthened this inequality assumption by using strict username filters instead of fuzzy ones (such as considering "RoniTheJungleMaster" and "RoniThe-JungleMasteR" as the same user). ...
Article
Full-text available
Many activities related to cybercrime operations do not require much secrecy, such as developing websites or translating texts. This research provides indications that many users of a popular public internet marketing forum have connections to cybercrime. It does so by investigating the involvement in cybercrime of a population of users interested in internet marketing, both at a micro and macro scale. The research starts with a case study of three users confirmed to be involved in cybercrime and their use of the public forum. It provides a first glimpse that some business with cybercrime connections is being conducted in the clear. The study then pans out to investigate the forum population's ties with cybercrime by finding crossover users, that is, users from the public forum who also comment on cybercrime forums. The cybercrime forums on which they discuss are analyzed and the crossover users’ strength of participation is reported. Also, to assess if they represent a sub-group of the forum population, their posting behavior on the public forum is compared with that of non-crossover users. This blend of analyses shows that (i) a minimum of 7.2% of the public forum population are crossover users that have ties with cybercrime forums; (ii) their participation in cybercrime forums is limited; and (iii) their posting behavior is relatively indistinguishable from that of non-crossover users. This is the first study to formally quantify how users of an internet marketing public forum, a space for informal exchanges, have ties to cybercrime activities. We conclude that crossover users are a substantial part of the population in the public forum, and even though they have thus far been overlooked, their aggregate effect in the ecosystem must be considered. This study opens new research questions on cybercrime participation that should consider online spaces beyond their cybercrime branding.
... Most existing UIL approaches are supervised models, which need a large number of annotations to train a classifier or ranker to separate linked identity pairs from unlinked ones [7], [8], [9], [10], [11], [12], [13], [14], [15]. Considering the boundaries between different platforms, it is extremely expensive and time-consuming to manually collect sufficient annotations. ...
... Existing user identity linkage approaches can be roughly categorized into supervised, semi-supervised and unsupervised methods. Most existing methods are supervised, which view the studied task as a ranking or classification problem to locate the candidates (identity pairs) with the highest linkage probabilities [8], [7], [9], [10], [11], [12], [13], [14], [15]. Man et al. [14] keep major structural regularities of networks by leveraging the observed anchor links as supervised information. ...
Preprint
Full-text available
User identity linkage, which aims to link identities of a natural person across different social platforms, has attracted increasing research interest recently. Existing approaches usually first embed the identities as deterministic vectors in a shared latent space, and then learn a classifier based on the available annotations. However, the formation and characteristics of real-world social platforms are full of uncertainties, which makes these deterministic embedding based methods sub-optimal. In addition, it is intractable to collect sufficient linkage annotations due to the tremendous gaps between different platforms. Semi-supervised models utilize the unlabeled data to help capture the intrinsic data distribution, which are more promising in practical usage. However, the existing semi-supervised linkage methods heavily rely on the heuristically defined similarity measurements to incorporate the innate closeness between labeled and unlabeled samples. Such manually designed assumptions may not be consistent with the actual linkage signals and further introduce the noises. To address the mentioned limitations, in this paper we propose a novel Noise-aware Semi-supervised Variational User Identity Linkage (NSVUIL) model. Specifically, we first propose a novel supervised linkage module to incorporate the available annotations. Each social identity is represented by a Gaussian distribution in the Wasserstein space to simultaneously preserve the fine-grained social profiles and model the uncertainty of identities. Then, a noise-aware self-learning module is designed to faithfully augment the few available annotations, which is capable of filtering noises from the pseudo-labels generated by the supervised module.
... Another metric, Mean Reciprocal Rank (MRR) [50], also known as Mean Average Precision (MAP) [19], evaluates the average performance of reciprocal rank. MRR is defined as follows: ...
Preprint
Network alignment task, which aims to identify corresponding nodes in different networks, is of great significance for many subsequent applications. Without the need for labeled anchor links, unsupervised alignment methods have been attracting more and more attention. However, the topological consistency assumptions defined by existing methods are generally low-order and less accurate because only the edge-indiscriminative topological pattern is considered, which is especially risky in an unsupervised setting. To reposition the focus of the alignment process from low-order to higher-order topological consistency, in this paper, we propose a fully unsupervised network alignment framework named HTC. The proposed higher-order topological consistency is formulated based on edge orbits, which is merged into the information aggregation process of a graph convolutional network so that the alignment consistencies are transformed into the similarity of node embeddings. Furthermore, the encoder is trained to be multi-orbit-aware and then be refined to identify more trusted anchor links. Node correspondence is comprehensively evaluated by integrating all different orders of consistency. {In addition to sound theoretical analysis, the superiority of the proposed method is also empirically demonstrated through extensive experimental evaluation. On three pairs of real-world datasets and two pairs of synthetic datasets, our HTC consistently outperforms a wide variety of unsupervised and supervised methods with the least or comparable time consumption. It also exhibits robustness to structural noise as a result of our multi-orbit-aware training mechanism.
... The authors of Refs. [32,33,34] tried to leveraging other types of profile information such as gender, address, and work experience to improve the performance of identifying. Instead of extracting features from the user profile, Goga et al. [35] focus on capture characteristics from user-generated contents. ...
Preprint
Interlayer link prediction aims at matching the same entities across different layers of the multiplex network. Existing studies attempt to predict more accurately, efficiently, or generically from the aspects of network structure, attribute characteristics, and their combination. Few of them analyze the effects of intralayer links. Namely, few works study the backbone structures which can effectively preserve the predictive accuracy while dealing with a smaller number of intralayer links. It can be used to investigate what types of intralayer links are most important for correct prediction. Are there any intralayer links whose presence leads to worse predictive performance than their absence, and how to attack the prediction algorithms at the minimum cost? To this end, two kinds of network structural perturbation methods are proposed. For the scenario where the structural information of the whole network is completely known, we offer a global perturbation strategy that gives different perturbation weights to different types of intralayer links and then selects a predetermined proportion of intralayer links to remove according to the weights. In contrast, if these information cannot be obtained at one time, we design a biased random walk procedure, local perturbation strategy, to execute perturbation. Four kinds of interlayer link prediction algorithms are carried out on different real-world and artificial perturbed multiplex networks. We find out that the intralayer links connected with small degree nodes have the most significant impact on the prediction accuracy. The intralayer links connected with large degree nodes may have side effects on the interlayer link prediction.
... The more effective solutions for user profile linking exploit the information and multimedia content that transits SNs. A framework for user profile linking based on the profile's attributes was proposed in [60], while in [61] the authors combined tags and user ID to match users' profiles across different social tagging systems. The solutions proposed in [62] and [63] match user profiles by using information about users' identities without compromising their privacy. ...
Article
Full-text available
In the last decade, Social Networks (SNs) have deeply changed many aspects of society, and one of the most widespread behaviours is the sharing of pictures. However, malicious users often exploit shared pictures to create fake profiles, leading to the growth of cybercrime. Thus, keeping in mind this scenario, authorship attribution and verification through image watermarking techniques are becoming more and more important. In this paper, we firstly investigate how thirteen of the most popular SNs treat uploaded pictures in order to identify a possible implementation of image watermarking techniques by respective SNs. Second, we test the robustness of several image watermarking algorithms on these thirteen SNs. Finally, we verify whether a method based on the Photo-Response Non-Uniformity (PRNU) technique, which is usually used in digital forensic or image forgery detection activities, can be successfully used as a watermarking approach for authorship attribution and verification of pictures on SNs. The proposed method is sufficiently robust, in spite of the fact that pictures are often downgraded during the process of uploading to the SNs. Moreover, in comparison to conventional watermarking methods the proposed method can successfully pass through different SNs, solving related problems such as profile linking and fake profile detection. The results of our analysis on a real dataset of 8400 pictures show that the proposed method is more effective than other watermarking techniques and can help to address serious questions about privacy and security on SNs. Moreover, the proposed method paves the way for the definition of multi-factor online authentication mechanisms based on robust digital features.
... Vosecky et al. [21] proposed a method to identify users based on web profile matching and further extended its effectiveness by incorporating the user's friend network. To investigate whether users can be identified across systems based on their tag-based profiles, an aggregate profile was constructed by combining usernames and user tags [22]. Following these studies, more abundant information was considered to link user accounts [23][24][25][26][27]. ...
Article
Full-text available
Sources of complementary information are connected when we link user accounts belonging to the same user across different platforms or devices. The expanded information promotes the development of a wide range of applications, such as cross-platform prediction, cross-platform recommendation, and advertisement. Due to the significance of user account linkage and the widespread popularization of GPS-enabled mobile devices, there are increasing research studies on linking user account with spatio-temporal data across location-aware social networks. Being different from most existing studies in this domain that only focus on the effectiveness, we propose a novel framework entitled HFUL (A Hybrid Framework for User Account Linkage across Location-Aware Social Networks), where efficiency, effectiveness, scalability, robustness, and application of user account linkage are considered. Specifically, to improve the efficiency, we develop a comprehensive index structure from the spatio-temporal perspective, and design novel pruning strategies to reduce the search space. To improve the effectiveness, a kernel density estimation-based method has been proposed to alleviate the data sparsity problem in measuring users’ similarities. Additionally, we investigate the application of HFUL in terms of user prediction, time prediction, and location prediction. The extensive experiments conducted on three real-world datasets demonstrate the superiority of HFUL in terms of effectiveness, efficiency, scalability, robustness, and application compared with the state-of-the-art methods.
... Vosecky et al. [21] proposed a method to identify users based on web profile matching and further extended its effectiveness by incorporating the user's friend network. To investigate whether users can be identified across systems based on their tag-based profiles, an aggregate profile was constructed by combining usernames and user tags [22]. Following these studies, more abundant information was considered to link user accounts [23] [24] [25] [26] [27]. ...
Preprint
Full-text available
Sources of complementary information are connected when we link user accounts belonging to the same user across different platforms or devices. The expanded information promotes the development of a wide range of applications, such as cross-platform prediction, cross-platform recommendation, and advertisement. Due to the significance of user account linkage and the widespread popularization of GPS-enabled mobile devices, there are increasing research studies on linking user account with spatio-temporal data across location-aware social networks. Being different from most existing studies in this domain that only focus on the effectiveness, we propose a novel framework entitled HFUL (A Hybrid Framework for User Account Linkage across Location-Aware Social Networks), where efficiency, effectiveness, scalability, robustness, and application of user account linkage are considered. Specifically, to improve the efficiency, we develop a comprehensive index structure from the spatio-temporal perspective, and design novel pruning strategies to reduce the search space. To improve the effectiveness, a kernel density estimation-based method has been proposed to alleviate the data sparsity problem in measuring users' similarities. Additionally, we investigate the application of HFUL in terms of user prediction, time prediction, and location prediction. The extensive experiments conducted on three real-world datasets demonstrate the superiority of HFUL in terms of effectiveness, efficiency, scalability, robustness, and application compared with the state-of-the-art methods.
Article
Anchor link prediction across social networks plays an important role in multiple social network analysis. Traditional methods rely heavily on user privacy information or high-quality network topology information. These methods are not suitable for multiple social networks analysis in real-life. Deep learning methods based on graph embedding are restricted by the impact of the active privacy protection policy of users on the graph structure. In this paper, we propose a novel method which neutralizes the impact of users’ evasion strategies. First, graph embedding with conditional estimation analysis is used to obtain a robust embedding vector space. Secondly, cross-network features space for supervised learning is constructed via the constraints of cross-network feature collisions. The combination of robustness enhancement and cross-network feature collisions constraints eliminate the impact of evasion strategies. Extensive experiments on large-scale real-life social networks demonstrate that the proposed method significantly outperforms the state-of-the-art methods in terms of precision, adaptability and robustness for the scenarios with evasion strategies.
Article
Network Alignment (NA), which aims to find the nodes that represent the same entity (i.e., anchor nodes) across different networks, is a fundamental problem in many cross-network researches. Recent advances in network embedding have inspired various auspicious approaches for addressing the NA task, and embedding-based NA technology has become the main research trend. Most embedding-based NA methods follow the consistency assumption explicitly or implicitly, where anchor nodes across different networks tend to have similar local structures/neighbors. However, through the detailed statistical analysis across networks, we observe that anchor nodes have high heterogeneity, i.e., they have different local structures across different networks. Hence, in this paper, we present the formal definition of the heterogeneity of anchor nodes and propose a network alignment framework that combines heterogeneity, which can simultaneously consider the case of heterogeneity and consistency of anchor nodes. In our approach, we propose to use a variational autoencoder to learn node embeddings, and design an effective dual constraint mechanism–Laplacian regularization and heterogeneity constraint to balance the consistency and heterogeneity for network alignment across different networks respectively. Finally, to verify the effectiveness of our proposed method, we conduct extensive experiments on several real-world datasets. Experimental results show that the proposed model achieves better performance than state-of-the-art methods.
Article
Interlayer link prediction aims at matching the same entities across different layers of the multiplex network. Existing studies attempt to predict more accurately, efficiently, or generically from the aspects of network structure, attribute characteristics, and their combination. Few of them analyze the effects of intralayer links. Namely, few works study the backbone structures which can effectively preserve the predictive accuracy while dealing with a smaller number of intralayer links. It can be used to investigate what types of intralayer links are most important for correct prediction. Are there any intralayer links whose presence leads to worse predictive performance than their absence, and how to attack the prediction algorithms at the minimum cost? To this end, two kinds of network structural perturbation methods are proposed. For the scenario where the structural information of the whole network is completely known, we offer a global perturbation strategy that gives different perturbation weights to different types of intralayer links and then selects a predetermined proportion of intralayer links to remove according to the weights. In contrast, if these information cannot be obtained at one time, we design a biased random walk procedure, local perturbation strategy, to execute perturbation. Four kinds of interlayer link prediction algorithms are carried out on different real-world and artificial perturbed multiplex networks. We find out that the intralayer links connected with small degree nodes have the most significant impact on the prediction accuracy. The intralayer links connected with large degree nodes may have side effects on the interlayer link prediction.
Conference Paper
Full-text available
Today, more and more people have their virtual identities on the Web. It is common that people are users of more than one social network and also their friends may be registered on multiple web sites. A facility to aggregate our online friends into a single integrated environment would enable the user to keep up-to-date with their virtual contacts more easily, as well as to provide improved facility to search for people across different websites. In this paper, we propose a method to identify users based on profile matching. We use data from two popular social networks to study the similarity of profile definition. We evaluate the importance of fields in the web profile and develop a profile comparison tool. We demonstrate the effectiveness and efficiency of our tool in identifying and consolidating duplicated users on different websites.
Article
Full-text available
The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Each step in the argument is matched by comparative retrieval tests, to provide a single coherent account of a major line of research. The experiments demonstrate, for a large test collection, that the probabilistic model is effective and robust, and that it responds appropriately, with major improvements in performance, to key features of retrieval situations.Part 1 covers the foundations and the model development for document collection and relevance data, along with the test apparatus. Part 2 covers the further development and elaboration of the model, with extensive testing, and briefly considers other environment conditions and tasks, model training, concluding with comparisons with other approaches and an overall assessment.Data and results tables forboth partsare given in Part 1. Key results are summarised in Part 2.
Conference Paper
Full-text available
One of the most interesting challenges in the area of social computing and social media analysis is the so-called com- munity analysis. A well known barrier in cross-community (multiple website) analysis is the disconnectedness of these websites. In this paper, our aim is to provide evidence on the existence of a mapping among identities across multiple com- munities, providing a method for connecting these websites. Our studies have shown that simple, yet effective approaches, which leverage social media's collective patterns can be uti- lized to find such a mapping. The employed methods suc- cessfully reveal this mapping with 66% accuracy.
Conference Paper
Full-text available
In recent years, tagging systems have become increasingly popular. These systems enable users to add keywords (i.e., "tags") to Internet resources (e.g., web pages, images, videos) without relying on a controlled vocabulary. Tagging systems have the potential to improve search, spam detection, reputation systems, and personal organization while introducing new modalities of social communication and opportunities for data mining. This potential is largely due to the social structure that underlies many of the current systems. Despite the rapid expansion of applications that support tagging of resources, tagging systems are still not well studied or understood. In this paper, we provide a short description of the academic related work to date. We offer a model of tagging systems, specifically in the context of web-based systems, to help us illustrate the possible benefits of these tools. Since many such systems already exist, we provide a taxonomy of tagging systems to help inform their analysis and design, and thus enable researchers to frame and compare evidence for the sustainability of such systems. We also provide a simple taxonomy of incentives and contribution models to inform potential evaluative frameworks. While this work does not present comprehensive empirical results, we present a preliminary study of the photo- sharing and tagging system Flickr to demonstrate our model and explore some of the issues in one sample system. This analysis helps us outline and motivate possible future directions of research in tagging systems.
Conference Paper
Full-text available
ABSTRACT As the popularity of the web increases, particularly the use of social networking sites and Web2.0 style sharing plat- forms, users are becoming increasingly connected, sharing more and more information, resources, and opinions. This vast array of information presents unique opportunities to harvest knowledge about user activities and interests through the exploitation of large-scale, complex systems. Communal tagging sites, and their respective folksonomies, are one ex- ample of such a complex system, providing huge amounts of information about users, spanning multiple domains of in- terest. However, the current Web infrastructure provides no mechanism for users to consolidate and exploit this informa- tion since it is spread over many desperate and unconnected resources. In this paper we compare,user tag-clouds from multiple folksonomies to: (a) show how they tend to over-
Conference Paper
Full-text available
While browsing the Web, providing profile information in social networking services, or tagging pictures, users leave a plethora of traces. In this paper, we analyze the nature of these traces. We investigate how user data is distributed across different Web systems, and examine ways to aggregate user profile information. Our analyses focus on both explicitly provided profile information (name, homepage, etc.) and activity data (tags assigned to bookmarks or images). The experiments reveal significant benefits of interweaving profile information: more complete profiles, advanced FOAF/vCard profile generation, disclosure of new facets about users, higher level of self-information induced by the profiles, and higher precision for predicting tag-based profiles to solve the cold start problem.
Article
Currently, there is an increasing demand for user-adaptive systems for various purposes in many different domains. Typically, personalisation in information systems occurs separately within each system. The recent trends in user modeling rely on cross-system personalisation, i.e., the opportunity to share information across multiple information systems in order to improve user adaptation. Cooperation among systems in order to exchange user model knowledge is a complex task. This paper addresses a key challenge for cross-system personalisation which is often taken as a starting assumption, i.e., user identification.In this paper, we describe the conceptualization and implementation of a framework that provides a common base for user identification for cross-system personalisation among web-based user-adaptive systems. However, the framework can be easily adopted in different working environments and for different purposes.The framework represents a hybrid approach which draws parallels both from centralized and decentralized solutions for user modeling. To perform user identification, we propose to exploit a set of identification properties that are combined using an identification algorithm.
Conference Paper
Collaborative tagging has become an increasingly popular means for sharing and organizing Web resources, leading to a huge amount of user generated metadata. These tags represent quite a few different aspects of the resources they describe and it is not obvious whether and how these tags or subsets of them can be used for search. This paper is the first to present an in-depth study of tagging behavior for very different kinds of resources and systems - Web pages (Del.icio.us), music (Last.fm), and images (Flickr) - and compares the results with anchor text characteristics. We analyze and classify sample tags from these systems, to get an insight into what kinds of tags are used for different resources, and provide statistics on tag distributions in all three tagging environments. Since even relevant tags may not add new information to the search procedure, we also check overlap of tags with content, with metadata assigned by experts and from other sources. We discuss the potential of different kinds of tags for improving search, comparing them with user queries posted to search engines as well as through a user survey. The results are promising and provide more insight into both the use of different kinds of tags for improving search and possible extensions of tagging systems to support the creation of potentially search-relevant tags.
Duplicate record detection: A survey. Knowledge and Data Engineering
  • A K Elmagarmid
  • P G Ipeirotis
  • V S Verykios
Elmagarmid, A. K.; Ipeirotis, P. G.; and Verykios, V. S. 2007. Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on 19(1):1-16.