Conference PaperPDF Available

Understanding co-evolution of social and content networks on Twitter

Authors:
  • GESIS - Leibniz Institute of the Social Sciences

Abstract and Figures

Social media has become an integral part of today's web and allows users to share content and socialize. Understanding the factors that influence how users evolve over time - for example how their social network and their contents co-evolve - is an issue of both theoretical and practical relevance. This paper sets out to study the temporal co-evolution of content and social networks on Twitter and bi-directional influences between them by using multilevel time series regression models. Our findings suggest that on Twitter social networks have a strong influence on content networks over time, and that social network properties, such as users' number of followers, strongly influence how active and informative users are. While our investigations are limited to one small dataset obtained from Twitter, our analysis opens up a path towards more systematic studies of network coevolution on platforms such as Twitter or Facebook. Our results are relevant for researchers and social media hosts interested in understanding how content-related and social activities of social media users evolve over time and which factors impact their co-evolution. Categories and Subject Descriptors E.1 [Data Structures]: Graphs and networks; J.4 [Computer Applications]: Social and behavioral sciences- Sociology General Terms Experimentation, Human Factors, Measurement.
Content may be subject to copyright.
Understanding co-evolution of social and content
networks on Twitter
Philipp Singer
Knowledge Management
Institute
Graz University of Technology
Graz, Austria
philipp.singer@tugraz.at
Claudia Wagner
DIGITAL Intelligent Information
Systems
JOANNEUM RESEARCH
Graz, Austria
claudia.wagner@joanneum.at
Markus Strohmaier
Knowledge Management
Institute and Know-Center
Graz University of Technology
Graz, Austria
markus.strohmaier@tugraz.at
ABSTRACT
Social media has become an integral part of today’s web and
allows users to share content and socialize. Understanding
the factors that influence how users evolve over time - for ex-
ample how their social network and their contents co-evolve -
is an issue of both theoretical and practical relevance. This
paper sets out to study the temporal co-evolution of con-
tent and social networks on Twitter and bi-directional in-
fluences between them by using multilevel time series re-
gression models. Our findings suggest that on Twitter so-
cial networks have a strong influence on content networks
over time, and that social network properties, such as users’
number of followers, strongly influence how active and in-
formative users are. While our investigations are limited to
one small dataset obtained from Twitter, our analysis opens
up a path towards more systematic studies of network co-
evolution on platforms such as Twitter or Facebook. Our
results are relevant for researchers and social media hosts
interested in understanding how content-related and social
activities of social media users evolve over time and which
factors impact their co-evolution.
Categories and Subject Descriptors
E.1 [Data Structures]: Graphs and networks;
J.4 [Computer Applications]: Social and behavioral sci-
ences—Sociology
General Terms
Experimentation, Human Factors, Measurement
Keywords
Microblog, Twitter, Influence Patterns, Semantic Analysis,
Time Series
1. INTRODUCTION
Social media applications such as blogs, message boards or
microblogs allow users to share content and socialize. Host-
ing such social media applications can however be costly,
and social media hosts need to ensure that their users re-
main active and their platform remains popular. Monitor-
ing and analyzing behavior of social media users and their
social and content co-evolution over time can provide valu-
able information on the factors which impact the activity
and popularity of such social media applications. Activity
and popularity are often measured by the growth of content
produced by users and/or the growth of its social network.
In recent work we have analyzed how the tagging behav-
ior of users influences the emergence of global tag semantics
[3]. However, as a research community we know little about
the factors that impact the activity and popularity of social
media applications and we know even less about how users’
content-related activities (e.g., their tweeting, retweeting or
hashtagging behavior) influence their social activities (i.e.,
their following behavior) and vice versa.
This paper sets out to explore factors that impact the co-
evolution of users’ content-related and social activities based
on a dataset consisting of randomly chosen users taken from
Twitter’s public timeline by using a multilevel time series
regression model. Unlike previous research, we focus on
measuring dynamic bi-directional influence between these
networks in order to identify which content-related factors
impact the evolution of social networks and vice versa. This
analysis enables us to tackle questions such as ”Does growth
of a user’s followers increase the number of links or hashtags
they use per tweet?” or ”Does an increase in users’ popular-
ity imply that their tweets will be retweeted more often on
average?”.
Our results reveal interesting insights into influence patterns
in content networks, social networks and between them. Our
observations and implications are relevant for researchers
interested in social network analysis, text mining and be-
havioral user studies, as well as for social media hosts who
need to understand the factors that influence the evolution
of users’ content-related and social activities on their plat-
forms.
2. METHODOLOGY
Since we aim to gain insights into the temporal evolution
of content networks and social networks, we apply time se-
ries modeling [2] based on the work by Wang and Groth [6]
who provide a framework to measure the bi-directional influ-
ence between social and content network properties. In this
work we apply an autoregressive model in order to model
our time series data. An autoregressive model is a model
that goes back p time units in the regression and has the
ability to make predictions. This model can be defined as
AR(p), where the parameter p determines the order of the
model. An autoregressive model aims to estimate an obser-
vation as a weighted sum of previous observations, which is
the number of the parameter p. In this work we apply a
simple model, which calculates each variable independently
and further only includes values from the last time unit.
The calculated coefficients of the model can determine the
influences between variables over time.
In regression analysis variables often stem from different lev-
els. So called multilevel regression models are an appropriate
way to model such data. Hence, the measurement occasion
is the basic unit which is nested under an individual, the
cluster unit. In our dataset we have such a hierarchical
nested structure. For each day different properties are mea-
sured repeatedly, but all of these values belong to different
individuals in our study. If we would apply a simple autore-
gressive model to our data we would ignore the difference
between each user and would just calculate the so-called
fixed effects, because we can not assume that all cluster-
specific influences are included as covariates in the analysis
[4]. The advantage of such multilevel regression models is
now that they add random effects to the fixed effects to also
consider variations among our individuals. Since we mea-
sure different properties repeatedly for different days and
different individuals in our study, our dataset has a hierar-
chical nested structure. Therefore, we utilize a multilevel
autoregressive regression model which is defined as follows:
x
(t)
i,p
= a
T
i
x
(t1)
p
+
(t)
i
+ b
T
i,p
x
(t1)
p
+
(t)
i,p
(1)
In this equation x
(t)
p
= (x
(t)
i,p
, ..., x
(t)
m,p
)
T
represents a vector,
which contains the variables for an individual p at time t.
Furthermore, a
i
= (a
i,1
, ..., a
im
)
T
represents the fixed effect
coefficients and b
i
= (b
i,1
, ..., b
im
)
T
represents the random
effect coefficients. It is assumed that
(t)
i
and
(t)
i,p
is the noise
with Gaussian distribution for the fixed and random effects
respectively. It has zero mean and variance σ
2
. To compare
the fixed effects to each other, the variables in the random
effect regression equations need to be linearly transformed to
represent standardized values. How this is done and how the
model is finally applied to our data is described in section 4.
3. DATASET
We chose Twitter as a platform for studying the co-evolution
of communication content and social networks, since it is
a popular micro-blogging service. We explore one random
dataset in this work, which was crawled within a time period
of 30 days. This random dataset consists of random users
from the public timeline who do not have anything special
in common.
To generate the random dataset, we randomly chose 1500
users from the public Twitter timeline who we used as seed
users. We used the public timeline method from the Twit-
ter API to sample users rather than using random user IDs
since the timeline method is biased towards active Twitter
users. To ensure that our random sample of seed users con-
sists of active, English-speaking Twitter users, we further
only kept users who mainly tweet in English, have at least
80 followers, 40 followees and 200 tweets. We also had to re-
move users from our dataset who deleted or protected their
account during the 30 days of crawling. Hence, we ended up
having 1.188 seed users for whom we were able to crawl their
social network (i.e., their followers and followees) and their
tweets and retweets. To identify retweets we used the flag
provided by the official Twitter API and to extract URLs
we used a regular expression. During a 30 day time period
(from 15.03.2011 to 14.04.2011) we polled the data daily at
about the same time.
4. EXPERIMENTAL SETUP
The goal of our experiments is to study the co-evolution of
social and content networks of Twitter users and influence
patterns between them. In order to achieve this we firstly
created a social and content network for each specific time
point.
Social network: The social network is a one-mode directed
network, where each vertex represents a user and the edges
between these vertices represent the directed follow-relations
between two users at a certain point in time. The con-
structed social network of seed users only reflects a sub-part
of a greater network. Therefore it makes no sense to cal-
culate and analyze specific network properties such as be-
tweenness centrality or clustering coefficient, because these
properties depend on the whole network and we only have
data available for a certain sub-network.
Content network: The content network at each point in
time is a two-mode network, which connects users and tweets
via authoring-relations. From these user-tweet networks one
can extract specific tweet features, such as hashtags, links or
retweet information, and build, for example, a user-hashtag
network. It would also be possible to create further types of
content networks, such as hashtag co-occurrence networks
(see [5] for further types), but we leave the investigation of
such network types open for future research.
Overall, the social networks capture the social following re-
lations between users, whereas our content networks account
the tweets users publish. Finally, we can connect both net-
works via their user vertexes, since we know which user in
the social network corresponds to which user in the content
network and vice versa.
A further step towards our final results is the normaliza-
tion of our available data. This is done by subtracting the
time-overall mean and dividing the result by the time-overall
standard deviation. The fixed effects can now be analyzed
as the effect of one standard deviation of change in the in-
dependent variable on the number of standard deviations
change in the dependent variable [6].
Based on the prepared data, the final model described in
section 2 can be applied to identify potential influences be-
tween social and content network properties over time. Ta-
Influence Network
Content Network
Social Network
RetweetedRatio 0.26
LinkRatio 0.25
RetweetRatio
-0.02
HashtagRatio
0.02
0.02
0.20
0.03 0.03
0.14
#Tweets 0.35
#Followees
-0.09
0.36
0.31
-0.23
1.00 #Followers
0.95
1.56 1.93
1.00
Figure 1: Influence network between the content and social network of a randomly chosen set of Twitter
users. An arrow between two properties indicates that the value of one property at time t has a positive
or negative effect on the value of the other property at time t + 1. Red dashed arrows represent negative
effects and blue solid arrows represent positive effects. The thickness of the lines indicates the weight of the
influence relations. Only statistically significant influences are illustrated.
ble 1 describes each social and content network property
used throughout our experiment. The properties are calcu-
lated for a corresponding social or content network at each
time point t of the random Twitter dataset. The depen-
dent variable of the model is always a property at time t
and the independent variable are all properties at time t 1
including the dependent variable at that time. Including
the dependent variable in that step allows us to detect if
a variable’s previous value influences it’s future value. Fi-
nally, the resulting statistical significant coefficients show a
relationship between an independent variable at time t 1
and a dependent variable at time t. Positive coefficients in-
dicate that a high value of a property leads to an increase
of another property, while negative coefficients indicate that
a high value of a property leads to an decrease of another
property. To reveal positive and negative influence relations
between properties within and across different networks, we
visualize them as graphical influence network.
5. RESULTS
Our results reveal interesting influence patterns between so-
cial networks and content networks. The influence network
in figure 1 shows the correlations detected in the multilevel
regression analysis via arrows that point out influences be-
tween a property at time t and another property at time
t + 1.
The influence network reveals significant influences of so-
cial properties on content network properties. The strongest
positive effects can be observed between the number of fol-
lowers of a user and the content network properties - i.e.,
users’ number of followers positively influences their link ra-
tio, their retweeted ratio and their number of tweets. This
indicates that users start providing more tweets and also
Table 1: Social and content network properties
Network
type
Property Description
Social #Followers The number of followers a user v has on
a specific time point t.
Social #Followees The number of followees a user v has on
a specific time point t.
Content #Tweets The number of tweets a user v has au-
thored on a specific time point t.
Content Hashtag
ratio
The number of hashtags used by a user
v on a specific time point t, normalized
by the number of daily tweets authored
by him/her.
Content Retweet
ratio
The number of retweets (originally au-
thored by other users) by a user v on a
specific time point t, normalized by the
number of tweets he/she published that
day.
Content Retweeted
ratio
The number of tweets produced by a user
v on a specific time point t that were
retweeted by other users, normalized by
the number of tweets user v published
that day.
more links in their tweets if their number of followers in-
creases. Not surprisingly, users’ tweets are also more likely
to get retweeted if their number of followers increases, be-
cause more users are potentially reading their tweets.
Further, figure 1 shows that the number of followees of the
social network has positive and negative influences on the
content network in our random dataset. While the positive
effects point to the link and hashtag ratio, the negative ef-
fects point to the number of tweets and the retweeted ratio.
This suggests that users who start following other users also
start using more hashtags and links. One possible expla-
nation for this is that users get influenced by the links and
hashtags used by the users they follow and might therefore
use them more often in their own tweets. The negative ef-
fect of the number of followees on the number of tweets and
the retweeted ratio suggests that users who start following
many other users start behaving more like passive readers
rather than active content providers.
Another observation of our experiment is that all properties
influence themselves positively, which indicates that users
who are active one day, tend to be even more active the
next day. This indicates for example, that users who attract
new followers one day tend to attract more new followers
the day after.
6. CONCLUSIONS AND FUTURE WORK
The main contributions of this paper are the following: (i)
We applied multilevel time series regression models to one
selected Twitter dataset consisting of social and content net-
work data and (ii) we explored influence patterns between
social and content networks on Twitter. In our experiments
we studied how the properties of social and content networks
co-evolve over time. We showed that the adopted approach
allows answering interesting questions about how users’ be-
havior on Twitter evolves over time and the factors that
impact this evolution. While our results are limited to the
dataset used, our work illuminates a path towards study-
ing complex dynamics of network evolution on systems such
as Twitter. Our analyses may also facilitate social media
hosts to promote certain features of the platform and steer
users and their behavior. For example, one can see from
our analysis that usage of content features, such as hashtags
and links, is highly influenced by social network properties
such as the number of followers of a user. Therefore, social
media hosts could try to encourage users to use more con-
tent features by introducing new measures such as a friend
recommender techniques which might impact the social net-
work of users. However, further work is warranted to study
these ideas.
Overall, our findings on one small Twitter dataset suggest
that there are manifold sources of influence between social
and content network properties. Our results indicate that
users’ behavior and the co-evolution of content and social
networks on Twitter is driven by social factors rather than
content factors. Previous research by Anagnostopoulos et
al. [1] showed that content on Flickr is not strongly in-
fluenced by social factors. This may suggest that different
social media applications may be driven by different factors.
The experimental setup used in our work can be applied
to different datasets to study these questions in the future.
Nevertheless, further work is required to confirm or refute
this observations on other, larger datasets.
Our experiments suggest that the number of followers pow-
erfully influences properties of the content network. One
interpreation for that is that the number of followers is a
very important motivation for Twitter users to add more
content and use more content features like hashtags, URLs
or retweets. However, the number of users a user is follow-
ing can also have a negative influence on content network
properties as one can see from figure 1. Our results suggests
that an increase of a user’s followees (i.e., the number of
users he/she follows) implies that the user starts tweeting
less and that his/her tweets get less frequently retweeted.
Further, our findings show that all properties influence them-
selves positively. This does not mean that the values of all
properties always increase over time, but that they tend to
increase depending on how much they increased the day be-
fore. For example, a Twitter user who started posting more
links at day t, is likely to post even more links at day t + 1
or a user who gain new followers at day t is likely to gain
even more new followers at day t + 1.
To summarize, our work highlights the existence of interest-
ing influence relationships between content and social net-
works on Twitter, and shows that multilevel time series re-
gression analysis can be used to reveal such relationships and
to study how they evolve over time. Based on the techniques
developed by Wang and Groth [6], our work investigated in-
fluence patterns in a new domain, i.e. on microblogging plat-
forms like Twitter. Our results are relevant for researchers
interested in social network analysis, text mining and be-
havioral user studies, as well as for community hosts who
need to understand the factors that influence the evolution
of their users in terms of their content-related and social
behavior.
Acknowledgments
This work is in part funded by the FWF Austrian Science
Fund Grant I677 and the Know-Center Graz. Claudia Wag-
ner is a recipient of a DOC-fForte fellowship of the Austrian
Academy of Science.
7. REFERENCES
[1] A. Anagnostopoulos, R. Kumar, and M. Mahdian.
Influence and correlation in social networks. In Y. Li,
B. Liu, and S. Sarawagi, editors, Proceedings of the 14th
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Las Vegas, Nevada, USA,
August 24-27, 2008, pages 7–15. ACM, 2008.
[2] G. Kitagawa. Introduction to Time Series Modeling
(Chapman & Hall/CRC Monographs on Statistics &
Applied Probability). Chapman and Hall/CRC, 2010.
[3] C. K
¨
orner, D. Benz, A. Hotho, M. Strohmaier, and
G. Stumme. Stop thinking, start tagging: tag semantics
emerge from collaborative verbosity. In Proceedings of
the 19th international conference on World wide web,
WWW ’10, pages 521–530, New York, NY, USA, 2010.
ACM.
[4] A. Skrondal and S. Rabe-Hesketh. Generalized Latent
Variable Modeling: Multilevel, Longitudinal, and
Structural Equation Models. Chapman and Hall/CRC,
2004.
[5] C. Wagner and M. Strohmaier. The wisdom in
tweetonomies: Acquiring latent conceptual structures
from social awareness streams. In Proc. of the Semantic
Search 2010 Workshop (SemSearch2010), april 2010.
[6] S. Wang and P. Groth. Measuring the dynamic
bi-directional influence between content and social
networks. In P. Patel-Schneider, Y. Pan, P. Hitzler,
P. Mika, L. Zhang, J. Pan, I. Horrocks, and B. Glimm,
editors, The Semantic Web
˝
U ISWC 2010, volume 6496
of Lecture Notes in Computer Science, pages 814–829.
Springer Berlin / Heidelberg, 2010.
... One of the observations was that when users first join the network, they create a relationship based on the preferential attachment mechanism (nodes with more edges obtain more edges in comparison to nodes with fewer edges [23]); however, in subsequent bursts, they explore various network regions to create relationships. Next, Singer et al. [24] studied the evolution of social networks and their content network on Twitter. A content network is defined as a network of user tweets where a user is linked to the tweets he or she has posted. ...
Article
Full-text available
The funa is a prevalent concept in Chile that aims to expose a person’s bad behavior, punish the aggressor publicly, and warn the community about it. Despite its massive use on the social networks of Chilean society, the real dissemination of funas among communities is unknown. In this paper, we extract, generate, analyze, and compare the Twitter social network’s spread of three tweets related to “funas” against three other trending topics, through the analysis of global network characteristics over time (degree distribution, clustering coefficient, hop plot, and betweenness centrality). As observed, funas have a specific behavior, and they disseminate as quickly as a common tweet or more quickly; however, they spread thanks to several network users, generating a cohesive group.
... Specifically, we explore the temporal co-evolution of content and social networks in Twitter, a popular microblogging platform, and Boards.ie 4 , a popular Irish message board. This work represents an extended version of a paper titled Understanding co-evolution of social and content networks on Twitter published at the workshop MSM2012: Making Sense of Microposts of the WWW2012 conference [17]. It adds new information by presenting results from two additional experiments conducted on data from Twitter and Boards.ie. ...
Conference Paper
Full-text available
Social media has become an integral part of today's web and allows communities to share content and socialize. Understanding the factors that influence how communities evolve over time - for example how their social network and their content co-evolve - is an issue of both theoretical and practical relevance. This paper sets out to study the temporal co-evolution of microblog messages' content and social networks on Twitter and of forum-messages' content and social networks induced from communication behavior of users from an online forum called Boards.ie and bi-directional influences between them by using multilevel time series regression models. Our findings suggest that social networks have a stronger influence on content networks in our datasets over time than vice versa, and that social network properties, such as Twitters users' in-degree or Boards.ie users' reply behavior, strongly influence how active and informative users are. While our investigations are limited to three small datasets obtained from Twitter and Boards.ie, our analysis opens up a path towards more systematic studies of network co-evolution on social media platforms. Our results are relevant for researchers and community managers interested in understanding how content-related and social behavior of social media users evolve over time and which factors impact their co-evolution.
Article
As a huge amount of tweets become available online, it has become an opportunity and a challenge to extract useful information from tweets for various purposes. This chapter proposes a novel way to extract topical structure from a large set of tweets and generate a usable summarization along with related topical keywords. Our system covers the full span of the topical analytics of tweets starting with collecting the tweets, processing and preparing them for text analysis, forming clusters of relevant words, and generating visual summaries of most relevant keywords along with their topical context. We evaluate our system by conducting a user study and the results suggest that users are able to detect relevant information and infer relationships between keywords better with our summarization method than they do with the commonly used word cloud visualizations.
Chapter
Social networks are complex systems which evolve through interactions among a growing set of actors or users. A popular methodology of studying such systems is to use tools of complex network theory to analyze the evolution of the networks, and the topological properties that emerge through the process of evolution. With the exponential rise in popularity of Online Social Networks (OSNs) in recent years, there have been a number of studies which measure the topological properties of such networks. Several network evolution models have also been proposed to explain the emergence of these properties, such as those based on preferential attachment, heterogeneity of nodes, and triadic closure. We survey some of these studies in this chapter. We also describe in detail a preferential attachment based model to analyze the evolution of OSNs in the presence of restrictions on node-degree that are presently being imposed in all popular OSNs.
Conference Paper
Full-text available
The Social Semantic Web has begun to provide connections between users within social networks and the content they produce across the whole of the Social Web. Thus, the Social Semantic Web provides a basis to analyze both the communication behavior of users together with the content of their communication. However, there is little research combining the tools to study communication behaviour and communication content, namely, social network analysis and content analysis. Furthermore, there is even less work addressing the longitudinal characteristics of such a combination. This paper presents a general framework for measuring the dynamic bi-directional influence between communication content and social networks. We apply this framework in two use-cases: online forum discussions and conference publications. The results provide a new perspective over the dynamics involving both social networks and communication content.
Article
In time series modeling, the behavior of a certain phenomenon is expressed in relation to the past values of itself and other covariates. Since many important phenomena in statistical analysis are actually time series and the identification of conditional distribution of the phenomenon is an essential part of the statistical modeling, it is very important and useful to learn fundamental methods of time series modeling. Illustrating how to build models for time series using basic methods, Introduction to Time Series Modeling covers numerous time series models and the various tools for handling them. The book employs the state-space model as a generic tool for time series modeling and presents convenient recursive filtering and smoothing methods, including the Kalman filter, the non-Gaussian filter, and the sequential Monte Carlo filter, for the state-space models. Taking a unified approach to model evaluation based on the entropy maximization principle advocated by Dr. Akaike, the author derives various methods of parameter estimation, such as the least squares method, the maximum likelihood method, recursive estimation for state-space models, and model selection by the Akaike information criterion (AIC). Along with simulation methods, he also covers standard stationary time series models, such as AR and ARMA models, as well as nonstationary time series models, including the locally stationary AR model, the trend model, the seasonal adjustment model, and the time-varying coefficient AR model. With a focus on the description, modeling, prediction, and signal extraction of times series, this book provides basic tools for analyzing time series that arise in real-world problems. It encourages readers to build models for their own real-life problems.
Article
Although one might argue that little wisdom can be con-veyed in messages of 140 characters or less, this paper sets out to explore whether the aggregation of messages in social awareness streams, such as Twitter, conveys meaningful in-formation about a given domain. As a research community, we know little about the structural and semantic properties of such streams, and how they can be analyzed, character-ized and used. This paper introduces a network-theoretic model of social awareness stream, a so-called "tweetonomy", together with a set of stream-based measures that allow researchers to systematically define and compare different stream aggregations. We apply the model and measures to a dataset acquired from Twitter to study emerging seman-tics in selected streams. The network-theoretic model and the corresponding measures introduced in this paper are rel-evant for researchers interested in information retrieval and ontology learning from social awareness streams. Our em-pirical findings demonstrate that different social awareness stream aggregations exhibit interesting differences, making them amenable for different applications.
Conference Paper
In many online social systems, social ties between users play an important role in dictating their behavior. One of the ways this can happen is through social influence, the phenomenon that the actions of a user can induce his/her friends to behave in a similar way. In systems where social influence exists, ideas, modes of behavior, or new technologies can diffuse through the network like an epidemic. Therefore, identifying and understanding social influence is of tremendous interest from both analysis and design points of view. This is a difficult task in general, since there are factors such as homophily or unobserved confounding variables that can induce statistical correlation between the actions of friends in a social network. Distinguishing influence from these is essentially the problem of distinguishing correlation from causality, a notoriously hard statistical problem. In this paper we study this problem systematically. We define fairly general models that replicate the aforementioned sources of social correlation. We then propose two simple tests that can identify influence as a source of social correlation when the time series of user actions is available. We give a theoretical justification of one of the tests by proving that with high probability it succeeds in ruling out influence in a rather general model of social correlation. We also simulate our tests on a number of examples designed by randomly generating actions of nodes on a real social network (from Flickr) according to one of several models. Simulation results confirm that our test performs well on these data. Finally, we apply them to real tagging data on Flickr, exhibiting that while there is significant social correlation in tagging behavior on this system, this correlation cannot be attributed to social influence.
Conference Paper
Recent research provides evidence for the presence of emergent semantics in collaborative tagging systems. While several methods have been proposed, little is known about the factors that influence the evolution of semantic structures in these systems. A natural hypothesis is that the quality of the emergent semantics depends on the pragmatics of tagging: Users with certain usage patterns might contribute more to the resulting semantics than others. In this work, we propose several measures which enable a pragmatic differentiation of taggers by their degree of contribution to emerging semantic structures. We distinguish between categorizers, who typically use a small set of tags as a replacement for hierarchical classification schemes, and describers, who are annotating resources with a wealth of freely associated, descriptive keywords. To study our hypothesis, we apply semantic similarity measures to 64 different partitions of a real-world and large-scale folksonomy containing different ratios of categorizers and describers. Our results not only show that "verbose" taggers are most useful for the emergence of tag semantics, but also that a subset containing only 40% of the most 'verbose' taggers can produce results that match and even outperform the semantic precision obtained from the whole dataset. Moreover, the results suggest that there exists a causal link between the pragmatics of tagging and resulting emergent semantics. This work is relevant for designers and analysts of tagging systems interested (i) in fostering the semantic development of their platforms, (ii) in identifying users introducing "semantic noise", and (iii) in learning ontologies.