Conference PaperPDF Available

Ignorance isntextquoterightt Bliss: An Empirical Analysis of Attention Patterns in Online Communities

Authors:

Abstract

Online community managers work towards building and managing communities around a given brand or topic. Arisk imposed on such managers is that their community may die out and its utility diminish to users. Understanding what drives attention to content and the dynamics of discussions in a given community informs the community manager and/or host with the factors that are associated with attention, allowing them to detect a reduction in such factors. In this paper we gain insights into the idiosyncrasies that individual community forums exhibit in their attention patterns and how the factors that impact activity differ. We glean such insights through a two-stage approach that functions by (i) differentiating between seed posts - i.e. posts that solicit a reply - and non-seed posts - i.e. posts that did not get any replies, and (ii) predicting the level of attention that seed posts will generate. We explore the effectiveness of a range of features for predicting discussions and analyse their potential impact on discussion initiation and progress. Our findings show that the discussion behaviour of different communities exhibit interesting differences in terms of how attention is generated. Our results show amongst others that the purpose of a community as well as the specificity of the topic of a community impact which factors drive the reply behaviour of a community. For example, communities around very specific topics require posts to fit to the topical focus of the community in order to attract attention while communities around more general topics do not have this requirement. We also found that the factors which impact the start of discussions in communities often differ from the factors which impact the length of discussions.
Open Research Online
The Open University’s repository of research publications
and other research outputs
Ignorance isn’t bliss: an empirical analysis of attention
patterns in online communities
Conference Item
How to cite:
Wagner, Claudia ; Rowe, Matthew; Strohmaier, Markus and Alani, Harith (2012). Ignorance isn’t bliss: an
empirical analysis of attention patterns in online communities. In: 4th IEEE International Conference on
Social Computing, 3-6 September 2012, Amsterdam, The Netherlands (forthcoming).
For guidance on citations see FAQs.
c
2012 The Authors
Version: Accepted Manuscript
Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copy-
right owners. For more information on Open Research Online’s data policy on reuse of materials please consult
the policies page.
oro.open.ac.uk
Ignorance isn’t Bliss:
An Empirical Analysis of Attention Patterns in
Online Communities
Claudia Wagner, Matthew Rowe, Markus Strohmaier, and Harith Alani
Institute of Information and Communication Technologies, JOANNEUM RESEARCH, Graz, Austria
Email: claudia.wagner@joanneum.at
Knowledge Media Institute, The Open University, Milton Keynes, UK
Email: m.c.rowe@open.ac.uk, halani@open.ac.uk
Knowledge Management Institute and Know-Center, Graz University of Technology,Graz, Austria
Email: markus.strohmaier@tugraz.at
Abstract—Online community managers work towards building
and managing communities around a given brand or topic. A
risk imposed on such managers is that their community may die
out and its utility diminish to users. Understanding what drives
attention to content and the dynamics of discussions in a given
community informs the community manager and/or host with the
factors that are associated with attention, allowing them to detect
a reduction in such factors. In this paper we gain insights into
the idiosyncrasies that individual community forums exhibit in
their attention patterns and how the factors that impact activity
differ. We glean such insights through a two-stage approach that
functions by (i) differentiating between seed posts - i.e. posts that
solicit a reply - and non-seed posts - i.e. posts that did not get any
replies, and (ii) predicting the level of attention that seed posts
will generate. We explore the effectiveness of a range of features
for predicting discussions and analyse their potential impact on
discussion initiation and progress.
Our findings show that the discussion behaviour of different
communities exhibit interesting differences in terms of how
attention is generated. Our results show amongst others that
the purpose of a community as well as the specificity of the topic
of a community impact which factors drive the reply behaviour
of a community. For example, communities around very specific
topics require posts to fit to the topical focus of the community in
order to attract attention while communities around more general
topics do not have this requirement. We also found that the
factors which impact the start of discussions in communities often
differ from the factors which impact the length of discussions.
Index Terms—attention, online communities, discussion, pop-
ularity, user generated content
I. INTRODUCTION
Social media applications such as blogs, video sharing
sites or message boards allow users to share various types
of content with a community of users. For the managers of
such communities, the investment of time and money means
that community utility is paramount. A reduction in activity
could be detrimental to the appearance of the community to
outside users, conveying an impression of a community that
is no longer active and therefore of little utility. The different
nature and intentions of online communities means that what
drives attention to content in one community may differ from
another. For example, what catches the attention of users in a
question-answering or a support-oriented community may not
have the same effect in conversation-driven or event-driven
communities. In this paper we use the number of replies that a
given post on a community message board yields as a measure
of its attention.
To explore these and related questions, our paper sets out
to study the following two research questions:
1) Which factors impact the attention level a post gets in
certain community forums?
2) How do these factors differ between individual commu-
nity forums?
Understanding what factors are associated with attention in
different communities could inform managers and hosts of
community forums with the know-how of what drives attention
and what catches the attention of users in their community.
Empowered with such information, managers could then detect
changes in such factors that could potentially impact commu-
nity activity and cause the utility of the community to alter.
We approach our research questions through an empirical
study of attention patterns in 20 randomly selected forums on
the Irish community message board Boards.ie.1Our study was
facilitated through a two-stage approach that (i) differentiates
between seed posts - i.e. thread starters on a community
message board that got at least one reply - and non-seed posts
- i.e. thread starters which did not get a single reply, and (ii)
predicts the level of attention that seed posts will generate - i.e.
the number of replies. Through the use of five distinct feature
sets, containing a total of 28 features and including user,
focus,content,community and post title features, we analysed
how attention is generated in different community forums.
We find interesting differences between these communities in
terms of what drives users to reply to thread starters initially
(through our seed post identification experiment) and what
factors are associated with the length of discussions (through
our seed post activity level prediction experiment). Our work
1http://www.boards.ie
is relevant for researchers interested in behavioural analysis of
communities and analysts and community managers who aim
to understand the factors that are associated with attention
within a community.
The paper is structured as follows: section 2 describes
related work within the fields of attention prediction on
different social web platforms. Section 3 describes the dataset
and Section 4 describes the features used in our analysis.
Section 5 presents our experiments on identifying seed posts
and anticipating their attention level in different communities.
Section 6 discusses our findings and relates them to previous
research. Section 7 concludes the paper with a summary of
the key findings gleaned from our experiments and plans for
future work.
II. RE LATE D WOR K
Attention on social media platforms can be gauged through
assessing the number of replies that a piece of content or
user receives. Within this context [1] consider the problem
of reciprocity prediction and study this problem in a commu-
nication network extracted from Twitter. They essentially aim
to predict whether a user A will reply to a message of user B
by exploring various features which characterise user pairs
and show that features that approximate the relative status
of two nodes are good indicators of reciprocity. Our work
differs from [1], since we do not aim to predict who will
reply to a message, but consider the problem of identifying
posts which will start a discussion and predicting the length of
discussions. Further, we focus on exploring idiosyncrasies in
the reply behaviour of different communities, while the above
work studies communication networks on Twitter without dif-
ferentiating between individual sub-communities which may
use Twitter as a communication medium.
The work presented in [2] investigates factors that impact
whether Twitter users reply to messages and explores if Twitter
users selectively choose whom to reply to based on the topic
or, otherwise, if they reply to anyone about anything. Their
results suggest that the social aspect predominantly conditions
users’ interactions on Twitter. Work described in [3] considers
the task of predicting discussions on Twitter, and found that
certain features were associated with increased discussion
activity - i.e. the greater the broadcast spectrum of the user,
characterised by in-degree and list-degree levels, the greater
the discussion activity. Further, in our previous work [4] we
explored factors which may impact discussions on message
boards and showed, amongst others, that content features are
better indicators of seed posts than user features. Similar to
our previous work [4] we also aim to predict discussions on
message boards, but unlike past work, which aimed to identify
global attention patterns, we focus on exploring and contrast-
ing the discussion behaviour of individual communities.
Closely related to the problem of anticipating the reply-
behaviour of social media users is the problem of predicting
the popularity and virality of content. For example, the work
described in [5] consider the task of predicting the rank of
stories on Digg and found that the number of early comments
and their quality and characteristics are useful indicators.
Hong et al. [6] investigated the problem of predicting the
popularity of messages on Twitter measured by the num-
ber of future retweets. One of their findings was that the
likelihood that a portion of a user’s followers will retweet
a new message depends on how many followers the user
has and that messages which only attract a small audience
might be very different from the messages which receive
huge numbers of retweets. Similar work by [7] explored
the relation between the content properties of tweets and
the likelihood of the tweets being retweeted. By analysing
a logistic regression model’s coefficients, Naveed et al. [7]
found that the inclusion of a hyperlink and using terms of a
negative valence increased the likelihood of the tweet being
retweeted. The work of [8] explores the retweet behaviour of
Twitter users by modeling individual micro-cosm behaviour
rather than general macro-level processes. They present four
retweeting models (general model, content model, homophily
model, and recency model) and found that content based
propagation models were better at explaining the majority
of retweet behaviours in their data. Szabo et al. [9] studied
content popularity on Digg and YouTube. They demonstrated
that early access patterns of users can be used to forecast
the popularity of content and showed that different platforms
reveal different attention patterns. For example, while Digg
stories saturate fairly quickly (about a day) to their respective
reference popularities, YouTube videos keep attracting views
throughout their lifetimes. In [10] the authors present a mutual
dependency model to study the virality of hashtags in Twitter.
Although its is well-known that sub-communities of users
can be identified on most social media applications, previous
research did not explore differences in the attention patterns
of such sub-communities. To the best of our knowledge,
our work is the first to focus on exploring idiosyncrasies
of communities’ attention patterns by comparing the reply
behaviour of different community forums. We also provide an
extended set of features to assess the effects that community
and focus features have on reply behaviour, something which
has not been explored previously.
III. DATASET: BOAR DS .IE
In this work, we analysed data from an Irish community
message board, Boards.ie, which consists of 725 community
forums ranging from communities around specific computer
games or spiritual groups to communities around general
topics such as films or music. Since our goal is to uncover the
idiosyncrasies that individual community forums exhibit and
the deltas between them, we selected 20 forums at random.
Forum 374 - Weather: Community of users who have
special interest in weather. This forum allows users to
talk about the current, future and past weather all over
the world and share information - e.g. weather pictures.
Forum 10 - Work & Jobs: The community around this
forum consists of users who are looking for jobs, offering
jobs and/or are seeking advice in work-related things.
This means that the community has, on the one hand,
a support and advice offering purpose and, on the other
hand, is a marketplace for users who are in similar
situations.
Forum 221 - Spanish: Community of practice where
users share a common long-term goal - namely to learn,
improve or practice their Spanish.
Forum 343 - Golf : Community of users who are inter-
ested in the sport Golf. In this forum users can discuss
anything related with golf.
Forum 646 - adverts.ie Support: A support oriented forum
for adverts.ie, which is a community based marketplace
where individuals can buy or sell items online.
Forum 235 - Rip Off Ireland: Support-oriented forum
which aims to help consumers in Ireland avoid being
ripped off with the current spate of Euro price hikes.
Forum 865 - Home Entertainment (HE) Video Players
& Recorders: Community of users formed around a
specific group of products namely HE Video Players and
Recorders. In this forum users are seek advice and discuss
issues related these products.
Forum 544 - Banking & Insurance & Pensions: Support
and advice oriented community of users who seek or
provide advice about banking, insurance and pensions.
Forum 876 - Construction & Planning: Forum where
users can discuss topics related to construction and plan-
ning.
Forum 267 - Astronomy & Space: Information and
content-sharing community of users who are interested
in astronomy and space.
Forum 669 - Google Earth : Forum where users talk about
Google Earth.
Forum 55 - Satellite: Information and content-sharing
community where users who are interested in satellite
television can discuss this topic.
Forum 858 - Economics: Community of users who have
a special interest or expertise in economics.
Forum 44 - CTYI: Community of users around the Centre
for the Talented Youth of Ireland (CTYI) which is a
youth programme for students between the ages of six
and sixteen of high academic ability in Ireland.
Forum 538 - Japanese RPG: Community of users playing
Japanese role games.
Forum 227 - Television: Discussion about television re-
lated topics such as TV series.
Forum 607 - Music Production: Community of music
producers and/or people interested in music and music
production in general.
Forum 630 - Real-World Tournaments & Events: Fo-
rum where users talk about events and tournaments -
i.e. competitions involving a relatively large number of
competitors, all participating in a sport, game or event.
Forum 190 - North West: Forum around the North West
of Ireland, where users who live in the North West or plan
to visit the North West can discuss related questions.
Forum 625 - Greystones & Charlesland: Forum where
users talk about everything related with Charlesland and
Greystones which are both located about 25 kilometres
from Dublin city centre.
For our analysis we use all data published in one of these
20 forums in the year 2006. We use this year to enable
comparisons of attention patterns with our previous work [4]
over the same time period. Table I describes the properties of
the dataset.
TABLE I
DESCRIPTION OF THE BOAR DS .IE DATAS ET.
Forum ID Users Posts Threadstarter Seeds
Work & Jobs 10 2371 13964 1741 1435
Music Production 607 308 2018 295 265
Golf 343 394 3361 415 364
Astronomy & Space 267 247 782 141 97
Weather 374 439 7598 233 209
HE Video Players &
Recorders
865 134 294 61 52
Banking & Insurance &
Pensions
544 956 3514 531 459
Google Earth 669 117 584 37 32
Satellite 55 1516 14704 1714 1620
Economics 858 73 260 28 26
Espanol (Spanish) 221 21 86 31 21
Rip Off Ireland 235 28 329 34 28
Construction & Planning 876 34 202 35
CTYI 44 39 1505 42 39
Japanese RPG 538 71 1157 75 71
adverts.ie Support 646 304 1227 216 172
Television 227 2086 17442 1238 1139
North West 190 376 4866 291 271
Greystones & Charles-
land
625 396 4930 418 382
Real-World Tournaments
& Events
630 640 18551 1475 1172
IV. FEATURE ENGINEERING
Understanding what factors drive reply behaviour in online
communities involves defining a collection of features and then
assessing which are important and which are not. Within our
approach setting we can identify the features that impact upon
seeding a discussion - through our seed post identification
experiments - and how features are associated with seed posts
that generate the most attention.
For each thread starter post we computed the features
by taking a 6-month window, based on work by [4], [11],
prior to when the post was made. That means, we used all
the author’s past posts within that window to construct the
necessary features - i.e. constructing a social network for the
user features, assessing the forums in which the posts were
made for the focus features and inferring topic distributions
per user based the content of posts he/she authored within the
previous 6 month. For the features that relied on topic models,
we first fit a Latent Dirichlet Allocation [12] model which we
use later for inferring users’ topic distributions. For training
the LDA model we aggregated all posts authored by one user
in 2005 into an artificial user document and chose the default
hyperparameters (α= 50/T ,β= 0.01 and T= 50) which
we optimised during training by using Wallach’s fixed point
iteration method [13]. Based on the empirical findings of [14],
we decided to place an asymmetric Dirichlet prior over the
topic distributions and a symmetric prior over the distribution
of words. We used the trained model to infer the average topic
distributions (averaged over 10 independent runs of a Markov
chain) of a user at a certain point in time by using all posts
he/she authored within the last 6 months.
We define five feature sets: user features, focus features,
content features, community features and title features, as
follows.
A. User Features
User features describe the author of a post via his/her past
behaviour, seeking to identify key behavioural attributes that
are associated with seed and non-seed posts. For example, a
post may only start a lengthy discussion if published by a
rather active user.
User Account Age: Measures the length of time (mea-
sured in days) that the user has been a member of the
community;
Post Count: Measures the number of posts that the user
has made.
Post Rate: Measures the number of posts made by the
user per day.
In-degree: For the author of each post, this feature
measures the number of incoming communication con-
nections to the user.
Out-degree: This feature measures the number of outgo-
ing communication connections from the user.
B. Focus Features
Focus features measure the topical concentration an author.
Our intuition is that by gauging the topical focus of a user we
will be able to capture his/her areas of interest or expertise.
For the first two features, we use the frequency distribution of
forums a user has published posts in to approximate his/her
interests or expertise, while for the last three features we learn
topics from a collection of posts and annotate users with topics
by using LDA.
Forum Entropy: Measures the forum focus of a user via
the entropy of a user’s forum distribution. Low forum
entropy would indicate high focus.
Forum Likelihood: Measures the likelihood that the user
will publish a post within a forum given the past forum
distribution of the user.
Topic Entropy: Measures the topical focus of a user via
the entropy of a user’s topic distributions inferred via the
posts he/she authored. Low topic entropy would indicate
high focus.
Topic Likelihood: Measures the likelihood that the user
will publish a post about certain topics given the past
topic distribution of the user’s posts. Therefore, we mea-
sure how well the user’s language model can explain a
given post by using the likelihood measures:
likelihood(p) =
Np
X
i=0
ln P(wi|ˆ
φ, ˆ
θ)(1)
Nprefers to the total number of words in the post, ˆ
φ
refers to the word-topic matrix and ˆ
θrefers to the average
topic distribution of a user’s past posts. The higher the
likelihood for a given post, the greater the post fits to the
topics the user has previously written about.
Topic Distance: Measures the distance between the topics
of a post and the topics the user wrote about in the past.
We use the Jensen-Shannon (JS) divergence to measure
the distance between the user’s past topic distribution and
the post’s topic distribution. The JS divergence is defined
as follows:
DJS =1
2DKL (P||A) + 1
2DKL (A||P)(2)
where DKL (P||A)represents the Kullback Leibler di-
vergence between a random variable P and A. The KL
divergence is calculated as follows:
DKL (P||A) = X
i
P(i) log P(i)
A(i)(3)
The lower the JS divergence, the greater the post fits the
topics the user has previously written about.
C. Post Features
Post features describe the post itself and identify attributes
that the content of a post should contain in order to start
a discussion. For example, a post may only start a lengthy
discussion if its content is informative or if it was published
at a certain time in the day.
Post Length: Number of words in the post.
Complexity: Measures the cumulative entropy of terms
within the post, using the word-frequency distribution, to
gauge the concentration of language and its dispersion
across different terms.
Readability: Gunning fog index using average sentence
length (ASL) [15] and the percentage of complex words
(PCW): 0.4(ASL +P C W )This feature gauges how
hard the post is to parse by humans.
Referral Count: Count of the number of hyperlinks within
the post.
Time in day: The number of minutes through the day from
midnight that the post was made. This feature is used to
identify key points within the day that are associated with
seed or non-seed posts.
Informativeness: The novelty of the post’s terms with
respect to other posts. We derive this measure using the
Term Frequency-Inverse Document Frequency (TF-IDF)
measure.
Polarity: Assesses the average polarity of the post using
Sentiwordnet.2Let ndenote the number of unique terms
in post p, the function pos(t.)returns the positive weight
of the term t.from the lexicon and neg(t.)returns the
negative weight of the term. We therefore define the
polarity of pas:
1
n
n
X
i=1
pos(ti)neg(ti)(4)
2http://sentiwordnet.isti.cnr.it/
D. Community Features
Community features describe relations between a post or
its author and the community with which the post is shared.
For example, members of a community might be more likely
to reply to a post which fits their areas of interest or they
might be likely to reply to someone who contributed a lot to
discussions in the past.
Topical Community Fit: Measures how well a post fits
the topical interests of a community by estimating how
well the post fits into the forum. We measure how well
the community’s language model can explain the post by
using the likelihood measure which is defined in equation
1, where ˆ
θrefers to the average topic distribution of posts
that were previously published in that forum. The higher
the likelihood of the post, the better the post fits to the
topics of this community forum.
Topical Community Distance: Measures the distance be-
tween the topics of a post and the topics the community
discussed in the past. We use the Jensen-Shannon (JS) di-
vergence to measure the distance between a community’s
past topic distribution and a post’s topic distribution. The
JS divergence is defined in equation 2. The lower the JS
divergence, the greater the post fits the topical interests
of the community.
Evolution score: Measures how many users of a given
community have replied to a user in the past, differing
from in-degree by being conditioned on the forum. The-
ories of evolution [16] suggest a positive tendency for
user A replying to user B if A previously replied to B.
Therefore, we define the evolution score of a given user
ujas follows:
evolution(uj) =
U
X
i
U(uj,i)+1
U(5)
where Urefers to the total number of users in a given
forum and U(uj,i)refers to the number of users who
replied to user ujin the past.
Inequity score: Measures how many users of a given
community a user has replied to in the past, differing from
out-degree by being conditioned on the forum. Equity
Theory [17] suggests a positive tendency for user A
replying to user B if B previously replied more often to
A than A to B. Therefore, we define the inequity score
of a user ujas follows:
inequity(uj) =
U
X
i
|P(ui,j )reply |
|P(uj,i)reply + 1|(6)
where Urefers to the total number of users in a given
forum, P(ui,j )reply refers to the probability that user
uireplies to user ujand P(uj,i)reply refers to the
probability that user ujreplies to user ui
E. Title Features
Title features describe the title of a post itself and identify
attributes that the title should contain in order to start a
discussion. We decided to separate title features from post
features in order to be able to capture potential affects of
the user interface since the current Boards.ie user interface
encourages users to decide which post to read based on
the title. Therefore, our intuition is that in some community
forums, title features may have a greater influence on the
start of discussions as well as on the development of lengthy
discussions.
Title Length: Number of words in the title of the post.
Title Question-mark: Measures the absence or presence
of a question-mark in the title.
V. EX PE RI ME NT S
Understanding what drives attention in different forums and
their implicit communities enables us to reveal key differences
between those forums. To detect such deltas we apply our two-
stage prediction approach to (i) detect seed posts within each
forum and (ii) predict the level of activity that such seed posts
will generate. We begin by explaining our experimental setup
before going on to discussing our findings and observing how
the communities differ from one another in their discussion
dynamics.
A. Experimental Setup
For our experiments we took all the thread starter posts -
i.e. that were both seeds and non-seeds - published in each
of the 20 forums throughout the year 2006. For each thread
starter we constructed the features as described in the previous
section. We performed two experiments using our generated
datasets, each intended to explore the research questions: (i)
Which factors may impact the attention level a post gets in
certain community forums? and (ii) How do these factors differ
between individual community forums?
1) Seed Post Identification: The first experiment sought to
identify the factors that help differentiating between posts that
initiate discussions and posts that do not get any attention
in different communities. To this end, we performed seed
post identification through a binary classification task using
a logistic regression model. For each forum, we divided the
forum’s dataset into a training/testing split using an 80/20%
split, trained the logistic regression model using the former
split and applied it to the latter. We tested each of the five
feature sets in isolation - i.e. user, focus, post, community
and title - such that the model was trained using only those
features, and then tested all the features combined together.
To assess how well each model performed, we measured the
F1 score, which is the harmonic mean of precision and recall,
and the Matthews correlation coefficient (MCC), which is a
balanced measure of the quality of binary classification and
can be used even if the classes are of very different sizes.
The MCC measure returns a value between 1and +1: a
coefficient of +1 represents a perfect prediction, 0is no better
than random prediction and 1indicates total disagreement
between prediction and observation. The F1 score is frequently
used by the Information Retrieval community, while the MCC
is widely used by the Machine Learning community and in
statistics where it is known as phi (φ) coefficient.
The best performing model was then chosen based on the
F1 score and MCC value and the coefficients of the logistic
regression model were inspected to detect how the features
were associated with seed posts, thereby identifying the factors
which impact reply behaviour. To gain further insights into
which features contribute most to the classification model, we
also ranked the features of the best performing model by using
the Information Gain Ratio (IGR) as a ranking criterion.
2) Activity Level Prediction: For our second experiment,
we sought to identify the factors that were correlated with
lengthy discussions and how they differed between communi-
ties. To do this we performed seed post activity level prediction
through a linear regression model. We maintained the same
splits as in our previous experiment and filtered through the
seed posts in the 20% test split using the best performing
model in each community. We then trained a linear regression
model using the seed posts in the training split and predicted
a ranking for the identified seed posts in the test split based
on expected discussion volume. This allowed us to pick out
the key factors that were associated with generating the most
activity by concentrating our rank assessments on the top
portion of the posts. We trained the linear regression model
using each of the five feature sets in isolation and then used all
the features combined together. We chose the best performing
model based on its rank prediction accuracy and assessed the
statistically significant coefficients of the regression model for
the relation between increased attention and its features.
To evaluate our predicted rank, we used the Normalised
Discounted Cumulative Gain (nDCG) at varying rank po-
sitions, looking at the performance of our predictions over
the top-kdocuments where k={1,5,10,20,50,100}, and
then averaging these values. nDCG is derived by dividing the
Discounted Cumulative Gain (DCG) of the predicted ranking
by the actual rank defined by (iDCG). DCG is well suited to
our setting, given that we wish to predict the most popular
posts and then expand that selection to assess growing ranks,
as the measure penalises elements in the ranking that appear
lower down when in fact they should be higher up. Let
rankibe the actual position in the ranking that seed post
ishould appear and Nbe the number of items in the total
set of seed posts that are to be predicted, we then define
reli=Nranki+ 1 and DCG based on the definition from
[18] as:
DCGk=
k
X
i=1
reli
log2(1 + i)(7)
B. Results: Seed Post Identification
Comparing the F1 score and MCC values of different
forums in Table II reveals interesting differences between
communities and corroborates our hypothesis that the reply
behaviour of users in different communities is impacted by
different factors. Table II shows the 9 forums for which a
classifier trained with our features outperformed the baseline
classifier. We decided not to analyse the results from the other
11 forums, since our classifier did not outperform (but only
matched) the performance of the baseline. We assume that this
happens because most of these 11 forums are rather inactive
forums such as forum 44 or 858 (i.e. only a few messages
have been published in 2006 and therefore our classifier had
not enough examples of seed and/or non-seed posts to learn
general attention patterns). Another potential explanation is
that the discussion behaviour of these communities is in part
rather random and/or driven by other, external factors which
we could not take into account in our study. For example
the discussion behaviour of the communities around specific
locations or regions (such as community 190 and 625) might
for example be impacted by spatial properties of users while
the discussion behaviour of the community around forum 227
(Television) seems to be mainly driven by external events (e.g.
start of a new series).
Our results from the seed post identification experiment
show that for most of the 9 forums a classifier trained with a
combination of all features achieves the highest performance
boost. Only for the community around forum 267 (Astronomy
and Space) a classifier trained with content features alone
performs best. This example nicely shows that this community
seems to be mainly content driven since its main purpose is
to share information and content. Another exception is the
community of practice around forum 221 (Spanish) for which
a classifier trained with title features alone and a classifier
trained with user features alone outperforms a classifier trained
with all feature groups. This indicates that the features of those
two groups best capture the characteristics of seed and non-
seed posts in this community.
To gain further insights into the factors that impact atten-
tion in different communities we inspected the statistically
significant coefficients of the best performing feature group
learned by the logistic regression model. The coefficients can
be interpreted as the log-odds for the features. Therefore, a
positive coefficient denotes a higher probability of getting
replies for posts having this feature. In addition to interpreting
the statistically significant coefficients we also ranked the
features of the best performing feature group by using the
Information Gain Ratio (IGR) as a ranking criterion. The
higher the information gain of a feature the higher the average
purity of the subsets that it produces. A feature with a
maximum information gain ratio of 1 would enable perfect
separation between seed and non seed posts. Due to space
constraints we only discuss features with an IGR >= 0.1.
Our results suggest that in the community around forum 10
(Work & Jobs) which has a support and marketplace function,
longer posts (content length’s coef = 0.063 and p < 0.001)
which do not really contain new information (informativeness
coef =0.028 and p < 0.001) and/or links (coef =0.592
and p < 0.01) are far more likely to get replies. Further, posts
which contain question marks (coef = 0.454 and p < 0.01) in
their title are more likely to attract the attention of this support-
oriented community. Finally, since the topic of this community
is quite general, posts are not required to be topically similar
to other posts in the forum (community fit’s coef =221.844
TABLE II
F1 S COR E AN D MATTHEWS CORRELATION COEFFICIENT (MCC) FOR DI FFER EN T FOR UMS W HE N PER FO RMI NG SE ED P OST I DE NTI FIC ATION . TH E BES T
PE RFO RM ING M ODE L FO R EAC H FOR UM IS M AR KED I N BO LD.
forumid User Focus Content Community Title All
MCC F1 MCC F1 MCC F1 MCC F1 MCC F1 MCC F1
10 0.0 0.75 0.0 0.75 0.071 0.76 0.0 0.75 0.0 0.75 0.1 0.766
607 0.332 0.839 0.0 0.802 0.0 0.802 0.0 0.802 0.0 0.802 0.359 0.857
343 0.0 0.769 0.0 0.769 0.093 0.782 0.0 0.769 0.0 0.769 0.148 0.789
267 0.078 0.609 -0.132 0.531 0.242 0.673 0.078 0.609 0.0 0.549 0.181 0.643
865 0.0 0.533 0.0 0.533 0.0 0.533 0.0 0.533 0.0 0.533 0.632 0.815
544 0.0 0.818 0.0 0.818 -0.052 0.809 0.0 0.818 0.0 0.818 0.109 0.828
55 0.0 0.913 0.0 0.913 0.0 0.913 0.0 0.913 0.0 0.913 0.144 0.918
221 0.447 0.625 -0.447 0.25 0.0 0.486 0.0 0.333 0.707 0.829 0.0 0.333
630 0.0 0.678 0.0 0.678 -0.044 0.675 0.0 0.678 0.0 0.678 0.109 0.686
and p < 0.01) in order to attract attention.
Another support and advise oriented community is the com-
munity around forum 343 (Golf). The topic of this community
is a more specific than the topic of the previous community.
In this community the content of a post needs to be rather
complex (coef = 2.261 and p < 0.01) and should also not
contain links (coef =0.586 and p < 0.05) in order to attract
attention. Further posts which are topically distinct from what
the Golf community usually talks about (community distance
coef =4.528 and p < 0.05) are less likely to get replies. This
indicates that within the community specialist terminology is
used and the divergence away from such vocabularies reduces
the likelihood of generating attention to a new post.
The community around forum 865 (HE Video Players &
Recorders) has an advice seeking and experience sharing
purpose but only for one specific group of products. For this
community forum all features’ coefficients are not significant.
However, a classification model trained with all features
outperformed a random baseline classification model with a
MCC value of 0.632. By looking at the feature list ranked
by the IGR, we note that only one feature contributed to this
performance boost, namely the inequity score (IGR = 0.7).
The coefficient of the inequity score in the regression model
is negative (coef =5.025) which indicates that a post is less
likely to get replies if it is authored by a user who replied
to many posts in this forum in the past but hasn’t got many
replies himself in this forum. One possible explanation is that
in support oriented communities users who reply to many posts
are more likely to be experts. It is not surprising that posts of
such expert users are less likely to get replies since less users
have enough expertise to answer or comment on the post of
an expert.
The main purpose of the community around forum 544
(Banking & Insurance & Pensions) is also for seeking advice
and sharing experiences and information. In this community
shorter posts (content length coef =0.017 and p < 0.05)
authored by users who are new to the topic - or have not
published anything about the topic before (topic distance coef =
2.890 and p < 0.01) - are more likely to get replies. When
inspecting the IGR based feature ranking of the content group,
we find that only the complexity of content is a useful feature
for informing a classifier which has to differentiate between
seed and non seeds (IGR = 0.354). This indicates that short,
but complex posts which have been authored by newbies are
most likely to catch the attention of this community.
The main purpose of the community around forum 267
(Astronomy & Space) is to share information and content
and to engage in discussions. Long posts (coef = 0.083
and p < 0.05) which do not contain many novel terms
(informativeness coef =0.029 and p < 0.05) but are positive
in their sentiment (polarity’s coef = 4.556 and p < 0.05)
are very likely to attract the attention of this community. The
content feature with the highest IGR is the number of links
per post (IGR = 0.1). Since the coefficient of the number of
links is positive in our regression model we can conclude that
a higher number of links indicates that the post is more likely
to get replies (coef = 0.157) in this forum. This suggests that
in this forum posts which are long, informative and re-use
the vocabulary of the community are more likely to attract
attention.
Also for the topical community around forum 55 (Satellite)
the main purpose is to share information and content and to
engage in discussions. In this community posts authored by
users who have a high forum likelihood are less likely to get
replies (coef =5.891 and p < 0.01). This suggests that
users who stimulate discussions in this community have to
focus their activity away from this forum. Further posts which
are topically distant from the topics the community usually
talks about are again less likely to get replies (coef =2.944
and p < 0.01). This pattern indicates that users who focus
their activity away from this community and then post a new
thread that is about topics which seem to be in the topical
interest area of the community are more likely to get replies.
The community around forum 221 (Spanish) is a community
of practice which means that the community members have
a common interest in a particular domain or area, and learn
from each other. This community is mainly impacted by user
and title factors, however all features’ coefficients are not
significant. Ranking the features by their IGR shows that
the most important feature for discriminating between posts
getting replies and posts not getting replies is the title length
(IGR = 0.558). Interestingly in this forum, posts with short
titles are more likely to get replies. The longer the title the
less likely a post gets replies (title length’s coef =0.326).
The second most important feature is the user account age
(IGR = 0.381). Users who have owned an account for
longer are more likely to get replies in this forum than users
who recently created their account. This suggests that in
communities where the members share a common long-term
goal and/or have a shared interest which is rather stable over
time, the duration of users’ community membership is a good
feature to predict if a post will become a seed post or not.
The community around forum 630 is a rather open and
diverse community of users who are interested in all kind
of events and/or want to promote events. For forum 630
(Real-World Tournaments & Events) a classifier trained with
all features performed best. The only significant feature for
this forum is the community distance (coef =1.185 and
p < 0.05). This indicates that posts which do not fit the topical
interests of this community are less likely to get replies.
C. Results: Activity Level Prediction
To explore which factors may affect the number of replies
a post gets, we first identified the feature groups which lead
to the best model for each community forum (see Table III)
and then analysed the statistically significant coefficients of
the best performing model from each community.
Interestingly, our results suggest that the factors that impact
whether a discussion starts around a post tend to differ from
the factors that impact the length of a discussion. For example
for the support and advise oriented community around forum
343 (Golf) content and community features contribute most
to the identification of seed posts, but focus features are
most important for predicting the activity level of discussions
around seed posts. This indicates that it is important that a
post’s content has certain characteristics (e.g. contains only
few links) and fits the topical interests of the community
in order to start a discussion, but afterwards it is important
that the author of a post has certain topical and/or forum
focus in order to stimulate a lengthy discussion in this forum.
In forum 865 (HE Video Players & Recorders) the seed
post identification works best when using features from all
feature groups, but for predicting the activity level a post
will produce a linear regression model trained with content
features works best. This indicates that posts which manage to
stimulate lengthy discussions in this forum share some content
characteristics. Also for the community around forum 544
(Banking & Insurance & Pensions) which also has an advice
seeking purpose a model using all feature groups performs best
in the seed post identification task. However, for predicting
the length of discussions which a seed post will generate a
model trained with community features only ranked the posts
most accurately according to their discussion length. This
suggest that in this forum it makes a difference who authored
a post and how this person relates to the community when
predicting the discussion length around a post. For the topical
community around forum 55 (Satellite) the main purpose
is to share information and content and to discuss satellite
television. Also in this community a model trained with all
feature groups performs best in the seed post identification
task. However for predicting the discussion length of seed
posts a regression model trained with title features only works
best. This indicates that in this community title features impact
if a post will stimulate a long discussion. Our results show that
seed posts with longer titles (coef =0.03003 and p < 0.05) are
more likely to stimulate lengthy discussions.
For certain communities, such as the community around
forum 267 (Astronomy & Space) whose main purpose is to
share information and content, the same group of features,
namely content features, works best for identifying posts
around which a discussion will start and for predicting the
length of a discussion. This indicates that in this community
users’ discussion behaviour is mainly impacted by characteris-
tics of posts’ content and therefore content features alone are
sufficient to predict users’ reply behaviour. Other factors play
a minor role in this community.
For the community around forum 630 (Real-World Tourna-
ments & Events) and the community around forum 10 (Work
& Jobs) a classification model using all features performs best
in both tasks, the seed post identification and the activity
level prediction tasks. For the community around forum 10
(Work & Jobs) our results show that posts authored by users
who replied to many other users in the past (coef of users’
out-degree is 0.005 and p < 0.01) and have longer titles
(coef =0.034 and p < 0.01) are more likely to stimulate
lengthy discussions than other posts. One potential explanation
is that posts with longer titles are more likely to attract the
attention of this community and that users in this community
are more likely to be involved in lengthy discussions with users
who have replied to them before. For the community around
forum 630 our results suggest that posts authored by users
with a high inequity score are more likely to lead to lengthy
discussions (coef =0.0015 and p < 0.05). This suggests that
in this community rather active users who frequently reply to
other community members’ posts but do not get many replies
themselves are most likely to stimulate lengthy discussions. It
seems that users in this community are more likely to reply to
posts of other users who replied to their own posts in the past.
Also in this community posts with longer titles are slightly
more likely to stimulate lengthy discussions (coef =0.04145
and p < 0.001). One potential explanation for that is that
posts with longer titles tend to catch the attention of more
users who then read the post and reply to it. However,
one needs to note that although the effect is statistically
significant the effect size is very small which indicates that
the dependent variable (discussion length) is expected to only
increase slightly when that independent variable (title length)
increases by one, holding all the other independent variables
constant.
Finally, in the community of practice around forum 221
(Spanish) no lengthy discussions happened within the selected
time period and therefore we could not analyse factors that
impact lengthy discussions.
VI. DISCUSSION OF RE SU LTS
Our findings from the seed post identification experiment
demonstrate that different community forums exhibit interest-
ing differences in terms of how attention is generated and that
TABLE III
AVERA GED NORMALISED DISCOUNTED CUMULATIVE GAIN nDCG@k
VALUE S US ING A L IN EAR R EG RES SI ON MO DE L WIT H DI FFER EN T FEATU RE
SE TS. A nD CG@kO F 1INDICATES THAT THE PREDICTED RANKING OF
PO STS P ER FEC TLY MATC HES T HE IR RE AL R ANK IN G. POSTS ARE RANKED
BY T HE NU MB ER OF R EPL IE S THE Y GOT.
Forum User Focus Content Commun’ Title All
10 0.599 0.561 0.452 0.516 0.418 0.616
221 0.887 0.954 0.863 0.954 0.88 0.985
267 0.63 0.703 0.773 0.6 0.75 0.685
343 0.558 0.727 0.612 0.634 0.572 0.636
544 0.5 0.514 0.607 0.684 0.461 0.574
55 0.574 0.42 0.655 0.671 0.73 0.692
607 0.77 0.632 0.814 0.48 0.686 0.842
630 0.707 0.459 0.635 0.547 0.485 0.762
865 0.673 0.612 0.85 0.643 0.771 0.796
the same features which have a positive impact on the start
of discussions in one community can have a negative impact
in another community. For example, our results from the seed
post identification experiment suggest that a high number of
links in a post has a negative impact on the post getting replies
especially in communities having a supportive purpose (such
as community 343 and 10). However, in the community around
forum 267, which mainly has an information and content
sharing purpose, the contrary is the case. Posts which tend
to have many links are more likely to get replies in this
community forum. This example nicely shows that the purpose
of a community may influence how individual factors impact
the start of discussions in a community forum.
It is also interesting to note that for support oriented
forums (such as forum 865 and 544) users which seem to
be rather new to a topic (i.e. have not published posts before
which are topically similar to the content produced by this
community) are more likely to get replies. Further, we notice
that the importance of whether a post fits the topical focus
of a community or not is largely dependent on the subject
specificity of the community. In other words communities
around very specific topics (such as the community around
the sport Golf) require posts to match the topical focus of the
community in order to attract attention, while communities
around more general topics (such as the community around
topic Work and Jobs) do not have this requirement.
In our previous work [4] we learnt a general pattern for
generating attention on Boards.ie by performing seed post
identification using all data from 2006, not just a selection
of forums. The best performing model contained all features
(user, content and focus), and indicated that the inclusion of
hyperlinks was correlated with non-seed posts, while seed
posts were those that had a high forum likelihood - i.e. the user
had posted in the forum before and was therefore familiar with
the forum. The results from our current work have identified
the key differences between this general attention pattern and
the patterns that each community exhibits. For instance for
the 9 analysed forums, 7 perform best when using all features
- similar to our previous work - while for the 2 remaining
forums, one forum performs best when using content features
and another when using title features. Additionally we find
differences in the patterns: for forum 55 we find that the lower
the forum likelihood the greater the likelihood that the user
will generate attention, this being the converse of the general
pattern learnt previously [4]. For forums 10 and 343 we find
that an increased number of hyperlinks reduces the likelihood
of the post generating attention, agreeing with the general
attention pattern, while for forum 267 a greater number of
hyperlinks increases the likelihood of generating attention.
Our results from the activity level prediction experiment
show that the factors that impact whether a discussion starts
around a post tend to differ from the factors that impact
the length of this discussion. For example, in the community
around forum 10 (Work & Jobs) a posts which has question
marks in the title is more likely to get a reply but in order
to stimulate lengthy discussions it is more important that the
title of a post has a certain length rather than that it contains
question marks.
It is also interesting to note that the title length is the
only feature which has a significant positive impact across
several communities on the number of replies a post gets. This
suggests that in some communities posts with longer titles are
more likely to stimulate lengthy discussions. We assume that
this happens because long titles may on the one hand attract
more users to read the posts and on the other hand long titles
may be correlated with high quality or substantivity of posts’s
content. It is also likely to be an effect caused by the platform’s
interface, as users are presented with a list of threads in a given
community each of which is listed by its title. The first piece
of information, along with the username of the author, that
community members see is the title of the post.
We also found a shared attention pattern between the Golf
and Real-World Tournaments and Events communities, since
in these communities posts which are topically distant from
what these communities usually talk about are less likely
to stimulate lengthy discussions. Therefore we can conclude
that although most attention patterns which we identified in
our work are local and community-specific, cross-community
patterns also exist and can be identified with our approach.
Comparing these findings to our previously work [4] once
again reveals interesting differences between the general pat-
tern learnt across the entirety of Boards.ie for activity level
prediction and the per-forum patterns that we have found in
this paper. For instance in [4] the general pattern indicated
that lower forum entropy and informativeness together with
increased forum likelihood lead to lengthier discussions, while
for forum 343 we found an increase in forum entropy to be
associated with an increase in activity. For the other features
none were found to be significant.
VII. CONCLUSIONS, LIMITATIONS AND FUTURE WORK
In this paper, we have presented work that identifies at-
tention patterns in community forums and shows how such
patterns differ between communities. Our exploration was
facilitated through a two-stage approach that provided novel
features able to capture the community and focus information
pertaining to the creators of community content.
Our results show that the attention patterns of different
communities are impacted by different factors and therefore
suggest that these patterns may only be valid in a certain
context and that the existence of global, context-free attention
patterns is highly questionable. In our previous work [4] we
focussed on identifying global attention patterns and found
amongst others that posts including links are less like stimulate
discussions. In this work we show by analysing attention
patterns of individual communities that this global attention
pattern is only valid for certain forums. The global attention
patterns one learns heavily depend on the mixture and con-
stitution of the sample of communities which one analyses.
Therefore, we can conclude that ignorance isn’t a bliss since
understanding the idiosyncrasies of individual communities
seem to be crucial for predicting which post will catch the
attention of a community and manages to stimulate lengthy
discussions in a forum.
We found for example that in support-oriented or advice
seeking communities posts which contain many links in their
content are less likely to get replies, while in information and
content sharing oriented communities a high number of links
may even have a positive impact and make posts more likely to
attract the attention of such a community. Further we observed
that in support-oriented communities especially posts authored
by newbies tend to be more likely to get replies. This suggests
that the purpose of a community impacts which factors drive
the reply behaviour of this community. Beside the purpose of a
community we also found that the specificity of the subject of
a community may impact which factors explain the discussion
behaviour of a community. Communities around very specific
topics require posts to fit to the topical focus of the community
in order to attract attention while communities around more
general topics do not have this requirement. Finally we also
found that the factors which impact the start of discussions
in communities often differ from the factors which impact the
length of discussions.
Although our work is limited to a small number of commu-
nities on one message board platform, Boards.ie, it uncovers
an interesting problem: the problem of identifying the context
in which attention patterns may occur. In our work we use the
number of replies a post gets to assess how much attention
it attracts. However, we want to point out that the number of
replies is just a proxy metric and other metrics such as the
number of views could be used as well. Since these metrics
tend to be correlated we believe that using other proxy metrics
would lead to similar results.
Community managers and hosts invest time, effort and
money into providing a community which is useful and
attractive to its users. By understanding what factors influence
community attention patterns, we can provide actionable in-
formation to community managers who are in desperate need
for systematic support in decision making and community
development. We hope that our research is a first step towards
analysing the context in which certain types of behavioural
patterns hold. Our future work will further investigate the
context of attention patterns in different communities by
clustering communities according to the factors which are best
for predicting which post will get the attention of a community.
ACK NOW LE DG ME NT
Claudia Wagner is a recipient of a DOC-fForte fellowship
of the Austrian Academy of Science. The work of Matthew
Rowe and Harith Alani was supported by the EU-FP7 project
Robust (grant no. 257859).
REFERENCES
[1] J. Cheng, D. Romero, B. Meeder, and J. Kleinberg, “Predicting reci-
procity in social networks,” in he Third IEEE International Conference
on Social Computing (SocialCom2011), 2011.
[2] D. Sousa, L. Sarmento, and E. Mendes Rodrigues, “Characterization
of the twitter @replies network: are user ties social or topical?”
in Proceedings of the 2nd international workshop on Search
and mining user-generated contents, ser. SMUC ’10. New York,
NY, USA: ACM, 2010, pp. 63–70. [Online]. Available: http:
//doi.acm.org/10.1145/1871985.1871996
[3] M. Rowe, S. Angeletou, and H. Alani, “Predicting discussions on the
social semantic web,” in Extended Semantic Web Conference, Heraklion,
Crete, 2011.
[4] ——, “Anticipating discussion activity on community forums,” in The
Third IEEE International Conference on Social Computing, 2011.
[5] H. Rangwala and S. Jamali, “Defining a Coparticipation Network
Using Comments on Digg,” IEEE Intelligent Systems, vol. 25,
no. 4, pp. 36–45, 2010. [Online]. Available: http://dx.doi.org/http:
//dx.doi.org/10.1109/MIS.2010.98
[6] L. Hong, O. Dan, and B. D. Davison, “Predicting popular messages in
twitter,” in Proceedings of the 20th international conference companion
on World wide web, ser. WWW ’11. New York, NY, USA: ACM,
2011, pp. 57–58.
[7] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi, “Bad news travel
fast: A content-based analysis of interestingness on twitter,” in WebSci
’11: Proceedings of the 3rd International Conference on Web Science,
2011.
[8] S. A. Macskassy and M. Michelson, “Why do People Retweet? Anti-
Homophily Wins the Day!” in Proceedings of the Fifth International
Conference on Weblogs and Social Media. Menlo Park, CA, USA:
AAAI, 2011. [Online]. Available: http://www.aaai.org/ocs/index.php/
ICWSM/ICWSM11/paper/view/2790
[9] G. Szabo and B. A. Huberman, “Predicting the popularity of online
content,” Commun. ACM, vol. 53, no. 8, pp. 80–88, 2010.
[10] T.-A. Hoang and E.-P. Lim, “Virality and susceptibility in information
diffusions,” in ICWSM, 2012.
[11] J. Chan, C. Hayes, and E. Daly, “Decomposing Discussion Forums using
Common User Roles,” in Proceedings of the WebSci10: Extending the
Frontiers of Society On-Line, Apr. 2010.
[12] D. M. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” JMLR,
vol. 3, pp. 993–1022, 2003.
[13] H. M. Wallach, “Structured topic models for language,” Ph.D. disserta-
tion, University of Cambridge, 2008.
[14] H. M. Wallach, D. Mimno, and A. McCallum, “Rethinking LDA:
Why priors matter,” in Proceedings of NIPS, 2009. [Online]. Available:
http://books.nips.cc/papers/files/nips22/NIPS2009\0929.pdf
[15] R. Gunning, The Technique of Clear Writing. McGraw-Hill, 1952.
[16] B. McKelvey, “Quasi-natural organization science,” Organization Sci-
ence, vol. 8(4), 1997.
[17] J. Adams, “Inequity in social exchange,Adv. Exp. Soc. Psychol., vol. 62,
pp. 335–343, 1965.
[18] C.-F. Hsu, E. Khabiri, and J. Caverlee, “Ranking Comments on the
Social Web,” in Computational Science and Engineering, 2009. CSE
’09. International Conference, vol. 4, August 2009.
... In particular, we analyze the tip popularity at posting time, while the above studies require early votes to perform predictions. Wagner et al. [2012] studied the patterns of user attention towards content shared within online communities, where attention was measured by the number of replies to a given post. One of their findings was that the purpose of a community may influence how individual factors affect the attention pattern of that community. ...
... A number of studies Lu et al., 2010;O'Mahony and Smyth, 2010;Wagner et al., 2012] have shown that the linguistic style can be a good indicator of the utility/helpfulness of the review or quality of other user generated contents [Agichtein et al., 2008;Chen et al., 2011;Dalip et al., 2011;Momeni et al., 2013]. As in Lu et al., 2010;Momeni et al., 2013], we have chosen to model syntactic features using the Part-Of-Speech (POS) tagging of the words in the tip's text. ...
... Informativeness (tip_informat) of a tip measures the novelty of the tip's terms with respect to other tips posted at the same venue. This metric was used in [Hsu et al., 2009;Wagner et al., 2012;Momeni et al., 2013] to predict quality or helpfulness of comments or reviews. We derive informativeness using the Term Frequency Inverse Document Frequency (TF-IDF) measure, where we sum over the TF-IDF values for all terms in a single tip p j (Equation 6.2): ...
Thesis
Full-text available
Since the popularization of the Web 2.0, people are becoming increasingly engaged expressing their opinions with reviews about products and services. As any other type of user-generated content, online reviews come in various forms, sizes and qualities. Such quality variability is particularly prominent in textual reviews produced on mobile apps, often called micro-reviews or tips, due to their inherent conciseness. In such content abundant environment, being able to estimate the helpfulness of an online (micro-)review, and ultimately predict its future popularity among users as accurately and early as possible, can greatly benefit content filtering and recommendation methods, helping users find valuable reviews and providing quick feedback to business owners and future customers. In this context, we investigate how users exploit micro-reviews, focusing particularly on Foursquare tips, an increasingly popular type of review whose high degree of informality and briefness offers extra difficulties to the design of effective prediction methods. Using collected data from Foursquare, we also investigate how tip popularity, given by the number of times the tip received a like from a user, evolves over time and which factors impact this popularity evolution. Then, we explore how these factors can be combined to develop models to predict tip popularity at a given point in time in the future. We develop solutions to two different prediction tasks: predicting the popularity ranking of a set of tips and predicting the popularity level a particular tip will achieve. Our experimental results show that a multidimensional set of predictor variables, which considers features of both the user who posted the tip and the venue where it was posted, leads to more accurate results than using each set of features in isolation. Our models, when applied to Foursquare tips, are also more robust than state-of-the-art popularity prediction methods, as they can be applied to any tip, at or after posting time.
... Popularity and attention is "the state or condition of being liked, admired, or supported by many people." 10 For postings in forums, Wagner et al. [2012aWagner et al. [ , 2012b define attention as "the number of replies that a given post on a community message board yields as a measure of its attention," whereas Szabo and Huberman [2010] define it as "the number of votes (diggs) a story collected on Digg.com 9 and the number of views a video received on YouTube.com." For posting on microblogging platforms, Hong et al. [2011] measure popularity as the number of retweets. ...
... Many approaches related to popularity and attention use a supervised learning method to classify content into popular (or seed) and nonpopular categories [Hong et al. 2011;Rowe et al. 2011;Wagner et al. 2012aWagner et al. , 2012bHsu et al. 2009]. The temporal and author-related features are shown as important features for assessment and ranking of popular content. ...
... How particular features are associated positively with the start of discussions in one community may differ in another community [Wagner et al. 2012b]. The influential factors for predicting whether a discussion begins around a post may vary depending on the factors that impact how long the discussion lasts [Wagner et al. 2012a[Wagner et al. , 2012b. Therefore, in forums, Wagner et al. [2012a] argue that the unawareness of a user is not advantageous, since understanding the behavioral patterns peculiar to individual communities is influenced by posts that attract a community and stimulate long dialogues in a forum. ...
Article
Full-text available
User-generated content (UGC) on the Web, especially on social media platforms, facilitates the association of additional information with digital resources; thus, it can provide valuable supplementary content. However, UGC varies in quality and, consequently, raises the challenge of how to maximize its utility for a variety of end-users. This study aims to provide researchers and Web data curators with comprehensive answers to the following questions: What are the existing approaches and methods for assessing and ranking UGC? What features and metrics have been used successfully to assess and predict UGC value across a range of application domains? What methods can be effectively employed to maximize that value? This survey is composed of a systematic review of approaches for assessing and ranking UGC: results are obtained by identifying and comparing methodologies within the context of short text-based UGC on the Web. Existing assessment and ranking approaches adopt one of four framework types: the community-based framework takes into consideration the value assigned to content by a crowd of humans, the end-user-based framework adapts and personalizes the assessment and ranking process with respect to a single end-user, the designer-based framework encodes the software designer's values in the assessment and ranking method, and the hybrid framework employs methods from more than one of these types. This survey suggests a need for further experimentation and encourages the development of new approaches for the assessment and ranking of UGC.
... In [3], the authors showed that the level of peer-to-peer messaging is a strong indicator of social interactions and social tie strength. With respect to OHCs, if two users exchange information frequently, it may imply that these users have similar interests or the same health problems [9]. Besides, the structural information is also effective in detecting and differentiating the polarisation of users' opinions [12]. ...
... Furthermore, the dynamics of peer-to-peer messaging in OHCs also uncover the dynamics of users' online behaviours and the evolution of social networks. From the perspective of individual users, the actions such as the message posting and receiving, community visiting frequency and duration, provide valuable clues to characterize the activeness and the role of the user [9,13]. From the other perspective of the overall community, the dynamic messaging reveals the evolution of user interactions and the development of the community [16]. ...
Conference Paper
Full-text available
Online Health Communities (OHCs) have become more and more prevalent with the advance of web 2.0 and social media. These platforms provide free, open and wide-sourced places for people to publicly discuss health-related problems, especially some mental health problems, such as depression. This paper aims to characterize the unique structural and dynamic patterns of users’ interactions in depression related OHCs. Through the topological analyses of social networks, we identify the unique highly sticky structure of depression related OHCs as compared with other social communities. Besides, users in these communities spend relatively longer time on closely peer-to-peer messaging. Moreover, the evolutionary trends show that depression related OHCs present distinctive growth patterns in terms of user addition and user activeness, which could be further applied in differentiating the community types and the development stages.
... In [3], the authors showed that the level of peer-to-peer messaging is a strong indicator of social interactions and social tie strength. With respect to OHCs, if two users exchange information frequently, it may imply that these users have similar interests or the same health problems [9]. Besides, the structural information is also effective in detecting and differentiating the polarisation of users' opinions [12]. ...
... Furthermore, the dynamics of peer-to-peer messaging in OHCs also uncover the dynamics of users' online behaviours and the evolution of social networks. From the perspective of individual users, the actions such as the message posting and receiving, community visiting frequency and duration, provide valuable clues to characterize the activeness and the role of the user [9,13]. From the other perspective of the overall community, the dynamic messaging reveals the evolution of user interactions and the development of the community [16]. ...
Chapter
Online Health Communities (OHCs) have become more and more prevalent with the advance of web 2.0 and social media. These platforms provide free, open and wide-sourced places for people to publicly discuss health-related problems, especially some mental health problems, such as depression. This paper aims to characterize the unique structural and dynamic patterns of users’ interactions in depression related OHCs. Through the topological analyses of social networks, we identify the unique highly sticky structure of depression related OHCs as compared with other social communities. Besides, users in these communities spend relatively longer time on closely peer-to-peer messaging. Moreover, the evolutionary trends show that depression related OHCs present distinctive growth patterns in terms of user addition and user activeness, which could be further applied in differentiating the community types and the development stages.
... Sousa and colleagues showed that Twitter replies by users with smaller social networks are more driven by social aspects than users with larger ego-networks [17]. Nevertheless, as with retweets, combining social and content features produced the best predictions of replies on Twitter [14,20,15]. However, when analysing replies on Boards.ie, it was found that content features are better for predictions than social features, thus contradicting the findings obtained from predicting replies on Twitter, and instead agreeing with the predictions of retweets [13]. ...
... This highlights the role played by the type and goal of the communities on their engagement dynamics and associated features. For example, the presence of a URL in posts on Boards.ie is only good for generating a reply in general forums, and users with low forum entropy, account age, and #posts, are less likely to get replies in support communities [20]. Other variations in dynamics were observed across topics on Yahoo! ...
Article
Full-text available
Understanding what attracts users to engage with social media content (i.e. reply-to, share, favourite) is important in domains such as market analytics, advertising, and community management. To date, many pieces of work have examined engagement dynamics in isolated platforms with little consideration or assessment of how these dynamics might vary between disparate social media systems. Additionally, such explorations have often used different features and notions of engagement, thus rendering the cross-platform comparison of engagement dynamics limited. In this paper we define a common framework of engagement analysis and examine and compare engagement dynamics, using replying as our chosen engagement modality, across five social media platforms: Facebook, Twitter, Boards.ie, Stack Overflow and the SAP Community Network. We define a variety of common features (social and content) to capture the dynamics that correlate with engagement in multiple social media platforms, and present an evaluation pipeline intended to enable cross-platform comparison.Our comparison results demonstrate the varying factors at play in different platforms, while also exposing several similarities.
... To derive content patterns we make use of a set of features which have been successfully used in the past for modelling engagement in social media [34] [29]. These features include: ...
Conference Paper
Full-text available
Online paedophile activity in social media has become a major concern in society as Internet access is easily available to a broader younger population. One common form of online child exploitation is child grooming, where adults and minors exchange sexual text and media via social media platforms. Such behaviour involves a number of stages performed by a predator (adult) with the final goal of approaching a victim (minor) in person. This paper presents a study of such online grooming stages from a machine learning perspective. We propose to characterise such stages by a series of features covering sentiment polarity, content, and psycho-linguistic and discourse patterns. Our experiments with online chatroom conversations show good results in automatically classifying chatlines into various grooming stages. Such a deeper understanding and tracking of predatory behaviour is vital for building robust systems for detecting grooming conversations and potential predators on social media.
Article
Online communities provide technical support for organisations on a range of products and services. These communities are managed by dedicated online community managers who nurture and help the community grow. While visual analytics are increasingly used to support a range of data-intensive management processes, similar techniques have not been adopted into the community management field. Although relevant tools exist, the majority is developed in the lab, without conducting a domain analysis or eliciting user requirements, or is designed to support more general analytic tasks. In this chapter, the authors describe a case study in which we design, develop, and evaluate a visual analytics application with the help of Symantec's online community management team. The authors suggest that the approach and the resulting application, called Petri, is an important step to promoting online community management as a strategic and data-driven process.
Conference Paper
For community managers and hosts it is not only important to identify the current key topics of a community but also to assess the specificity level of the community for: a) creating sub-communities, and: b) anticipating community behaviour and topical evolution. In this paper we present an approach that empirically characterises the topical specificity of online community forums by measuring the abstraction of semantic concepts discussed within such forums. We present a range of concept abstraction measures that function over concept graphs-i.e. resource type-hierarchies and SKOS category structures-and demonstrate the efficacy of our method with an empirical evaluation using a ground truth ranking of forums. Our results show that the proposed approach outperforms a random baseline and that resource type-hierarchies work well when predicting the topical specificity of any forum with various abstraction measures.
Conference Paper
Content injection methods rely on understanding community dynamics (i.e. attention factors) in order to publish content that community users will engage with (e.g. product-related posts), however such methods require re-training should the community's discussed topics change. In this paper we present an examination of the semantic evolution of community forums by measuring the topical specificity of online community forums and then tracking changes in the concepts discussed within the forums over time. Our results indicate that general discussion communities tend to diverge in their semantics, while topically-specific communities do not. These findings inform content injection methods on model longevity and the need for adaptation for general communities.
Article
Full-text available
Positing that organizational phenomena result from both individual human intentionality and natural causes independent of individuals' intended behavior, the need for a quasi-natural organization science is identified. The paradigm war is defined in terms of positivism and postpositivism, with the suggestion that a more relevant epistemology might be scientific realism. The current unconstructive paradigm proliferation is seen as resulting from an underlying cause, idiosyncratic organizational microstates, phenomena identified by postmodernists. The article develops quasi-natural organization science as an antidote to multiparadigmaticism by recognizing that mathematically, computationally, and experimentally intense twentieth century natural sciences all have microstate idiosyncrasy assumptions similar to those postmodernists suggest are true of organizational phenomena. By framing a quasi-natural organization science focusing on microstates, my intent is not to deny the relevance of either intentionality and subjectivity or natural science and objectivity. The article attacks the microstate idiosyncrasy problem on four frontiers: micro- and macroevolutionary theory, semantic conception epistemology, analytical mechanics, and complexity theory. The first frontier develops the natural side of quasi-natural organization science to explain natural pattern or order. This ''order'' arguably results from multilevel coevolutionary behavior in a selectionist competitive context in the form of multi-level selectionist effects. The second frontier reviews the historic role of idealized models, as understood by historical realists and the ''semantic conception of theories'' - idealized constructs such as point masses or the rational actor assumption - that currently successful sciences, such as physics and economics, drew upon early in their life-cycles to sidestep the idiosyncrasy problem. Organization scientists are encouraged to develop theories in terms of idealized models. The third frontier attends to the role of 'instrumental conveniences' as essential constructs in the early life-cycle stages of sciences and the importance of studying rates. For example, a construct such as a pressure vessel acts as a container translating idiosyncratic gas particle movements into a directed pressure stream where particles emerge at some rate. Drawing on Sommerhoff's ''directive correlation'' concept as an analogous ''container'' in firms, this section argues that such containers can be used in organizational analysis to translate idiosyncratic microstates into probabilistic rates of occurrence, thereby allowing the use of intrafirm rate models and Hempel's deductive-statistical model of explanation. An example is given showing how human resource variables can be translated into rate concepts and then used in the context of the directive correlation and the deductive statistical model. The fourth frontier draws on complexity theory as a computational/analytical approach that directly incorporates idiosyncrasy by use of dynamical (nonlinear) methods. Complex adaptive systems, kinds of complexity, the causal role of complexity, and levels of adaptive tension likely to foster self-organization are discussed. An example shows how a complexity theory approach differs from a conventional explanation of why participative management decision making styles have failed to proliferate. The combined effect of rate dynamics, statistical mechanics, and dynamical analysis lays the platform for a realist, predictive, and generalizable quasi-natural organization science, thereby offering a possible resolution of the paradigm war. The mitigation of idiosyncrasy effects allows a re emphasis of background laws in organization science, as opposed to the further emphasis of contingent details advocated by postmodernists.
Article
Full-text available
Discussion forums are a central part of Web 2.0 and Enterprise 2.0 infrastructures. The health and sustainability of forums is depen-dent on the information exchange behaviour of its members. Such behaviour needs to be better understood and characterised so that forums can be better managed, new services delivered and oppor-tunities and risks detected. In this paper, we present a method for analysing user communication roles in discussion forums. We analyse the composition of several forums from a medium-sized national bulletin board in terms of these roles, demonstrating simi-larities between forums based on underlying user behaviour rather than topic. We suggest that analysing the evolution of role com-position is an important step in developing a predictive model of forum health.
Article
Full-text available
On the microblogging site Twitter, users can forward any message they receive to all of their followers. This is called a retweet and is usually done when users find a message particularly interesting and worth sharing with others. Thus, retweets reflect what the Twitter community considers interesting on a global scale, and can be used as a function of interestingness to generate a model to describe the content-based characteristics of retweets. In this paper, we analyze a set of high-and low-level content-based features on several large collections of Twitter messages. We train a prediction model to forecast for a given tweet its likelihood of being retweeted based on its contents. From the parameters learned by the model we de-duce what are the influential content features that contribute to the likelihood of a retweet. As a result we obtain insights into what makes a message on Twitter worth retweeting and, thus, interest-ing.
Article
The process of exchange is almost continual in human interactions, and appears to have characteristics peculiar to itself, and to generate affect, motivation, and behavior that cannot be predicted unless exchange processes are understood. This chapter describes two major concepts relating to the perception of justice and injustice; the concept of relative deprivation and the complementary concept of relative gratification. All dissatisfaction and low morale are related to a person's suffering injustice in social exchanges. However, a significant portion of cases can be usefully explained by invoking injustice as an explanatory concept. In the theory of inequity, both the antecedents and consequences of perceived injustice have been stated in terms that permit quite specific predictions to be made about the behavior of persons entering social exchanges. Relative deprivation and distributive justice, as theoretical concepts, specify some of the conditions that arouse perceptions of injustice and complementarily, the conditions that lead men to feel that their relations with others are just. The need for much additional research notwithstanding, the theoretical analyses that have been made of injustice in social exchanges should result not only in a better general understanding of the phenomenon, but should lead to a degree of social control not previously possible. The experience of injustice need not be an accepted fact of life.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
Viral diffusion allows a piece of information to widely and quickly spread within the network of users through word-ofmouth. In this paper, we study the problem of modeling both item and user factors that contribute to viral diffusion in Twitter network. We identify three behaviorial factors, namely user virality, user susceptibility and item virality, that contribute to viral diffusion. Instead of modeling these factors independently as done in previous research, we propose a model that measures all the factors simultaneously considering their mutual dependencies. The model has been evaluated on both synthetic and real datasets. The experiments show that our model outperforms the existing ones for synthetic data with ground truth labels. Our model also performs well for predicting the hashtags that have higher retweet likelihood. We finally present case examples that illustrate how the models differ from one another.
Article
In recent years, social media services have become a global phenomenon on the Internet. The popularity of these services provides an opportunity to study the characteristics of online social networks and the communities that emerge in them. This paper presents an analysis of the users' interactions in the implicit network derived from tweet replies of a specific dataset obtained from a popular micro-blogging service, Twitter. We analyze the influence of the topics of the tweet messages on the interaction among users, to determine if the social aspect prevails over the topic in the moment of interaction. Thus, the main goal of this paper is to investigate if people selectively choose whom to reply to based on the topic or, otherwise, if they reply to anyone about anything. We found that the social aspect predominantly conditions users' interactions. For users with larger and denser ego-centric networks, we observed a slight tendency for separating their connections depending on the topics discussed.
Article
The past decade has seen a massive rise in Web services and applications that let users create, collaborate, and share various forms of data including articles (blogs), pictures (Flickr), video (YouTube), and status updates (Twitter). Social bookmarking Web sites such as Delicious.com, Slashdot.org, and Digg.com let users submit links to Web content they find interesting along with a short description. Users in these online communities can comment on the posted content (initiating discussions) and rate the articles they find interesting. Thus, social bookmarking sites serve as data aggregators, Web-based discussion forums, and an online collaborative filtering system that can collectively determine popular online content.