Conference PaperPDF Available

On the Relationship between Novelty and Popularity of User-Generated Content

Authors:
  • IBM Research - Haifa

Abstract and Figures

This work deals with the task of predicting the popularity of user-generated content. We demonstrate how the novelty of newly published content plays an important role in affecting its popularity. We study three dimensions of novelty: contemporaneous novelty, self novelty, and discussion novelty. We demonstrate the contribution of the new novelty measures to estimating blog-post popularity by predicting the number of comments expected for a fresh post. We further demonstrate how novelty based measures can be utilized for predicting the citation volume of academic papers.
Content may be subject to copyright.
69
On the Relationship between Novelty and Popularity of
User-Generated Content
DAVID CARMEL and HAGGAI ROITMAN, IBM Research Haifa
ELAD YOM-TOV, IBM Research
This work deals with the task of predicting the popularity of user-generated content. We demonstrate how
the novelty of newly published content plays an important role in affecting its popularity. More specifically,
we study three dimensions of novelty. The first one, termed contemporaneous novelty, models the relative
novelty embedded in a new post with respect to contemporary content that was generated by others. The
second type of novelty, termed self novelty, models the relative novelty with respect to the user’s own contri-
bution history. The third type of novelty, termed discussion novelty, relates to the novelty of the comments
associated by readers with respect to the post content. We demonstrate the contribution of the new novelty
measures to estimating blog-post popularity by predicting the number of comments expected for a fresh
post. We further demonstrate how novelty based measures can be utilized for predicting the citation volume
of academic papers.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and
Retrieval
General Terms: Algorithms, Experimentation
Additional Key Words and Phrases: Popularity, novelty, user-generated content
ACM Reference Format:
Carmel, D., Roitman, H., and Yom-Tov, E. 2012. On the relationship between novelty and popularity of
user-generated content. ACM Trans. Intell. Syst. Technol. 3, 4, Article 69 (September 2012), 19 pages.
DOI = 10.1145/2337542.2337554 http://doi.acm.org/10.1145/2337542.2337554
1. INTRODUCTION
Recent years have witnessed a tremendous increase in the amount of user-generated
content (UGC) available on the Web. Many social media (Web 2.0) applications have
emerged to provide an open stage for users to contribute content and to publicly share
their ideas and opinions with others. The most notable source for UGC on the web
nowadays is the Blogosphere, where blogging web services such as Blogger1,and
ReadWriteWeb2have become popular media for personal content publication. More
recently, microblogging services, such as Twitter3, let users publish short comments
1www.blogger.com
2www.readwriteweb.com
3www.twitter.com
Portions of the work reported here were previously presented in a short conference paper [Carmel et al.
2010].
E. Yom-Tov is currently affiliated with Yahoo! Research.
Author’s address: H. Reitman, IBM Research, Haifa 31905, Israel; email: haggai@il.ibm.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is per-
mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permission may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, USA, fax +1 (212)
869-0481, or permissions@acm.org.
c
2012 ACM 2157-6904/2012/09-ART69 $15.00
DOI 10.1145/2337542.2337554 http://doi.acm.org/10.1145/2337542.2337554
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:2 D. Carmel et al.
about breaking world news as well as daily updates about themselves. Other notable
sources are collaborative services such as Wikipedia4and YouTube5, where UGC is
collaboratively contributed by the community (articles on Wikipedia and video uploads
on YouTube), and triggers public discussions around. Collaborative bookmarking
services such as Delicious6and Digg7are additional sources of UGC, allowing users
to annotate content published on the Web.
As the popularity of social media services increases, identifying UGC with high
“quality” becomes more difficult due to the enormous amount of new content that is
continually published on those sites. Typically, due to the large amounts of data, only
a small fraction of new posts are expected to gain popularity. Consequently, only a
few posts will be read, commented, or rated by others, while most of the posts will
be ignored. Mishne and Glance [2006] revealed that only 15% of the blog posts are
commented on, with an average of two comments per post. They note that the number
of comments per post follows a power-law distribution, with a small number of posts
containing a high number of comments, and a long tail of posts with only a few or no
comments. Therefore, it is extremely important for blog services to be able to identify
those sporadic “good” posts to be recommended for their users.
The quality of UGC is usually measured in several dimensions such as the author’s
reputation, objectivity, and reliability, as well as content relevancy, completeness, and
accuracy [Chai et al. 2009]. The most successful indicators for UGC quality are the
amount of user feedback and the number of citations the content has [Hsu et al. 2009;
Mishne and Glance 2006; Tsagkias et al. 2009]. These approaches usually rely on
explicit and publicly available feedback such as comments, ratings, recommendations,
and tagging [Hsu et al. 2009; Tsagkias et al. 2009], as well as implicit feedback such as
click-through data [Lerman and Hogg 2010; Szabo and Huberman 2010]. In addition,
following the successful link analysis techniques for measuring web site authority, the
user’s “authority” can also be inferred by link analysis. Estimating the author’s au-
thority in the Blogosphere has been intensively studied recently [Agarwal et al. 2008;
Kempe et al. 2003; Song et al. 2007]. Content-based features such as writing style and
missing spelling errors can additionally be used for quality analysis [Agichtein et al.
2008; Hasan Dalip et al. 2009].
User feedback is indeed very valuable for identifying high quality content. However,
there still remains a fundamental gap in evaluation of UGC quality when no feedback
is available, especially for freshly published content. When user feedback does not ex-
ist, evaluation approaches are mostly based on the author’s reputation, as reflected by
the popularity of the author’s previous content contributions [Hsu et al. 2009; Tsagkias
et al. 2009]. However, such methods are strongly biased to popular contributors in the
past, and underestimate fresh content published by unfamiliar contributors.
1.1. The Relationship between Novelty and Popularity
In this work we deal with the task of predicting UGC popularity, as reflected by the
amount of expected user feedback, focusing on new content (post) that has not yet re-
ceived feedback. We address a new dimension for content quality evaluation based on
measuring the novelty of the published content and demonstrate how the novelty of a
new post plays an important role in affecting its popularity. More specifically, we study
three dimensions of novelty types. The first type, termed contemporaneous novelty,
4www.wikipedia.com
5www.youtube.com
6www.delicious.com
7www.digg.com
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:3
models the relative novelty embedded in new post with respect to contemporary con-
tent generated by others in the same time period. We hypothesize that non-novel posts
are less popular since they fail to explore new valuable information for their readers.
The second type of novelty, termed self novelty, models the relative novelty em-
bedded in new UGC with respect to the user’s own contribution history. Self novelty
measures the novelty of a new post with respect to previous contributions published
on the same source. We show that self-novelty also contributes to content popularity,
probably due to the fact that authors who repeat themselves and fail to innovate, lose
their readers over time.
The third novelty type, termed discussion novelty, relates to the novelty of the com-
ments associated by readers with respect to the post’s original content. The comments
are compared to their associated post for measuring the amount of information they
add to the post. For example, controversial and provocative posts are expected to have
high discussion novelty. In order to predict the discussion novelty, before the post has
been commented, we measure the average discussion novelty of comments to previous
posts published by the same author. The assumption behind this measure is that a
new post, contributed by an author with a history of high discussion novelty, is also
likely to initiate a stimulating online discussion that will affect the post’s popularity.
The novelty-based features we have described do not require existing user feedback;
therefore, they can enhance existing popularity estimation techniques for new posts,
which are currently based primarily on the author’s reputation and on textual anal-
ysis. Furthermore, the contemporaneous novelty feature can even be used to predict
the popularity of new posts provided by unfamiliar contributors with no history at all.
Such an estimation of fresh content quality, prior to the availability of any user feed-
back, is of extreme importance. For example, the success of commercial marketing
campaigns in the blogosphere strongly depends on identifying those blog posts that
are expected to a have high potential for reaching large audiences and influencing
their readers. This is also true for search systems that look for high quality new items
that are relevant to their user needs, and for recommendation services, which recom-
mend interesting posts to their customers. Most existing blogging services recommend
a short list of high-quality blog posts on their main page. This list is usually composed
of the latest top-rated and commented posts, as well as new posts of authors who con-
tributed popular posts in the past. Given the ability to predict the expected number
of comments for new posts will enable better identification of high quality posts in
advance, when no feedback is available yet. It will also allow those recommendation
tools to identify interesting posts immediately after publication, independently of their
future recognition by the society.
1.2. Main Results
Using two real-world datasets of blogs and academic papers, we demonstrate the ef-
fectiveness of the novelty measures described in this work in assisting with the task
of estimating the number of comments expected for new posts. For the blog data, we
show that the novelty-based features significantly improve the prediction accuracy,
especially for posts published on blogs with short history. For such blogs the novelty-
based features are extremely important in the absence of previous data. The improve-
ment in prediction accuracy is less impressive (but still significant) for posts published
by authors with a long publication history. This is expected because a long history
provides enough evidence for the blog’s popularity, hence novelty-based features are
less significant for prediction. Similarly, in the academic papers domain, we obtain
significant improvement in accuracy of the citation number prediction as a result of
using the new novelty-based features.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:4 D. Carmel et al.
Novelty by itself is not sufficient to gain popularity. Novel content that is not rel-
evant or interesting will not receive the public attention. Moreover, spammers can
easily pretend novelty, for instance, by adding randomness into their posts [Mishne
et al. 2005]. Our results indeed show that novelty-based features are not enough to
predict popularity, however they can significantly enhance the prediction of post pop-
ularity when combined with other valuable features such as content quality and the
author’s reputation.
The rest of this article is organized as follows. In Section 2 we provide general
background and review several works that address the task of estimating content pop-
ularity. In Section 3 we describe the new novelty-based features and implementation
details. In Section 4 and Section 5 we illustrate the effectiveness of these features in
the blog domain and the academic papers domain. Section 6 concludes our work and
discusses future directions.
2. RELATED WORK
Several recent works studied the reasons why users contribute content and participate
in online discussions. Nardi et al. [2004] conducted an ethnographic investigation
about the reasons that drive bloggers “to document their lives, provide commentary
and opinions, express deeply felt emotions, articulate ideas through writing, and form
and maintain community forums”.
Cha et al. [2007] analyzed the popularity distribution of user-generated videos and
its evolution on YouTube. De Choudhury et al. [2009] further studied what drives in-
dividuals to participate in online conversations on YouTube. They developed a model,
based on mixed random walk, that was used to estimate the interestingness of con-
versations that emerged around YouTube videos. They found that users tend to par-
ticipate in conversations that have “interesting themes,” and are commented on by
familiar users with high social impact.
A parallel body of studies, yet indirectly related to the scope of our article, is fo-
cused on the spread of information in the blogoshpere [Gruhl et al. 2004; Kumar et al.
2005], and the influence of one author on others in social media sites [Agarwal et al.
2008; Kempe et al. 2003; Song et al. 2007]. Song et al. [2007] modeled influence using
InfluenceRank, a measure that considers novel information diffusion between differ-
ent users. According to this model, influencers are those that initiate novel ideas
in the blogosphere, which are then cited by many others. Similar to our work, this
work also measures novelty as an important feature for identifying influencers. How-
ever, it is mostly based on citation analysis and thus is not effective for analyzing
new content contributions. Moreover, Song et al. [2007] did not consider other pos-
sible novelty features, such as self-novelty, as a factor that can assist in estimating
“interestingness.”
Chai et al. [2009] provide a comprehensive survey on various content quality
characteristics that were already considered in the literature, which can be used to
estimate the quality of social media content. The set of features can be categorized into
two main classes: the author’s features (reputation, objectivity) and the content-based
features (amount of user feedback, relevancy, completeness, accuracy, understand-
ability, consistency). Several works tried to use similar features to predict the quality
of UGC and its popularity [Agichtein et al. 2008; Hasan Dalip et al. 2009; Hsu et al.
2009; Khabiri et al. 2009; Mishne and Glance 2006; Tsagkias et al. 2009]. Mishne
and Glance [2006] showed that blog popularity is highly correlated with the volume
of blog comments. Agichtein et al. [2008] measured the quality of questions and
answers in Yahoo! Answers using a classifier trained on several structural, textual,
and community-based features. Hasan Dalip et al. [2009] estimated the quality of
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:5
Wikipedia articles by training a classifier on several textual, revisional, and structural
features extracted from the articles’ content.
More recently, Khabiri et al. [2009] studied the popularity factor of Digg comments.
They showed that popular comments were those contributed by users with a good
reputation (i.e., active users with history of highly ranked comments), and those with
high textual quality. These authors further attempted to predict the popularity of
Digg comments using support-vector regression over these features [Hsu et al. 2009].
Tsagkias et al. [2009] addressed a similar prediction task for online news and classified
articles as such with low or high potential to be commented, using similar features to
those in Hsu et al. [2009]. Szabo and Huberman [2010] and Lerman and Hogg [2010]
further used click-though data to enhance the popularity prediction of Digg news.
Focusing on a similar prediction task [Hsu et al. 2009; Lerman and Hogg 2010;
Szabo and Huberman 2010; Tsagkias et al. 2009], this article extends our preliminary
study in [Carmel et al. 2010] and shows how novelty factors can assist in predicting
the expected volume of comments for a new UGC, in the absence of user feedback.
Novelty detection has been the focus of several IR tasks [Allan et al. 2003; Soboroff
and Harman 2005]. Most detection approaches are based on measuring the dissim-
ilarity between new content to previously published content. We follow the same
approach, however, we measure novelty by the normalized compression distance
(NCD) [Cilibrasi and Vit´
anyi 2005], which assesses the similarity between a pair of
strings by measuring the improvement achieved by compressing one string using the
information found in the other string.
Predicting the number of expected comments for a given UGC has some characteris-
tics in common with predicting the volume of academic paper citations. This task was
studied intensively in the 2003 KDD Cup [Gehrke et al. 2003]. Furthermore, Castillo
et al. [2007] estimated the number of expected citations based on the paper authors’
reputations. Dietz et al. [2007] also considered the paper content and the topic flows
between papers using an extension of the LDA model. In this article we further demon-
strate how the novelty features developed for the UGC comment prediction task are
beneficial for predicting the volume of academic paper citations.
A related task in the music domain was recently studied by Bischoff et al. [2009].
This work introduced a new method for predicting the potential of music tracks to
become hits. Instead of relying directly on the intrinsic characteristics of the tracks,
it uses data mined from music social network sites and the existing relationships be-
tween tracks, artists, and albums. Following the results of our work, it will be inter-
esting to explore the relations between novelty and music popularity. To the best of
our knowledge, this question has not been researched yet and should be furthered
investigated.
3. NOVELTY MEASUREMENT
We start this section with some definitions and then derive several novelty-based fea-
tures for a given post, whose popularity we predict in a later section.
3.1. Definitions
Let U={u1,u2,...,un}be a set of nsources of UGC, where each source uUis
modeled as a stream of user contributed content updates, termed hereinafter as posts.
Such content is usually contributed by the same author, or a community of authors
who post updates on the same source, for instance, a blog on the same topic. Let pu
denote a single post published on source u. We assume each post puhasatimestamp
t(pu) that captures its publication time.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:6 D. Carmel et al.
Fig. 1. An illustration of the three novelty sets with respect to a given new post that is illustrated at the
bottom left (gray rectangle).
Each post can further have zero to many comments from its readers.8We denote
the sequence of comments to a post puby C(pu)={c1,...,ck}, where each comment
cihas its own publication time t(ci). Obviously, the publication time for any comment
cC(pu)satisest(c)t(pu).
We now define three sets that will be used later on to derive the proposed novelty-
based features. Figure 1 provides an illustration of the three sets, where a single new
post is illustrated at the bottom left (gray rectangle).
Definition 3.1 (Self-novelty set). Given a post pu, we define SN(pu,T)tobetheset
of all posts that were published on the same source uprevious to post pu,withina
given time window T. Formally:
SN(pu,T)=p
u
t(pu)Tt(p
u)<t(pu)(1)
The self-novelty set is illustrated by the left (black) rectangle region in Figure 1.
Definition 3.2 (Contemporaneous-novelty set). Given a post pu, and a time window
T, we define CN(pu,T)tobethesetofallpostsp
uthat were contemporary published
with post puon other sources within the time window T. Formally:
CN(pu,T)=
p
u
uU\{u}∧t(pu)Tt(p
u)t(pu)(2)
The contemporaneous-novelty set determines all posts that were published on other
sources, contemporary with post puin a given time window. This set of posts is illus-
trated by the bottom (blue) rectangle region in Figure 1.
8In this work we focus on comments as representatives of the UGC popularity. Other types of user feedback
(ratings, tags, citations) can be used for that purpose similar to the source comments.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:7
Definition 3.3 (Discussion-novelty set). Given a post pu, a sequence of comments to
the post C(pu), and a timestamp t, we define DN(pu,t)tobethesetofallcommentsto
post puthat were submitted up to time t. Formally:
DN(pu,t)=c
cC(pu)t(c)t.(3)
The discussion novelty set for a given post is magnified on the upper left corner in
Figure 1.
3.2. Novelty Measurement
We measure the novelty of a single source-post over three contextual dimensions,
which we hypothesize as contributing to predicting its popularity. Many novelty mea-
sures have been proposed over the years; most are based on measuring dissimilarity
of the new content from previously published content [Allan et al. 2003; Soboroff and
Harman 2005]. In this work we utilize a distance measure that is derived from infor-
mation theory, known as the normalized compression distance (NCD), suggested (and
was given a formal justification) by [Cilibrasi and Vit´
anyi 2005].
Given a compressor Mand two strings xand y,theNCD is defined as:
NCD(x,y)= M(xy)min M(x),M(y)
max M(x),M(y),(4)
where M(x), M(y)andM(xy) are the bitwise sizes of the resulting sequences when
using Mto compress x,y, and the concatenation of xand y, respectively. In this work,
we used the 7ZA-compression algorithm [Salomon 2004] as our choice for M,duetoits
ability to utilize a large buffer that is well suited for large text.
NCD estimates the distance between two text strings by measuring the improve-
ment achieved by compressing one string using the information found in the other
string. It has been proved to serve as an approximation of Kolmogorov complexity
[Cilibrasi and Vit´
anyi 2005]. NCD was shown in previous studies to be efficient in
identifying distances between different types of information items, including music
[Cilibrasi et al. 2004], authors [Amitay et al. 2007], and languages [Benedetto et al.
2002].
3.3. Novelty-Based Features
We now suggest several novelty-based features for predicting UGC popularity. This
set of features can be further used along with more traditional features [Chai et al.
2009]. We measure novelty along three contextual dimensions, coined self novelty,
contemporaneous novelty, and discussion novelty.
3.3.1. Self Novelty. This measure models the relative novelty embedded in new UGC
with respect to the user’s own contribution history. Self novelty measures the novelty
of a new post with respect to previous posts published on the same source. We hy-
pothesize that self-novelty contributes to post popularity, probably due to the fact that
authors who fail to innovate and to “surprise” their readers, lose their popularity over
time.
Given a post puof some source uU, and a time window Ts;letSN(pu,Ts)bethe
corresponding self-novelty set according to Definition 3.1. The self-novelty of post pu
with respect to SN(pu,Ts) is given by:
novs(pu,Ts)=NC D(pu,concat(SN(pu,Ts))); (5)
purepresents the content of the given post and concat(SN(pu,Ts)) is the concatena-
tion of all post contents in the SN set.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:8 D. Carmel et al.
3.3.2. Contemporaneous Novelty. This measure evaluates the novel contribution of a
single source-post with respect to other posts submitted in the same time period. We
hypothesize that non-novel source-posts are less popular as they fail to explore new
valuable information for their readers.
Given a post puof some source uU,andatimewindowTc;letCN(pu,Tc)bethe
corresponding contemporaneous-novelty set according to Definition 3.2. The contem-
poraneous novelty of post puwith respect to CN(pu,Tc) is given by:
novc(pu,Tc)=NCD pu,concat(CN(pu,Tc)).(6)
3.3.3. Discussion Novelty. This measures relates to the novelty of the comments as-
sociated by readers with previous posts on the same source. The comments of each
previous post are compared to the original post content to measure their novelty in
terms of the amount of information they add to the original post. The intuition behind
this measure is that a post on a source that is commented with comments that add new
content to its own content (e.g., more details on the original post’s topic, new point of
views, sentiments, etc.), results with a high discussion novelty. Therefore, such a post
is more likely to initiate an interesting fruitful discussion that will affect the post’s
popularity.
Given a post puof some source uU.Letb={p
uu
t(p
u)<t(pu)}be the set of all
posts in uprior to pu. The discussion novelty of puis measured by the average novelty
of comments on previous posts on the same source. More formally:
novd(pu)= 1
|b|
p
ub
NCD p
u,concat(DN(p
u,t(pu))),(7)
where |b|represents the number of previous posts on u,andconcat(DN(p
u,t(pu))) is
the concatenation of all comments to post p
uthat were given prior to publication time
of pu.
Finally, it is important to note that the amount of source post history that can be
kept depends on the underline application capabilities. For example, most blog sys-
tems nowadays keep the whole history of their UGC. The size of source post history
may be controlled by tuning the number of previous posts considered for the discussion
novelty calculation in our model. Furthermore, efficient techniques for maintaining
fresh UGC source post histories with bounded size may be employed [Roitman et al.
2008].
The following sections provide experimental results that validate the usage of the
novelty-based features to predict the post popularity, as indicated by the number of
its expected comments. In Section 4 we focus on the task of predicting the number of
comments for a blog post. In Section 5 we study the prediction of number of citations
for academic papers.
4. PREDICTING BLOG POST POPULARITY
4.1. Blog Dataset
We collected data from an internal IBM blogging system [Huh et al. 2007]. We obtained
a total of 42,502 blog posts written by 4,416 unique bloggers. Figure 2(a) demonstrates
a power-law distribution of the number of posts per blog in our dataset with a power
factor k=1.74 (R2=0.89).
We extracted several features from each blog post to be used by the popularity pre-
dictor. These features include the post’s raw text, the time and date of its publication,
the number of comments it received as well as the comments’ raw text and publication
time, the number of tags the post was tagged with as well as its average rating value.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:9
Fig. 2. Power-law distribution of (a) number of blog posts per blog and (b) number of comments per blog
post.
We further measured (average and standard deviation) the relative activity (relative
publication rate) of the blog’s author. Table I provides the full list of features that we
used, including the new proposed novelty-based features described in Section 3.
4.2. Experimental Methodology
We measured blog post’s popularity as the total number of comments it received until
the time we obtained the dataset, as in [Hsu et al. 2009; Mishne and Glance 2006;
Tsagkias et al. 2009]. The number of comments per blog post in our dataset is power-
lawed (similar to what was shown in [Mishne and Glance 2006]) and is demonstrated
in Figure 2(b) with a power factor k=2.55 (R2=0.95).
Our goal is to predict whether a new blog post would receive a total of Nor more
comments in the future, by learning a predictor that is based on the features described
in Table I. We tested two classifiers for prediction: a Linear Regression and a Decision
Tree. The goal of the classifier is to separate blog posts with Nor more comments from
all other posts. Given a new blog post, the predictor will then classify it into either of
the two classes. Both classifiers were trained using the given blog data. Tenfold cross
validation was used to reduce the chance of overfitting.
The accuracy of prediction was measured by the area under the corresponding ROC
curve. The ROC curve shows how well items with Nor more comments can be identi-
fied, by plotting the true positive classification rate as a function of the false positive
rate, that is, the fraction of correctly identified blog posts versus the fraction of posts
that would incorrectly be labeled as having Nor more comments.
Since both classifiers predict the number of comments per post directly, they can ad-
ditionally be evaluated by the Spearman’s-ρregression coefficient that measures pre-
diction quality by the correlation between the ranking of the posts according to their
“true” number of comments and the ranking of the posts according to their predicted
number of comments. Therefore, we evaluate the predictors by the area under the ROC
curve, when used as binary predictors, and by the rank correlation (Spearman’s-ρ),
when used as multiple value predictors. The linear regression classifier gave superior
performance for this task, thus we provide an analysis of its results in the following
text.
4.3. The Effectiveness of Novelty-Based Features
In the first set of experiments, we analyzed the contribution of the proposed novelty-
based features over the remaining features to the accuracy of the prediction (see
Table I). We split the training blogs into three groups according to the length of
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:10 D. Carmel et al.
Fig. 3. Top: Area under the ROC curve of the linear regression classifiers for predicting posts with at least
10 comments, learned for different blog history sizes using (1) only the novelty-based features, (2) only the
traditional features, and (3) all features together. Bottom: Rank correlation of the same classifiers when
used as multiple value predictors.
their posting history: short (with 1–5 posts), moderate (with 6–10 posts) and long
(with 11+ posts). We learned three predictors for each set: the first one with only the
novelty-based features, the second with only the traditional blog and post features,
and the third with all features together. The predictor’s task was to identify posts with
at least 10 comments. For the novelty-based features we set Ts(the self-novelty time
window) to 120 days, and Tc(the contemporaneous-novelty time window) to 10 days.
Figure 3 shows the area under the ROC curve for the linear regression classifier in
the three settings.
As the figure illustrates, considering only the novelty-based features for prediction
leads to poor prediction accuracy, and the best strategy is to combine all features
together. The proposed novelty-based features add significant gain to the overall per-
formance of the predictors. This is most evident for blogs with short history: the
novelty-based features add approximately 7% to the prediction accuracy, as measured
by ROC, and 14% to the rank correlation, for blogs with five or fewer posts. The con-
tribution of the novelty-based features for blogs with moderate and long histories is
smaller but still significant. This reduction in gain is expected because a long history
provides enough evidence for the blog’s popularity, hence novelty-based features are
less significant for prediction.
The results clearly reveal that novelty-based features are indeed valuable for pre-
dicting blog post popularity, especially for blogs with a short history, according to the
two evaluation metrics. In the following we take a deeper look into the prediction task.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:11
Table I. List of Features
Feature name Description
Blog features
num comments Average and standard deviation of the
number of comments given to previous
posts on the same blog
rating Average and standard deviation of the
rating given to previous posts on the
same blog
num tags Average and standard deviation of the
number of tags to previous posts
published on the same blog
num recommendations Average and standard deviation of the
number of recommendations given to
previous posts published on the same
blog
num unique commenters Average and standard deviation of the
number of unique commenters to previous
posts published on the same blog
blogger activity Relative publication rate of the blog’s
author
avg post length Average and standard deviation of the
length of previous posts published on
the same blog
Post features post length Post length
timestamp Post publication time and date
Novelty-based features
self novelty Self-novelty of a post given its SN set
contemporaneous novelty Contemporaneous-novelty of a post
given its CN set
discussion novelty Discussion-novelty of a post given its
DN set
Full list of features that were used for predicting post popularity, including the post’s blog history features,
basic post features, and proposed novelty-based post features.
4.4. Predictor Performance Analysis
Having successfully verified the effectiveness of using the proposed novelty-based fea-
tures together with the baseline features, we now analyze the performance of the linear
regression classifier for different parameter settings.
In the first experiment we studied the effect of the window size used by the novelty-
based features on prediction quality. Recall that, in order to obtain the self-novelty
and contemporaneous novelty features, we must first determine the time windows for
which we derive the SN and CN sets respectively. We trained the linear regression
classifier, using all features and all data (with tenfold cross validation), for different
threshold values and different window sizes. Figure 4 shows the ROC area obtained
by the classifier, for predicting at least Ncomments, while increasing Tswindow size
(in days) and a fixed window size Tc= 10 days. We can observe that the area tends to
saturate at around 120 days for every value of Nthat we tested, and any improvement
thereafter is relatively small. This indicates that the effective horizon for blog posts is
approximately 4 months, with longer horizons adding relatively little information. The
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:12 D. Carmel et al.
Fig. 4. Effect of Tson the accuracy of predicting at least Ncomments per post, as measured by the ROC
area, while setting Tcto 10 days.
Table II. Area under the ROC Curve for Various
Thresholds (blogs)
Threshold Percentage of Area under the
Nposts with at least ROC curve for the
Ncomments learned predictor
131% 0.66
218% 0.68
54% 0.73
10 1% 0.77
Threshold for the number of expected comments, the
percentage of blog posts at each threshold, and the area
under the ROC for blog post comments prediction.
effect of Tcwas also examined in a similar manner, and was found to reach a plateau
at Tc= 10 days. Its overall effect on the ROC area was negligible.
Table II shows for each threshold value (N), the fraction of blog posts with at least
this number of comments, and the area under the ROC curve that the linear regression
predictor reached. Note that for N1 the predictor also predicts no comments at all.
Interestingly, it is easier to identify blog posts with many expected comments than to
distinguish between blog posts with no comments to those posts that will have at least
one comment (as is also evident from Figure 4). We attribute this to the fact that most
blog posts that receive many comments tend to come from blogs that have a persistent
following over time, thus are easier to be identified.
When applying the linear regression predictor as multiple value predictor, the rank
correlation between the actual and predicted rankings (according to the actual number
of comments and the predicted number of comments) in this setting was ρ=0.28
(p<104).
Figure 5 further shows the ROC curves obtained by the predictor using the same
parameters setting and threshold values of Table II. Different points on the ROC curve
represent different thresholds for deciding if a blog post will receive Nor more posts.
The curves are relatively smooth, indicating that there are no specific groups of blogs
that are easier (or more difficult) to identify.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:13
Fig. 5. ROC curves of the linear regression predictor, using different threshold (N)values,withTS= 120
days and TC=10days.
Table III. Relative Contribution of Each Feature for Predicting the Number of Blog Post Comments
Feature name Area under ROC curve Spearman’s ρ
N1N2N5N10
discussion novelty 0.60 0.62 0.65 0.66 0.17
self novelty 0.63 (6%) 0.64 (3%) 0.67 (3%) 0.70 (6%) 0.22 (27%)
avg. num comment 0.65 (3%) 0.67 (4%) 0.72 (7%) 0.77 (11%) 0.28 (28%)
std. discussion novelty 0.65 (0%) 0.67 (0%) 0.72 (0%) 0.78 (1%) 0.28 (0%)
4.5. Influential Features
Finally, we identified the most influential features used by the predictor by normaliz-
ing each feature to zero mean and unit variance, and ranking the weights learned by
the predictor (for N= 5) in decreasing order of absolute magnitude. Using this method,
the most influential features (in decreasing order of importance) were:
(1) discussion-novelty,
(2) self novelty,
(3) average number of comments to previous posts, and
(4) standard deviation of discussion-novelty.
The influential features were all positively correlated with the comment count. We
estimated the relative contribution of each of these features by building a predictor
with the most influential feature, the two most influential features, etc. The area un-
der the ROC curve, for different threshold values, and the rank correlation, is given in
Table III. According to this table, even when using a single feature, it is easier to iden-
tify posts that will receive many (more than 10) comments, compared to identifying
posts that will receive no comments. Furthermore, the average number of comments
to previous posts is mostly influential for identifying blog posts that will receive 10 or
more comments, and self-novelty also contributes to this prediction task. This is in
line with our previous observation on the relative ease of identifying such posts (see
Section 4.3) that mostly belong to blogs with long history. Note also the improvement
in rank correlation when self novelty is added to the features set. This again provides
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:14 D. Carmel et al.
evidence to the value of adding the novelty-based features for the prediction task. In
contrast, contemporaneous novelty is not one of the most influential features in the
blog domain. However, in the following we will show the dominance of this feature for
popularity prediction in the academic papers domain.
5. PREDICTING ACADEMIC PAPERS POPULARITY
We now further demonstrate how the novelty-based predictor can be used for the task
of citation prediction for academic papers. We first describe the papers dataset that
we used. We then show that the number of citations to a given paper is correlated
with the number of paper’s bookmarks, demonstrating that using paper citations for
popularity measurement is indeed valid. Finally, we describe how the novelty-based
predictor can be used for this task.
5.1. Academic Papers Dataset
Our dataset consists of all papers published in SIGIR between 1991 and 2002. We
chose this dataset because it represents the publications made by a single academic
community, and that enough time has passed since the publication of the last paper in
the collection to allow reasonable exposure time. This dataset consists of 710 papers.
Since we tracked publication histories of authors, we only used papers of authors who
had published more than one paper in that data period. This resulted in 499 papers
published by 663 authors. We measured the number of citations made to a paper, as
provided by Google Scholar at the end of 2009. The number of papers per author and
the number of citations per paper in our dataset have power-law distribution with
power law factors k=2.3(R2=0.96) for papers and k=1.2(R2=0.63) for citations
respectively.
An alternative measure for paper popularity might be the number of times a paper
was bookmarked by readers, in a bookmarking website such as CiteULike,9which
allows its users to bookmark academic articles. For the 499 papers considered in
this study, we tested this relationship and found a Spearman correlation of 0.44
(p<1010)) between the number of citations a paper has and the number of times
it was bookmarked in CiteULike. This relatively high correlation lends strength to our
assumption that paper popularity can be measured by the number of its citations.
For each paper, we extracted similar features to the ones described in Table I. The
first set of features includes the average and standard deviation of the number of ci-
tations for previously published papers by the same author, the number of unique au-
thors who cited papers by the same author, and the paper length. If there are several
coauthors for a paper, we took the maximum, minimum, and average of the attributes
for each author.
Given a paper and time windows Tsand Tcwe then calculated the paper’s self-
novelty and contemporaneous-novelty features. It is worth noting that for papers we
did not measure the discussion-novelty since paper citations are not accompanied with
text compared to blog post comments.10
5.2. Effectiveness of the Novelty-Based Features
Similarly to the blog domain, we attempted to predict if the number of citations of a
paper will be Nor more. Because of the relatively short time span of the data, we
9www.citeulike.com
10Actually, the text around a citation might be used for measuring discussion novelty, as anchor text is used
for Web search. We leave this direction to future work.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:15
Table IV. ROC Area and Rank Correlation of the Decision
Tree Classifier
Novelty-based Traditional All
features features features
ROC area 0.54 0.64 0.70(9.3%)
Correlation 0.09 0.26 0.35(35%)
ROC area and rank correlation of the decision tree classifier,
when trained to predict N10 citations, with (1) only the
novelty-based features, (2) only the traditional features, and
(3) all features together.
Table V. Area under the ROC Curve for
Various Thresholds (papers)
Threshold Percentage of Area under
Npapers ROC curve
183% 0.64
275% 0.67
554% 0.70
10 42% 0.70
Threshold for the number of predicted
citations, the percentage of papers at each
threshold, and the area under the ROC for
paper citations prediction.
set Ts= 5 years and Tc= 1 year. We tested the two classifiers for prediction: Linear
Regression and a Decision Tree. The classifiers were trained using all data; ten-fold
cross validation was used to reduce the chance of overfitting. In this case, the decision
tree predictor gave superior performance for this task, and thus we provide an analysis
of its results below.
In the first set of experiments, we analyzed the contribution of the new proposed
novelty-based features over the remaining features for prediction. Table IV shows the
ROC area (for predicting 10 or more citations) and the rank correlation of the decision
tree classifier, when trained with (1) novelty-based features, (2) traditional features,
and (3) all features.
Similarly to the results in the blog domain, the novelty-based features contribute
significantly to the prediction accuracy of the expected number of citations, and the
best performance is obtained when all features are used for prediction.
Table V shows different thresholds at which the decision tree predictor was evalu-
ated, as well as the resulting area under the ROC curve, while setting Ts= 5 years and
Tc= 1 years. We can observe from the table that, overall, we obtain similar results
to those in the blog domain, with relatively high prediction accuracy for the number
of citations. We can also see that prediction accuracy increases with N, that is, it is
easier to identify papers with many citations than to identify papers with no citations.
Figure 6 shows the ROC curves obtained using the decision tree predictor, with the
same thresholds as in Table V.
The ROC curves are not “smooth” as the ROC curves in the blog domain, probably
due to data sparseness in this domain.
Finally, we identified the most influential features for citation prediction by per-
forming sequential forward feature selection [Duda et al. 2001]. Table VI lists the
most influential features, in decreasing order. As this table shows, contemporaneous
novelty is the most influential factor for paper citations. This is in contrast to the blog
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:16 D. Carmel et al.
Fig. 6. Citation prediction accuracy for different thresholds, as measured by the area under the ROC curve,
for Ts=5yearsandTc=1years.
Table VI. Relative Contribution of Each Feature for Predicting the Number of Paper Citations
Feature name Area under ROC curve Spearman ρ
N1N2N5N10
min(avg. contemporaneous novelty) 0.56 0.56 0.57 0.57 0.07
max(author activity) 0.63 (12%) 0.64 (14%) 0.66 (16%) 0.66 (18%) 0.27 (400%)
min(std. self novelty) 0.64 (0%) 0.67 (5%) 0.70 (5%) 0.70 (5%) 0.35 (26%)
max(std. contemporaneous novelty) 0.64 (0%) 0.67 (1%) 0.70 (0%) 0.70 (0%) 0.35 (0%)
min(std. contemporaneous novelty) 0.64 (0%) 0.67 (0%) 0.70 (0%) 0.70 (0%) 0.35 (0%)
The notation max(m)/min(m) denotes the maximum/minimum of the metric mbetween the paper
authors.
domain where this feature is not found among the top influential features. Interest-
ingly, it is also negatively correlated with citation count, that is, high contemporaneous
novelty indicates low citation count. The number of papers published by the authors
and self novelty are the second and third most influential features. This means that
papers that are similar to other contemporaneous papers, and are published by prolific
authors, are most likely to be cited by peers. The additional improvement in prediction
beyond the first three features is small (less than 1% improvement).
6. SUMMARY
This work studies the relationship between novelty and popularity of user-generated
content. Our results show that people are more likely to gain public attention by pub-
lishing novel content. Attractive posts are those that are novel with respect to previous
content published by the same author, as well as novel with respect to contemporane-
ous published content, at least in some domains. In addition, stimulating posts that
are able to trigger online discussion have high potential to become popular.
By measuring the three novelty dimensions of a blog post, we demonstrated how
the novelty-based features can contribute to the task of predicting the number of com-
ments to the post, which reflects its popularity. We also demonstrated how the same
novelty-based features can assist in predicting the citation volume of academic papers.
Interestingly, in contrast to the blog domain, contemporaneous-novelty was found to
be negatively correlated with the paper’s citation count. This result suggests that in
order to gain popularity, academic articles should not be radically different from other
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:17
articles published in the same time period. In other words, popular articles are those
that address issues of current interest to the community.
The datasets we used in this work are free of spam that is prevalent in user-
generated content, especially in blog domains [Agichtein et al. 2008]. Spam, in the
context of this work, can harden the identification of good quality UGC, as spammers
can easily pretend novelty by “injecting” some randomness into their posts [Mishne
et al. 2005]. We leave the separation of spam from novel content for future work.
Furthermore, we note that the overall prediction accuracy of our predictors is not
very high, even when using the novelty-based features, probably due to the high com-
plexity of the popularity prediction task. While novelty-based features significantly
contribute to the prediction accuracy, it seems that high quality prediction of post pop-
ularity is still an open challenge and deserves further exploration for more valuable
features for that prediction task.
The positive correlation between content’s novelty and popularity might be seem
obvious, as good content is expected to be novel. However, this is the first study, to the
best of our knowledge, that this relationship has been examined and verified system-
atically, in the context of the blogosphere and academic writing.
Our work can be extended in several directions. First, we note that there might
be more effective ways of calculating novelty. A comparable study is required to
investigate the applicability of different novelty measures for the task of popularity
prediction. Second, we observe that in the blogosphere, self-novelty seems to dominate
contemporaneous-novelty with respect to its importance for popularity prediction; for
scientific papers, we observe the opposite trend. Therefore, it might be interesting to
explore in more depth the differences and relationships between the two. Furthermore,
following the discrepancy between the two domains studied in this article, it would
be interesting to study UGC popularity in other domains such as social media (e.g.,
Facebook, Twitter) and Q/A (e.g., Yahoo! Answers, Quora). Finally, we wish to examine
new novelty measures and their relationship with popularity in more domains (e.g.,
audio-visual content) for which such novelty measures can be captured. Furthermore,
it would be interesting to examine novelty-measures which also consider content
semantics.
REFERENCES
AGARWAL,N.,LIU,H.,TANG,L.,AND YU, P. S. 2008. Identifying the inuential bloggers in a community.
In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08). ACM,
New York, NY, 207–218.
AGICHTEIN,E.,CASTILLO,C.,DONATO,D.,GIONIS,A.,AND MISHNE, G. 2008. Finding high-quality con-
tent in social media. In Proceedings of the International Conference on Web Search and Web Data Mining
(WSDM’08). ACM, New York, NY, 183–194.
ALLAN,J.,WADE,C.,AND BOLIVAR, A. 2003. Retrieval and novelty detection at the sentence level. In
Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR’03). ACM, New York, NY, 314–321.
AMITAY,E.,YOGEV,S.,AND YOM-TOV, E. 2007. Serial sharers: Detecting split identities of web authors.
In Proceedings of the SIGIR’07 Workshop on Plagiarism Analysis, Authorship Identification, and Near-
Duplicate Detection.
BENEDETTO,D.,CAGLIOTI,E.,AND LORETO, V. 2002. Language trees and zipping. Phys. Rev. Lett. 88,4.
BISCHOFF,K.,FIRAN,C.S.,GEORGESCU,M.,NEJDL,W.,AND PAIU, R. 2009. Social knowledge-driven
music hit prediction. In Proceedings of the 5th International Conference on Advanced Data Mining and
Applications (ADMA’09). Springer, 43–54.
CARMEL,D.,ROITMAN,H.,AND YOM-TOV, E . 2010. On the relationship between novelty and popularity of
user-generated content. In Proceedings of the 19th ACM International Conference on Information and
Knowledge (CIKM’10).
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
69:18 D. Carmel et al.
CASTILLO,C.,DONATO,D.,AND GIONIS, A. 2007. Estimating number of citations using author reputation.
In Proceedings of the 14th International Conference on String Processing and Information Retrieval
(SPIRE’07). Springer, 107–117.
CHA,M.,KWAK,H.,RODRIGUEZ,P.,AHN,Y.-Y.,AND MOON, S. 2007. I tube, you tube, everybody tubes:
Analyzing the world’s largest user generated content video system. In Proceedings of the 7th ACM SIG-
COMM Conference on Internet Measurement (IMC’07). ACM, New York, NY, 1–14.
CHAI,K.,POTDAR,V.,AND DILLON, T. 2009. Content quality assessment related frameworks for social
media. In Proceedings of the International Conference on Computational Science and Its Applications
(ICCSA’09). Springer, 791–805.
CILIBRASI,R.AND VIT ´
ANYI, P. M. B. 2005. Clustering by compression. IEEE Trans. Inf. Theory 51,
1523–1545.
CILIBRASI,R.,VIT ´
ANYI,P.,AND DEWOLF, R. 2004. Algorithmic clustering of music based on string com-
pression. Comput. Music J. 28, 4, 49–67.
DECHOUDHURY,M.,SUNDARAM,H.,JOHN,A.,AND SELIGMANN, D. D. 2009. What makes conversa-
tions interesting? Themes, participants and consequences of conversations in online social media. In
Proceedings of the 18th International Conference on World Wide Web (WWW’09). ACM, New York, NY,
331–340.
DIETZ,L.,BICKEL,S.,AND SCHEFFER, T. 2007. Unsupervised prediction of citation inuences. In Pro-
ceedings of the 24th International Conference on Machine Learning (ICML’07). ACM, New York, NY,
233–240.
DUDA,R.,HART,P.,AND STORK, D. 2001. Pattern Classification. John Wiley and Sons, Inc, New York, NY.
GEHRKE,J.,GINSPARG,P.,AND KLEINBERG, J. M. 2003. Overview of the 2003 KDD Cup. SIGKDD Explor.
5, 2, 149–151.
GRUHL,D.,GUHA,R.,LIBEN-NOWELL,D.,AND TOMKINS, A. 2004. Information diffusion through
blogspace. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). ACM,
New York, NY, 491–501.
HASAN DALIP,D.,ANDR ´
EGONC¸ALVES,M.,CRISTO,M.,AND CALADO, P. 2009. Automatic quality assess-
ment of content created collaboratively by web communities: A case study of Wikipedia. In Proceed-
ings of the 9th ACM/IEEE-CS Joint Conference on Digital libraries (JCDL’09). ACM, New York, NY,
295–304.
HSU,C.-F.,KHABIRI,E.,AND CAVERLEE, J. 2009. Ranking comments on the social web. In Proceedings of
the International Conference on Computational Science and Engineering. Vol. 4, 90–97.
HUH,J.,JONES,L.,ERICKSON,T.,KELLOGG,W.A.,BELLAMY,R.K.E.,AND THOMAS, J. C. 2007. Blog-
central: The role of internal blogs at work. In CHI’07 Extended Abstracts on Human Factors in Comput-
ing Systems. ACM, New York, NY, 2447–2452.
KEMPE,D.,KLEINBERG,J.,AND TARDOS, E. 2003. Maximizing the spread of inuence through a social
network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (KDD’03). ACM, New York, NY, 137–146.
KHABIRI,E.,HSU,C.-F.,AND CAVERLEE, J. 2009. Analyzing and predicting community preference of so-
cially generated metadata: A case study on comments in the digg community. In Proceedings of the AAAI
International Conference on Weblogs and Social Media (ICWSM’09).
KUMAR,R.,NOVAK,J.,RAGHAVAN,P.,AND TOMKINS, A. 2005. On the bursty evolution of blogspace. Wor ld
Wide Web 8, 2, 159–178.
LERMAN,K.AND HOGG, T. 2010. Using a model of social dynamics to predict popularity of news. In Pro-
ceedings of the 19th International Conference on World Wide Web (WWW’10). ACM, New York, NY,
621–630.
MISHNE,G.AND GLANCE, N. 2006. Leave a reply: An analysis of weblog comments. In Proceedings of the
3rd Annual Workshop on the Weblogging Ecosystem (WWW’06).
MISHNE,G.,CARMEL,D.,AND LEMPEL, R . 2005. Blocking blog spam with language model disagree-
ment. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web
(AIRWeb).
NARDI,B.A.,SCHIANO,D.J.,GUMBRECHT,M.,AND SWARTZ, L. 2004. Why we blog. Comm. ACM 47, 12,
41–46.
ROITMAN,H.,CARMEL,D.,AND YOM-TOV, E. 2008. Maintaining dynamic channel profiles on the web.
Proc. VLDB Endow. 1, 151–162.
SALOMON, D. 2004. Data Compression: The Complete Reference.Springer.
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
On the Relationship between Novelty and Popularity of User-Generated Content 69:19
SOBOROFF,I.AND HARMAN, D. 2005. Novelty detection: The TREC experience. In Proceedings of the Hu-
man Language Technology Conference (HLT’05). Association for Computational Linguistics, Morristown,
NJ, 105–112.
SONG,X.,CHI,Y.,HINO,K.,AND TSENG, B. 2007. Identifying opinion leaders in the blogosphere. In
Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07). ACM,
New York, NY, 971–974.
SZABO,G.AND HUBERMAN, B. A. 2010. Predicting the popularity of online content. Comm. ACM 53, 80–88.
TSAGKIAS,M.,WEERKAMP,W.,AND DE RIJKE, M. 2009. Predicting the volume of comments on online
news stories. In Proceeding of the 18th ACM Conference on Information and Knowledge Management
(CIKM’09). ACM, New York, NY, 1765–1768.
Received November 2010; revised March 2011; accepted May 2011
ACM Transactions on Intelligent Systems and Technology, Vol. 3, No. 4, Article 69, Publication date: September 2012.
... To achieve this, some models are based on probabilities for matching and diversity [2] or on graphs for computing distance [12] between items to select their minimum representative set. Some of them propose to modify diversity measures by focusing on uncommon attributes between items based on user-defined filters [31], by defining a trade-off between similarity and diversity [27], by integrating entities and sentiment in a Greedy Max-Min algorithm [1], by defining time-based distances with a gaussian similarity for blog retrieval [20], or by comparing an item with the compression of all previous texts like the NCD distance [8]. In [7] they propose Maximal Marginal Relevance method. ...
Conference Paper
Full-text available
Publish/Subscribe (Pub/Sub) systems have been designed to face the exponential growth of information published on the Web by subscribing to sources of interest which produce flows of items. However users may receive some information several times, or information that does not contain any new content, and conversely miss some information of interest hidden in all information received. Pub/Sub systems are consequently witnessing a real challenge to efficiently filter relevant information. We propose in this paper a scalable approach for filtering news (items) which match the user interests (expressed as subscriptions). Introducing for the first time Term Discrimination Value (TDV) in this context, which allows to measure how a term discrimines an item, we filter out in real-time items whose content has already been notified recently to the user, either in another item (filtering by novelty) or globally in his recent history (filtering by diversity). Our experiments illustrate the impact of our different parameters and confirm the scalability of our approach and the relevance of the results notified.
Chapter
Full-text available
This chapter focuses on the way to enhance the relevance of filtering and to integrate such a process in two different implementations: a centric‐based version and a distributed version in a not only SQL (NoSQL) environment. It presents an overview on novelty and diversity processes in content‐based publish/subscribe (Pub/Sub) techniques. The chapter develops the term discrimination values (TDV) weighting approach adapted to the Pub/Sub context, which is more convenient for the filtering step. TDV computation is a heavy process that requires evaluation of the density of a collection of items. The chapter explains how to provide better solutions to compute TDV. It compares the quality of the system with different settings and a top‐k approach and shows that real‐time delivery is a strong constraint which the system guarantees with a TDV‐weighted coverage combined with diversity.
Article
Full-text available
Purpose – The purpose of this paper is to explore how to use social media in e-government to strengthen interactivity between government and the general public. Design/methodology/approach – Categorizing the determinants to interactivity covering depth and breadth into two aspects that are the structural features and the content features, this study employs general linear model and ANOVA method to analyse 14,910 posts belonged to the top list of the 96 most popular government accounts of Sina, one of the largest social media platforms in China. Findings – The main findings of the research are that both variables of the ratio of multimedia elements, and the ratio of external links have positive effects on the breadth of interactivity, while the ratio of multimedia features, and the ratio of originality have significant effects on the depth of interactivity. Originality/value – The contributions are as follows. First, the authors analyse the properties and the topics of government posts to draw a rich picture of how local governments use the micro-blog as a communications channel to interact with the public. Second, the authors conceptualize the government online interactivity in terms of the breadth and depth. Third, the authors identify factors that will enhance the interactivity from two aspects: structural features and content features. Lastly, the authors offer suggestions to local governments on how to strengthen the e-government interactivity in social media.
Article
Based on the social network perspective and work group perspective, this study brings social interaction tie and membership esteem together as the mediating variables between knowledge contribution and social identity to construct an inductive route model, aiming to understand how social identity and self-identity form based on knowledge contribution behaviors in virtual communities. To assess the theoretical model, an online survey was conducted in an interest-based discussion community, Baidu Post Bar (China), and yielded 348 useable responses. Both social interaction tie and membership esteem were found to have mediating effects between knowledge contribution and social identity. In addition, knowledge contribution was found to have a direct influence on social identity. The results also showed that self-identity can form through an inductive route. Our findings have implications for both practice and theory.
Article
In the era of News 2.0, the number of comments can indicate the influence of online news, which brings potential social value and economic benefits. The present study proposes a framework that involves integrating the features of news structure, news content, and reader usage (social media recommendation) to explain the number of comments. The results of logistic regression suggest that the proposed framework is a powerful tool for explaining the number of comments (R2 = 47.1%). The relative and mediating role of recommendation in social media from readers is also explored. The theoretical and managerial implications of these results are provided.
Article
Open production communities (OPCs) provide technical features and social norms for a vast but dispersed and diverse crowd to collectively accumulate content. In OPCs, certain mechanisms, policies, and technologies are provided for voluntary users to participate in community-related activities including content generation, evaluation, qualification, and distribution and in some cases even community governance. Due to the known complexities and dynamism of online communities, designing a successful community is deemed more an art than a science. Numerous studies have investigated different aspects of certain types of OPCs. Most of these studies, however, fall short of delivering a general view or prescription due to their narrow focus on a certain type of OPCs. In contribution to theories on technology-mediated social participation (TMSP), this study synthesizes the streams of research in the particular area of OPCs and delivers a theoretical framework as a baseline for adapting findings from one specific type of community on another. This framework consists of four primary dimensions, namely, platform features, content, user, and community. The corresponding attributes of these dimensions and the existing interdependencies are discussed in detail. Furthermore, a decision diagram for selecting features and a design guideline for "decontextualizing" findings are introduced as possible applications of the framework. The framework also provides a new and reliable foundation on which future research can extend its findings and prescriptions in a systematic way.
Conference Paper
Full-text available
On-line news agents provide commenting facilities for readers to express their views with regard to news stories. The number of user supplied comments on a news article may be indicative of its importance or impact. We report on exploratory work that predicts the comment volume of news articles prior to publication using five feature sets. We address the prediction task as a two stage classification task: a binary classification identifies articles with the potential to receive comments, and a second binary classification receives the output from the first step to label articles "low" or "high" comment volume. The results show solid performance for the former task, while performance degrades for the latter.
Conference Paper
Full-text available
Opinion leaders are those who bring in new information, ideas, and opinions, then disseminate them down to the masses, and thus influence the opinions and decisions of others by a fashion of word of mouth. Opinion leaders capture the most representative opinions in the social network, and consequently are important for understanding the massive and complex blogosphere. In this paper, we propose a novel algorithm called InfluenceRank to identify opinion leaders in the blogosphere. The InfluenceRank algorithm ranks blogs according to not only how important they are as compared to other blogs, but also how novel the information they can contribute to the network. Experimental results indicate that our proposed algorithm is effective in identifying influential opinion leaders.
Conference Paper
Full-text available
User Generated Content (UGC) is re-shaping the way people watch video and TV, with millions of video producers and consumers. In particular, UGC sites are creating new view- ing patterns and social interactions, empowering users to be more creative, and developing new business opportunities. To better understand the impact of UGC systems, we have analyzed YouTube, the world's largest UGC VoD system. Based on a large amount of data collected, we provide an in-depth study of YouTube and other similar UGC systems. In particular, we study the popularity life-cycle of videos, the intrinsic statistical properties of requests and their re- lationship with video age, and the level of content aliasing or of illegal content in the system. We also provide insights on the potential for more efficient UGC VoD systems (e.g. utilizing P2P techniques or making better use of caching). Finally, we discuss the opportunities to leverage the latent demand for niche videos that are not reached today due to information filtering effects or other system scarcity distor- tions. Overall, we believe that the results presented in this paper are crucial in understanding UGC systems and can provide valuable information to ISPs, site administrators, and content owners with major commercial and technical implications.
Book
"A wonderful treasure chest of information; spanning a wide range of data compression methods, from simple test compression methods to the use of wavelets in image compression. It is unusual for a text on compression to cover the field so completely." – ACM Computing Reviews "Salomon’s book is the most complete and up-to-date reference on the subject. The style, rigorous yet easy to read, makes this book the preferred choice … [and] the encyclopedic nature of the text makes it an obligatory acquisition by our library." – Dr Martin Cohn, Brandeis University Data compression is one of the most important tools in modern computing, and there has been tremendous progress in all areas of the field. This fourth edition of Data Compression provides an all-inclusive, thoroughly updated, and user-friendly reference for the many different types and methods of compression (especially audio compression, an area in which many new topics covered in this revised edition appear). Among the important features of the book are a detailed and helpful taxonomy, a detailed description of the most common methods, and discussions on the use and comparative benefits of different methods. The book’s logical, clear and lively presentation is organized around the main branches of data compression. Topics and features: •highly inclusive, yet well-balanced coverage for specialists and nonspecialists •thorough coverage of wavelets methods, including SPIHT, EZW, DjVu, WSQ, and JPEG 2000 •comprehensive updates on all material from previous editions And these NEW topics: •RAR, a proprietary algorithm •FLAC, a free, lossless audio compression method •WavPack, an open, multiplatform audio-compression algorithm •LZMA, a sophisticated dictionary-based compression method •Differential compression •ALS, the audio lossless coding algorithm used in MPEG-4 •H.264, an advanced video codec, part of the huge MPEG-4 project •AC-3, Dolby's third-generation audio codec •Hyperspectral compression of 3D data sets This meticulously enhanced reference is an essential resource and companion for all computer scientists; computer, electrical and signal/image processing engineers; and scientists needing a comprehensive compilation of compression methods. It requires only a minimum of mathematics and is well-suited to nonspecialists and general readers who need to know and use this valuable content. David Salomon is a professor emeritus of computer Science at California State University, Northridge. He has authored numerous articles and books, including Coding for Data and Computer Communications, Guide to Data Compression Methods, Data Privacy and Security, Computer Graphics and Geometric Modeling, Foundations of Computer Security and Transformations and Projections in Computer Graphics.
Article
There are currently hundreds of millions of people contributing content to the Web. They do so by rating items, sharing links, photos, music and video, creating their own webpage or writing them for friends, family, or employer, socializing in social networking sites, and blogging their daily life and thoughts. Of those who author Web content there is a group of people who contribute to more than a single Web entity, be it on a different host, on a different application or under a different username. We name this group Serial Sharers. In this paper we analyze patterns in the contributions of Serial Sharers. We examine the overlap between their individual contributions and propose a method for detecting their pages in large and diverse collections of pages.
Article
We propose two new tools to address the evolution of hyperlinked corpora. First, we define time graphs to extend the traditional notion of an evolving directed graph, capturing link creation as a point phenomenon in time. Second, we develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution. We develop these tools in the context of Blogspace, the space of weblogs (or blogs). Our study involves approximately 750 K links among 25 K blogs. We create a time graph on these blogs by an automatic analysis of their internal time stamps. We then study the evolution of connected component structure and microscopic community structure in this time graph. We show that Blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding, not just in metrics of scale but also in metrics of community structure and connectedness. By randomizing link destinations in Blogspace, but retaining sources and timestamps, we introduce a concept of randomized Blogspace. Herein, we observe similar evolution of a giant component, but no corresponding increase in community structure. Having demonstrated the formation of micro-communities over time, we then turn to the ongoing activity within active communities. We extend recent work of Kleinberg (2002) to discover dense periods of “bursty” intra-community link creation. Furthermore, we find that the blogs that give rise to these communities are significantly more enduring than an average blog.
Conference Paper
What makes a song to a chart hit? Many people are trying to find the answer to this question. Previous attempts to identify hit songs have mostly focused on the intrinsic characteristics of the songs, such as lyrics and audio features. As social networks become more and more popular and some specialize on certain topics, information about users’ music tastes becomes available and easy to exploit. In the present paper we introduce a new method for predicting the potential of music tracks for becoming hits, which instead of relying on intrinsic characteristics of the tracks directly uses data mined from a music social network and the relationships between tracks, artists and albums. We evaluate the performance of our algorithms through a set of experiments and the results indicate good accuracy in correctly identifying music hits, as well as significant improvement over existing approaches.