Conference PaperPDF Available

Diversionary Comments under Blog Posts

Authors:

Abstract and Figures

An important issue that has been neglected so far is the identification of diversionary comments. Diversionary comments under political blog posts are defined as comments that deliberately twist the bloggers' intention and divert the topic to another one. The purpose is to distract readers from the original topic and draw attention to a new topic. Given that political blogs have significant impact on the society, we believe it is imperative to identify such comments. We then categorize diversionary comments into 5 types, and propose an effective technique to rank comments in descending order of being diversionary. To the best of our knowledge, the problem of detecting diversionary comments has not been studied so far. Our evaluation on 2,109 comments under 20 different blog posts from Digg.com shows that the proposed method achieves the high mean average precision (MAP) of 92.6%. Sensitivity analysis indicates that the effectiveness of the method is stable under different parameter settings.
Content may be subject to copyright.
Diversionary Comments under Political Blog Posts
Jing Wang
University of Illinois at Chicago
jwang69@uic.edu
Clement T. Yu
University of Illinois at Chicago
yu@cs.uic.edu
Philip S. Yu
University of Illinois at Chicago
psyu@uic.edu
Bing Liu
University of Illinois at Chicago
liub@cs.uic.edu
Weiyi Meng
SUNY at Binghamton
meng@cs.binghamton.edu
ABSTRACT
An important issue that has been neglected so far is the iden-
tification of diversionary comments. Diversionary comments
under political blog posts are defined as comments that de-
liberately twist the bloggers’ intention and divert the topic
to another one. The purpose is to distract readers from the
original topic and draw attention to a new topic. Given that
political blogs have significant impact on the society, we be-
lieve it is imperative to identify such comments. We then
categorize diversionary comments into 5 types, and propose
an effective technique to rank comments in descending or-
der of being diversionary. To the best of our knowledge, the
problem of detecting diversionary comments has not been
studied so far. Our evaluation on 2,109 comments under 20
different blog posts from Digg.com shows that the proposed
method achieves the high mean average precision (MAP) of
92.6%. Sensitivity analysis indicates that the effectiveness
of the method is stable under different parameter settings.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval-Information filtering, retrieval models,
selection processes
Keywords
diversionary comments, spam, topic model, LDA, corefer-
ence resolution, extraction from Wikipedia
1. INTRODUCTION
As a strong force of public opinions, blog comments at-
tract attention from people with different backgrounds. Ide-
ally, commentators write their truthful opinions to help shape
and build the contents in the blog posts. However, in prac-
tice, various types of unrelated comments are also written
by people with different intentions. For instance, companies
post advertisements to promote products, and politicians
or their supporters leave comments to divert the readers’
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’12, October 29–November 2, 2012, Maui, HI, USA.
Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00.
concerns to another political issue. Many kinds of these
unrelated comments in the blogosphere have drawn inter-
ests from researchers. One type of unrelated comments has
hyperlinks to commercial-oriented pages, and is defined as
comment-spam[4]. Various initiatives have been taken to re-
duce comment-spam. However, we did not find any study on
detecting comments that try to deliberately divert readers’
attention to another topic. Based on a study of 10,513 com-
ments for 115 political blog posts from Digg.com, we found
39.5% of comments trying to divert the discussion topic.
Furthermore, according to the research by Brigham Young
University[1], most people who closely follow both politi-
cal blogs and traditional news media tend to believe that
the content in the blogosphere is more trustworthy. Given
such significant impact of political blog posts and comments,
the existence of the large amount of diversionary comments
would have considerably negative effect since they not only
twist the blogger’s original intention, but also confuse the
public. Therefore, we believe it is imperative to identify
such comments, especially under the political category.
In this paper, we define comments diverting to unrelated
topics as diversionary comments, and we focus our work
on comments under political blog posts. Based on our ob-
servation, there are generally five types of diversionary com-
ments, which are listed below (the type distribution among
diversionary comments is also given based on a manually
labeled data set of 2,109 comments for 20 randomly chosen
blog posts):
Type 1 (63.1%)(Comments diverting to different
topics):
Those that twist the post content’s intention or purposely
divert the discussion to a topic that is different from the
content of the post. One example of this type is that given
a post which states the risky rush to cut defense spending,
commentators write to discuss about reducing social security
spending without mentioning defense spending. This action
tries to steal the public’s attention from defense spending to
social security spending.
Type 2 (19.5%)(Comments about personal attack
to commentators):
Those that comment on the behavior of some preceding
commentators without discussing anything related to the
topic of the post. An example of this type is “What’s the
matter with you? Are you only posting at the very lowest
level of threads so you don’t deal with responses?
Type 3 (7.3%)(Comments with little content):
Those that lack content and only contain words such as
“lol” and “hahaha”. Even though they might express agree-
1789
ments or disagreements with the preceding commentators
or the content of the blog post, their relatedness to the post
content is not clear, and therefore, are considered as diver-
sions.
Type 4 (5.8%)(Comments about the hosting web-
site only):
Those that complain about the poor function of the blog
hosting website. We consider them as unrelated to the post
content. An example diversionary comment of this type is
“Everyone should boycott Digg on Monday”. In this com-
ment, “Digg” is the hosting website.
Type 5 (4.3%)(Advertisements):
Those that introduce products or refer to companies, and
both of which are unrelated to the post content.
This paper proposes an effective unsupervised method,
which aims to rank comments in descending order of be-
ing diversionary. The method is based on the intuition that
the relatedness between two documents can be measured by
their similarity. A related comment should have high simi-
larity with the post content or with the preceding comment
it replies to, while a diversionary comment should have low
similarities with both the post content and the preceding
comment. Our approach tries to first represent each com-
ment and post by a vector, then to use a similarity function
to compute the relatedness between each comment and the
post, and that between each comment and the comment it
replies to, and finally rank comments based on these similar-
ities. However, the following reasons make it a challenging
task.
(1) It is difficult to find an accurate representation for
each comment and the post. Comments are relatively short
and can only offer limited literal information. A simplistic
way of applying term frequencies to build document vectors
would yield low accuracies, because a related comment may
not share enough words with the post, while a diversionary
comment may share significant words with the post.
(2) Pronouns and hidden knowledge in the comments and
post are other obstacles to accurate representations. Firstly,
many commentators use pronouns to represent the person
or issue mentioned in the post. Without mapping pronouns
to their corresponding proper nouns or phrases, the number
of occurrences of the person or issue cannot be captured
accurately. Secondly, comments under political posts often
indicate political figures or events, which are not explicitly
mentioned in the post but are closely related to the post
content. Thirdly, many words or phrases, though different,
may refer to the same topics. Thus when two comments
contain different words but refer to the same topics, their
representations are different but ideally should be similar.
(3) A commentator may write to reply to the post directly,
but may also write to follow a preceding comment. Most
blog hosting websites offer a reply-to hierarchy for commen-
tators. However, many comments do not follow the hierar-
chy, which makes it difficult to find what a comment replies
to.
The main contributions of this paper are as follows:
(1) It proposes the new problem of identifying diversionary
comments and makes the first attempt to solve the problem
in the political blogosphere.
(2) It introduces several rules to accurately locate what a
comment replies to.
(3) It proposes an effective approach to identify diver-
sionary comments, which first applies coreference resolution
[3] to replace pronouns with corresponding proper nouns or
phrases, extracts related information from Wikipedia[10] for
proper nouns in comments and the post, utilizes the topic
modeling method LDA [6] to group related terms into the
same topics, and represent comments and the post by their
topic distributions, and then rank comments in descending
order of being diversionary.
(4) A data set, which consists of 2,109 comments under 20
different political blog posts from Digg.com, was annotated
by 5 annotators with substantial agreement. Experiments
based on the data set are performed to verify the effective-
ness of the proposed approach versus various baseline meth-
ods. The proposed method achieves 92.6% in mean average
precision (MAP)[2]. In addition, its effectiveness remains
high under different parameter settings.
2. RELATED WORK
By analyzing different types of diversionary comments, we
realize that types 2, 4 and 5 belong to the traditional spam
in different contexts. Therefore, we investigate related work
on various types of spam detection in this section.
The most investigated types of spam are the web spam
[7, 8] and email spam [5, 9]. Web spam can be classified
into content spam and link spam. Content spam involves
adding irrelevant words in pages to fool search engines. In
the environment of our study, the commentators do not add
irrelevant words as they want to keep their comments read-
able. Link spam is the spam of hyperlinks, and comment
spam [4, 14] is a form of it, but as we discussed in the
previous section, diversionary comments seldom contain hy-
perlinks. Email spam targets individual users with direct
mail messages, and are usually sent as unsolicited and nearly
identical commercial advertisements. However, diversionary
comments are mostly not commercial oriented and may not
contain the same kind of features. In addition, comments are
written with a context of the post and preceding comments,
while emails are written independently.
Another related research is opinion spam detection[11],
though it is not conducted in the blogosphere. Jindal and
Liu regard untruthful or fake reviews aiming at promoting
or demoting products as opinion spam. They detected spam
reviews based on supervised learning and manually labeled
examples. They detected untruthful reviews by using dupli-
cate and near-duplicate reviews as the training data. How-
ever, diversionary comments are different because they are
not necessarily untruthful or fake. In addition, we aim to
automatically identify all types of diversionary comments
without incurring the expensive task of manually collecting
and labeling of training data.
3. COMMENTS DATA ANALYSIS
A standard hierarchy of post-comments in Digg is as fol-
lows. Under each post, each comment consists of 4 features
(username, written time, comment level, comment content).
Comments with “comment level” of (n+1) are designed to
reply to preceding comments of level n. In addition, if a
comment’s level is 0, then it is supposed to reply to the post
content directly.
Under such a hierarchy, we believe that a relevant com-
ment can be one related to the post content directly, and can
also have a close relation with the preceding comment that
it replies to, while a diversionary comment is unrelated to
1790
both the post content and the comment it replies to. There-
fore, finding what a comment replies to is necessary for the
identification of diversionary comments.
3.1 Finding what a comment replies to
In most cases, a comment at level 0 replies to the post
content and a comment at level (n+ 1) replies to a comment
at level n. However, in practice, not all commentators follow
such rules when writing comments. Therefore, besides the
feature of “level”, we need to combine other features such as
written time and username to locate a comment’s reply-to
comment. We use the following heuristics to find a com-
ment’s potential reply-to comments.
Assume comment A is at level n and written at time t,
while its reply-to comment is written at time t.
(1) If comment A’s content contains the information about
username such as“@usernamej, then among comments which
precede comment A and are written by “usernamej”, the
reply-to comment of A is the one that has the smallest pos-
itive value of (tt);
(2) Among all comments which precede comment A and
have the level (n1), the reply-to comment of A may be
the one that has the smallest positive value of (tt);
(3) Among all comments which precede comment A and
have the level n, the reply-to comment of A may be the one
that has the smallest positive value of (tt);
(4) Among all comments which precede comment A, the
reply-to comment of A may be the one that has the smallest
positive value of (tt), no matter what its level is.
(5) If comment B satisfies condition (1), then B is A’s
reply-to comment, otherwise, all comments which satisfy
any of conditions (2), (3) or (4) are considered as poten-
tial reply-to comments. If there is only one potential reply-
to comment, we consider it as the final reply-to comment.
However, if there are multiple potential reply-to comments,
we compare the similarities between the comment and all of
its potential reply-to comments. Then among all potential
ones, we choose the one that has the largest similarity.
However, some comments reply to the post content di-
rectly instead of to other comments. The first comment of
the post definitely replies to the post. For each of the other
comments which have the level of 0, when its similarity with
the post is greater than its similarity with its potential reply-
to comments, and is greater than a specified threshold t, we
consider it replying to the post directly.
4. DIVERSIONARY COMMENTS IDENTI-
FICATION
In this section, we present the proposed techniques for
identifying diversionary comments. We first explain each
strategy that we use to exploit the pronouns and hidden
knowledge, and the algorithm we use to rank comments.
We then discuss the pipeline of our method.
4.1 Techniques
As we mentioned in the previous section, a diversionary
comment is unrelated to both the post content and the reply-
to comment. Typical similarity functions such as the Co-
sine [15] function and the KL-Divergence [12] can be used
to measure the relatedness between two documents. How-
ever, a simplistic way of utilizing these similarity functions
would yield inaccuracies based on the experiment perfor-
mance, therefore, we add the following techniques to com-
pute the pairwise relatedness more accurately.
4.1.1 Coreference Resolution
Coreference resolution groups all the mentioned entities in
a document into equivalence classes so that all the mentions
in a class refer to the same entity. By applying coreference
resolution, pronouns are mapped into the proper nouns or
other noun phrases. If we replace pronouns with their cor-
responding words or phrases, then the entities become more
frequent. For example, a post which talks about President
Obama’s accomplishments, only mentions “Obama” once,
but uses “he” several times. Without coreference resolution,
the word “Obama” only occurs once. However, with corefer-
ence resolution, “he” will be replaced by “Obama”, and the
frequency of “Obama” increases. In this paper, we use the
Illinois coreference package [3].
4.1.2 Extraction from Wikipedia
When a post talks about President Hu Jintao’s visit to
U.S., a comment which discusses the foreign policy of China
will be considered relevant. However, the post does not men-
tion the word “China”, and it does not share any words with
the comment. A similarity function such as Cosine which
utilizes words in common would yield a small value be-
tween the post and the comment. Wikipedia comes to help
here, which offers a vast amount of domain-specific world
knowledge. In the above example, if we search “President
Hu Jintao” in Wikipedia, we will find the information that
President Hu Jintao is the current president of the People’s
Republic of China. However, Wikipedia offers much more
knowledge than is needed in the analysis of the post or com-
ments. In order to avoid adding noise, we only pick up an-
chor texts in the first paragraph from the searched webpage
since this information is believed to be most related.
4.1.3 Latent Dirichlet Allocation (LDA) [6]
LDA places different terms, which are related and co-
occur frequently, into the same topics with high probabil-
ities. Each term can be represented as a vector of topics.
Thus, two related terms which share some topics together
will have a positive similarity.
In general, a document-topic distribution can be obtained
in the LDA model using Gibbs sampling and it is given by
formula (1) [16]:
Θ= CDT
dj +α
T
k=1 CDT
dk +(1)
Here, Dand Tstand for documents and the number of
topics respectively, CDT
dj is the number of occurrences of
terms in document d, which have been assigned to topic
j,andαis a smoothing constant. Based on formula (1),
the distribution of a document on a set of topics can be
estimated. Then we can compute the similarity between
two documents by using topic distributions as vectors.
Using Gibbs sampling, a term-topic distribution is also
obtained and it is given by formula (2)[16]:
ϕ=CWT
ij +β
W
k=1 CWT
kj +(2)
Here, Wand Tstand for the number of terms and topics
respectively, CWT
ij is the number of times that term ihas
been assigned to topic j,andβis a smoothing constant.
1791
4.1.4 LDA on training and test data
In order to build an accurate LDA model, a substan-
tial amount of data is required. A post and its associated
comments usually have limited amount of data. To obtain
enough data, we submit the title of the post as a query to
search engines and obtain the first 600 documents as pre-
liminary data to build an LDA model. We denote this data
as the training data. Then we build another LDA model
on the test data (which is the set of comments of the post),
but the term-topic distribution from the LDA model on the
training data is utilized in the following way: when running
Gibbs sampling to determine the topic assignment for each
term occurrence in the test data, if the term has appeared
in the training data, its term-topic distribution from the
LDA model on the training data is used, but if the term
only appears in the test data, the above formula (2) is ap-
plied to decide the topic assignment. At the same time, the
document-topic distribution for documents in the test data
is obtained based on the above formula (1). Then after this
LDA model is built, we can use the topic distributions as
the document vectors to compute pairwise similarities.
4.1.5 Rank comments in descending order of being
diversionary
Algorithm 1 Rank comments in descending order of being
diversionary
Constants t,t1,t2,t3,t4,wheret>0, t1t3,andt2t4
for each comment do
C1= the similari ty be tween the comme nt and the post;
C2= the similarity between the comment and its reply-to com-
ment;
if its level = 0 and C1>C
2and C1tthen
C2=C1;
end if
if (C1<t
1and C2<t
2)then
Put the comment into potential diversionary list(PDL);
else if (C1>t
3or C2>t
4)then
Put the comment into potential non-diversionary list(PNDL);
else
Put the comment into the intermediate list(IL);
end if
end for
Sort comments in PDL in ascending order of sum(C1,C2);
Sort comments in IL in ascending order of max(C1t1,C2t2);
Sort comments in PNDL in ascending order of max(C1t3,C2t4);
OutputcommentsinPDLfollowedbycommentsinIL,followedby
comments in PNDL.
According to the property that a diversionary comment
is unrelated to both the post content and its reply-to com-
ment, if a comment has small similarities with both the post
and the reply-to comment, there is a high probability for it
to be diversionary. As a consequence, we set two thresholds
t1and t2such that if a comment’s similarity with the post
(C1)islessthant1and its similarity with the reply-to com-
ment (C2)islessthant2, then it is placed into a list called
potential diversionary list (PDL). In contrast, if a comment
has a big enough similarity either with the post or with its
reply-to comment, it is very unlikely to be diversionary. As
a result, we set two thresholds t3and t4such that if the sim-
ilarity of a comment with the post is higher than t3,orits
similarity with its reply-to comment is higher than t4,then
it is placed into a list called potential non-diversionary list
(PNDL). Comments which belong to neither of the above
two lists are placed into an intermediate list (IL). Com-
ments in this list do not have high probabilities of being
diversionary relative to those in PDL; they do not have high
probabilities of being non-diversionary compared to those in
PNDL either. Thus, comments in PDL are placed ahead of
comments in IL, which are ahead of comments in PNDL.
Based on the above analysis, we use Algorithm 1 to rank
comments.
4.2 Pipeline of the Proposed Method
Our proposed method combines all the techniques dis-
cussed above to identify diversionary comments. Each step
in the procedure is described below:
(1) Submit each post title as a query to search engines
and retrieve the first 600 web pages. We extract contents
from them as the training corpus. The test corpus consists
of each post and the associated comments.
(2) Apply coreference resolution to each document in the
training corpus and the test corpus separately, and replace
pronouns with their corresponding proper nouns or phrases.
(3) Identify proper nouns in the test data and search them
through Wikipedia. For each proper noun, if an unambigu-
ous page is returned, terms in the anchor texts in the first
paragraph of the page are added into the document.
(4) Build an LDA model based on the training and test
data as discussed in section 4.1.4. The document-topic dis-
tribution of each document in the test corpus is obtained.
(5) According to the rules described in Section 3.1, com-
pute the similarity between each comment and the post,
and the similarities between each comment and its potential
reply-to comments in the test corpus and then decide what
a comment replies to.
(6) Rank comments (the test corpus) based on the algo-
rithm in 4.1.5.
5. EVALUATION
For the experiments of this work, we use a data set col-
lected from the politics category in Digg.com, which contains
2,109 comments under 20 different political posts. Each post
contains around 100 comments. The corpus is annotated by
5 annotators, and they resolve their disagreement in the an-
notations together. We consider the final annotation as the
gold standard.
5.1 Diversionary Comments Distribution
In this section, we report diversionary comments distri-
bution variation. Based on the gold standard, there are 834
diversionary comment in the test corpus, which account for
39.5% of all comments. We observe that most posts contain
35% to 45% of diversionary comments, and among all diver-
sionary comments, type 1 is the most significant while type
5 takes a relatively low percentage, which also indicates that
diversionary comments studied in this work are not commer-
cially oriented, but focus on those deliberately diverting to
other topics.
5.2 Experimental Results
As we proposed in section 4, our method consists of sev-
eral techniques. In order to test the necessity of combin-
ing them, we perform experiments by comparing our final
method with baseline methods which only apply one tech-
nique or combine fewer techniques. The effectiveness of each
method is measured by mean average precision(MAP)[13].
In order to keep consistency among all methods to be
compared, we set parameters t,t1,t2,t3and t4using fixed
percentiles, which are required in the ranking algorithm as
1792
Table 1: MAP for Cosine with different techniques
Techniques Baseline With With With
/MAP Coref Wiki Coref
Results(%) Wiki
Term Frequency 69.9 70.4 71.7 72.2
LDA on test data 57.4 58.0 61.1 60.1
LDA on training and test data 75.4 76.7 80.8 92.6
presented in Algorithm 1. In the section below, we set t to
50%, t1to 10%, t2to 20%, t3to 50%, and t4to 90%. When
LDA is applied, the number of topics is set to 10, αto 0.1,
and βto 0.01.
5.2.1 Comparison Results
We compare the following methods firstly: Cosine sim-
ilarity with term frequency, Cosine similarity with coref-
erence resolution, Cosine similarity with extraction from
Wikipedia, and Cosine similarity with both coreference res-
olution and extraction from Wikipedia. All these methods
represent comments and the post by building vectors based
on term frequencies. From the first row of Table 1, we ob-
serve that Cosine similarity with term frequency has the low-
est MAP value (69.9%), while Cosine similarity with both
coreference resolution and extraction from Wikipedia per-
forms the best (72.2%). Yet, even the best result is far from
being acceptable. The reasons for these poor results are ob-
vious. The Cosine similarity is incapable of matching a doc-
ument with another document if they have related but differ-
ent terms. This mismatch can be alleviated to some extent
by coreference resolution and extraction from Wikipedia.
However, many related terms remain unmatched.
In the second row of Table 1, the LDA model is built sim-
ply on the test data, we represent comments and the post by
their topic distributions. However, the results are also poor.
When coreference resolution, extraction from Wikipedia or
both of them are combined with LDA, better results are ob-
tained. However, even the best result has a mean average
precision value of 61.1% only. The reason for such a poor
result is that the amount of test data is too small for LDA
to identify related terms.
In the third row of Table 1, the LDA model is built on
the training data and test data, we rank comments based
on similarities of their topic distributions. When corefer-
ence resolution and extraction from Wikipedia are individu-
ally added to LDA, there are notable improvements, but the
largest and dramatic improvement comes when LDA and the
two techniques are combined, yielding 92.6% mean average
precision.
When the Cosine similarity function is replaced by the
symmetric KL function, the results turn out be close (89.1%)
to those in the third row of Table 1, where the Cosine simi-
larity function is applied.
5.2.2 Sensitivity Analysis
In order to test the stability of our method, we compare its
effectiveness by setting different parameters. We first test
its sensitivity by setting different numbers of topics while
keeping other parameter values unchanged. The number
of topics is set to 8, 10 and 12, but similar mean average
precisions are obtained. Thus, the method is believed to be
stable with different but reasonable number of topics.
Secondly, we test the method’s stability by setting differ-
ent values for ranking algorithm parameters while keeping
the number of topics as 10. To make the comparison simple,
we set t1and t2to be the same percentile, and t4to be the
percentage of t3plus 10%. t1and t2are set in the range
from 0.1 and 0.45, while t3changes from 0.2 to 0.55, and
t4changes from 0.3 to 0.65. The average MAP based on
Cosine function and symmetric KL are 89.5% and 86.1% re-
spectively. We find that with such wide ranges of threshold
values, there is little change in the effectiveness of identify-
ing diversionary comments. Therefore, we conclude that the
method is stable with reasonable threshold values.
6. CONCLUSIONS
This paper presented a study on identifying diversionary
comments under political posts. In our evaluation data set,
39.5% of comments were annotated as diversions. To the
best of our knowledge, this problem has not been researched
in the literature. This paper first identified 5 types of diver-
sionary comments, and then introduced rules to determine
what a comment replies to under a hierarchy of the post and
its associated comments. It then proposed a method to com-
pute the relatedness between a comment and the post con-
tent, and the relatedness betweenacommentanditsreply-to
comment, which involves coreference resolution, extraction
from Wikipedia and topic modeling. We demonstrated the
effectiveness of the proposed method using the mean aver-
age precision (MAP) measure. Comparisons with baseline
methods showed that the proposed method outperformed
them considerably.
7. REFERENCES
[1] Brigham young university.
http://news.byu.edu/archive09-may-blogs.aspx.
[2] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern
Information Retrieval. 1999.
[3] E. Bengtson and D. Roth. Understanding the value of
features for coreference resolution. In EMNLP 2008.
[4] A. Bhattarai, V. Rus, and D. Dasgupta. Characterizing
comment spam in the blogosphere through content
analysis. Distribution, 2009.
[5] E. Blanzieri and A. Bryl. A survey of learning-based
techniques of email spam filtering. Artificial Intelligence
Review, 2009.
[6] D. Blei, A. Y. Ng, and M. Jordan. Latent dirichlet
allocation. J. Mach. Learn. Res. 2003.
[7] C. Castillo and B. D. Davison. Adversarial web search.
Foundations and Trends in Information Retrieval, 2010.
[8] C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi,
M. Santini, and S. Vigna. A reference collection for web
spam. SIGIR Forum, 2006.
[9] G. V. Cormack. Email spam filtering: A systematic review.
Found. Trends Inf. Retr., 2008.
[10] E. Gabrilovich and S. Markovitch. Computing semantic
relatedness using wikipedia-based explicit semantic
analysis. In IJCAI 2007.
[11] N. Jindal and B. Liu. Opinion spam and analysis. In
WSDM 2008.
[12] S. Kullback. Information Theory and Statistics. Wiley
1959.
[13] C. D. Manning, P. Raghavan, and H. Schtze. Introduction
to Information Retrieval 2008.
[14] G. Mishne. Blocking blog spam with language model
disagreement. In AIRWeb 2005.
[15] G. Salton, A. Wong, and C. S. Yang. A vector space model
for automatic indexing. Commun. ACM, 1975.
[16] M. Steyvers and T. Griffiths. Probabilistic Topic Models.
Lawrence Erlbaum Associates 2007.
1793
... Spam content is a specific concept throughout the emails, web-page, blog posts, and comments. Short text type spam such as spam comments following posts in blogs and social networks has attracted further attention [16], [17]. Mishne et al. [18] followed a language-based model to create a statistical model for text generation to identify spam comments in blogs. ...
... Bhattarai et al. [19] investigated the characteristics of spam comments in the blogosphere based on their content, with an effort to extract the features of the blog spam comments and classify them by applying a semi-supervised and supervised learning method. Wang et al. [16] aimed to identify diversionary comments as comments designed to deliberately divert readers' attention to another topic on political blog posts. They applied a combination of co-reference resolution and Wikipedia embedding to replace pronouns with corresponding nouns and used the topic modeling method LDA to group related terms in the same topics. ...
... Considering all the previous mentioned studies in identifying related/unrelated comments following a post, we believe that our model has gone beyond the state of the art in using a combination of syntax, topic, and semantic-based features to find similarity between short texts. Our model does not rely on the entire story of a post or external webpages content related to the post in comparison with previous studies [3], [16] and we leverage word embedding approach to enrich the short text corpus. Therefore, it can be applied in different social media applications in which we are just dealing with short texts to categorize them as related/unrelated contents. ...
Conference Paper
Written comments to the posts on social media are an important metric to measure the followers’ feedback to the content of the posts. But the huge presence of unrelated comments following each post can impact many parts of people engagement as well as the visibility of the actual post. Related comments to a post’ s topic usually provide readers more insight into the post content and can attract their attention. On the other hand, unrelated comments distract them from the original topic of the post or disturb them by worthless content and can mislead or impact their opinion. In this paper, we propose an effective framework to measure the similarity of given comments to a post in terms of the content and distinguish the related and unrelated written comments to the actual post. Toward that end, the proposed framework enhances a novel feature engineering by combining a syntactical, topical, and semantical set of features and leveraging word embeddings approach. A machine learning-based classification approach is used to label related and unrelated comments of each post. The proposed framework is evaluated on a dataset of 33,921 comments written under 30 posts from BBC News agency page on Facebook. The evaluation indicates that our model achieves in average the precision of 86% in identifying related and unrelated comments with an improvement of 9.6% in accuracy in comparison with previous work, without relying on the entire article of the posts or external web pages’ content related to each post. As a case study, the learned classifier is applied on a bigger dataset of 278,370 comments written under 332 posts and we observed almost 60% of the written comments are not related to the actual posts’ content. Investigating the content of both group of related and unrelated comments regarding the topics of their actual posts shows that most of the related comments are objective and they discuss the posts’ content in terms of topics whereas unrelated comments usually contain subjective and very general words expressing feedback without any focus on the subject of the posts.
... Additionally, the problem of comment relevance is also addressed [33,41,107], with the latter assessing the degree of pertinence of comments by comparing their tf-idf vectors to the articles' in News York Times. Detecting the comments that shift the main article topic and change the article's focus at Digg.com is tackled by Wang et al. [168], while Zhang and Setty [179] identify sets of topic-wise diverse user comments in Reddit news articles. Recent research focuses not only on the comment moderation, but also on identifying how user comments can be helpful to journalists [97]. ...
Thesis
Full-text available
As part of our everyday life we consume breaking news and interpret it based on our own viewpoints and beliefs. We have easy access to online social networking platforms and news media websites, where we inform ourselves about current affairs and often post about our own views, such as in news comments or social media posts. The media ecosystem enables opinions and facts to travel from news sources to news readers, from news article commenters to other readers, from social network users to their followers, etc. The views of the world many of us have depend on the information we receive via online news and social media. Hence, it is essential to maintain accurate, reliable and objective online content to ensure democracy and verity on the Web. To this end, we contribute to a trustworthy media ecosystem by analyzing news and social media in the context of politics to ensure that media serves the public interest. In this thesis, we use text mining, natural language processing and machine learning techniques to reveal underlying patterns in political news articles and political discourse in social networks. Mainstream news sources typically cover a great amount of the same news stories every day, but they often place them in a different context or report them from different perspectives. In this thesis, we are interested in how distinct and predictable newspaper journalists are, in the way they report the news, as a means to understand and identify their different political beliefs. To this end, we propose two models that classify text from news articles to their respective original news source, i.e., reported speech and also news comments. Our goal is to capture systematic quoting and commenting patterns by journalists and news commenters respectively, which can lead us to the newspaper where the quotes and comments are originally published. Predicting news sources can help us understand the potential subjective nature behind news storytelling and the magnitude of this phenomenon. Revealing this hidden knowledge can restore our trust in media by advancing transparency and diversity in the news. Media bias can be expressed in various subtle ways in the text and it is often challenging to identify these bias manifestations correctly, even for humans. However, media experts, e.g., journalists, are a powerful resource that can help us overcome the vague definition of political media bias and they can also assist automatic learners to find the hidden bias in the text. Due to the enormous technological advances in artificial intelligence, we hypothesize that identifying political bias in the news could be achieved through the combination of sophisticated deep learning modelsxi and domain expertise. Therefore, our second contribution is a high-quality and reliable news dataset annotated by journalists for political bias and a state-of-the-art solution for this task based on curriculum learning. Our aim is to discover whether domain expertise is necessary for this task and to provide an automatic solution for this traditionally manually-solved problem. User generated content is fundamentally different from news articles, e.g., messages are shorter, they are often personal and opinionated, they refer to specific topics and persons, etc. Regarding political and socio-economic news, individuals in online communities make use of social networks to keep their peers up-to-date and to share their own views on ongoing affairs. We believe that social media is also an as powerful instrument for information flow as the news sources are, and we use its unique characteristic of rapid news coverage for two applications. We analyze Twitter messages and debate transcripts during live political presidential debates to automatically predict the topics that Twitter users discuss. Our goal is to discover the favoured topics in online communities on the dates of political events as a way to understand the political subjects of public interest. With the up-to-dateness of microblogs, an additional opportunity emerges, namely to use social media posts and leverage the real-time verity about discussed individuals to find their locations. That is, given a person of interest that is mentioned in online discussions, we use the wisdom of the crowd to automatically track her physical locations over time. We evaluate our approach in the context of politics, i.e., we predict the locations of US politicians as a proof of concept for important use cases, such as to track people that are national risks, e.g., warlords and wanted criminals.
... Active users of the system can also introduce bias by adding irrelevant data to post to distract the attention from the main topic to another topic. Wang et al. [101] develop a framework to detect the diversionary comments on political blogs. The method is based on textual features and involves reference resolutions, Wikipedia's first paragraph for more data points to the topics, and LDA. ...
Article
Full-text available
Computational Politics is the study of computational methods to analyze and moderate users’ behaviors related to political activities such as election campaign persuasion, political affiliation, and opinion mining. With the rapid development and ease of access to the Internet, Information Communication Technologies (ICT) have given rise to massive numbers of users joining online communities and the digitization of political practices such as debates. These communities and digitized data contain both explicit and latent information about users and their behaviors related to politics and social movements. For researchers, it is essential to utilize data from these sources to develop and design systems that not only provide solutions to computational politics but also help other businesses, such as marketers, to increase users’ participation and interactions. In this survey, we attempt to categorize main areas in computational politics and summarize the prominent studies in one place to better understand computational politics across different and multidimensional platforms. e.g., online social networks, online forums, and political debates. We then conclude this study by highlighting future research directions, opportunities, and challenges.
... In the aspects of blog, (J. Wang, Yu, Yu, Liu, & Meng, 2012) mentions different types of diversionary comments under political blog posts. The user activities in Reddit is also studied (Ferraz Costa, Yamaguchi, Juci Machado Traina, Traina, & Faloutsos, 2015). ...
Article
Full-text available
Many news outlets allow users to contribute comments on topics about daily world events. News articles are the seeds that spring users' interest to contribute content, that is, comments. A news outlet may allow users to contribute comments on all their articles or a selected number of them. The topic of an article may lead to an apathetic user commenting activity (several tens of comments) or to a spontaneous fervent one (several thousands of comments). This environment creates a social dynamic that is little studied. The social dynamics around articles have the potential to reveal interesting facets of the user population at a news outlet. In this paper, we report the salient findings about these social media from 15 months worth of data collected from 17 news outlets comprising of over 38,000 news articles and about 21 million user comments. Analysis of the data reveals interesting insights such as there is an uneven relationship between news outlets and their user populations across outlets. Such observations and others have not been revealed, to our knowledge. We believe our analysis in this paper can contribute to news predictive analytics (e.g., user reaction to a news article or predicting the volume of comments posted to an article). This article is categorized under: Internet > Society and Culture Ensemble Methods > Web Mining Fundamental Concepts of Data and Knowledge > Human Centricity and User Interaction
... Active users of the system can also introduce bias by adding irrelevant data to post to distract the attention from the main topic to another topic. Wang et al. [101] develop a framework to detect the diversionary comments on political blogs. The method is based on textual features and involves reference resolutions, Wikipedia's first paragraph for more data points to the topics, and LDA. ...
Preprint
Full-text available
Computational Politics is the study of computational methods to analyze and moderate users\textquotesingle behaviors related to political activities such as election campaign persuasion, political affiliation, and opinion mining. With the rapid development and ease of access to the Internet, Information Communication Technologies (ICT) have given rise to a massive number of users joining the online communities and to the digitization of analogous data such as political debates. These communities and digitized data contain both explicit and latent information about users and their behaviors related to politics. For researchers, it is essential to utilize data from these sources to develop and design systems that not only provide solutions to computational politics but also help other businesses, such as marketers, to increase the users\textquotesingle participation and interaction. In this survey, we attempt to categorize main areas in computational politics and summarize the prominent studies at one place to better understand computational politics across different and multidimensional platforms. e.g., online social networks, online forums, and political debates. We then conclude this study by highlighting future research directions, opportunities, and challenges.
... Although blogging is recognized as an effective learning format [5,9,11,[14][15][16], blogging is less used in health care education compared with other types of academia. Health care professionals continue to underuse Web 2.0 for education, with blogging as just one example [17]. ...
Article
Introduction: The value of a blog as an educational tool is thought to be underestimated by health care professionals. This research aimed to explore the MRI educational utility of blogs, and to determine who was participating in writing those blogs. It was hoped that this research would increase awareness of alternative education formats that would be useful for MRI technologists. Methods: Between March and April of 2017, an online blog search was performed using MRI-related keywords. Strict exclusion criteria were then applied. Two coders independently used lean coding to analyse selected blog posts and organized the codes into themes. Data were tested for intercoder reliability. Results: Researchers analysed 39 posts from 9 blogs and identified the following themes: focus on MRI techniques and technologies, knowledge dissemination, sharing of experience, collaborative learning, authorship, and informal writing. Bloggers, self-identified as practitioners or scholars, communicated about research projects and used an informal writing style. Evidence of intentional teaching of MRI-specific content and sharing of professional and personal experiences was found. Communication between authors and readers from most of the MRI professions was observed, with the exception of MRI technologists. Conclusions: This research found that MRI-related blogs provide a credible and accessible forum for the sharing and discussion of knowledge, experiences, and ideas. Although many MRI professionals author blogs, MRI technologists do not seem to participate in this form of communication. As social media gains in popularity within the medical radiation technologist profession, it is hoped that more MRI technologists will make use of blogging to facilitate learning, collaboration, and communication.
... When discussion participants are regularly off-topic, or move away from topics too quickly, the behavior leads to an incoherent discussion that cannot deeply consider a policy issue. Too much attention on a single topic is also limiting, although in practice online policy discussions often result in far more than one topic being deeply considered [12,14,28,31,43,71]. ACM acknowledges that this contribution was authored or co-authored by an employee, or contractor of the national government. ...
Conference Paper
Public concern related to a policy may span a range of topics. As a result, policy discussions struggle to deeply examine any one topic before moving to the next. In policy deliberation research, this is referred to as a problem of topical coherence. In an experiment, we curated the comments in a policy discussion to prioritize arguments for or against a policy proposal, and examined how this curation and participants' initial positions of support or opposition to the policy affected the coherence of their contributions to existing topics. We found an asymmetric interaction between participants' initial positions and comment curation: participants with different initial positions had unequal reactions to curation that foregrounded comments with which they disagreed. This asymmetry implies that the factors underlying coherence are more nuanced than prioritizing participants' agreement or disagreement. We discuss how this finding relates to curating for coherent disagreement, and for curation more generally in deliberative processes.
... In [7], the authors studied spams from an NLP perspective. Irrelevance of review texts is also considered as an indicative feature of spams [23], [26]. Behavioral features are proved to be quite important in spam detection, since spammers have learned to write reviews that sound more realistic, rendering the traditional text-based detection algorithms less helpful. ...
Conference Paper
Millions of ratings and reviews on online review websites are influential over business revenues and customer experiences. However, spammers are posting fake reviews in order to gain financial benefits, at the cost of harming honest businesses and customers. Such fake reviews can be illegal and it is important to detect spamming attacks to eliminate unjust ratings and reviews. However, most of the current approaches can be incompetent as they can only utilize data from individual websites independently, or fail to detect more subtle attacks even they can fuse data from multiple sources. Further, the revealed evidence fails to explain the more complicated real world spamming attacks, hindering the detection processes that usually have human experts in the loop. We close this gap by introducing a novel framework that can jointly detect and explain the potential attacks. The framework mines both macroscopic level temporal sentimental patterns and microscopic level features from multiple review websites. We construct multiple sentimental time series to detect atomic dynamics, based on which we mine various cross-site sentimental temporal patterns that can explain various attacking scenarios. To further identify individual spams within the attacks with more evidence, we study and identify effective microscopic textual and behavioral features that are indicative of spams. We demonstrate via human annotations, that the simple and effective framework can spot a sizable collection of spams that have bypassed one of the current commercial anti-spam systems.
Article
Full-text available
Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the com-munity propagates through the network. Studying the character-istics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential. We address the problem of using standard topic models in micro-blogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems. We also discuss how the state-of-the-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.
Article
Full-text available
We consider problems involving groups of data where each observation within a group is a draw from a mixture model and where it is desirable to share mixture components between groups. We assume that the number of mixture components is unknown a priori and is to be inferred from the data. In this setting it is natural to consider sets of Dirichlet processes, one for each group, where the well-known clustering property of the Dirichlet process provides a nonparametric prior for the number of mixture components within each group. Given our desire to tie the mixture models in the various groups, we consider a hierarchical model, specifically one in which the base measure for the child Dirichlet processes is itself distributed according to a Dirichlet process. Such a base measure being discrete, the child Dirichlet processes necessarily share atoms. Thus, as desired, the mixture models in the different groups necessarily share mixture components. We discuss representations of hierarchical Dirichlet processes in terms of a stick-breaking process, and a generalization of the Chinese restaurant process that we refer to as the "Chinese restaurant franchise." We present Markov chain Monte Carlo algorithms for posterior inference in hierarchical Dirichlet process mixtures and describe applications to problems in information retrieval and text modeling.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
In recent years, opinion mining attracted a great deal of research attention. However, limited work has been done on detecting opinion spam (or fake reviews). The problem is analogous to spam in Web search [1, 9 11]. However, review spam is harder to detect because it is very hard, if not impossible, to recognize fake reviews by manually reading them [2]. This paper deals with a restricted problem, i.e., identifying unusual review patterns which can represent suspicious behaviors of reviewers. We formulate the problem as finding unexpected rules. The technique is domain independent. Using the technique, we analyzed an Amazon.com review dataset and found many unexpected rules and rule groups which indicate spam activities.
Conference Paper
Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems. In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling: (1) Topic modeling assumptions (2) Algorithms for computing with topic models (3) Applications of topic models In (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership. In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream. In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations. Finally, I will discuss some future directions and open research problems in topic models.
Conference Paper
Spams are no longer limited to emails and Web-pages. The increasing penetration of spam in the form of comments in blogs and social networks has started becoming a nuisance and potential threat. In this work, we explore the challenges posed by this type of spam in the blogosphere with substantial generalization regarding other social media. Thus, we investigate the characteristics of comment spam in blogs based on their content. The framework uses some of the previously explored methods developed to effectively extract the features of the blog spam and also introduces a novel method of active learning from the raw data without requiring training instances. This makes the approach more flexible and realistic for such applications. We also incorporate the concept of co-training for supervised learning to get accurate results. The preliminary evaluation of the proposed framework shows promising results.
Conference Paper
Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as ``bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases -- Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods.