Conference Paper

Using Inter-comment Similarity for Comment Spam Detection in Chinese Blogs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Blog has become one of the most popular ways of communication among social communities since blog posts can be replied, commented, and even shared to other users in a convenient way. All posts and comments, no matter good or bad, have to be manually coordinated by blog owners. In order to prevent comment spam, most blog sites provide challenge- response tests such as CAPTCHA to ensure that the response is from human, instead of automatically generated by a computer. However, these tests cannot prohibit spammers from manually leaving spam messages. Existing studies of Chinese blog comment spam only focus on comments containing hyperlinks, which only stand for a small portion of blog comment spam. In this paper, we propose to include inter-comment Jaccard similarity in the features in addition to the post-comment similarity, stopwords ratio, and comment length for blog comment classification. In order to verify the effects of inter-comment similarity features, we compared several classification algorithms such as C4.5, Naïve Bayes, and Neural Network. Experimental results showed that the feature combination of inter-comment and post- comment similarity under the classification of C4.5 achieves the best performance. This shows the effectiveness of the proposed inter-comment similarity feature for Chinese blog comment spam classification.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Sblogs and general web spam Spam blogs (Sblogs) (Wang and Lin, 2011;Abu-Nimeh and Chen, 2010) General web spam (Spirin and Han, 2012) The main concept behind this Sblogs is to link other websites. ...
Article
Full-text available
The social network, by the name which has popularised in today’s world and growing rapidly at all times and controlling over mankind. The social networks like Twitter, Facebook, and LinkedIn, etc., have become a regular and daily usage of many people. It becomes a good mediator for the people who would like to share some posts, are some of their own videos, or some messages. But there has been major issues that the particular user of the social networks like Twitter and Facebook have the problem of indiscipline actions which we call as spam, by the third person who is knowingly doing this to spoil their intention and good opinion upon each other. Also, these spams help to steal information about the people who using social networks. In this paper, we study and analyse about the spam in social networks and machine learning algorithms to detect such kind of spams. This paper also focuses on the ML algorithms detection rate and false positive rate over different datasets.
... Sblogs and general web spam Spam blogs (Sblogs) (Wang and Lin, 2011;Abu-Nimeh and Chen, 2010) General web spam (Spirin and Han, 2012) The main concept behind this Sblogs is to link other websites. ...
... Maalouf (2011) reviewed the most important aspects of logistic regression being used in data analysis, specifically from an algorithmic and machine learning perspective and how logistic regression can be applied to imbalanced and rare events data. Wang and Lin (2011) proposed an approach to classify blogs by including inter-comment Jaccard similarity in the features in addition to the stop-words ratio, comment length, and post-comment similarity. Lau et al. (2011) integrated a novel text mining model and a semantic language model for the detection of untruthful reviews. ...
Article
As an important platform of electronic commerce, blogs can greatly influence internet users' purchasing decisions. Spam, however, can substantially reduce blogs' positive impact on electronic commerce. This paper introduces SK, an alternative algorithm combining supervised learning (SVM) and unsupervised learning (K-means++) to detect blog spam. If either classifies a blog as spam, then the blog is assigned to the spam category. Feature selection includes term frequency, inverse document frequency, binary representation, stop words, outgoing links, advertiser content, and burst with keywords. Accuracy of each model was tested and compared in experiments with 3,000 blog pages from University of Maryland and 3,560 internet blogs. Findings suggest that combining the SVM algorithm and K-means++ clustering can increase accuracy of filtering spams by about 7% as compared to using just one of these methods. Strengths and weaknesses of various spam-filtering methods were discussed, providing considerations for businesses when choosing a spam filter.
... Maalouf (2011) reviewed the most important aspects of logistic regression being used in data analysis, specifically from an algorithmic and machine learning perspective and how logistic regression can be applied to imbalanced and rare events data. Wang and Lin (2011) proposed an approach to classify blogs by including inter-comment Jaccard similarity in the features in addition to the stop-words ratio, comment length, and post-comment similarity. Lau et al. (2011) integrated a novel text mining model and a semantic language model for the detection of untruthful reviews. ...
... Suffer from problem with sparse context links. A56 The potential of social media in delivering transport policy goals 2014 [74] A57 The social media genome: modeling individual topic-specific behavior in social media 2013 [75] A58 Topic-sensitive influencer mining in interest-based social media networks via hypergraph learning 2014 [76] A59 Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media 2012 [77] A60 Unsupervised and supervised learning to evaluate event relatedness based on content mining from socialmedia streams 2012 [78] A61 Using explicit linguistic expressions of preference in social media to predict voting behavior 2013 [79] A62 Using inter-comment similarity for comment spam detection in Chinese blogs 2011 [80] A63 Using Sentiment to Detect Bots on Twitter: Are Humans more Opinionated than Bots? 2014 [81] A64 Using social media to enhance emergency situation awareness 2012 [82] A65 Web data extraction, applications and techniques: A survey 2014 [83] A66 What's in twitter: I know what parties are popular and who you are supporting now! 2012 [84] Frequency Ref Blogs 8 A6, A12, A18, A22, A31, A37, A48, A55 Forums and Discussion Boards 9 A4, A5, A6, A24, A25, A27, A31, A44, A52 Microblogging 31 A1, A2, A5, A7, A9, A11, A13, A15, A17, A21, A28, A30, A32, A35, A39, A41, A42, A45, A47, A49, A50, A51, A54, A57, A59, A60, A61, A62, A63, A64, A66 Product Reviews 1 A33 ...
Article
Full-text available
Today, the use of social networks is growing ceaselessly and rapidly. More alarming is the fact that these networks have become a substantial pool for unstructured data that belong to a host of domains, including business, governments and health. The increasing reliance on social networks calls for data mining techniques that is likely to facilitate reforming the unstructured data and place them within a systematic pattern. The goal of the present survey is to analyze the data mining techniques that were utilized by social media networks between 2003 and 2015. Espousing criterion-based research strategies, 66 articles were identified to constitute the source of the present paper. After a careful review of these articles, we found that 19 data mining techniques have been used with social media data to address 9 different research objectives in 6 different industrial and services domains. However, the data mining applications in the social media are still raw and require more effort by academia and industry to adequately perform the job. We suggest that more research be conducted by both the academia and the industry since the studies done so far are not sufficiently exhaustive of data mining techniques.
Article
In today's world, the issue of identifying spammers has received increasing attention because of its practical relevance in the field of social network analysis. The growing popularity of social networking sites has made them prime targets for spammers. By allowing users to publicize and share their independently generated content, online social networks become susceptible to different types of malicious and opportunistic user actions. Social network community users are fed with irrelevant information while surfing, due to spammer's activity. Spam pervades any information system such as e-mail or web, social, blog or reviews platform. Therefore, this study attempts to review various spam detection frameworks which deals about the detection and elimination of spams in various sources.
Conference Paper
Full-text available
In the Web 2.0 eras, the individual Internet users can also act as information providers, releasing information or making comments conveniently. However, some participants may spread irresponsible remarks or express irrelevant comments for commercial interests. This kind of so-called comment spam severely hurts the information quality. This paper tries to automatically detect comment spam through content analysis, using some previously-undescribed features. Experiments on a real data set show that our combined heuristics can correctly identify comment spam with high precision(90.4%) and recall(84.5%).
Conference Paper
Spams are no longer limited to emails and Web-pages. The increasing penetration of spam in the form of comments in blogs and social networks has started becoming a nuisance and potential threat. In this work, we explore the challenges posed by this type of spam in the blogosphere with substantial generalization regarding other social media. Thus, we investigate the characteristics of comment spam in blogs based on their content. The framework uses some of the previously explored methods developed to effectively extract the features of the blog spam and also introduces a novel method of active learning from the raw data without requiring training instances. This makes the approach more flexible and realistic for such applications. We also incorporate the concept of co-training for supervised learning to get accurate results. The preliminary evaluation of the proposed framework shows promising results.
Conference Paper
Measuring the similarity between documents and queries has been extensively studied in information retrieval. Howe ver, there are a growing number of tasks that require computing the similari ty between two very short segments of text. These tasks include query reform ulation, sponsored search, and image retrieval. Standard text similarity meas ures perform poorly on such tasks because of data sparseness and the lack of co ntext. In this work, we study this problem from an information retrieval perspect ive, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemmi ng, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries f rom a web search log. Our analysis provides insights into the strengths and w eaknesses of each method, including important tradeoffs between effectiveness and efficiency.
Conference Paper
Many Web information services utilize techniques of information extraction(IE) to collect important facts from the Web. To create more advanced services, one possible method is to discover thematic information from the collected facts through text classification. However, most conventional text classification techniques rely on manual-labelled corpora and are thus ill-suited to cooperate with Web information services with open domains. In this work, we present a system named LiveClassifier that can automatically train classifiersthrough Web corpora based on user-defined topic hierarchies. Due to its flexibility and convenience, LiveClassifier can be easily adapted for various purposes. New Web information services can be created to fully exploit it; human users can use it to create classifiers for their personal applications. The effectiveness of classifiers created by LiveClassifier is well supportedby empirical evidence.
Conference Paper
Determining the similarity of short text snippets, such as search queries, works poorly with traditional document sim- ilarity measures (e.g., cosine), since there are often few, if any, terms in common between two short text snippets. We address this problem by introducing a novel method for mea- suring the similarity between short text snippets (even those without any overlapping terms) by leveraging web search re- sults to provide greater context for the short texts. In this paper, we dene such a similarity kernel function, mathe- matically analyze some of its properties, and provide exam- ples of its ecacy . We also show the use of this kernel func- tion in a large-scale system for suggesting related queries to search engine users.
Conference Paper
Access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. This overlooks an important aspect distin- guishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. In this paper we present a large-scale study of weblog com- ments and their relation to the posts. Using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of we- blog access.
Book
Class-tested and coherent, this textbook teaches classical and web information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. It gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Slides and additional exercises (with solutions for lecturers) are also available through the book's supporting website to help course instructors prepare their lectures.
Detecting Comment Spam in Chinese Blog
  • H Wang