Article

Astroturfing detection in social media: a binary n-gram-based approach: Astrofurfing Detection in Social Media: A Binary N-gram Based Approach

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Astroturfing is appearing in numerous contexts in social media, with individuals posting product reviews or political commentary under a number of different names, and is of concern because of the intended deception. An astroturfer works with the aim of making it seem that a large number of people hold the same opinion, promoting a consensus based on the astroturfer's intentions. It is generally done for commercial or political advantage, often by paid writers or ideologically motivated writers. This paper brings the notion of authorship attribution to bear on the astroturfing problem, collecting quantities of data from public social media sites and analyzing the putative individual authors to see if they appear to be the same person. The analysis comprises a binary n-gram method, which was previously shown to be effective at accurately identifying authors on a training set from the same authors, while this paper shows how authors on different social media turn out to be the same author. The method has identified numerous instances where multiple accounts are apparently being operated by a single individual.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Unlike other forms of spam [2]- [4], it is challenging to identify fake opinions, as one may need to also understand the context of the postings in order to determine whether the particular opinion is deceptive [5]- [7]. For example, how can one reliably determine whether the online review postings about a particular business (e.g. ...
... K. K. R. Choo is with the Department of Information Systems and Cyber Security, The University of Texas at San Antonio, San Antonio, TX 78249-0631, USA. email: raymond.choo@fulbrightmail.org individual purporting to be different persons -a practice also known as astroturfing [1], [6], [7])? ...
Article
Full-text available
With more consumers using online opinion reviews to inform their service decision making, opinion reviews have an economical impact on the bottom line of businesses. Unsurprisingly, opportunistic individuals or groups have attempted to abuse or manipulate online opinion reviews (e.g. spam reviews) to make profits, etc, and that detecting deceptive and fake opinion reviews is a topic of ongoing research interest. In this paper, we explain how semi-supervised learning methods can be used to detect spam reviews, prior to demonstrating its utility using a dataset of hotel reviews.
... As bit level n-grams also have a fixed length of 1 bit with only two possible outcomes, many researchers Peng, Detchon, Choo, & Ashman, 2016 ) have relied on this bit level categorization for text analysis. Table 2 Bit level 3-gram representation of "We". ...
... Additionally, the research also did not address the usual issue of astroturfing that are performed by a crowd within a short period of time. The authors extended their work and documented the extensions in a different paper [24], where they gathered additional data and author profiles to demonstrate the performance of the proposed model but they extension did not address the mentioned limitations. ...
Article
Astroturfing is one of the most impactful threats on today’s internet. It is the process of masking and portraying a doctored message to the general population in a way as though it originated from the grass-root level. The concept of astroturfing detection is started to gain popularity among researchers in social media, e-commerce and politics. With the recent growth of crowdsourcing systems, astroturfing is also creating a profound impact on people’s opinions. Political blogs, news portals and review websites are being flooded with astroturfs. Some groups are using astroturfing to promote their interest and some are using it to demote the interest of competitors. Researchers have adopted many approaches to detect astroturfing on the web. These approaches include content analysis techniques, individual and group identification techniques, analysing linguistic features, authorship attribution techniques, machine learning and so on. We present a taxonomy of these approaches based on the key issues in online astroturfing detection techniques and discuss the relevant approaches in each category. The paper also summarises the discussed literature and highlights research challenges and directions for future work that have not aligned with the currently available research.
... persecution). It allows people with hidden agenda and malicious intention to masquerade their opinions [10] as independent members of society, leading to the posting of fake reviews, the discrediting of legitimate products and giving of a false impression which leads to opinion spamming or astroturfing [27][28][29]. Another challenge to sentiment analysis is that users typically specify mixed feelings. ...
Article
Full-text available
Sentiment analysis has applications in diverse contexts such as in the gathering and analysis of opinions of individuals about various products, issues, social, and political events. Understanding public opinion can help improve decision making. Opinion mining is a way of retrieving information via search engines, blogs, microblogs and social networks. Individual opinions are unique to each person, and Twitter tweets are an invaluable source of this type of data. However, the huge volume and unstructured nature of text/opinion data pose a challenge to analyzing the data efficiently. Accordingly, proficient algorithms/computational strategies are required for mining and condensing tweets as well as finding sentiment bearing words. Most existing computational methods/models/algorithms in the literature for identifying sentiments from such unstructured data rely on machine learning techniques with the bag-of-word approach as their basis. In this work, we use both unsupervised and supervised approaches on various datasets. Unsupervised approach is being used for the automatic identification of sentiment for tweets acquired from Twitter public domain. Different machine learning algorithms such as Multinomial Naive Bayes (MNB), Maximum Entropy and Support Vector Machines are applied for sentiment identification of tweets as well as to examine the effectiveness of various feature combinations. In our experiment on tweets, we achieve an accuracy of 80.68% using the proposed unsupervised approach, in comparison to the lexicon based approach (the latter gives an accuracy of 75.20%). In our experiments, the supervised approach where we combine unigram, bigram and Part-of-Speech as feature is efficient in finding emotion and sentiment of unstructured data. For short message services, using the unigram feature with MNB classifier allows us to achieve an accuracy of 67%.
... Peng et al. [24] brings the notion of authorship attribution to bear on the astroturfing problem, collecting quantities of data from public social media sites and analyzing the putative individual authors to see if they appear to be the same person. The analysis comprises a binary n-gram method, which was previously shown to be effective at accurately identifying authors on a training set from the same authors, while this paper shows how authors on different social media turn out to be the same author. ...
Article
Full-text available
Authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially conspicuous in legal, criminal/civil cases, threatening letters and terroristic communications also in computer forensics. There are two basic approaches for authorship attribution one is instance based (treat each training text individually) and the other is profile based (treat each training text cumulatively). Both of these methods have their own advantages and disadvantages. The present paper proposes a new region based document model for authorship identification, to address the dimensionality problem of instance based approaches and scalability problem of profile based approaches. The proposed model concatenates a set of individual ‘n’ instance documents of the author as a single region based instance document (RID). On the RID compression based similarity distance method is used. The compression based methods requires no pre-processing and easy to apply. This paper uses Gzip compression algorithm with two compression based similarity measures NCD, CDM. The proposed compression model is character based and it can automatically capture easily non word features such as word stems, punctuations etc. The only disadvantage of compression models is complexity is high. The proposed RID approach addresses this issue by reducing the repeated words in the document. The present approach is experimented on English editorial columns. We achieved approximately 98% of accuracy in identifying the author.
... Moreover, recent works are exploring the integration of different techniques to analyse the content of social networks' interactions in order to identify real user profiles (Peng et al., 2016a(Peng et al., , 2016b. An interesting example is the work by Vuong et al. (2016), proposing a hybrid approach relying on a Maximum Entropy model for classifying user comments as either spam or non-spam, based on comment content and user's social behaviour. ...
Article
Precision farming technologies have been increasingly recognized for their potential ability for improving agricultural productivity, reducing production cost, and minimizing damage to the environment. In this context, the main goals of this paper are the following: First, we present a methodology that can be applied to extract semantic information, more specifically some vegetative indices, from plants, in order to further improve the vegetation representation and health by means of a specific semantic robotic system; then, we study in detail the tracked robot’s behavior, by emulating the real settings in a field and analytically analyze the simulation of the robot on an up and down slope path.
... The approach consisted of extracting the user's writing style, uses the k-Nearest Neighbors algorithm (k-NN) to evaluate the post content and identify the user, and uses a continuous updating of the user baseline to support existing trends and seasonality issues of the user's posts. The findings reported by the authors in this special issue, as well as those of Peng et al. [14]) and Peng et al. [15]) demonstrated the potential to identify users based on the textual contents of their postings on social media. ...
... Similarly, classification methods such as K-Nearest Neighbors [21], Gradient Boosted Decision Trees [23] and Support Vector Machines [17] have been used for identifying user based on their profiles. In addition, some works follow an information retrieval approach and estimate distance metrics for ranking possible users for a given content [22,33]. In this work, we employ a single-label multi-class model approach, where each class correspond to a specific user, and the content could be assigned exclusively to one user. ...
Article
Full-text available
User identification in social media is of crucial interest for companies and organizations for purposes of marketing, e-commerce, security and demographics. In this paper, we aim to identify users from Pinterest, a platform where users post pins, a combination of an image and a short text. This type of multi-modal content is very common nowadays, since it is a natural way in which users express their interests, emotions and opinions. Thus, the goal is to identify the user that would post a particular pin. For solving the problem, we propose a two-phase classification model. In a first phase, we train independent classifiers from image data, using a deep learning representation, and from text data, using a bag-of-words representation. During testing we apply a cascade fusion of the classifiers. In a second phase, we refine the output of the cascade for each test pin by selecting the top most likely users for the test pin and re-weighting their corresponding output in the cascade by their similarity with the test pin. Our experiments show that the problem is very hard because several reasons with the data distribution, but they also show promising results.
... Peng et al. discussed the related content of forensic authorship analysis [5]. They analyzed user profiling in intrusion detection [6] and they conducted a thorough study of astroturfing detection in media [7,8]. Osanaiye et al. studied the defense against DDoS attack, and presented a taxonomy of the different types of cloud DDoS attacks, and the corresponding DDoS defense taxonomy [9]. ...
Article
Full-text available
Application layer firewalls protect the trusted area network against information security risks. However, firewall performance may affect user experience. Therefore, performance analysis plays a significant role in the evaluation of application layer firewalls. This paper presents an analytic model of the application layer firewall, based on a system analysis to evaluate the capability of the firewall. In order to enable users to improve the performance of the application layer firewall with limited resources, resource allocation was evaluated to obtain the optimal resource allocation scheme in terms of throughput, delay, and packet loss rate. The proposed model employs the Erlangian queuing model to analyze the performance parameters of the system with regard to the three layers (network, transport, and application layers). Then, the analysis results of all the layers are combined to obtain the overall system performance indicators. A discrete event simulation method was used to evaluate the proposed model. Finally, limited service desk resources were allocated to obtain the values of the performance indicators under different resource allocation scenarios in order to determine the optimal allocation scheme. Under limited resource allocation, this scheme enables users to maximize the performance of the application layer firewall.
... Additionally, the research also did not address the usual issue of astroturfing that are performed by a crowd within a short period of time. The authors extended their work and documented the extensions in a different paper (Peng et al., 2016c), where they gathered additional data and author profiles to demonstrate the performance of the proposed model but they extension did not address the mentioned limitations. ...
Article
Full-text available
Astroturfing is one of the most impactful threats on today's internet. It is the process of masking and portraying a doctored message to the general population in a way as though it originated from the grass-root level. The concept of astroturfing detection is started to gain popularity among researchers in social media, e-commerce and politics. With the recent growth of crowdsourcing systems, astroturfing is also creating a profound impact on people's opinions. Political blogs, news portals and review websites are being flooded with astroturfs. Some groups are using astroturfing to promote their interest and some are using it to demote the interest of competitors. Researchers have adopted many approaches to detect astroturfing on the web. These approaches include content analysis techniques, individual and group identification techniques, analysing linguistic features, authorship attribution techniques, machine learning and so on. We present a taxonomy of these approaches based on the key issues in online astroturfing detection techniques and discuss the relevant approaches in each category. The paper also summarises the discussed literature and highlights research challenges and directions for future work that have not aligned with the currently available research.
... Authors consider technologies specifics of positive and negative influence of social networks on public opinion. The peculiarities of the astroturfing technologies as the creation of artificial public opinion are considered by J. Peng, S. Detchon, K. R. Choo, H. Ashman[17]. R. Korzh, A. Peleshchyshyn, S. Fedushko, Y. Syerov ...
Article
Full-text available
The article presents the dynamics of social networks users increase, depending on the total world population from 2010 to 2018. It also identifies the most popular social networks in Ukraine. The systematic risk indicator of using social networks relative to the total number of Internet resources users is determined. Types of social intercourse in the process of the higher education institution image creation are presented. The peculiarities of using social networks in the formation of a positive image of an educational institution are highlighted. The statistical indicators of user actions in the official group of the, as well as the average attraction coefficient of users depending on the subject of publications. The main technologies of astroturfing in the creation process of the higher education institution negative image are considered.
... Authors consider technologies specifics of positive and negative influence of social networks on public opinion. The peculiarities of the astroturfing technologies as the creation of artificial public opinion are considered by J. Peng, S. Detchon, K. R. Choo, H. Ashman[17]. R. Korzh, A. Peleshchyshyn, S. Fedushko, Y. Syerov ...
Preprint
Full-text available
The article presents the dynamics of social networks users increase, depending on the total world population from 2010 to 2018. It also identifies the most popular social networks in Ukraine. The systematic risk indicator of using social networks relative to the total number of Internet resources users is determined. Types of social intercourse in the process of the higher education institution image creation are presented. The peculiarities of using social networks in the formation of a positive image of an educational institution are highlighted. The statistical indicators of user actions in the official group of the Faculty of Mathematics and Information Technologies of Vasyl Stus Donetsk National University in January, February and March 2019 are presented, as well as the average attraction coefficient of users depending on the subject of publications. The main technologies of astroturfing in the creation process of the higher education institution negative image are considered.
... The IRA's accounts have been created in such a way that they are portrayed as real American accounts. Masking the sponsor of a message such that it appears to originate, and be supported by, grassroots participants is also known as astroturfing (Peng et al., 2017). Based on a 2018 Pew Report, 53% of the Americans participate in some form of civic or political activities on social media during the year (Anderson et al., 2018). ...
... While a false front group is needed for a mass media campaign (i.e. for the media buy and speaker placement), social media provides practitioners with the ability to do astroturfing without being publicly connected to even a fake organization (Peng, Detchon, Choo, & Ashman, 2016;Zhang, Carpenter, & Ko, 2013). Indeed, the affordances provided by social media have significantly enhanced the opportunity for astroturfing, with the anonymity provided by social networking platforms make it difficult to authenticate an individual's true identity or purpose, while also reducing the costs required to manufacture fake public support for an issue or cause. ...
Article
Full-text available
This article uses social media network analysis (SMNA) to examine whether there was an astroturfing campaign on Twitter in support of the Adani Carmichael coal mine in 2017. It shows that SMNA can be used to visualize and analyze outsider lobbying activity in issue arenas and is capable of identifying networks of fake opinion. This study found that in April 2017, there was a small network of accounts that made a series of suspiciously similar pro‐Adani tweets that could be considered a form of duplicitous lobbying. However, this study concludes that these posts were likely a weak influence on public opinion in Australia and largely ineffectual as a lobbying tactic. Nevertheless, this analysis shows how communitas public interests can be subverted by covert social media campaigns used in support of corporatas goals, as well as the role digital research methods can play in protecting the integrity on public debates by exposing disingenuous actors.
... Author attribute recognition focuses on the notion of authorship attribution, which is to collect data from public social media sites to analyze the putative authors to see if they are the same person. The analysis comprises a binary n-gram method, which was previously shown to be effective at accurately identifying authors on a training set from the same authors, while [28] shows how authors on different social media turn out to be the same author. ...
Article
Full-text available
With the extensive development of big data and social networks, the user profile field has received much attention. User profiling is essential for understanding the characteristics of various users, contributing to better understanding of their requirements in specific scenarios. User-generated contents which directly reflect people’s thoughts and intention are a valuable source for profiling users, among which user reviews by nature are invaluable sources for acquiring user requirements and have drawn increasing attention from both academia and industry. However, review-based user profiling (RBUP), as an emerging research direction, has not been systematically reviewed, hindering researchers from further investigation. In this work, we carry out a systematic mapping study on review-based user profiling, with an emphasis on investigating the generic analysis process of RBUP and identifying potential research directions. Specifically, 51 out of 2478 papers were carefully selected for investigation under a standardized and systematic procedure. By carrying out in-depth analysis over such papers, we have identified a generic process that should be followed to perform review-based user profiling. In addition, we perform multi-dimensional analysis on each step of the process in order to review current research progress and identify challenges and potential research directions. The results show that although traditional methods have been continuously improved, they are not sufficient to unleash the full potential of large-scale user reviews, especially the use of heterogeneous data for multi-dimensional user profiling.
... The measure of a sockpuppet's success is the extent to which audiences are fooled into believing the false identity is the real one. A related term that is applied mostly to attempts to create false grassroots organizations is astroturfing, which occurs both on and off social media (Peng et al., 2017;Ratkiewicz et al., 2011). Most of the currently available information on the IRA's sockpuppet identities is descriptive, with both Howard et al. (2018) and DiResta et al. (2018) listing examples of conservative, progressive, African American, and Muslim American identities. ...
Article
The recent rise of disinformation and propaganda on social media has attracted strong interest from social scientists. Research on the topic has repeatedly observed ideological asymmetries in disinformation content and reception, wherein conservatives are more likely to view, redistribute, and believe such content. However, preliminary evidence has suggested that race may also play a substantial role in determining the targeting and consumption of disinformation content. Such racial asymmetries may exist alongside, or even instead of, ideological ones. Our computational analysis of 5.2 million tweets by the Russian government-funded “troll farm” known as the Internet Research Agency sheds light on these possibilities. We find stark differences in the numbers of unique accounts and tweets originating from ostensibly liberal, conservative, and Black left-leaning individuals. But diverging from prior empirical accounts, we find racial presentation—specifically, presenting as a Black activist—to be the most effective predictor of disinformation engagement by far. Importantly, these results could only be detected once we disaggregated Black-presenting accounts from non-Black liberal accounts. In addition to its contributions to the study of ideological asymmetry in disinformation content and reception, this study also underscores the general relevance of race to disinformation studies.
... The recent use of seemingly "ordinary citizens" as deflective sources in disinformation campaigns, as with the IRA's troll accounts, is also not new, nor is it limited to the Kremlin. Researchers within media studies and political communication, for example, commonly use the term "astroturfing" to describe the use of online fake accounts as a deflective source to mimic spontaneous grass roots activity (Howard and Kollanyi 2016;Peng et al. 2017;Ratkiewicz et al. 2011). Finally, to maximize success, authoritarian regimes and liberal democracies have both historically relied on what the propaganda literature refers to as pre-propaganda: propaganda that is not directly related to the political message of the propagandist (Ellul 1973). ...
Article
This paper investigates online propaganda strategies of the Internet Research Agency (IRA)—Russian “trolls”—during the 2016 U.S. presidential election. We assess claims that the IRA sought either to (1) support Donald Trump or (2) sow discord among the U.S. public by analyzing hyperlinks contained in 108,781 IRA tweets. Our results show that although IRA accounts promoted links to both sides of the ideological spectrum, “conservative” trolls were more active than “liberal” ones. The IRA also shared content across social media platforms, particularly YouTube—the second-most linked destination among IRA tweets. Although overall news content shared by trolls leaned moderate to conservative, we find troll accounts on both sides of the ideological spectrum, and these accounts maintain their political alignment. Links to YouTube videos were decidedly conservative, however. While mixed, this evidence is consistent with the IRA’s supporting the Republican campaign, but the IRA’s strategy was multifaceted, with an ideological division of labor among accounts. We contextualize these results as consistent with a pre-propaganda strategy. This work demonstrates the need to view political communication in the context of the broader media ecology, as governments exploit the interconnected information ecosystem to pursue covert propaganda strategies.
... Gang members treated social media as a "virtual street corner" and "an electronic graffiti wall", and spent a large amount of time online [74]. Due to the anonymity of social media, some people manipulate multiple social media platform accounts or multiple accounts under the same platform to engage in some illegal activities [75]. ...
Article
Full-text available
In recent years, the news industry, digital rights management, digital advertising, content production and social media are facing inevitable problems, such as fake news spreading, press freedom conflicts, advertising fraud, difficulties in digital rights management, defects in content production mechanisms, and rumor spreading aggravated by social media. Fortunately, the development of blockchain technology, characterized by decentralization, security, and tamper-proof, offers a brand-new perspective on the above problems. Existing research and projects on this topic firmly believe in the new opportunities for the media industry brought by blockchain technology. This study mainly focuses on the following parts: (1) Describe current difficulties faced by the media; (2) Review the relevant research and projects on blockchain applications in the media industry, and discuss the possibility of developing “Blockchain + Media”; (3) Summarize the shortcomings and challenges of blockchain application in media. Through investigation, it is found that blockchain has great potential for media development, but current studies on this topic are still in early-stage, and the results need to be observed.
... Moreover, recent works are exploring the integration of different techniques to analyse the content of social networks' interactions in order to identify real user profiles (Peng et al., 2016a(Peng et al., , 2016b. An interesting example is the work by Vuong et al. (2016), proposing a hybrid approach relying on a Maximum Entropy model for classifying user comments as either spam or non-spam, based on comment content and user's social behaviour. ...
Article
Online Social Networks (OSNs) have become a primary area of interest for cutting-edge cybersecurity applications, due to their ever increasing popularity and to the variety of data their interaction models allow for. In this perspective, most of the existing anomaly detection techniques rely on models of normal users' behaviour as defined by domain experts. However, the identification of “bad” behaviour as a probable deviation of normality still remains an open issue. Here, we propose a method for identifying human behaviour in a social network, based on a “two-step” detection strategy. In particular, we first train Markov chains on a certain number of models of normal human behaviour from social network data; then, we exploit an activity detection framework to identify unexplained activities on the basis of the normal behaviour models. Finally, the validity of our approach is tested through a set of experiments run on data extracted from Facebook.
Article
Full-text available
This article reflects on the conceptualization and the salient features of the ecology of e-democracy. The authors identify four distinct waves marked by technological innovations and studied under the control–participation dichotomy. In the first wave, during the 1990s, political actors begin to establish their online presence but without any other notable changes in communication. The second wave takes place from 2004 to 2008 and features the consolidation of social networks and the increasing commodification of audience engagement. The third wave begins to take shape during Obama’s 2008 election campaign, which featured micro-segmentation and the use of big data. The fourth wave, starting in 2016 with the Brexit campaign and the Cambridge Analytica scandal, has been defined by the front and center use of Artificial Intelligence. Some recent phenomena that challenge or buttress the make-up of critical public opinion are the following: a) digital platforms as political actors; b) the marked use of Artificial Intelligence and big data; c) the use of falsehoods as a political strategy, as well as other fake news and deep fake phenomena; d) the combination of hyperlocal and supranational issues; e) technological determinism; f) the search for audience engagement and co-production processes; and g) trends that threaten democracy, to wit, the polarization of opinions, astroturfing, echo chambers and bubble filters. Finally, the authors identify several challenges in research, pedagogy and politics that could strengthen democratic values, and conclude that democracy needs to be reimagined both under new research and political action frameworks, as well as through the creation of a social imaginary on democracy.
Conference Paper
To improve the access speed and efficiency of excessive and low efficiency micro-blog data, this paper presents a method for building the knowledge flow of micro-blog topic. The core task of building the knowledge flow of micro-blog topic is analyzing each micro-blog information (e.g., the number of point praise, the number of forwarding and the fresh degree for micro-blog) to realize the organization of micro-blog topic. First, we collect and process micro-blog information, including the number of point praise, the number of forwarding and the fresh degree. Then, based on the achieved the information of each micro-blog, we do the filtering of micro-blog to keep that interesting/meaningful micro-blog. And we sort all the kept micro-blog messages browsed by a user to generate a knowledge flow of micro-blog topic. The experimental results show that the proposed algorithm has a high accuracy.
Conference Paper
In the traditional discovery methods of micro-blog new login word, compound words are difficult to be extracted effectively. Aiming to solve this problem, this paper proposes an extraction method of micro-blog new login word based on improved Position-Word Probability (PWP) and N-increment algorithm. First, the micro-blog long text is composed of all micro-blog within a single topic in period of a given time and then pre-treated. Then, the extension direction of frequent strings is judged by improved the probability of word location in the query process of N-increment algorithm. Finally, the redundant strings are reduced by pruning frequent strings set. The experimental results show that the algorithm proposed in this paper can effectively extract the compound words in micro-blog new login word.
Conference Paper
The accuracy of textual keyword extraction is a major factor which influences the text semantic processing. Up to now, there is still much room to improve the precision of textual keyword extraction. To solve the problem, this paper proposes a method to optimize the textual keyword using priori knowledge. First, some priori knowledge for keyword extraction is discussed. Then, a keyword quality evaluation method based on semantic distance between keywords is proposed to judge whether a keyword is good or bad. Next, a textual keyword optimization method is proposed based on the keyword evaluation. Finally, some experiments are carried out, the results of which show that the proposed method can improve the accuracy of keyword extraction on domain texts.
Article
Full-text available
Les métiers de la communication et des relations publiques sont parfois décriés et critiqués par leurs recours à des stratégies non authentiques. Parmi elles, l'astroturfing consiste à simuler une certaine opinion publique en gardant son identité secrète. Cette problématique pose des questions éthiques dans des sociétés démocratiques et une littérature scientifique se développe sur le sujet. Néanmoins, force est de constater que de nombreuses définitions du concept d'astroturfing coexistent, compliquant ainsi son étude. Cet article propose une définition de l'astroturfing permettant de dépasser certains écueils identifiés dans une revue de la littérature systématique. En se basant sur cette définition, une typologie claire de tactiques astroturfs est alors établie. Ensuite, un modèle de la contingence est proposé pour pouvoir situer des tactiques de communication sur un continuum allant d'opérations totalement grassroots à des tactiques totalement astroturfs. Ce modèle est ensuite illustré par trois cas d'étude démontrant l'hybridité que certains groupes et certaines opérations de communication possèdent. Cette hybridité pose la question plus large de l'authenticité en communication, avec des conséquences sur la perception des métiers des relations publiques dans la société. La désambiguïsation du concept d'astroturfing et l'analyse de différents cas permettent d'entrevoir des solutions pour identifier de telles tactiques dans le futur. English: Communication and public relations practitioners are sometimes criticized for their use of inauthentic strategies. Among them, astroturfing consists in simulating public opinion while keeping one's identity secret. This issue raises ethical questions in democratic societies and a scientific literature is developing on the subject. However, it appears that many definitions of the concept of astroturfing coexist, thus complicating its study. This article proposes a comprehensive definition of astroturfing that overcomes some of the pitfalls identified in a systematic literature review. Based on this definition, a clear typology of astroturfing tactics is then established. Next, a contingency model is proposed to situate communication tactics on a continuum ranging from grassroots to astroturf. This model is then illustrated by three case studies demonstrating the hybridity that certain groups and certain communication operations possess. This hybridity raises the question of communication authenticity, with consequences on how public relations practioners are perceived in the society. Disambiguating the concept and analyzing different cases pave the way for solution to identify such tactics in the future.
Article
Twitter has a significant user base, and there are reportedly over 300 million active user accounts. Twitter, a micro blog service, limits the length of each tweet, keeping them short and concise. The contents of tweets include news, trending topics, emotions, and opinions. This makes Twitter a source of data for social science, marketing, psychology and news. Twitter users tend to use emojis, slang, and acronyms in order to fit more content within the character limit. The use of emojis in tweets complicates efforts in text mining and emotion analysis, as such emojis can also be used to express sarcasm when used in different contexts. In this paper, we use Twitter API to mine tweets that were geotagged by users and apply text analytics to the tweets. We also develop a system to detect events using the geospatial emotion vector in the area we are monitoring. Combining graph theory, machine learning semantics, and statistics with the geospatial emotion vectors, we track trending topics during times of extreme emotion. Our findings suggest that Robert Parks theory on Expressive Groups and Gustave Le Bons Theory on Social Contagion hold true in the Twittersphere.
Article
Full-text available
Mining temporal association patterns from time-stamped temporal databases, first introduced in 2009, remain an active area of research. A pattern is temporally similar when it satisfies certain specified subset constraints. The naive and apriori algorithm designed for non-temporal databases cannot be extended to find similar temporal patterns in the context of temporal databases. The brute force approach requires performing \(2^{n }\) true support computations for ‘n’ items; hence, an NP-class problem. Also, the apriori or fp-tree-based algorithms designed for static databases are not directly extendable to temporal databases to retrieve temporal patterns similar to a reference prevalence of user interest. This is because the support of patterns violates the monotonicity property in temporal databases. In our case, support is a vector of values and not a single value. In this paper, we present a novel approach to retrieve temporal association patterns whose prevalence values are similar to those of the user specified reference. This allows us to significantly reduce support computations by defining novel expressions to estimate support bounds. The proposed approach eliminates computational overhead in finding similar temporal patterns. We then introduce a novel dissimilarity measure, which is the fuzzy Gaussian-based dissimilarity measure. The measure also holds the monotonicity property. Our evaluations demonstrate that the proposed method outperforms brute force and sequential approaches. We also compare the performance of the proposed approach with the SPAMINE which uses the Euclidean measure. The proposed approach uses monotonicity property to prune temporal patterns without computing unnecessary true supports and distances.
Article
Full-text available
With the development of remote sensing technologies, especially the improvement of spatial, time and spectrum resolution, the volume of remote sensing data is bigger. Meanwhile, the remote sensing textures of the same ground object present different features in various temporal and spatial scales. Therefore, it is difficult to describe overall features of remote sensing big data with different time and spatial resolution. To represent big data features conveniently and intuitively compared with classical methods, we propose some texture descriptors from different sides based on wavelet transforms. These descriptors include a statistical descriptor based on statistical mean, variance, skewness, and kurtosis; a directional descriptor based on a gradient histogram; a periodical descriptor based on auto-correlation; and a low-frequency statistical descriptor based on the Gaussian mixture model. We analyze three different types of remote sensing textures and contrast the results similarities and differences in three different analysis domains to demonstrate the validity of the texture descriptors. Moreover, we select three factors representing texture distributions in the wavelet transform domain to verify that the texture descriptors could be better to classify texture types. Consequently, the texture descriptors appropriate for describe remote sensing big data overall features with simple calculation and intuitive meaning.
Article
Full-text available
Instant messaging (IM) has changed the way people communicate with each other. However, the interactive and instant nature of these applications (apps) made them an attractive choice for malicious cyber activities such as phishing. The forensic examination of IM apps for modern Windows 8.1 (or later) has been largely unexplored, as the platform is relatively new. In this paper, we seek to determine the data remnants from the use of two popular Windows Store application software for instant messaging, namely Facebook and Skype on a Windows 8.1 client machine. This research contributes to an in-depth understanding of the types of terrestrial artefacts that are likely to remain after the use of instant messaging services and application software on a contemporary Windows operating system. Potential artefacts detected during the research include data relating to the installation or uninstallation of the instant messaging application software, log-in and log-off information, contact lists, conversations, and transferred files.
Article
Full-text available
Cloud storage services are popular with both individuals and businesses as they offer cost-effective, large capacity storage and multi-functional services on a wide range of devices such as personal computers (PCs), Mac computers, and smart mobile devices (e.g. iPhones). However, cloud services have also been known to be exploited by criminals, and digital forensics in the cloud remains a challenge, partly due to the diverse range of cloud services and devices that can be used to access such services. Using SugarSync (a popular cloud storage service) as a case study, research was undertaken to determine the types and nature of volatile and non-volatile data that can be recovered from Windows 8, Mac OS X 10.9, Android 4 and iOS 7 devices when a user has carried out different activities such as upload and download of files and folders. We then document the various digital artefacts that could be recovered from the respective devices.
Article
Full-text available
One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints,technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.
Article
Full-text available
A large group of dictionary learning algorithms focus on adaptive sparse representation of data. Almost all of them fix the number of atoms in iterations and use unfeasible schemes to update atoms in the dictionary learning process. It's difficult, therefore, for them to train a dictionary from Big Data. A new dictionary learning algorithm is proposed here by extending the classical K-SVD method. In the proposed method, when each new batch of data samples is added to the training process, a number of new atoms are selectively introduced into the dictionary. Furthermore, only a small group of new atoms as subspace controls the current orthogonal matching pursuit, construction of error matrix, and SVD decomposition process in every training cycle. The information, from both old and new samples, is explored in the proposed incremental K-SVD (IK-SVD) algorithm, but only the current atoms are adaptively updated. This makes the dictionary better represent all the samples without the influence of redundant information from old samples.
Article
Full-text available
The task of intrinsic plagiarism detection deals with cases where no reference corpus is available and it is exclusively based on stylistic changes or inconsistencies within a given document. In this paper a new method is presented that attempts to quantify the style variation within a document using character n-gram profiles and a style change function based on an appropriate dissimilarity measure originally proposed for author identification. In addition, we propose a set of heuristic rules that attempt to detect plagiarism–free documents and plagiarized passages, as well as to reduce the effect of irrelevant style changes within a document. The proposed approach is evaluated on the recently-available corpus of the 1 st Int. Competition on Plagiarism Detection with promising results.
Conference Paper
Full-text available
As social networks are growing in terms of the number of users, resources and interactions; the user may be lost or unable to find useful information. Social elements could avoid this disorientation like the social annotations (tags) which become more and more popular and contribute to avoid the disorientation of the user. Representing a user based on these social annotations has showed their utility in reflecting an accurate user profile which could be used for a recommendation purpose. In this paper, we give a state of the art of characteristics of social user and techniques which model and update a tag-based profile. We show how to treat social annotations and the utility of modelling tag-based profiles for recommendation purposes.
Article
Full-text available
There is an alarming increase in the number of cybercrime incidents through anonymous e-mails. The problem of e-mail authorship attribution is to identify the most plausible author of an anonymous e-mail from a group of potential suspects. Most previous contributions employed a traditional classification approach, such as decision tree and Support Vector Machine (SVM), to identify the author and studied the effects of different writing style features on the classification accuracy. However, little attention has been given on ensuring the quality of the evidence. In this paper, we introduce an innovative data mining method to capture the write-print of every suspect and model it as combinations of features that occurred frequently in the suspect's e-mails. This notion is called frequent pattern, which has proven to be effective in many data mining applications, but it is the first time to be applied to the problem of authorship attribution. Unlike the traditional approach, the extracted write-print by our method is unique among the suspects and, therefore, provides convincing and credible evidence for presenting it in a court of law. Experiments on real-life e-mails suggest that the proposed method can effectively identify the author and the results are supported by a strong evidence.
Conference Paper
Full-text available
Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Character n-grams are a very successful approach to represent text for stylistic purposes since they are able to capture nuances in lexical, syntactical, and structural level. So far, character n-grams of fixed length have been used for authorship identification. In this paper, we propose a variable-length n-gram approach inspired by previous work for selecting variable-length word sequences. Using a subset of the new Reuters corpus, consisting of texts on the same topic by 50 different authors, we show that the proposed approach is at least as effective as information gain for selecting the most significant n-grams although the feature sets produced by the two methods have few common members. Moreover, we explore the significance of digits for distinguishing between authors showing that an increase in performance can be achieved using simple text pre-processing.
Conference Paper
Full-text available
The automatic detection of shared content in written documents –which includes text reuse and its unacknowledged commitment, plagiarism– has become an important problem in Information Retrieval. This task requires exhaustive comparison of texts in order to determine how similar they are. However, such comparison is impossible in those cases where the amount of documents is too high. Therefore, we have designed a model for the proper pre-selection of closely related documents in order to perform the exhaustive comparison afterwards. We use a similarity measure based on word-level n-grams, which proved to be quite effective in many applications As this approach becomes normally impracticable for real-world large datasets, we propose a method based on a preliminary word-length encoding of texts, substituting a word by its length, providing three important advantages: (i) being the alphabet of the documents reduced to nine symbols, the space needed to store n-gram lists is reduced; (ii) computation times are decreased; and (iii) length n-grams can be represented in a trie, allowing a more flexible and fast comparison. We experimentally show, on the basis of the perplexity measure, that the noise introduced by the length encoding does not decrease importantly the expressiveness of the text. The method is then tested on two large datasets of co-derivatives and simulated plagiarism.
Conference Paper
Full-text available
Source code author identification deals with the task of identifying the most likely author of a computer program, given a set of predefined author candidates. This is usually .based on the analysis of other program samples of undisputed authorship by the same programmer. There are several cases where the application of such a method could be of a major benefit, such as authorship disputes, proof of authorship in court, tracing the source of code left in the system after a cyber attack, etc. We present a new approach, called the SCAP (Source Code Author Profiles) approach, based on byte-level n-gram profiles in order to represent a source code author's style. Experiments on data sets of different programming-language (Java or C++) and varying difficulty (6 to 30 candidate authors) demonstrate the effectiveness of the proposed approach. A comparison with a previous source code authorship identification study based on more complicated information shows that the SCAP approach is language independent and that n-gram author profiles are better able to capture the idiosyncrasies of the source code authors. Moreover, the SCAP approach is able to deal surprisingly well with cases where only a limited amount of very short programs per programmer is available for training. It is also demonstrated that the effectiveness of the proposed model is not affected by the absence of comments in the source code, a condition usually met in cyber-crime cases.
Article
Full-text available
Statistical authorship attribution has a long history, culminating in the use of modern machine learning classification methods. Nevertheless, most of this work suffers from the limitation of assuming a small closed set of candidate authors and essentially unlimited training text for each. Real-life authorship attribution problems, however, typically fall short of this ideal. Thus, following detailed discussion of previous work, three scenarios are considered here for which solutions to the basic attribution problem are inadequate. In the first variant, the profiling problem, there is no candidate set at all; in this case, the challenge is to provide as much demographic or psychological information as possible about the author. In the second variant, the needle-in-a-haystack problem, there are many thousands of candidates for each of whom we might have a very limited writing sample. In the third variant, the verification problem, there is no closed candidate set but there is one suspect; in this case, the challenge is to determine if the suspect is or is not the author. For each variant, it is shown how machine learning methods can be adapted to handle the special challenges of that variant. © 2009 Wiley Periodicals, Inc.
Article
Full-text available
The threat of malware on mobile devices is gaining attention recently. It is important to provide security solutions to these devices before these threats cause widespread damage. However, mobile devices have severe resource constraints in terms of memory and power. Hence, even though there are well developed techniques for malware detection on the PC domain, it requires considerable effort to adapt these techniques for mobile devices. In this paper, we outline the considerations for malware detection on mobile devices and propose a signature based malware detection method. Specifically, we detail a signature matching algorithm that is well suited for use in mobile device scanning due to its low memory requirements. Additionally, the matching algorithm is shown to have high scanning speed which makes it unobtrusive to users. Our evaluation and comparison study with the well known Clam-AV scanner shows that our solution consumes less than 50% of the memory used by Clam-AV while maintaining a fast scanning rate.
Article
Full-text available
We initiate a systematic study to help distinguish a special group of online users, called hidden paid posters, or termed "Internet water army" in China, from the legitimate ones. On the Internet, the paid posters represent a new type of online job opportunity. They get paid for posting comments and new threads or articles on different online communities and websites for some hidden purposes, e.g., to influence the opinion of other people towards certain social events or business markets. Though an interesting strategy in business marketing, paid posters may create a significant negative effect on the online communities, since the information from paid posters is usually not trustworthy. When two competitive companies hire paid posters to post fake news or negative comments about each other, normal online users may feel overwhelmed and find it difficult to put any trust in the information they acquire from the Internet. In this paper, we thoroughly investigate the behavioral pattern of online paid posters based on real-world trace data. We design and validate a new detection mechanism, using both non-semantic analysis and semantic analysis, to identify potential online paid posters. Our test results with real-world datasets show a very promising performance.
Article
Full-text available
We present a method for authorship discrimination that is based on the frequency of bigrams of syntactic labels that arise from partial parsing of the text. We show that this method, alone or combined with other classification features, achieves a high accuracy on discrimination of the work of Anne and Charlotte Brontë, which is very difficult to do by traditional methods. Moreover, high accuracies are achieved even on fragments of text little more than 200 words long. © The Author 2007. Published by Oxford University Press on behalf of ALLC and ACH. All rights reserved.
Conference Paper
Full-text available
This paper deals with the problem of author identification. The common N-grams (CNG) method [6] is a language-independent profile-based approach with good results in many author identification experiments so far. A variation of this approach is presented based on new distance measures that are quite stable for large profile length values. Special emphasis is given to the degree upon which the effectiveness of the method is affected by the available training text samples per author. Experiments based on text samples on the same topic from the Reuters Corpus Volume 1 are presented using both balanced and imbalanced training corpora. The results show that CNG with the proposed distance measures is more accurate when only limited training text samples are available, at least for some of the candidate authors, a realistic condition in author identification problems.
Conference Paper
Astroturfing is appearing in numerous contexts in social media, with individuals posting product reviews or political commentary under a number of different names, and is of concern because of the intended deception. An astroturfer works with the aim of making it seem that a large number of people hold the same opinion, promoting a consensus based on the astroturfer's intentions. It is generally done for commercial or political advantage, often by paid writers or ideologically-motivated writers. This paper brings the notion of authorship attribution to bear on the astroturfing problem, collecting quantities of data from public social media sites and analysing the putative individual authors to see if they appear to be the same person. The analysis comprises a binary n-gram method which was previously shown to be effective at accurately identifying authors on a training set from the same authors, while this paper shows how authors on different social media turn out to be the same author.
Article
Intrusion detection systems are important for detecting and reacting to the presence of unauthorised users of a network or system. They observe the actions of the system and its users and make decisions about the legitimacy of the activity and users. Much work on intrusion detection has focused on analysing the actions triggered by users, determining that atypical or disallowed actions may represent unauthorised use. It is also feasible to observe the users' own behaviour to see if they are acting in their'usual' way, reporting on any sufficiently-aberrant behaviour. Doing this requires a user profile, a feature found more often in marketing and education, but increasingly in security contexts. In this paper, we survey literature on intrusion detection and prevention systems from the viewpoint of exploiting the behaviour of the user in the context of their user profile to confirm or deny the legitimacy of their presence on the system (i.e. review of intrusion detection and prevention systems aimed at user profiling). User behaviour can be measured with both behavioural biometrics, such as keystroke speeds or mouse use, but also psychometrics which measure higher-order cognitive functions such as language and preferences.
Article
The increasing use of smartphones and cloud storage apps allows users to access their data anywhere, anytime. Due to the potential of mobile devices being used and/or targeted by criminals, such devices are an important source of evidence in investigations of both cybercrime and traditional crimes, such as drug trafficking. In this paper, we study the MEGA cloud client app, an increasingly popular alternative to Google Drive, Dropbox and OneDrive, on both Android and iOS platforms. In our study, we identify a range of artefacts arising from user activities, such as login, uploading, downloading, deletion, and the sharing of files, which could be forensically recovered, as well as findings such as modification of files’ timestamps. Our findings contribute to an up-to-date understanding of cloud storage forensics.
Article
Users interact with social media in a number of ways, providing a variety of data, from ratings and approvals to quantities of text. Public discussion for hotspots in particular generates significant volume and velocity of user-contributed text, frequently attributable to a user identifier or nom de plume. It may be feasible to determine authorship of various tracts of text on social media using n-gram analysis on the bit-level rendition of the text. This paper explores the facility of bit-level n-gram analysis with other statistical classification approaches for determining authorship on two months of captured user postings from an online news and opinion website with moderated discussion. The results show that this approach can achieve a good recognition rate with a low false negative rate.
Article
To reduce the risk of digital forensic evidence being called into question in judicial proceedings, it is important to have a rigorous methodology and set of procedures for conducting digital forensic investigations and examinations. Digital forensic investigation in the cloud computing environment, however, is in infancy due to the comparatively recent prevalence of cloud computing. Cloud Storage Forensics presents the first evidence-based cloud forensic framework. Using three popular cloud storage services and one private cloud storage service as case studies, the authors show you how their framework can be used to undertake research into the data remnants on both cloud storage servers and client devices when a user undertakes a variety of methods to store, upload, and access data in the cloud. By determining the data remnants on client devices, you gain a better understanding of the types of terrestrial artifacts that are likely to remain at the Identification stage of an investigation. Once it is determined that a cloud storage service account has potential evidence of relevance to an investigation, you can communicate this to legal liaison points within service providers to enable them to respond and secure evidence in a timely manner.
Article
The rapid growth in usage and application of Social Networking (SN) platforms make them a potential target by cyber criminals to conduct malicious activities such as identity theft, piracy, illegal trading, sexual harassment, cyber stalking and cyber terrorism. Many SN platforms are extending their services to mobile platforms, making them an important source of evidence in cyber investigation cases. Therefore, understanding the types of potential evidence of users’ SN activities available on mobile devices is crucial to forensic investigation and research. In this paper, we examine four popular SN applications: Facebook, Twitter, LinkedIn and Google+, on Android and iOS platforms, to detect remnants of users’ activities that are of forensic interest. We detect a variety of artefacts (e.g. usernames, passwords, login information, personal information, uploaded posts, exchanged messages and uploaded comments from SN applications) that could facilitate a criminal investigation.
Article
Cloud storage has been identified as an emerging challenge to digital forensic researchers and practitioners in a range of literature. There are various types of cloud storage services with each type having a potentially different use in criminal activity. One area of difficulty is the identification, acquisition, and preservation of evidential data when disparate services can be utilised by criminals. Not knowing if a cloud service is being used, or which cloud service, can potentially impede an investigation. It would take additional time to contact all service providers to determine if data is being stored within their cloud service. Using Dropbox™ as a case study, research was undertaken to determine the data remnants on a Windows 7 computer and an Apple iPhone 3G when a user undertakes a variety of methods to store, upload, and access data in the cloud. By determining the data remnants on client devices, we contribute to a better understanding of the types of terrestrial artifacts that are likely to remain for digital forensics practitioners and examiners. Potential information sources identified during the research include client software files, prefetch files, link files, network traffic capture, and memory captures, with many data remnants available subsequent to the use of Dropbox by a user.
Article
Cloud storage is an emerging challenge to digital forensic examiners. The services are increasingly used by consumers, business, and government, and can potentially store large amounts of data. The retrieval of digital evidence from cloud storage services (particularly from offshore providers) can be a challenge in a digital forensic investigation, due to virtualisation, lack of knowledge on location of digital evidence, privacy issues, and legal or jurisdictional boundaries. Google Drive is a popular service, providing users a cost-effective, and in some cases free, ability to access, store, collaborate, and disseminate data. Using Google Drive as a case study, artefacts were identified that are likely to remain after the use of cloud storage, in the context of the experiments, on a computer hard drive and Apple iPhone3G, and the potential access point(s) for digital forensics examiners to secure evidence.
Article
In digital forensics, questions often arise about the authors of documents: their identity, demographic background, and whether they can be linked to other documents. The field of stylometry uses linguistic features and machine learning techniques to answer these questions. While stylometry techniques can identify authors with high accuracy in non-adversarial scenarios, their accuracy is reduced to random guessing when faced with authors who intentionally obfuscate their writing style or attempt to imitate that of another author. While these results are good for privacy, they raise concerns about fraud. We argue that some linguistic features change when people hide their writing style and by identifying those features, stylistic deception can be recognized. The major contribution of this work is a method for detecting stylistic deception in written documents. We show that using a large feature set, it is possible to distinguish regular documents from deceptive documents with 96.6% accuracy (F-measure). We also present an analysis of linguistic features that can be modified to hide writing style.
Article
Recently, the computational requirements for large-scale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data is processed on more than 140 computing centers distributed across 34 countries. The MapReduce paradigm has emerged as a highly successful programming model for large-scale data-intensive computing applications. However, current MapReduce implementations are developed to operate on single cluster environments and cannot be leveraged for large-scale distributed data processing across multiple clusters. On the other hand, workflow systems are used for distributed data processing across data centers. It has been reported that the workflow paradigm has some limitations for distributed data processing, such as reliability and efficiency. In this paper, we present the design and implementation of G-Hadoop, a MapReduce framework that aims to enable large-scale distributed computing across multiple clusters.
Article
To overcome the lack of security provided by passwords for authentication and access control, some researchers have investigated the field of biometrics for individual identification such as voice recognition, fingerprints or handwritten signature. Because of the significant amount of processing and memory space required by those approaches, implementing them in a physically secure environment such as a smart card remains difficult. Moreover, biometric systems based on physiological criteria suffer from the possibility for a potential intruder to imitate these features.In this paper we propose a new system of biometric identification based on the behavioral recognition of keyboard signature. Such a system provides the user with more security in the sense that a behavior is difficult to copy. The simplicity and reliability of our approach compared to the high power of discrimination it provides makes it suitable for built-in smart card applications.A neural network implementation through supervised and self-organizing techniques is also discussed in this paper and evaluated in terms of efficiency and performance. Because this new biometric system must be further evaluated with a large community of users, we finally discuss a proposal for testing purpose on the World-Wide Web using the emerging Java language.
Article
Current intrusion detection systems (IDS) examine all data features to detect intrusion or misuse patterns. Some of the features may be redundant or contribute little (if anything) to the detection process. The purpose of this study is to identify important input features in building an IDS that is computationally efficient and effective. We investigated the performance of two feature selection algorithms involving Bayesian networks (BN) and Classification and Regression Trees (CART) and an ensemble of BN and CART. Empirical results indicate that significant input feature selection is important to design an IDS that is lightweight, efficient and effective for real world detection systems. Finally, we propose an hybrid architecture for combining different feature selection algorithms for real world intrusion detection.
Article
We survey the authorship attribution of documents given some prior stylistic characteristics of the author's writing extracted from a corpus of known works, e.g., authentication of disputed documents or literary works. Although the pioneering paper based on word length histograms appeared at the very end of the nineteenth century, the resolution power of this and other stylometry approaches is yet to be studied both theoretically and on case studies such that additional information can assist finding the correct attribution.We survey several theoretical approaches including ones approximating the apparently nearly optimal one based on Kolmogorov conditional complexity and some case studies: attributing Shakespeare canon and newly discovered works as well as allegedly M. Twain's newly-discovered works, Federalist papers binary (Madison vs. Hamilton) discrimination using Naive Bayes and other classifiers, and steganography presence testing. The latter topic is complemented by a sketch of an anagrams ambiguity study based on the Shannon cryptography theory.
Article
Data preprocessing is widely recognized as an important stage in anomaly detection. This paper reviews the data preprocessing techniques used by anomaly-based network intrusion detection systems (NIDS), concentrating on which aspects of the network traffic are analyzed, and what feature construction and selection methods have been used. Motivation for the paper comes from the large impact data preprocessing has on the accuracy and capability of anomaly-based NIDS. The review finds that many NIDS limit their view of network traffic to the TCP/IP packet headers. Time-based statistics can be derived from these headers to detect network scans, network worm behavior, and denial of service attacks. A number of other NIDS perform deeper inspection of request packets to detect attacks against network services and network applications. More recent approaches analyze full service responses to detect attacks targeting clients. The review covers a wide range of NIDS, highlighting which classes of attack are detectable by each of these approaches.Data preprocessing is found to predominantly rely on expert domain knowledge for identifying the most relevant parts of network traffic and for constructing the initial candidate set of traffic features. On the other hand, automated methods have been widely used for feature extraction to reduce data dimensionality, and feature selection to find the most relevant subset of features from this candidate set. The review shows a trend toward deeper packet inspection to construct more relevant features through targeted content parsing. These context sensitive features are required to detect current attacks.
Article
We have been developing signature-based methods in the telecommunications industry for the past 5 years. In this paper, we describe our work as it evolved due to improvements in technology and our aggressive attitude toward scale. We discuss the types of features that our signatures contain, nuances of how these are updated through time, our treatment of outliers, and the trade-off between time-driven and event-driven processing. We provide a number of examples, all drawn from the application of signatures to toll fraud detection.
Article
Authorship attribution supported by statistical or computational methods has a long history starting from the 19th century and is marked by the seminal study of Mosteller and Wallace (1964) on the authorship of the disputed “Federalist Papers.” During the last decade, this scientific field has been developed substantially, taking advantage of research advances in areas such as machine learning, information retrieval, and natural language processing. The plethora of available electronic texts (e.g., e-mail messages, online forum messages, blogs, source code, etc.) indicates a wide variety of applications of this technology, provided it is able to handle short and noisy text from multiple candidate authors. In this article, a survey of recent advances of the automated approaches to attributing authorship is presented, examining their characteristics for both text representation and text classification. The focus of this survey is on computational requirements and settings rather than on linguistic or literary issues. We also discuss evaluation methodologies and criteria for authorship attribution studies and list open questions that will attract future work in this area.
Article
Many criminals exploit the convenience of anonymity in the cyber world to conduct illegal activities. E-mail is the most commonly used medium for such activities. Extracting knowledge and information from e-mail text has become an important step for cybercrime investigation and evidence collection. Yet, it is one of the most challenging and time- consuming tasks due to special characteristics of e-mail dataset. In this paper, we focus on the problem of mining the writing styles from a collection of e-mails written by multiple anonymous authors. The general idea is to first cluster the anonymous e-mail by the stylometric features and then extract the writeprint, i.e., the unique writing style, from each cluster. We emphasize that the presented problem together with our proposed solution is different from the traditional problem of authorship identification, which assumes training data is available for building a classifier. Our proposed method is particularly useful in the initial stage of investigation, in which the investigator usually have very little information of the case and the true authors of suspicious e-mail collection. Experiments on a real-life dataset suggest that clustering by writing style is a promising approach for grouping e-mails written by the same author.
In this article, we study the problem of Web user profiling, which is aimed at finding, extracting, and fusing the “semantic”-based user profile from the Web. Previously, Web user profiling was often undertaken by creating a list of keywords for the user, which is (sometimes even highly) insufficient for main applications. This article formalizes the profiling problem as several subtasks: profile extraction, profile integration, and user interest discovery. We propose a combination approach to deal with the profiling tasks. Specifically, we employ a classification model to identify relevant documents for a user from the Web and propose a Tree-Structured Conditional Random Fields (TCRF) to extract the profile information from the identified documents; we propose a unified probabilistic model to deal with the name ambiguity problem (several users with the same name) when integrating the profile information extracted from different sources; finally, we use a probabilistic topic model to model the extracted user profiles, and construct the user interest model. Experimental results on an online system show that the combination approach to different profiling tasks clearly outperforms several baseline methods. The extracted profiles have been applied to expert finding, an important application on the Web. Experiments show that the accuracy of expert finding can be improved (ranging from +6% to +26% in terms of MAP) by taking advantage of the profiles.
Conference Paper
Many researchers have applied statistical analy-sis techniques to email for classification purposes, such as identifying spam messages. Such ap-proaches can be highly effective, however many examine incoming email exclusively — which does not provide detailed information about an individual user's behavior. Only by analyzing outgoing messages can a user's behavior be as-certained. Our contributions are: the use of em-pirical analysis to select an optimum, novel col-lection of behavioral features of a user's email traffic that enables the rapid detection of abnor-mal email activity; and a demonstration of the effectiveness of outgoing email analysis using an application that detects worm propagation.
Article
Research in automatic text plagiarism detection focuses on algorithms that compare suspicious documents against a collection of reference documents. Recent approaches perform well in identifying copied or modified foreign sections, but they assume a closed world where a reference collection is given. This article investigates the question whether plagiarism can be detected by a computer program if no reference can be provided, e.g., if the foreign sections stem from a book that is not available in digital form. We call this problem class intrinsic plagiarism analysis; it is closely related to the problem of authorship verification. Our contributions are threefold. (1) We organize the algorithmic building blocks for intrinsic plagiarism analysis and authorship verification and survey the state of the art. (2) We show how the meta learning approach of Koppel and Schler, termed “unmasking”, can be employed to post-process unreliable stylometric analysis results. (3) We operationalize and evaluate an analysis chain that combines document chunking, style model computation, one-class classification, and meta learning.
Detecting and tracking the spread of astroturf memes in microblog streams
  • Ratkiewicz J
  • Conover M
  • Meiss M
Notebook for PAN at CLEF 2013
  • Shrestha P
  • Solorio T
“An android application sandbox system for suspicious software detection,” In 5th International Conference on Malicious and Unwanted Software (Malware 2010) (MALWARE'2010)
  • Blasing T
  • Schmidt A-D
  • Batyuk L
  • Camtepe SA
  • Albayrak S
Word length n -grams for text re-use detection Computational Linguistics and Intelligent Text Processing
  • Barron-Cedeno A Basile
  • C Esposti
  • M D Rosso
CA Amous I Gargouri F A user profile modeling using social annotations: a survey Proceeding WWW ‘12 Companion Proceedings of the 21st International Conference Companion on World Wide Web
  • M Mezghani
  • Zayani