Conference Paper
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Product and service reviews can markedly influence consumer purchase decisions, leading to financial gains or losses for businesses. Therefore, there is a growing interest towards techniques for bringing out reviews that could negatively or positively bias new customers. To this goal, we propose a visual analysis of reviews that enables quick elicitation of interesting patterns and singularities. The proposed approach is based on a theoretically sound framework, while its effectiveness and viability is demonstrated by its application to real data extracted from Tripadvisor and Booking.com.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The work from [17], which focused on reviews manipulation, exploits reviewer-centric and hotel-centric features to identify outliers: the work compares hotels reviews and related features across different review sites, outperforming the detection of suspicious hotels with respect to check the reviews on sites in isolation. Relying on visualization tools, the authors of [6] highlight suspicious changes on reviews scores, while work in [7] proposes new score aggregators to let review systems robust with respect to injection of fake scores. ...
Conference Paper
Full-text available
In this paper, we analyse a dataset of hotel reviews. In details, we enrich the review dataset, by extracting additional features, consisting of information on the reviewers' profiles and the reviewed hotels. We argue that the enriched data can gain insights on the factors that most influence consumers when composing reviews (e.g., if the appreciation for a certain kind of hotel is tied to specific users' profiles). Thus, we apply statistical analyses to reveal if there are specific characteristics of reviewers (almost) always related to specific characteristics of hotels. Our experiments are carried out on a very large dataset, consisting of around 190k hotel reviews, collected from the Tripadvisor website.
... Carvalho & Chaves [5] [4] follow an identical approach, i.e., analyse hotel reviews using relevant adjectives through a concept ontology and provide three visualisation techniques: a bubble tree, tree map and plot visualisation. Colantonio et al. [6] propose a matrix-based visualisation approach reusing the Access Data visualiseR (ADVISER) algorithm [7] in order to identify singularities or trends. ...
Conference Paper
Full-text available
The tourist behaviour has changed significantly over the last decades due to technological advancement (e.g., ubiquitous access to the Web) and Web 2.0 approaches (e.g., Crowdsourcing). Tourism Crowdsourcing includes experience sharing in the form of ratings and reviews (evaluation-based), pages (wiki-based), likes, posts, images or videos (social-network-based). The main contribution of this paper is a tourist-centred off-line and on-line analysis, using hotel ratings and reviews, to discover and present relevant trends and patterns to tourists and businesses. On the one hand, online, we provide a list of the top ten hotels, according to the user query, ordered by the overall rating, price and the ratio between the positive and negative Word Clouds reviews. On the other hand, off-line, we apply Multiple Linear Regression to identify the most relevant ratings that influence the hotel overall rating, and generate hotel clusters based on these ratings.
Conference Paper
Full-text available
Popular Internet services in recent years have shown that remarkable things can be achieved by harnessing the power of the masses using crowd-sourcing systems. However, crowd-sourcing systems can also pose a real challenge to existing security mechanisms deployed to protect Internet services. Many of these security techniques rely on the assumption that malicious activity is generated automatically by automated programs. Thus they would perform poorly or be easily bypassed when attacks are generated by real users working in a crowd-sourcing system. Through measurements, we have found surprising evidence showing that not only do malicious crowd-sourcing systems exist, but they are rapidly growing in both user base and total revenue. We describe in this paper a significant effort to study and understand these crowdturfing systems in today's Internet. We use detailed crawls to extract data about the size and operational structure of these crowdturfing systems. We analyze details of campaigns offered and performed in these sites, and evaluate their end-to-end effectiveness by running active, benign campaigns of our own. Finally, we study and compare the source of workers on crowdturfing sites in different countries. Our results suggest that campaigns on these systems are highly effective at reaching users, and their continuing growth poses a concrete threat to online communities both in the US and elsewhere.
Article
Full-text available
Firms' incentives to manufacture biased user reviews impede review usefulness. We examine the differences in reviews for a given hotel between two sites: Expedia.com (only a customer can post a review) and TripAdvisor.com (anyone can post). We argue that the net gains from promotional reviewing are highest for independent hotels with single-unit owners and lowest for branded chain hotels with multiunit owners. We demonstrate that the hotel neighbors of hotels with a high incentive to fake have more negative reviews on TripAdvisor relative to Expedia; hotels with a high incentive to fake have more positive reviews on TripAdvisor relative to Expedia.
Article
Full-text available
A tension exists between the increasingly rich semantic models in knowledge management systems and the continuing prevalence of human language materials in large organisations. The process of tying semantic models and natural language together is referred to as semantic annotation, which may also be char-acterized as the dynamic creation of bidirectional relationships between ontologies and unstructured and semi-structured documents. Information extraction (IE) takes unseen texts as input and produces fixed-format, unambiguous data as output. It involves processing text to identify selected infor-mation, such as particular named entities or relations among them from text docu-ments. Named entities include people, organizations, locations and so on, while relations typically include physical relations (located, near, part-whole, etc.), per-sonal or social relations (business, family, etc.), and membership (employ-staff, member-of-group, etc.). Ontology-based information extraction (OBIE) can be adapted specifically for semantic annotation tasks. An important difference between traditional IE and OBIE is the latter's closely coupled use of an ontology as one of the system's resources – the ontology serves not only as a schema or list of classifications in the output, but also as input data – its structure affects the training and tagging processes. We present here two ontology-based developments for information extrac-tion. OBIE experiments demonstrate clearly that the integration of ontologies as a knowledge source within HLT applications leads to improved perform-ance. Another important finding is that computational efficiency of the under-lying machine learning methods is especially important for HLT tasks, as the system may need to train hundreds of classifiers depending on the size of the ontology.
Article
Full-text available
Opinionated social media such as product reviews are now widely used by individuals and organizations for their decision making. However, due to the reason of profit or fame, people try to game the system by opinion spamming (e.g., writing fake reviews) to promote or demote some target products. For reviews to reflect genuine user experiences and opinions, such spam reviews should be detected. Prior works on opinion spam focused on detecting fake reviews and individual fake reviewers. However, a fake reviewer group (a group of reviewers who work collaboratively to write fake reviews) is even more damaging as they can take total control of the sentiment on the target product due to its size. This paper studies spam detection in the collaborative setting, i.e., to discover fake reviewer groups. The proposed method first uses a frequent itemset mining method to find a set of candidate groups. It then uses several behavioral models derived from the collusion phenomenon among fake reviewers and relation models based on the relationships among groups, individual reviewers, and products they reviewed to detect fake reviewer groups. Additionally, we also built a labeled dataset of fake reviewer groups. Although labeling individual fake reviews and reviewers is very hard, to our surprise labeling fake reviewer groups is much easier. We also note that the proposed technique departs from the traditional supervised learning approach for spam detection because of the inherent nature of our problem which makes the classic supervised learning approach less effective. Experimental results show that the proposed method outperforms multiple strong baselines including the state-of-the-art supervised classification, regression, and learning to rank algorithms.
Article
Full-text available
This paper offers a new role engineering approach to Role-Based Access Control (RBAC), referred to as visual role mining. The key idea is to graphically represent user-permission assignments to enable quick analysis and elicitation of meaningful roles. First, we formally define the problem by introducing a metric for the quality of the visualization. Then, we prove that finding the best representation according to the defined metric is a NP-hard problem. In turn, we propose two algorithms: ADVISER and EXTRACT. The former is a heuristic used to best represent the user-permission assignments of a given set of roles. The latter is a fast probabilistic algorithm that, when used in conjunction with ADVISER, allows for a visual elicitation of roles even in absence of pre-defined roles. Besides being rooted in sound theory, our proposal is supported by extensive simulations run over real data. Results confirm the quality of the proposal and demonstrate its viability in supporting role engineering decisions.
Conference Paper
Full-text available
Assessing the trustworthiness of reviews is a key issue for the maintainers of opinion sites such as TripAdvisor. In this paper we propose a distortion criterion for assessing the impact of methods for uncovering suspicious hotel reviews in TripAdvisor. The principle is that dishonest reviews will distort the overall popularity ranking for a collection of hotels. Thus a mechanism that deletes dishonest reviews will distort the popularity ranking significantly, when compared with the removal of a similar set of reviews at random. This distortion can be quantified by comparing popularity rankings before and after deletion, using rank correlation. We present an evaluation of this strategy in the assessment of shill detection mechanisms on a dataset of hotel reviews collected from TripAdvisor.
Conference Paper
Full-text available
In the past few years, sentiment analysis and opinion mining becomes a popular and important task. These studies all assume that their opinion resources are real and trustful. However, they may encounter the faked opinion or opinion spam problem. In this paper, we study this issue in the context of our product review mining system. On product review site, people may write faked reviews, called review spam, to promote their products, or defame their competitors' products. It is important to identify and filter out the review spam. Previous work only focuses on some heuristic rules, such as helpfulness voting, or rating deviation, which limits the performance of this task. In this paper, we exploit machine learning methods to identify review spam. Toward the end, we manually build a spam collection from our crawled reviews. We first analyze the effect of various features in spam identification. We also observe that the review spammer consistently writes spam. This provides us another view to identify review spam: we can identify if the author of the review is spammer. Based on this observation, we provide a twoview semi-supervised method, co-training, to exploit the large amount of unlabeled data. The experiment results show that our proposed method is effective. Our designed machine learning methods achieve significant improvements in comparison to the heuristic baselines.
Conference Paper
Full-text available
Online reviews provide valuable information about products and services to consumers. However, spammers are joining the community trying to mislead readers by writing fake reviews. Previous attempts for spammer detection used reviewers' behaviors, text similarity, linguistics features and rating patterns. Those studies are able to identify certain types of spammers, e.g., those who post many similar reviews about one target entity. However, in reality, there are other kinds of spammers who can manipulate their behaviors to act just like genuine reviewers, and thus cannot be detected by the available techniques. In this paper, we propose a novel concept of a heterogeneous review graph to capture the relationships among reviewers, reviews and stores that the reviewers have reviewed. We explore how interactions between nodes in this graph can reveal the cause of spam and propose an iterative model to identify suspicious reviewers. This is the first time such intricate relationships have been identified for review spam detection. We also develop an effective computation method to quantify the trustiness of reviewers, the honesty of reviews, and the reliability of stores. Different from existing approaches, we don't use review text information. Our model is thus complementary to existing approaches and able to find more difficult and subtle spamming activities, which are agreed upon by human judges after they evaluate our results.
Conference Paper
Full-text available
Article
Full-text available
We initiate a systematic study to help distinguish a special group of online users, called hidden paid posters, or termed "Internet water army" in China, from the legitimate ones. On the Internet, the paid posters represent a new type of online job opportunity. They get paid for posting comments and new threads or articles on different online communities and websites for some hidden purposes, e.g., to influence the opinion of other people towards certain social events or business markets. Though an interesting strategy in business marketing, paid posters may create a significant negative effect on the online communities, since the information from paid posters is usually not trustworthy. When two competitive companies hire paid posters to post fake news or negative comments about each other, normal online users may feel overwhelmed and find it difficult to put any trust in the information they acquire from the Internet. In this paper, we thoroughly investigate the behavioral pattern of online paid posters based on real-world trace data. We design and validate a new detection mechanism, using both non-semantic analysis and semantic analysis, to identify potential online paid posters. Our test results with real-world datasets show a very promising performance.
Article
Full-text available
Motivation: Most approaches to gene expression analysis use real-valued expression data, produced by high-throughput screening technologies, such as microarrays. Often, some measure of similarity must be computed in order to extract meaningful information from the observed data. The choice of this similarity measure frequently has a profound effect on the results of the analysis, yet no standards exist to guide the researcher. Results: To address this issue, we propose to analyse gene expression data entirely in the binary domain. The natural measure of similarity becomes the Hamming distance and reflects the notion of similarity used by biologists. We also develop a novel data-dependent optimization-based method, based on Genetic Algorithms (GAs), for normalizing gene expression data. This is a necessary step before quantizing gene expression data into the binary domain and generally, for comparing data between different arrays. We then present an algorithm for binarizing gene expression data and illustrate the use of the above methods on two different sets of data. Using Multidimensional Scaling, we show that a reasonable degree of separation between different tumor types in each data set can be achieved by working solely in the binary domain. The binary approach offers several advantages, such as noise resilience and computational efficiency, making it a viable approach to extracting meaningful biological information from gene expression data.
Article
Full-text available
Given a collection of fossil sites with data about the taxa that occur in each site, the task in biochronology is to find good estimates for the ages or ordering of sites. We describe a full probabilistic model for fossil data. The parameters of the model are natural: the ordering of the sites, the origination and extinction times for each taxon, and the probabilities of different types of errors. We show that the posterior distributions of these parameters can be estimated reliably by using Markov chain Monte Carlo techniques. The posterior distributions of the model parameters can be used to answer many different questions about the data, including seriation (finding the best ordering of the sites) and outlier detection. We demonstrate the usefulness of the model and estimation method on synthetic data and on real data on large late Cenozoic mammals. As an example, for the sites with large number of occurrences of common genera, our methods give orderings, whose correlation with geochronologic ages is 0.95.
Article
Full-text available
The set of frequent closed itemsets uniquely determines the exact frequency of all itemsets, yet it can be orders of magnitude smaller than the set of all frequent itemsets. In this paper, we present CHARM, an efficient algorithm for mining all frequent closed itemsets. It enumerates closed sets using a dual itemset-tidset search tree, using an efficient hybrid search that skips many levels. It also uses a technique called diffsets to reduce the memory footprint of intermediate computations. Finally, it uses a fast hash-based approach to remove any "nonclosed" sets found during computation. We also present CHARM-L, an algorithm that outputs the closed itemset lattice, which is very useful for rule generation and visualization. An extensive experimental evaluation on a number of real and synthetic databases shows that CHARM is a state-of-the-art algorithm that outperforms previous methods. Further, CHARM-L explicitly generates the frequent closed itemset lattice.
Article
Crowdturfing has recently been identified as a sinister counterpart to the enormous positive opportunities of crowdsourcing. Crowdturfers leverage human-powered crowdsourcing platforms to spread malicious URLs in social media, form "astroturf" campaigns, and manipulate search engines, ultimately degrading the quality of online information and threatening the usefulness of these systems. In this paper we present a framework for "pulling back the curtain" on crowdturfers to reveal their underlying ecosystem. Concretely, we analyze the types of malicious tasks and the properties of requesters and workers in crowdsourcing sites such as Microworkers.com, ShortTask.com and Rapidworkers.com, and link these tasks (and their associated workers) on crowdsourcing sites to social media, by monitoring the activities of social media participants. Based on this linkage, we identify the relationship structure connecting these workers in social media, which can reveal the implicit power structure of crowdturfers identified on crowdsourcing sites. We identify three classes of crowdturfers -- professional workers, casual workers, and middlemen -- and we develop statistical user models to automatically differentiate these workers and regular social media users.
Article
In this paper, we consider 0/1 databases and provide an alternative way of extracting knowledge from such databases using tiles. A tile is a region in the database consisting solely of ones. The interestingness of a tile is measured by the number of ones it consists of, i.e., its area. We present an efficient method for extracting all tiles with area at least a given threshold. A collection of tiles constitutes a tiling. We regard tilings that have a large area and consist of a small number of tiles as appealing summaries of the large database. We analyze the computational complexity of several algorithmic tasks related to finding such tilings. We develop an approximation algorithm for finding tilings which approximates the optimal solution within reasonable factors. We present a preliminary experimental evaluation on real data sets.
Article
Visual analytics employs interactive visualizations to integrate users’ knowledge and inference capability into numerical/algorithmic data analysis processes. It is an active research field that has applications in many sectors, such as security, finance, and business. The growing popularity of visual analytics in recent years creates the need for a broad survey that reviews and assesses the recent developments in the field. This report reviews and classifies recent work into a set of application categories including space and time, multivariate, text, graph and network, and other applications. More importantly, this report presents analytics space, inspired by design space, which relates each application category to the key steps in visual analytics, including visual mapping, model-based analysis, and user interactions. We explore and discuss the analytics space to add the current understanding and better understand research trends in the field.
Conference Paper
In Chap. 9, we studied the extraction of structured data from Web pages. The Web also contains a huge amount of information in unstructured texts. Analyzing these texts is of great importance as well and perhaps even more important than extracting structured data because of the sheer volume of valuable information of almost any imaginable type contained in text. In this chapter, we only focus on mining opinions which indicate positive or negative sentiments. The task is technically challenging and practically very useful. For example, businesses always want to find public or consumer opinions on their products and services. Potential customers also want to know the opinions of existing users before they use a service or purchase a product.
Article
The current centralized application (or app) markets provide convenient ways to distribute mobile apps. Their vendors maintain rating systems, which allow customers to leave ratings and reviews. Since positive ratings and reviews can lead to more downloads/installations and hence more monetary benefit, the rating systems have become a target of manipulation by some collusion groups hired by app developers. In this paper, we thoroughly analyze the features of hidden collusion groups and propose a novel method called \emph{GroupTie} to narrow down the suspect list of collusive reviewers for further investigation by app stores. As members of a hidden collusion group have to work together more frequently and their ratings often deviate more from apps' quality, collusive actions will enhance their relation over time. We build a relation graph named \emph{tie graph} and detect collusion groups by applying graph clustering. Simulation results show that the precision of GroupTie approaches to 99.70%99.70\% and the recall is about 91.50%91.50\%. We also apply our method to detect hidden collusion groups among the reviewers of 89 apps in Apple's China App Store. A large number of reviewers are discovered belonging to a large collusion group and several small groups.
Conference Paper
How can web services that depend on user generated content discern fraudulent input by spammers from legitimate input? In this paper we focus on the social network Facebook and the problem of discerning ill-gotten Page Likes, made by spammers hoping to turn a profit, from legitimate Page Likes. Our method, which we refer to as CopyCatch, detects lockstep Page Like patterns on Facebook by analyzing only the social graph between users and Pages and the times at which the edges in the graph (the Likes) were created. We offer the following contributions: (1) We give a novel problem formulation, with a simple concrete definition of suspicious behavior in terms of graph structure and edge constraints. (2) We offer two algorithms to find such suspicious lockstep behavior - one provably-convergent iterative algorithm and one approximate, scalable MapReduce implementation. (3) We show that our method severely limits "greedy attacks" and analyze the bounds from the application of the Zarankiewicz problem to our setting. Finally, we demonstrate and discuss the effectiveness of CopyCatch at Facebook and on synthetic data, as well as potential extensions to anomaly detection problems in other domains. CopyCatch is actively in use at Facebook, searching for attacks on Facebook's social graph of over a billion users, many millions of Pages, and billions of Page Likes.
Article
This paper reports initial findings from a study that used quantitative and qualitative research methods and custom-built software to investigate online economies of reputation and user practices in online product reviews at several leading ecommerce sites (primarily Amazon.com). We explore several cases in which book and CD reviews were copied in part or in whole from one item to another and show that hundreds of product reviews on Amazon.com might be copies of one another. We further explain the strategies involved in these suspect product reviews, and the ways in which the collapse of the barriers between authors and readers affect the ways in which these information goods are being produced, and exchanged. We report on techniques that are employed by authors, artists, editors, and readers to ensure they promote their agendas while they build their identities as experts. We suggest a framework for discussing the changes of the categories of authorship, creativity, expertise, and reputation that are being re-negotiated in this multi-tier reputation economy.
Article
BicOverlapper is a tool to visualize biclusters from gene-expression matrices in a way that helps to compare biclustering methods, to unravel trends and to highlight relevant genes and conditions. A visual approach can complement biological and statistical analysis and reduce the time spent by specialists interpreting the results of biclustering algorithms. The technique is based on a force-directed graph where biclusters are represented as flexible overlapped groups of genes and conditions. Availability: The BicOverlapper software and supplementary material are available at http://vis.usal.es/bicoverlapper Contact: rodri{at}usal.es
Article
We study three corporate nonmarket strategies designed to influence the lobbying behavior of other special interest groups: (1) astroturf, in which the firm covertly subsidizes a group with similar views to lobby when it normally would not; (2) the bear hug, in which the firm overtly pays a group to alter its lobbying activities; and (3) self-regulation, in which the firm voluntarily limits the potential social harm from its activities. All three strategies reduce the informativeness of lobbying, and all reduce the payoff of the public decision-maker. We show that the decision-maker would benefit by requiring the public disclosure of funds spent on astroturf lobbying but that the availability of alternative influence strategies limits the impact of such a policy. Copyright Blackwell Publishing 2004.
Article
Researchers have made significant progress in disciplines such as scientific and information visualization, statistically based exploratory and confirmatory analysis, data and knowledge representations, and perceptual and cognitive sciences. Although some research is being done in this area, the pace at which new technologies and technical talents are becoming available is far too slow to meet the urgent need. National Visualization and Analytics Center's goal is to advance the state of the science to enable analysts to detect the expected and discover the unexpected from massive and dynamic information streams and databases consisting of data of multiple types and from multiple sources, even though the data are often conflicting and incomplete. Visual analytics is a multidisciplinary field that includes the following focus areas: (i) analytical reasoning techniques, (ii) visual representations and interaction techniques, (iii) data representations and transformations, (iv) techniques to support production, presentation, and dissemination of analytical results. The R&D agenda for visual analytics addresses technical needs for each of these focus areas, as well as recommendations for speeding the movement of promising technologies into practice. This article provides only the concise summary of the R&D agenda. We encourage reading, discussion, and debate as well as active innovation toward the agenda for visual analysis.
Finding deceptive opinion spam by any stretch of the imagination
  • M Ott
  • Y Choi
  • C Cardie
  • J T Hancock
M. Ott, Y. Choi, C. Cardie, and J. T. Hancock. Finding deceptive opinion spam by any stretch of the imagination. In Proc. 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 309-319, 2011.