Conference PaperPDF Available

A platform for real-time opinion mining from social media and news streams

Authors:

Figures

Content may be subject to copyright.
A platform for real-time opinion mining
from social media and news streams
Nikos Tsirakis, Vasilis Poulopoulos,
Panagiotis Tsantilas
Palo LTD
Kokkoni Corinthias P/C 20002, Greece
{nt,pv,pt}@paloservices.com
Iraklis Varlamis
Dept. of Informatics and Telematics,
Harokopio University of Athens
Omirou 9, Tavros, Greece
varlamis@hua.gr
AbstractBig data is a popular term used to describe the
exponential growth and availability of data, both structured and
unstructured. It can best be defined by thinking of three Vs: Big
data is not just about Volume, but also about Velocity and
Variety. The demand for stream processing is increasing a lot
these days. The reason is that often processing big volumes of
data is not enough. The increasing amount of opinionated data
that is published in social media, in combination with the variety
of data sources has created a demanding ecosystem for stream
processing. The reason is that in order to deliver high quality
knowledge extraction services, several tasks of high complexity
must be accomplished and the existing solutions and
architectures are not sufficient for processing huge volumes of
streamed data. When opinion mining is applied in social media in
order to cover the needs of businesses for brand monitoring,
heterogeneous data has to be processed fast, so that a firm can
react to changing business conditions in real time.
Keywordsopinion mining; news streams; social media
I. INTRODUCTION
The growing enthusiasm for social media, created a new
channel for companies that want to advertise their products
and services, or simply want to boost and monitor their brand
name. This resulted to large amounts of data, which are
created daily to various social media and the news and contain
mentions to products and companies.
These data can be textual or audiovisual, can be presented
in a formal (e.g. product reviews) or informal way (e.g.
comments), can be neutral mentions or carry an opinion about
the company or product or an aspect of it [1][2].
The volume and complexity of the data that can be
acquired, stored and manipulated have created a flood of data
90 per cent of all data were generated in the last two years
[3]. This creates a big challenge for companies that provide
social media analytics services and cope with data from
multiple data streams.
The notion of Big Data perfectly applies in this case,
with all core big data issues to impede the task of useful
knowledge extraction in real-time: scale, heterogeneity,
timeliness, complexity etc. The issues that must be confronted
cover the whole processing pipeline starting from data
acquisition, when a decision must be made on what data to
keep and what to discard, and how to store and for how long.
The value of data explodes when the can be linked,
compared and analyzed in a common ground. The problem
that rises here is the heterogeneity between unstructured text
data from social feeds such as Twitter and Facebook,
structured product reviews and ratings from sites such as
Epinions and lengthy news articles that contain references to
images and other multimedia. The gathering of images, videos
and other multimedia content is a decision that must be taken
with care, since their processing for knowledge extraction is a
hard task and their requirements for storage are increased. A
major challenge here is to integrate content and bring
everything in a common form for analysis and presentation.
Data analysis is the next bottleneck since traditional
algorithms lack of scalability and do not easily adapt to the
complexity of data that needs to be analyzed. Finally, the
presentation of the extracted knowledge must be carefully
designed in order for the results to be self-interpreted by non-
technical domain experts and assist them in getting valuable
actionable knowledge.
Palo Ltd is a company specializing in information
extraction from the web. It started by gathering content from
news sites and blogs in Greece, about 5 years ago. The
analysis was limited in clustering articles based on content
similarity and presenting them in an aggregated form to the
end users. PaloPro is Palo’s social media analytics service,
which was launched primarily in Greece but now expands to
Serbia, Cyprus, Turkey and Romania. The service monitors
and analyzes data from the web and social media, giving
emphasis to entity extraction and sentiment analysis from text.
In the same architecture, several modules for crawling, feed
aggregation, text clustering, multi-document summarization,
Named Entity Recognition, aspect extraction and opinion
mining synthesize the ecosystem of Palo services.
PaloPro can be described as a business intelligence
platform with social basis (social business intelligence) [4]
that takes advantage of the knowledge of the crowd (crowd
sourcing) as expressed in social media. The benefit is both for
companies, which are able to monitor the popularity of their
products and for buyers who receive long-term improved
services and products. The interest for such a platform is
increased, for example the mobile phone industry in Greece
numbers 13 million active subscribers, who are active on the
internet, and comment the products of the three main
competitors (packages, special offers, etc.). Although
information about the popularity of each of the three partners
has little value, knowledge about the course of their products
in social media and the opinion formed from every new
movement is valuable for any further advertisement campaign.
In the following section, we provide an overview of
PaloPro service and the infrastructure that supports them. In
section 3 we discuss the processing pipeline in more details
and in section 4 we summarize the open issues concerning the
processing of big data in a real-time environment.
II. BACKGROUND
A. Scientific background
The scientific interest in opinion mining and analysis is
huge and reflected in numerous publications in leading
scientific conferences and journals in IT and marketing. The
concept of opinion mining for different aspects of an entity
(aspect based sentiment mining) appeared in literature in 2009.
Early research focused on multi-aspect entities such as movies
[5] -and the opinions provided by the viewers comments for
the different aspects that make up the final result (actors,
director, screenplay, music, etc.)- electronic devices [6] and
hotels [7]. These main principles behind these works were: a)
extracting opinions or emotions and b) labeling entities and
their aspects (head terms) and the words that convey emotion
(modifiers). To identify the sentiment (or polarity) of a
comment, researchers used emotion dictionaries (comprising
mostly adjectives) [8], [9] statistical techniques based on co-
occurrence of head terms and modifiers, classification
techniques such as SVM, Naïve Bayes, Maximum Entropy
etc. [10] and in some cases, semantic and syntactic analyzers
[11], [12].
Moghaddam and Ester [13] gave a new impetus to opinion
mining on individual properties of commercial products from
customer overviews. A typical example is the evaluation of a
photo camera, where users evaluate separately the ease of use,
the image quality, the shutter lag and battery duration, and
behind the comments attached a positive or negative score for
each aspect. This approach gives a new dimension to the
problem of extracting knowledge from texts adding additional
granularity levels in opinion or sentiment expressed in a text.
The fact that these opinions mining techniques were
applied to commercial products has increased the interest of
marketers and brand makers who want to handle the image of
a product or company on the market and understand the
preferences of potential customers. An interesting analysis of
the economic power of the comments filed by users on
product review sites has been carried by Ghose et al [14], who
studied the effect of good or bad reviews to the price of a
product sold on Amazon. The immediate consequence to the
increase of the power of comments and opinions to the
commercial products is the appearance of malicious comments
(spam) with positive or negative orientation that aim to alter
the real image of a product [15], [16].
B. Competitive systems
The big interest of businesses is reflected to a number of
commercial tools that provide analysis and monitoring
services of the markets. The Blogmeter1 is one such product
that has been developed by the Italian company CELI and
adapted for specific markets (telephony, food, fashion, etc.). It
offers tools for monitoring and reporting the image of
companies, products or services to social media such as
facebook, twitter, google +, pinterest etc. However, the system
requires that the company has a profile in the respective social
media and focuses only on analyzing the information posted
on the respective websites of the companies in each medium
(eg likes, the followers, the retweets, etc. at pins. each
company) .The SentiMeter2 is a tool that gathers data from
Twitter, Facebook, YouTube, Google+, Digg, Blogger,
Tumblr and other ATOM and RSS feeds, and then allows
users to create and monitor their own campaigns. It also
allows creating reports and control user access to them. The
Sentimarket3 API is yet another tool for content analysis
derived from social media, which allows monitoring of a
market, alert creation, etc.
Other competitors in social media monitoring include:
Brandwatch4, Sysomos5, Trackur6, Engagor7. Their main
characteristics are: a) they primarily collect English content or
a single language content and certainly do not provide cross-
country social media monitoring solutions, b) they mainly
focus on social media monitoring and primarily target market
analysts with good technological background, c) they provide
tools for monitoring the effect of ads campaigns to the brand’s
image to social media but do not offer tools for interactive
campaign management.
C. Advantages of PaloPro
A major disadvantage of all the above tools is that they do
not support prioritization of sources. All references to a
company or product do not contribute equally to the overall
feeling and by incorporating the importance or influence of
each source we get better insights on how the public opinion
will evolve. This prioritization of sources is automatically
done in PaloPro, by employing the metadata collected from
social media sources, concerning users and their buzz in the
community.
An equally important lack of competitive tools is that they
do not analyze the factors that lead to the increase or decrease
in the popularity of an entity in social media. Although
internationally there are several sites where consumers can
comment on the products that interest them (eg, epinions.com,
amazon.com, rateitall.com) and evaluate individual features or
services (eg. tripadvisor.com), the existing social media
analysis tools do not provide such detail. The aspect extraction
that is performed in PaloPro, allow us to perform aspect-based
sentiment analysis and provide the tools for in-depth analysis
of results.
1 http://www.blogmeter.eu
2 https://sentimeter.com/
3 http://www.sentimarket.com
4 http://www.brandwatch.com
5 http://www.sysomos.com/
6 http://www.trackur.com/
7 https://engagor.com/
Figure 1. The architecture of PaloPro
Finally, one of the key advantages of Palo is its crawling
and indexing mechanism and the efficient language agnostic
techniques for Sentiment Analysis and Entity Recognition,
which allow us to deploy a PaloPro services clone in a new
country in a four months cycle. Shifting to a new language (or
a country with multiple languages) is a problem that we
already face in Palo, since we have already developed a
solution for Serbia (palo.rs), Cyprus (palo.com.cy) and Greece
(palo.gr) and we are deploying our services to Turkey and
Romania. Our NER and Sentiment analysis tools are based on
a unique knowledge building infrastructure, which exploits
open multilingual resources (e.g. Wikipedia), which are
available in almost any language and probabilistic (n-gram
based) language agnostic techniques and can be deployed and
fine-tuned with a minimum user effort.
III. AN OVERVIEW OF PALOPRO
A. PaloPro media analytics services
PaloPro service provides a tool for monitoring and analysis
of opinions about user defined entities (e.g. products, persons,
locations) thus creating a Reputation Management System.
The user has the opportunity to view in real-time, the source
of the buzz, the parameters that affect the positive, negative or
neutral reputation towards an organization, brand or person
and, ultimately, the overall polarity sentiment and trend on the
Web. This is achieved by gathering and processing all
references through natural language technologies that extract
entities and opinions about these entities. Being a commercial
subscription service, the requirements for accurate results are
high for the underlying linguistic processing infrastructure,
aiming at achieving accuracy over 87% for both the named-
entity recognition and the polarity detection tasks.
The data are collected, filtered and processed by a set of
crawlers, which aggregate data from different sources,
including traditional news sites, blogs, forums, video
comments and social media such as Twitter and Facebook
posts and comments. The crawling and storage procedure is
fully distributed and controlled in such a way that the system
may provide a near real-time analysis to the end user. In order
to achieve this, the crawling controller adjusts the frequency
of visits to each source and prioritizes sources that have a
higher update frequency. As a result, it providing an efficient
way to instantly locate and retrieve new content. Multiple
layers of spam filtering are deployed to ensure that clean data
are provided to the analysis modules. The amount of
documents crawled in a typical day usually exceeds 3 million
documents. The lengthiest documents are collected from
thousands of different websites and a huge amount of small
texts comes from social media networks and specifically from
Twitter. All the content that is collected is categorized on a
predefined set of news domains and it is ranked for
importance based on a predefined ranking of importance for
sources (e.g. news portals are ranked higher than blogs).
The concept that dominates the design of PaloPro is
workspace”, a dashboard that contains visualizations of
information collected for an entity of interest (e.g. a brand
name and its core products). The reputation of an entity is
measured on a set of user-selectable entities or user-specified
keywords, which are monitored across the different news sites
and social media. The user can create a new workspace and is
expected to select one or more persons, companies, locations,
brands, or product names from a large database of monitored
entities, and/or define a set of keywords, in case an entity is
not contained into the database of monitored objects. The user
may define any number of workspaces, all of which are visible
when the user logs into the system.
Through the system, a user can access information related
to different entities or keywords such as persons,
organizations, companies, brands, products, events etc. that
are monitored by the system in the crawled corpora, along
with aspect information about them. Automated alerts can be
set up so that the service may deliver instant notifications
whenever the data matches some predefined, user-specified
criteria, as new information is extracted or when the extracted
information exceeds certain user-configurable thresholds.
B. Infrastructure
PaloPro follows the major requirements of real-time
stream processing set out by [17] by using advanced custom
coding and infrastructure software.
In order to support the data collection, storage and real-
time procedures, Palo has a complex multi-level infrastructure,
which consists of crawling and analysis servers, database
(SQL and no-SQL) servers, web servers, caching and load
balancing servers. Figure 1 depicts the generic infrastructure.
Since data sources may reside in any place in world, but
end users of PaloPro come from different countries, the initial
design comprises different servers per country. Recently we
interconnected the data collection services for all countries
and now all services use the same infrastructure. So data
collection, storage, analysis and presentation are done within
the same collection of servers.
In order to be able to handle the huge amounts of data
collected and served to our clients, we store data in two types
of databases. The first one is a Percona MySQL database
cluster8, while the second one is a set of nodes of Elastic
Search9 nodes distributed into multiple servers.
IV. PALOPRO PROCESS FLOW
The process flow in PaloPro starts from raw data and ends
up to useful business knowledge. It comprises several steps,
which are depicted in Figure 2 and are explained in the
following subsections. The volume, velocity and variety of
data, affects the design of each step.
A. Data Acquisition and Recording
PaloPro starts with the collection of content from social
media, which is performed in a continuous basis (every few
minutes) and results in a huge repository of textual-raw
content and associated metadata that describe the source and
the content itself (e.g. time information, location information,
author, social medium, etc).
Social media content does not arise by itself: it is recorded
from some data generating source. Much of this data is of no
interest, and it can be filtered and compressed by orders of
magnitude. All these filters in Palo and PaloPro are
implemented using machine learning techniques and allow
new filters to be trained in such a way that they do not discard
useful information. A detailed description of the news
crawling mechanism of Palo is provided in [18]. This
mechanism allows the administrators of Palo to quickly feed
in news sources, when entering a new country and thus
quickly create an initial content repository. Data from popular
social media platforms is gathered using the provided APIs.
The next and most important step of PaloPro comprises the
semantic analysis of texts (e.g. named entity recognition,
sentiment analysis, aspect detection etc.) and the analysis of
associated metadata (e.g. influential users detection, social
8 http://www.percona.com/software/percona-xtradb-cluster
9 https://www.elastic.co/downloads
Figure 2. The information flow and the core processing tasks of PaloPro
medium impact etc.). The result of this step is a rich repository
of semantically enhanced content and information concerning
the social media sites and users and their influence to the
social media sphere.
B. Information Extraction and Cleaning
The challenge here is to extract useful level content on-the-
fly from the original content that is aggregated from the
various sources. An information extraction process that pulls
out the required information from the underlying sources and
expresses it in a structured form suitable for analysis is
needed. For this purpose, Palo incorporates several high
parallelizable algorithms for document and sentence
clustering, text summarization, named entity, aspect and
opinion extraction. Using a very fast text clustering algorithm
we manage to refresh our news every 3 minutes and to
automatically cluster them into themes, without human
intervention. Document and sentence clustering significantly
reduces the load of the remaining processing pipeline since
news content is highly reproduced in many sources.
For the summarization of content, we employ an efficient
language-agnostic technique, which is based on n-gram graphs
[19] and produces comparative results to other language
dependent techniques. Finally, for entity extraction and
opinion mining we implement a machine learning technique,
which can be easily trained for new languages [20]. A new,
highly parallelized alternative implementation, which is
automatically deployed to new languages, is currently under
development. The alternative takes advantage of structured
collaboratively created content in order to train the respective
entity extraction and opinion mining models for a new
language.
C. Data Integration, Modelling, and Analysis
The acquisition of data and extraction of information are
the first steps towards business intelligence. However, due to
the heterogeneity of information it is necessary to properly
model the extracted information in order to further analyse it.
A problem with current Big Data analysis is the lack of
coordination between database systems, which host the data
for querying, and the analytics tools that perform data mining
tasks and statistical analyses. In PaloPro this binding is driven
by the business need for information. So starting from the
needs for visualization and information for the domain
experts, we properly orchestrate the underlying mechanisms in
order to be able to continuously feed the end-user dashboard
with up-to-date knowledge about his/her company or product.
D. Interpretation Visualization
On top of the collected information and extracted
knowledge, we have developed the PaloPro dashboard, which
comprises sophisticated tools that allow us to depict the image
of an entity (e.g. a company, a person, a product) to the social
media, to measure the result of a certain action or event to an
entity’s image in the long run, and to drill down to the details
that contributed to this result. Figure 3, provides a glance of
PaloPro dashboard
V. CHALLENGES
Having in mind the multiple phases in PaloPro stream data
processing pipeline, it is interesting to consider some common
challenges that underlie the different phases.
A. Heterogeneity
Content in PaloPro is mainly textual. However, we also
collect multimedia content which is interesting to be
associated with text. In addition to this, for Greece, we collect
data from Diavgeia10, a governmental site that provides
metadata and data concerning all the payments made by the
public organizations to companies and individuals. Using
these data, we are able to provide a detailed analysis of where
public money are spend, similar to [21]. The interest for such
analysis is bigger for reporters and professionals from the
news industry. Although it is currently out of Palo objectives,
the rich content that we continuously collect can be exploited
if properly integrated.
10 https://diavgeia.gov.gr/
Figure 3. PaloPro dashboard with real-time information on mentions’ polarity, top influencers and topics of interest
B. Scalability
A great issue that arises when we upscaled our solution
was the dilemma between cloud computing and private
servers. Currently PaloPro is using its own dedicated servers
and the workload for them is constantly increase, thus
minimizing the idle resources and making the choice of own
servers more reasonable.
In terms of speed, the limits are set not by the demand for
an increasing processing throughput but from the acquisition
rate in conjunction with the amount of collected data. Entering
a new country, such as Turkey for example, which provides 10
times the size of content that Greece provides, does not change
the requirement for refresh of the news sphere every three
minutes (or even less). So the scalability of algorithms and
architecture must be examined accordingly.
C. System training
The last but not least concern is the quality of the collected
content and consequently of the information and services
delivered. The quality standards have been defined from the 5
year presence in Greek social media analysis but the same
standards must be met in a shorter period when entering a new
country. A set of language agnostic methods guarantees that
the quality of some modules will be the same in all countries.
In the case of language specific modules, a set of tools that
accelerate the training of the different models either using
structured human-created content or human annotated content
allows fast deployment and high quality of services.
VI. CONCLUSIONS
PaloPro implements a holistic approach for social media
analysis and monitoring of brand awareness through easy to
use dashboard and support content in multiple languages.
Using the business intelligence it provides, the brands and
companies are able to monitor the outcomes of their
campaigns using real-time analysis of the impact in social
media and the news. This analysis creates a high corporate
business value but on the same value creates several issues
that relate to the management of big data. In this work, we
presented the main challenges and summarized on the
solutions that we implement.
ACKNOWLEDGMENTS!
The project is partially funded by GSRT (ICT4Growth
project).
REFERENCES!
[1] Thet, T. T., Na, J. C., & Khoo, C. S. (2010). Aspect-based sentiment
analysis of movie reviews on discussion boards. Journal of Information
Science, 0165551510388123.
[2] Pontiki, M., Papageorgiou, H., Galanis, D., Androutsopoulos, I.,
Pavlopoulos, J., & Manandhar, S. (2014, August). Semeval-2014 task 4:
Aspect based sentiment analysis. In Proceedings of the 8th International
Workshop on Semantic Evaluation (SemEval 2014) (pp. 27-35).
[3] SINTEF. "Big Data, for better or worse: 90% of world's data generated
over last two years." ScienceDaily. ScienceDaily, 22 May 2013.
<www.sciencedaily.com/releases/2013/05/130522085217.htm>.
[4] Dinter B., Lorenz, A. (2012). “Social Business Intelligence: a Literature
Review and Research Agenda”. In: Thirty Third International
Conference on Information Systems (ICIS 2012). Ed. by F. George Joey.
Orlando, Florida: Association for Information Systems. isbn: 978-0-615-
71843-9. url:
http://aisel.aisnet.org/icis2012/proceedings/ResearchInProgress/104/
[5] Thet, T. T., Na, J. C., & Khoo, C. S. (2010). Aspect-based sentiment
analysis of movie reviews on discussion boards. Journal of Information
Science, 36(6), 823-848.
[6] Hu, M., Liu, B. (2004). Mining and summarizing customer reviews,
Proceedings of the 10th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (2004) 168177.
[7] Blair-Goldensohn, S., Hannan, K., McDonald, S., Neylon, T., Reis,
G.A., Reynar, J. (2008). Building a sentiment summarizer for local
service reviews, Proceedings of WWW 2008 Workshop: NLP
Challenges in the Information Explosion Era.
[8] Hatzivassiloglou, V., McKeown, K.R. (1997) Predicting the semantic
orientation of adjectives, Proceedings of the 35th Annual Meeting of the
ACL and the 8th Conference of the European Chapter of the ACL
(1997).
[9] Qiu, G. Liu, B. Bu, J., Chen, C. (2009). Expanding domain sentiment
lexicon through double propagation, Proceedings of the 21st
International Joint Conference on Artificial Intelligence (Morgan
Kaufmann, San Francisco, 2009) 11991204
[10] Pang, B., Lee, L. (2005). Seeing stars: exploiting class relationships for
sentiment categorization with respect to rating scales, Proceedings of the
Association for Computational Linguistics (2005) 115124.
[11] Yi, J., Nasukawa, T., Bunescu, R., Niblack, W. (2003). Sentiment
analyzer: extracting sentiments about a given topic using natural
language processing techniques, Proceedings of the 3rd IEEE
International Conference on Data Mining (2003) 427434.
[12] Miyoshi, T., Nakagami, Y. (2007). Sentiment classification of customer
reviews on electric products, Proceedings of International Conference on
Systems, Man and Cybernetics (2007) 20282033.
[13] Moghaddam, S., Ester, M. (2012). Aspect-based opinion mining from
product reviews. In Proceedings of the 35th international ACM SIGIR
conference on Research and development in information retrieval
(SIGIR '12). ACM, New York, NY, USA, 1184-1184.
[14] Ghose, A., Ipeirotis, P., & Sundararajan, A. (2007, June). Opinion
mining using econometrics: A case study on reputation systems. In
annual meeting-association for computational linguistics (Vol. 45, No. 1,
p. 416).
[15] Mukherjee, A., Liu, B., Glance, N. (2012). Spotting Fake Reviewer
Groups in Consumer Reviews. International World Wide Web
Conference (WWW-2012), Lyon, France, April 16-20, 2012.
[16] Jindal, N., Liu, B. (2008). Opinion Spam and Analysis. Proceedings of
First ACM International Conference on Web Search and Data Mining
(WSDM-2008), Feb 11-12, 2008, Stanford University, Stanford,
California, USA.
[17] Stonebraker, M., Çetintemel, U., Zdonik. S. (2005). The 8 requirements
of real-time stream processing. SIGMOD Rec., 34(4):4247, 2005.
[18] Varlamis, I., Tsirakis, N., Tsantilas, P., & Poulopoulos, V. (2014,
October). An automatic wrapper generation process for large scale
crawling of news websites. In Proceedings of the 18th Panhellenic
Conference on Informatics (pp. 1-6). ACM.
[19] Giannakopoulos, G., Kiomourtzis, G., & Karkaletsis, V. (2014).
NewSum:“N-Gram Graph”-Based. Innovative Document
Summarization Techniques: Revolutionizing Knowledge Understanding:
Revolutionizing Knowledge Understanding, 205.
[20] Petasis, G., Spiliotopoulos, D., Tsirakis, N., & Tsantilas, P. (2014).
Sentiment analysis for reputation management: Mining the greek web.
In Artificial Intelligence: Methods and Applications (pp. 327-340).
Springer International Publishing.
[21] Vafopoulos, M. N., Meimaris, M., Papantoniou, A., Anagnostopoulos,
I., Alexiou, G., Avraam, I., ... & Loumos, V. (2012). Public Spending:
Interconnecting and Visualizing Greek Public Expenditure Following
Linked Open Data Directives. Available at SSRN 2064517.
... For twitter data, sentiment analysis [13]- [15], opinion mining [16]- [18], are well-known techniques applied to extract trends in multitude of areas like elections, events and much more. In CRM systems, a semantic analysis on multisource unstructured data a semantic analysis is conducted to annotate, extract, and rate customer feedbacks. ...
Conference Paper
Full-text available
Big Data has gained an enormous momentum the past few years because of the tremendous volume of generated and processed Data from diverse application domains. Nowadays, it is estimated that 80% of all the generated data is unstructured. Evaluating the quality of Big data has been identified to be essential to guarantee data quality dimensions including for example completeness, and accuracy. Current initiatives for unstructured data quality evaluation are still under investigations. In this paper, we propose a quality evaluation model to handle quality of Unstructured Big Data (UBD). The later captures and discover first key properties of unstructured big data and its characteristics, provides some comprehensive mechanisms to sample, profile the UBD dataset and extract features and characteristics from heterogeneous data types in different formats. A Data Quality repository manage relationships between Data quality dimensions, quality Metrics, features extraction methods, mining methodologies, data types and data domains. An analysis of the samples provides a data profile of UBD. This profile is extended to a quality profile that contains the quality mapping with selected features for quality assessment. We developed an UBD quality assessment model that handles all the processes from the UBD profiling exploration to the Quality report. The model provides an initial blueprint for quality estimation of unstructured Big data. It also, states a set of quality characteristics and indicators that can be used to outline an initial data quality schema of UBD.
... Dealing with real-time mining of user generated contents on OSM might be practically impossible because of the huge size of the whole involved information. Choosing a statistically representative subset of users for evaluating an entire network can be a solution to this problem [14], [15]: a selection approach for users that represent monitoring points has been extensively validated on Twitter and Sina Weibo [16]. The required accommodation of such selection procedures yields a corresponding flexibility requirement in the design of a general crawler architecture. ...
Conference Paper
Full-text available
The creation and maintenance of a large-scale news content aggregator is a tedious task, which requires more than a simple RSS aggregator. Many news sites appear every day on the Internet, providing new content in different refresh rates; well established news sites restrict access to their content only to subscribers or online readers, without offering RSS feeds, whereas other sites update their CMS or website tem-plate and lead crawlers to fetch errors. The main problem that arises from this continuous generation and alteration of pages on the Internet is the automated discovery of the appropriate and useful content and the dynamic rules that crawlers need to apply in order not to become outdated. In this paper we present an innovative mechanism for extracting useful content (title, body and media) from news articles web pages, based on automatic extraction of patterns that form each domain. The system is able to achieve high performance by combining information gathered while discovering the structure of a news site, together with "knowledge" that acquires at each crawling step, in order to improve the quality of the next steps of its own procedure. Additionally, the system can recognize changes in patterns in order to rebuild the domain rules whenever the domain changes structure. This system has been successfully implemented in palo.rs, the first news search engine in Serbia.
Conference Paper
Full-text available
Harvesting the web and social web data is a meticulous and complex task. Applying the results to a successful business case such as brand monitoring requires high precision and recall for the opinion mining and entity recognition tasks. This work reports on the integrated platform of a state of the art Named-entity Recognition and Classification (NERC) system and opinion mining methods for a Software-as-a-Service (SaaS) approach on a fully automatic service for brand monitoring for the Greek language. The service has been successfully deployed to the biggest search engine in Greece powering the large-scale linguistic and sentiment analysis of about 80.000 resources per hour.
Conference Paper
Full-text available
The domains of Business Intelligence (BI) and social media have meanwhile become significant research fields. While BI aims at supporting an organization’s decisions by providing relevant analytical data, social media is an emerging source of personal and individual knowledge, opinion, and attitudes of stakeholders. For a while, a convergence of the two domains can be observed in real-world implementations and research, resulting in concepts like social BI. Many research questions still remain open – or even worse – are not yet formulated. Therefore, the paper aims at articulating a research agenda for social BI. By means of a literature review we systematically explored previous work and developed a framework. It contrasts social media characteristics with BI design areas and is used to derive the social BI research agenda. Our results show that the integration of social media (data) into a BI system has impact on almost all BI design objects.
Article
Full-text available
Opinionated social media such as product reviews are now widely used by individuals and organizations for their decision making. However, due to the reason of profit or fame, people try to game the system by opinion spamming (e.g., writing fake reviews) to promote or demote some target products. For reviews to reflect genuine user experiences and opinions, such spam reviews should be detected. Prior works on opinion spam focused on detecting fake reviews and individual fake reviewers. However, a fake reviewer group (a group of reviewers who work collaboratively to write fake reviews) is even more damaging as they can take total control of the sentiment on the target product due to its size. This paper studies spam detection in the collaborative setting, i.e., to discover fake reviewer groups. The proposed method first uses a frequent itemset mining method to find a set of candidate groups. It then uses several behavioral models derived from the collusion phenomenon among fake reviewers and relation models based on the relationships among groups, individual reviewers, and products they reviewed to detect fake reviewer groups. Additionally, we also built a labeled dataset of fake reviewer groups. Although labeling individual fake reviews and reviewers is very hard, to our surprise labeling fake reviewer groups is much easier. We also note that the proposed technique departs from the traditional supervised learning approach for spam detection because of the inherent nature of our problem which makes the classic supervised learning approach less effective. Experimental results show that the proposed method outperforms multiple strong baselines including the state-of-the-art supervised classification, regression, and learning to rank algorithms.
Conference Paper
Full-text available
Merchants selling products on the Web often ask their customers to review the products that they have purchased and the associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds or even thousands. This makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. It also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. For the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. In this research, we aim to mine and to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. We do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. Our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. This paper proposes several novel techniques to perform these tasks. Our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques.
Conference Paper
Full-text available
In most sentiment analysis applications, the sentiment lexicon plays a key role. However, it is hard, if not impossible, to collect and maintain a universal sentiment lexicon for all application domains because different words may be used in different domains. The main existing technique extracts such sentiment words from a large domain corpus based on different conjunctions and the idea of sentiment coherency in a sentence. In this paper, we propose a novel propagation approach that exploits the relations between sentiment words and topics or product features that the sentiment words modify, and also sentiment words and product features themselves to extract new sentiment words. As the method propagates information through both sentiment words and features, we call it double propagation. The extraction rules are designed based on relations described in dependency trees. A new method is also proposed to assign polarities to newly discovered sentiment words in a domain. Experimental results show that our approach is able to extract a large number of new sentiment words. The polarity assignment method is also effective.
Article
The provision of publicly available open data leads to transparency in several public sector exchanges, spending and decisions. However, this information is served massively and heterogeneously - mostly due to different bureaucratic procedures and paperwork formats, while its diffusion does not occur at regular or at least generally predictable time intervals. Thus, even though the information is available by the involved public sectors, enterprises and citizens are overwhelmed from the size/inconsistency of the information they deal with. The scope of our publicly accessible Web point is two-fold. Firstly, it aims to promote clarity and enhance citizen awareness regarding public spending in Greece through easily consumed visualization diagrams. Information provision is based on semantic processing of real-time open data provided by Greek government ('Diavgia') and the Greek Taxation Information System. Secondly, a proposed ontology for public spending in Greece functions in two distinct levels. It checks the validity of the publicly available data accessed by the system, cleaning and reconstructing in parallel false entries, while it will interconnect the data to existing ontological and data schemes derived from other similar initiatives worldwide and core vocabularies.
Article
"What other people think" has always been an important piece of information for most of us during the decision-making process. Today people tend to make their opinions available to other people via the Internet. As a result, the Web has become an excellent source of consumer opinions. There are now numerous Web resources containing such opinions, e.g., product reviews forums, discussion groups, and blogs. But, it is really difficult for a customer to read all of the reviews and make an informed decision on whether to purchase the product. It is also difficult for the manufacturer of the product to keep track and manage customer opinions. Also, focusing on just user ratings (stars) is not a sufficient source of information for a user or the manufacturer to make decisions. Therefore, mining online reviews (opinion mining) has emerged as an interesting new research direction. Extracting aspects and the corresponding ratings is an important challenge in opinion mining. An aspect is an attribute or component of a product, e.g. 'zoom' for a digital camera. A rating is an intended interpretation of the user satisfaction in terms of numerical values. Reviewers usually express the rating of an aspect by a set of sentiments, e.g. 'great zoom'. In this tutorial we cover opinion mining in online product reviews with the focus on aspect-based opinion mining. This problem is a key task in the area of opinion mining and has attracted a lot of researchers in the information retrieval community recently. Several opinion related information retrieval tasks can benefit from the results of aspect-based opinion mining and therefore it is considered as a fundamental problem. This tutorial covers not only general opinion mining and retrieval tasks, but also state-of-the-art methods, challenges, applications, and also future research directions of aspect-based opinion mining.
Conference Paper
A lot of terms represent the speaker's evaluation of the item. This evaluative character of the word is called its semantic orientation. The word with positive semantic orientation implies that the item is desirable, and on the other hand, the word with negative semantic orientation implies that the item is not desirable. We propose a method to estimate the semantic orientation(positive or negative) of Japanese products reviews, in which a pair of noun and adjective are selected as the corpus to determine the semantic orientation of the reviews and the semantic orientation scores of the review are calculated according to the corpus. The words which change the semantic orientation of a word called as "contextual valence shifter" such as "not", "very" "quite" are considered in the algorithm to determine the semantic orientation. In order to empirically evaluate the performance of the classification method for Japanese documents, 1400 reviews of two electric products, LCD and MP3 music player, are classified to two categories (positive or negative).