Content uploaded by Majdi Owda
Author content
All content in this area was uploaded by Majdi Owda on Nov 25, 2020
Content may be subject to copyright.
Spatial Sentiment and Perception Analysis of BBC News Articles Using Twitter
Posts Mining
Farah Younas1 and Dr. Majdi Owda2
1 Department of Computer Science, Shaheed Zulfikar Ali Bhutto Institute of Science and Technology, Islamabad, Pakistan.
farah.younas@szabist-isb.edu.pk.
2 Department of Computing and Mathematics, Manchester Metropolitan University, Chester Street, Manchester, M1 5GD,
UK, M.Owda@mmu.ac.uk
Abstract. Over the past few decades, due to an exponential growth in social media, online resources and
microblogging websites such as twitter. There has been a gush of user generated content and production of huge
amount of data through news and event sharing on these sites is no exception. Data generated by these resources
is a rich source of information for data mining. Sentiment Analysis is a current and important research area that
attempts to determine the polarity of text. Determining the sentiments on happening events around the globe has
become extremely important. In this paper, a subjective lexicon-based approach is proposed to mine the
unstructured data into meaning full information from a popular microblogging website, Twitter, in order to
determine the semantic orientation of real-time reactions and opinions. The main focus is to extract the audience’s
sentiment related to BBC news articles being shared on twitter. Firstly, our approach will extract all comments
on shared articles and determine their polarity. Secondly, it categorizes the extracted users based upon their
location and shows the collective opinion of users in different regions. Thirdly, a visualization tool has been
developed for viewing the obtained results.
Keywords: Microblogging, Big Data, Sentiment Analysis, Subjective Lexicon-based approach, Twitter, Spring
MVC, Hibernate.
1. Introduction
The entire process of identifying and mining subjective information from raw data is termed as sentiment analysis and
is closely related to the field of NLP (Natural Language Processing) which tries to minimize the gap between machine
and human by extracting and analysing beneficial information from natural language messages. In past few years, with
the growth of web technology there has been enormous growth in use of microblogging platform like Twitter. People
are not only using these web resources but also giving their feedback, consequently producing further useful
information. Social networking sites produce up to terabytes of data per week. Spurred by this growth in amount of
user’s feedback, views and opinions, it is becoming essential to mine this data. Data collected on the daily basis is
wasted if it is not utilized properly for any purpose. Companies are seeking ways to perform this task for better decision
making. Wide quantity of information generating large and complex datasets on day to day basis presents many
challenges to the analysts who want to extract meaningful information from the data. The traditional data processing
applications are in adequate for processing this data. Analysis such large amount of data is a challenge.
Sentiment Analysis or Opinion mining is a Natural Language Processing application and over the past few decades
Information Extraction (IE) task has perceived a flourishing attention. It is also called as emotion analysis or mood
extraction and the basic in this task is classification of text polarity as positive, negative or neutral. Social media is an
area where sentiment analysis has been extensively applied. The aim of the approach is to use the twitter data in order
to perform the sentiment analysis and develop a tool for visualizing the results. The targeted domain is the comments
posted on news articles of the BBC website which people share on Twitter; purpose is to analyse the impact of the
particular article on various regions around the globe. People from various parts of world comment on the shared news
article and express their thoughts and opinions about it which can be positive or negative. Due to large amount of data,
it is nearly impossible to analyse it manually hence an automated tool is required for its analysis.
There are three different approaches to perform sentiment analysis (1) Subjective Lexicon - each word in the list is
allocated a score that specifies word’s nature as positive (good), negative (bad), or neutral. (2) N-Gram Modelling -
different types of models (uni-gram, bi-gram, tri-gram or their combination) are used to make N-Gram model which is
further utilized for classifying training data. (3) Machine Learning - supervised and semi-supervised learning is
performed by feature extraction from text and learn the model [1]. We will be using the first approach for this project
i.e. Subjective Lexicon.
Sentiment Analysis can be done at different levels such as (1) Document Level Classification – This type of analysis is
performed on an entire review; whole review is categorized on the basis of overall opinion. (2) – Sentence Level
Classification – This process is carried out in two steps (i) A sentence is classified into either of the two classes:
objective or subjective and known as subjectivity classification. (ii) classification of a subjective sentence into either of
the two classes: positive or negative and referred to as sentiment classification, approach adopted by our study. The
approach will be used is based Subject Lexicon which is an unsupervised learning technique as data is not labelled.
The remaining sections of this paper are organized as follows. Section 2 is Related work. Section 3 and Section 4 contain
Research Methodology and Implementation Details respectively. Analysis and results are presented in section 5. Section
6 lists the limitation and future research challenges. Finally, the conclusion is mentioned in section 7.
2. Related Work
“Sentiment analysis or opinion mining refers to the opinion mining refers to the application of natural language
processing, computational linguistics and text analytics to identify and extract subjective information in source
materials” [2]. Support vector machine (SVM) was used in a study of emotion classification from to investigate the
emotion grouping of web blog corpora. The study took into account the sentence context and performed the emotional
classification. From this work it was concluded that emotions in the document’s last sentence has the maximum
significance in determining the polarity of surveyed document [3]. Read looked at the emoticons such as ☺ for
building the training set in order to perform text classification. Texts comprising of emoticons from Usenet newsgroups
were used by author as sources. It is evident that the training set depends both on the topic of the domain and the time
when the data was collected. Therefore, if the training set from one domain is applied on another domain let’s say from
domain A to domain B, it could be done if both the domains share the domain specific vocabulary. Domain, time and
topic independent datasets are obtained when experiments are performed using emoticons labelled training sets.
Satisfactory results were obtained by emoticons-trained classifiers [4].
Pang et al., used movie review as data to build sentiment lexicon. They did not classify the document by topic but
overall sentiments categorizing the reviews as positive or negative. It was found that machine learning techniques are
better than baselines produced by human. Their system motivated the other machine learning techniques Naïve Bayes,
maximum entropy classification, SVM and found that they perform better on traditional topic-based categorization as
on sentiment characterization [5]. empirical method for building adjective’s sentiment lexicon was first developed by
Hatzivassiloglou [6]. They used a large corpus to identify and validate constraint from conjunctions based on semantic
orientation i.e positive or negative of the conjoined adjective. The nature of conjunction linking the adjective is the key
point. To find out the nature of conjoined adjectives (same or different) orientation, these constraints were used by a
log-regression model and 82% accuracy was obtained on each independent conjunction. High level of performance was
obtained by experimenting real data evaluation and simulation providing more than 90% classification precision for
adjectives [6]. For classifying reviews as positive or negative, Peter turney presented a simple unsupervised algorithm
and proposed the idea of classifying these reviews as recommended i.e., thumbs up or not recommended i.e., thumbs
down. A phrase in review which contains the adjectives or adverbs was taking and their average semantic orientation
was used to predict the classification of a review. Experiments were conducted on movie review corpus. A review is
placed is recommended (thumbs up) category if average semantic orientation of the phrase is positive otherwise it is
classified as not recommended. An automated system was necessary for better formalization of problem [7]. The
algorithm was proposed by Turney to extract PMI – Point wise Mutual Information for consecutive words and their
polarity [5].
Since the last few years’ companies such as tweetfeel (https://www.tweetfeel.com), Twitratr
(https://www.twitrratr.com), Twitter Sentiment Analysis Tool (https://twittersentiment.appspot.com/), Social Mention
(www.socialmention.com) are available. While there has been reasonable volume of research on how sentiments are
conveyed in genres like news articles, online reviews. Sentic Corner or Sentic Computing, a model developed a new
model was developed [8]. This model is an intelligent user interface in which our current frame of mind is in harmony
with design and content and we don’t have to deal with continuously blasted ads and user unfriendly interfaces. Their
research is based on emotion representation and common sense. It collects audio, video, images dynamically related to
user’s current activities and feelings to infer emotional state over the web [9].
Users are interested in the aspectual sentiment classification rather than the binary output as positive or negative.
Therefore, in order to satisfy the end user, the sentiment analysis covered till now is insufficient.
Let’s take an example, consider a social worker who has developed a scheme and he wants to the change in society
prior and after the operation of his scheme. Hence, the system used for the purpose of sentiment analysis should be
efficient enough to recognise and classify the aspectual sentiment that is present in the text. As a solution to this problem,
Das proposed a sentiment structuration technique which is based on 5Ws which are Why, What, Where, When. Label
bias problem may occur due to some drawbacks of this techniques [10]. To rectify this problem hidden Maximum
Entropy Model (MEMM) explained above was introduced. Another theory called as Appraisal theory was described by
Bloom which characterised the opinion into three categories: affect, appreciation and judgment [11]. Aggregation of
data is the foremost needs of the end user. Sentiment Summarization -Visualization-Tracking can be done in two ways;
• Polarity Wise: An overall polarity wise summary can be shown in the form of a Gantt chart produced by the system.
A user can find out more details by looking into the summary text [12].
• Topic Wise: Sentiment summaries based on the customized topic about 5Ws can be generated by users. A user can
choose any dimension or combination of multiple dimensions. Pang and Lee performed topic wise sentiment
summarization [5].
A similar study was conducted on twitter news article which used Lexicon based approach to perform sentiment
analysis. The experiments were performed on BBC information dataset, which expresses the applicability and validation
of the adopted approach. Opinion mining was performed on articles from 2004 and 2005 to analyse which category of
news have more positive articles than the other ones [13]. Wang et al., examined the affiliation among crime statistics
and drug-associated tweets. Social media, which includes Twitter, has been shown to be a feasible tool for monitoring
and predicting public health events which include disease outbreaks. According to their study, Twitter can be used as
tool for monitoring crimes [14]. Within our previous study on Experiment for Analysing the Impact of Financial Events
on Twitter, we conducted a research on twitter which analysed the issue of detecting irregularities at real-time in
financial market according to the volume (as a sign of the importance of the irregularity) and to other features (as signs
of the potential origin causing the irregularity) [15]. Furthermore, another study conducted by us used Twitter as a
source of decision making tool and inspected the permeability of Twitter to financial events as a way to provide evidence
which allows Twitter for use as a social sensor for the economic and stock market with real time [16].
3. Research Methodology
The architecture of the proposed system is 3-tier that performs the sentiment analysis on a news articles which user
wants to find the user opinion on. 3 layers of system architecture comprise of Presentation, Business, and Data Access
Layer. Figure 1 - show the main modules of system architecture. Figure 2 - shows all the components and
subcomponents in addition to indication of interaction between them. The input to the system is a URL of a BBC
news article under analysis. A search query is generated based on the user input with the help of its subcomponent
known as filtering engine. Twitter is then queried by the Social Media Retrieval Engine to gather the required
information about the people who shared that article on twitter and extracts the comments on that article. Obtained
results are stored in the data store for analysis purpose. The collected data might contain a lot of noise such as
unnecessary words, punctuation marks, emoticons etc., which are not helpful in analysis. So, the data needs to be
pre-processed prior to analysis. Finally, processed data is used to carry out the analysis and presents the results on
the client’s browser. Results contain the extracted comments, information about the users, a map indicating their
geographical location, pie and bar graphs presenting the percentage result of the sentiment analysis. Collecting data
for the project manually from twitter would be a tedious job, so the twitter library twitter4j for twitter API which
allows user to collect data from twitter website was used for data collection purpose. Following sub-sections illustrate
the components of proposed system.
Fig 1 Modules of System
Fig. 2. System Architecture
3.1 Query Builder
This component generates a query based upon the user input i.e. what user wants to search. Hence, the input stream to
this component is the input URL and the output stream is the generated query according to which social media retrieval
engine will collect data for analysis. Its subcomponent is filtering engine (a) Filtering Engine – It filters out the tweets
which we do not want to process and provides us the required tweets and their associated comments. Characteristics of
the tweets to be filtered out are as follows:
a. Tweets written in any other language except English.
b. Tweets which does not contain the URL of the searched article.
c. Tweets from specified time period.
3.2 Social Media Retrieval Engine
Responsibility of this module is to gather the information about the users who have commented on or have shared the
searched article. To carry out this task, Twitter4j library is used to query Twitter API as mentioned in subsection 2.1,
generating tweet objects as the output stream of the component. Once a tweet and its relevant information is retrieved,
it is added to the outgoing stream and stored in the database. Twitter API imposes several regulations and an
unavoidable rate limit restriction on the number of tweets which can be extracted per hour, so it is a very likely for
some tweets to be overlooked.
3.3 Data Pre-Processing Engine
To prepare the data for analysis, it is pre-processed. This process makes the gathered data noise free. In first step, every
extracted tweet is split into characters. In second step, word which are not meaning full for analysis, termed as Stop
words, are removed from extracted tweet sentences. This step helps in decreasing the meaningless vocabulary in order
to obtain optimum results. There is no universal list of such words. Table 1 provides the list of words removed from
collected tweets.
Table 1. Action performed on unwanted content
Unwanted Content
Action
Punctuation (! ? , . ” : ; )
Removed
#tag
Removed
@user
Removed
Uppercase characters
Lowercase all content
Stop words
Removed
RT
Removed
BBC News
Removed
3.4 Data Store
MySQL database was used in development of the project for storing and managing the information about users and
tweets along with the comments extracted by retrieval engine. It also contains the list of words for performing the
sentiment analysis on the collected information.
3.5 Reasoning Engine
Main task of this engine is to perform the core sentiment analysis making it the most important part of the whole
architecture. Four subcomponents of this engine shown in Figure 3. are as follows:
Sentiment Analysis. This subcomponent is responsible for comparing the obtained comments with the list of sentiment
words stored in the database to categorize the impact of the article as positive or negative. Once it finds a word, an
increment is performed on the weight of obtained sentiment either positive or negative.
Aggregation Function. After obtaining the positive and negative count from sentiment analysis component, an
aggregation function %𝑨𝒈𝒈𝒓𝒆𝒈𝒂𝒕𝒆 = [(𝜮𝒄𝒐𝒖𝒏𝒕/𝑻𝒐𝒕𝒂𝒍𝒄𝒐𝒖𝒏𝒕) ∗ 𝟏𝟎𝟎] is applied on them to calculate the final
percentage of for both positive and negative counts.
Graph Generation. Values obtained from the aggregation function are used to plot pie chart and bar chart for
visualizing the results of experiment.
Map Plotting. This component plots the graph using google maps based on user location extracted in the data gathering
process.
Fig. 1. Reasoning Engine
4. Implementation Details
To make the system as responsive as possible the program is developed using Spring MVC integrated with hibernate
along with Tomcat7 web server. MySQL database is used for data management and retrieval. The Twitter data model
used in this project is shown in the Figure 4.
Fig. 4. Twitter Data Model
4.1. Tools and Environment
The developed web application runs on the following technology stack:
• Java EE IDE - Eclipse Mars Release 4.5.0
• Jdk8
• Tomcat v7.0
• Spring MVC 3.0
• Hibernate 3.0
• MySQL Workbench 6.3
• Platform - All the processing was carried out on 64- bit operating system, AMD A10 processor, 8GB RAM
running on windows operating system.
4.2. User Interface
Designed interface allows the user to view the searched article. Extracted tweets, its comment and information related
to the users is displayed on the user interface. It also provides a facility to view the results in various forms such as,
overall percentages of obtained results categorized as positive or negative. Furthermore, different regions indicating
their associated percentages is also displayed for comparison.
5. Results and Analysis
For illustration purpose, a news article and a tweet with its comments are presented as a case study. For example, we
search the following article: BBC News - General election 2019: Labour facing long haul, warns Few obtained
tweets are shown in the Figure 5.
Fig. 5. Extracted comments on Queried Tweet
The approach was applied to a set of tweets and comments on the article. At the time of query execution, a total of
590 people shared this article on their twitter profiles. Results obtained showed that in 71.88% people of United states
of America perceived it as a positive news whereas, 28.57 percent of people considered it as negative. Figure 6 and
7 shows the result in form of a map, indicating percentage of positive and negative perception of people from where
they responded to the news article. Table 2 shows the obtained percentages of perception from extracted regions. An
overall ratio is also obtained through as a part of analysis and displayed in the form of pie chart and bar chart.
According to which 69.57 % of people around the globe considered it as positive and 30.43% people considered this
news article as negative as shown in Figure 8 and 9.
Table 2. Percentages of perceptions in different regions.
Country
Percentage of Positive Perception
Percentage of Negative Perception
Canada
9.4
57.14
United States
71.88
28.57
United Kingdom
18.75
14.28
Fig. 8. Pie Chart of Perceptions Fig. 9. Bar Chart of Perceptions
Fig. 6. Locations’ Percentage of Positive Perception
Fig. 7. Locations’ Percentage of Positive Perception
6. Research Limitation and Challenges
Performed Sentiment analysis was limited to comments written in English language only. Another limitation and
challenge are the generation and addition of newer vocabulary, hence arising the need of keeping our dictionary
updated. Furthermore, limitation of extracting a specified number of tweets in an hour imposed by twitter API results
in missing out some of the tweets affecting the analysis in turn. Since the analysis is performed on real time data,
determining the accuracy of results is a challenging task.
7. Conclusion
People around the globe these days intend to consume the news more than ever before. Mining the polarity of important
events happening around us is a useful approach in order to get an idea of impact of a certain event or news on different
regions of the world therefore providing a spatial sentiment or spreading sentiment is essential. This paper explored the
aforementioned direction of sentiment analysis out of many directions possible, in which, opinion mining was
performed on the comments posted by various users on BBC news articles shared on twitter accounts. Since opinions
differ according to the context of news and location of the audiences. A BBC news article which was shared on twitter
was used for demonstration of experiment and its results. Further work will be based on extracting data from various
social media sources such as Facebook and integrating it with the results obtained from the proposed approach. This
will result in an improved spatial sentiment analysis as data will be collected and fused from diverse sources.
References
[1]
L. Zhang, R. Ghosh, M. Dekhil, M. Hsu and B. Liu, "Combining Lexicon-based and Learning-
based Methods," June 21, 2011.
[2]
R. Tejwani, "Sentiment Analysis: A Survey," 2014.
[3]
C. Yang, K. Hsin-Yih and H.-H. Chen, "Emotion Classification Using Web Blog Corpora," in
IEEE/WIC/ACM International Conference on Web Intelligence, National Taiwan University,
Taipei, Taiwan, 2007.
[4]
J. Read, "Using emoticons to reduce dependency in machine learning techniques for sentiment
classification.," in The Association for Computer Linguistics., 2005.
[5]
B. Pang, L. Lee and S. Vaithyanathan, "Thumbs up? Sentiment Classification using Machine
Learning," in EMNLP, USA, 2002.
[6]
V. Hatzivassiloglou and K. McKeown, "Predicting the Semantic Orientation of Adjectives,"
May 2002.
[7]
P. D. Turney, "Thumbs Up or Thumbs Down? Semantic Orientation Applied to," 40th Annual
Meeting of the Association for Computational Linguistics (ACL), pp. 417-424, July 2002.
[8]
E. Cambria and A. Hussain, Sentic Computing, Techniques, Tools, and Applications, Springer,
2012.
[9]
E. Cambria, A. Hussain and C. Eckl, "Taking Refuge in Your Personal Sentic Corner," in
Cambria2011TakingRI, 2011.
[10]
A. Das, S. Bandyopadhyay and B. Gambäck, "The 5W Structure for Sentiment Summarization-
Visualization-Tracking," in Proceedings of the 13th international conference on Computational
Linguistics and Intelligent Text Processing, March 2012.
[11]
J. Martin and P. White, "The language of evaluation: Appraisal in English.," London, 2005.
[12]
M. A. Karim, Technical Challenges and Design Issues in Bangla Language Processing, IGI
Global, 2013.
[13]
S. Taj and A. F. Meghji, "Sentiment Analysis of News Articles: A lexicon Based Approach," in
2nd International Conference on Computing Mathematics & Engineering Technologies-2019
(iCoMET), February 2019.
[14]
W. Yan, W. Yu, S. Liu and S. Young, "The Relationship Between Social Media Data and Crime
Rates in the United States," Social Media + Society, vol. 5, no. 1, March 2019.
[15]
M. Owda, K. Crockett and A. F. Vilas, "Experiment for Analysing the Impact of Financial
Events on Twitter," August 2017.
[16]
A. F. Vilas, R. P. D. Redondo, K. Crockett, M. Owda and L. Evans, "Twitter permeability to
financial events: an experiment towards a model for sensing irregularities," Multimedia Tools
and Applications, vol. 78, no. 7, p. 9217–9245, April 2019.