Conference PaperPDF Available

The Use of POS Sequence for Analyzing Sentence Pattern in Twitter Sentiment Analysis

Authors:

Abstract and Figures

As one of the largest Social Media in providing public data every day, Twitter has attracted the attention of researcher to investigate, in order to mine public opinion, which is known as Sentiment Analysis. Consequently, many techniques and studies related to Sentiment Analysis over Twitter have been proposed in recent years. However, there is no study that discuss about sentence pattern of positive/negative sentence and neither subjective/objective sentence. In this paper we propose POS sequence as feature to investigate pattern or word combination of tweets in two domains of Sentiment Analysis: subjectivity and polarity. Specifically we utilize Information Gain to extract POS sequence in three forms: sequence of 2-tags, 3-tags, and 5-tags. The results reveal that there are some tendencies of sentence pattern which distinguish between positive, negative, subjective and objective tweets. Our approach also shows that feature of POS sequence can improve Sentiment Analysis accuracy.
Content may be subject to copyright.
The Use of POS Sequence for Analyzing Sentence
Pattern in Twitter Sentiment Analysis
Fajri Koto, and Mirna Adriani
Faculty of Computer Science
University of Indonesia
Depok, Jawa Barat, Indonesia 16423
Email: fajri91@ui.ac.id, mirna@cs.ui.ac.id
Abstract—As one of the largest Social Media in providing
public data every day, Twitter has attracted the attention of
researcher to investigate, in order to mine public opinion, which
is known as Sentiment Analysis. Consequently, many techniques
and studies related to Sentiment Analysis over Twitter have been
proposed in recent years. However, there is no study that discuss
about sentence pattern of positive/negative sentence and neither
subjective/objective sentence. In this paper we propose POS
sequence as feature to investigate pattern or word combination
of tweets in two domains of Sentiment Analysis: subjectivity and
polarity. Specifically we utilize Information Gain to extract POS
sequence in three forms: sequence of 2-tags, 3-tags, and 5-tags.
The results reveal that there are some tendencies of sentence
pattern which distinguish between positive, negative, subjective
and objective tweets. Our approach also shows that feature of
POS sequence can improve Sentiment Analysis accuracy.
Keywordssocial media, twitter, sentiment analysis, POS se-
quence, subjectivity, polarity
I. INTRODUCTION
Nowadays the emerging popularity of Social Media brings
in an overwhelming amount of data published by people.
According to Statistic Brain1,Twitter2is one of the world
largest Social Media with more than 600 million active users
at 2014. Twitter is a microblogging environment which allows
users to post certain free text, limited up to 140 characters,
called a tweet. Since 2014, there have been 58 million tweets
posted in average per day. Recent estimates indicate that one
out of five tweets discuss products or brands [1]. This implies
the abundance of user-generated content published through
such social media renders automated information monitoring
tools crucial for today’s business [2].
In information retrieval, the automatic system predicting
the sentiment of textual data is known as Sentiment Analysis.
This field refers to a broad area of natural language processing,
computational linguistic, and text mining. Typically, the goal
is to determine the polarity of natural language texts [2]
These sentiments can be categorized either into two categories:
positive and negative; or into an n-point scale, e.g., very
good, good, satisfactory, bad, very bad. In this respect, a
sentiment analysis task can be interpreted as a classification
task where each category represents a sentiment [3]. In this
work, we follow Bravo-Marquez et al. who roughly divide
these tasks into two categories: 1) Subjectivity classification
and 2) Polarity classification [4].
1http://www.statisticbrain.com/
2http://www.twitter.com
Subjectivity classification involves the discriminations be-
tween subjective and objective utterances. Formally, it can
be defined as: Given a collection of tweet T and the set
of binary subjectivity classification classes S = {subjective,
objective}, the goal is to approximate the unknown target
function F : T −→ S that is called as binary subjectivity
classifier. Objective utterance commonly contains facts, while
the subjective reflects a private point of view, emotion or
belief [5]. In polarity classification, three sentiment classes:
positive, negative and neutral have been introduced. According
to Aisopos et al., the classification can be defined as two
problems: 1) classes P1 = {positive, negative}for binary
polarity classification and 2) classes P2 = {positive, negative,
neutral}for general polarity classification. The goal is also to
approximate the unknown target Function F1 : T−→ P1 and
F2:T−→ P2 [6].
Many approaches have been addressed to approximate the
unknown target function of Sentiment Analysis. However,
from the existing approaches, there is no study that discusses
about sentence pattern of positive/negative tweets and neither
subjective/objective tweets. We hypothesize that there may
be some tendencies of pattern or word combinations which
distinguish tweets in polarity and either subjectivity domain.
For instance, people tend to write ”I like this phone” rather
than ”I don’t hate this phone”. It is uncommon for people
to use negation word in expressing their positive impression.
In subjectivity domain, the presence of adverb and adjective
is also commonly different between subjective and objective
sentence. People tend to use adjective or adverbs rather than
noun in uttering their opinion, emotion or belief.
Therefore, in this study we propose Part of Speech (POS)
sequence to investigate this issue. Part of Speech is linguistic
category of words which is generally defined by the syntactic.
We use combination of consecutive POS tags and call them as
POS sequence in order to investigate the pattern of word com-
bination that commonly appear in tweet containing sentiment.
Specifically, we conduct experiment by performing sequence
of 2, 3 and 5 tags. For each type of sequences, we calculate
their Information Gain and then use the top-k sequences. In
addition we also perform supervised classification in which
we incorporate POS sequence with previous method in Twitter
sentiment analysis.
The rest of this paper is structured as follows. Section
2 describes our approach in using POS sequence as feature
of Sentiment Analysis. Experimental-set up will be given in
Section 3, while Section 4 describes the experiment results
2015 29th International Conference on Advanced Information Networking and Applications Workshops
978-1-4799-1775-4/15 $31.00 © 2015 IEEE
DOI 10.1109/WAINA.2015.58
547
consisting of sentence pattern analysis and POS sequence
performance in Sentiment Classification. Finally conclusion
are drawn in Section 5.
II. RELATED WORK
The first investigation of tweet sentiment was done by
Go et al. in which they utilized emoticons to annotate tweet
with sentiment label [7]. The next study by Agarwal et al.
used manually annotated tweets with sentiment and perform
unigram model to do classification [8]. In other studies, Wang
et al. utilized hashtag3to perform graph-based classification
[9], while Cui et al. analyzed the emoticon of tweets with
graph propagation algorithm for emoticon weighting [10].
Some lexical resources of Sentiment Analysis such as: Opinion
Finder [11], SentiWord [12], ANEW [13], AFINN [14] and
NRC-emotion lexicon [15] were also released.
As discussed in previous section, our work purpose is
to investigate the sentence pattern that distinguish tweet in
sentiment domain. Although our effort is the first in twitter
sentiment analysis, some previous works of POS Sequence
have been published. Bandersky et al. used POS Sequence as
one of features in detecting memorable quote from structured
document like book. Specifically, they used Information Gain
to select top-isequence and then performed a supervised
quotable phrase detection using other lexical and syntactic
features [16]. Mukherjee et al. has also proposed POS sequence
as feature in gender classification of blog authors. The main
idea of their algorithm is to perform a level-wise search for
such patterns, which are POS sequences with minsup and
minadherence [17].
We realize that generating POS Sequence in unstructured
document like tweets is not a trivial matter. Applying POS
tagger to Twitter needs consideration since study shows that
the POS tagger accuracy drops when it is trained on well-
formed language and tested on Twitter data. A study by Ritter
et al. shows that POS tagger accuracy drops about 0.1-0.2
when it is applied to tweet. A key reason for this drop in
accuracy is that Twitter contains far more Out of Vocabulary
(OOV) words than grammatical text [18]. However, this is a
research area in itself. As preliminary study, we have started
our investigation by using existing POS-tagger in generating
POS Sequence. We argue that this accuracy reduction is still
acceptable in conducting the investigation.
III. POS SEQUENCE AS SENTIMENT ANALYSIS FEATURE
POS sequence is defined as a series of several tags which
are limited to certain number of tags. As example, a sentence
I went to school yesterday” with its POS sequence PRP-VBD-
TO-NN-ADV, can produce 3 sequences of 3-tags: PRP-VBD-
TO, VBD-TO-NN, and TO-NN-ADV. To retrieve sequences
that represent space of dataset, weighting mechanism is re-
quired. In this work, by following Bandersky et al., Information
Gain (Eq. 1 and Eq. 2) is used to select the top-isequences.
IG(X,Y )=H(X)H(X|Y)(1)
H(X)=p(x)log2p(x)(2)
3a word started with the #symbol, and is used to mark keywords or topics
in a Tweet.
Fig. 1. Sequence of n-tags Extraction
Technically, feature of POS sequence is written as
#IGSeq[i]that expresses a number of POS sequence i-th
contained in the tweet, where X indicates the presence or
absence of POS sequence in current tweet, and Y indicates
the type of tweet. In this context, it can be positive or nega-
tive for polarity classification and subjective or objective for
subjectivity classification. Intuitively, the features #IGSeq[i]
measure how many POS sequence i-th are indicative of tweet
with certain sentiment.
The POS sequences are generated based on procedure
described in Fig. 1. The preprocessing stage includes: 1) Re-
moving url and phrase of twitter @account, 2) removing non-
alphabetic symbol, 3) removing RT phrase, and 4) converting
tweet into lowercase character. After that, the POS tagger
is applied. The dataset then is divided accordingly by their
sentiment class in order to calculate information gain for all
sequences. Finally, we select the top of-isequences and use it
as Sentiment Analysis feature.
IV. EXPERIMENTAL SET-UP
A. Datasets
The experiments were conducted in two domains: polarity
and subjectivity and used 5 different datasets: 1) Standford
Twitter Sentiment (STS)4which was used by Go et al. [7], 2)
Sanders5,3)Health Care Reform (HCR)6,4)Obama-McCain
Debate (OMD)6which were used by Speriosu et al. [19], and
5) International Workshop Sem-Eval 2013 (SemEval)7data.
Each tweet in these datasets includes a positive, negative, or
neutral tag. The summary of all datasets is given in Table I.
TABLE I. DATASETS STATISTIC
STS Sanders HCR OMD SemEval
#negative 177 635 784 1582 896
#positive 182 555 368 844 2341
#neutral 139 2293 280 813 2256
#total 498 3483 1432 3239 5493
4http://cs.standford.edu/people/alecmgo/trainingtestdata.zip
5http://www.sanalytics.com/lab/twitter-sentiment
6https://bitbucket.org/speriosu/updown/src/5de483437466/data?at=default
7http://www.cs.york.ac.uk/semeval-2013/
548
TABLE II. BALANCED DATASET
Subjectivity STS Sanders HCR OMD SemEval
#neutral 139 1190 280 800 2256
#objective 139 1190 280 800 2256
#total 278 2380 560 1600 4512
Polarity STS Sanders HCR OMD SemEval
#negative 177 555 368 800 896
#positive 177 555 368 800 896
#total 354 1110 736 1600 1792
In this work, we perform binary classification and tackle
the class imbalance by sampling tweets in Table I. Polarity
classification is done by only using positive and negative
label, while our subjectivity classification considers neutral
tweet as objective and positive/negative tweet as subjective.
As example, the class imbalance in STS is tackled by sampling
177 positive tweets from 182 tweets for polarity classification.
Whereas we only use 139 neutral and 139 objective tweets to
conduct subjectivity classification. The summary of these data
is given in Table II.
B. Experiment Stage
Fig. 2. Stage of experiments
Experiments were done based on experiment stages de-
scribed in Figure 2. First we utilized NLTK Python [20] as
POS Tagger to build data in POS tag form. Tweet datasets
then were extracted to several sequences of n-tags based on
steps in Figure 1. There were three forms of POS sequence
conducted in our experiment: n=2,n=3, and n=5, and
we only selected the top-100 of sequences of n-tags based on
Information Gain. After that, the extracted data were applied
with SVM weighting in order to filter out them into top-
10. This procedure was applied to each five datasets told in
previous subsection. Consequently, five variations of top-10
sequences were generated. We then selected sequences which
exist in two or more top-10 sequences in order to construct
analysis of sentence pattern in Twitter Sentiment Analysis.
In addition, we also conducted sentiment classification us-
ing top-100 POS sequences constructed by Information Gain.
We performed 5-fold cross validation with 80% of tweets as
training set, where the reminder of tweet as the test set. For
each fold in cross validation, the POS Sequences are extracted
based on the training set. Here, we used LibSVM [21] in open
source tool Rapidminer8[22] to classify tweet datasets. As
baseline, we used AFINN [13], a lexicon containing 2477
English words and constructed based on the Affective Norms
for English Words lexicon (ANEW) proposed by Bradley and
Lang [14]. Bravo-Marquez et al. in their latest study also
used this lexicon as baseline [4]. It is motivated by their
good performance in performing sentiment classification over
Twitter.
V. E XPERIMENT RESULT
A. Sentence Pattern of Tweet Containing Sentiment
After applying SVM weighting to all five datasets in three
forms of sequence based on Figure 2, 3x5 variations of top-10
POS sequences are produced. To perform analysis, we selected
sequences which arise in two or more datasets. We provide the
result in Table III for subjectivity and Table IV for polarity
domain. In these tables, result of sequence containing 2 and
3 tags are given. The #Dataset column reflects number of
datasets which arise a certain sequence as their Top-10.
Unlike sequence of 2-tags and 3-tags, results of 5-tags
sequence are difficult to interpret. It is caused by more
combination of POS sequences for higher nvalue. It impacts
the extracted POS sequences are very sparse and resemble to
vector 0. Therefor in this paper we don’t show the result in
the table.
We also provide the frequency column to look the tendency
of word combination between two classes. The column reflects
the appearance number of a sequence in all datasets by their
sentiment class. As example, in Table III sequence of RB-VBG
has subjective frequency equals to 258. This number shows that
RB-VBG appears 258 times in all of our subjective tweets.
TABLE III. WORD COMBINATION IN SUBJECTIVITY DOMAIN
Sequence of Description #Dataset Frequency
2 tags Subjective Objective
RB-VBG Adverb-Verb 4 258 124
RB-VB Adverb-Verb 3 709 381
RB-JJ Adverb-Adj 3 477 343
NN-PRP Noun-Pronoun 2 1506 981
VBZ-VBG Verb-Verb 2 221 177
PRP-VBP Pronoun-Verb 2 1702 1189
NN-NNS Noun-Noun 2 436 541
RB-NN Adverb-Noun 2 296 235
VBP-JJ Verb-Adj 2 326 269
Sequence of Description #Dataset Frequency
3 tags Subjective Objective
NN-NN-PRP Noun-Noun-Pronoun 3 593 374
RB-JJ-NN Adverb-Adj-Noun 3 247 181
PRP-VBP-JJ Pronoun-Verb-Adj 3 209 105
NN-NN-IN Noun-Noun-Conj 2 1356 1608
NN-PRP$-NN Noun-Possessive-Noun 2 143 96
IN-DT-JJ Conj-Det-Adj 2 321 337
MD-RB-VB Modal-Adverb-Verb 2 333 131
NN-NN-NN Noun-Noun-Noun 2 2933 3368
NN-VBZ-RB Noun-Verb-Adverb 2 143 98
1) Subjectivity: Results in Table III reveal the difference of
word combination between subjective and objective tweets. In
sequence of 2-tags, the sequences of RB-VBG, RB-VB, RB-
NN, RB-JJ, VBZ-VBG, VB-JJ tend to be POS sequence of
subjective tweets. The examples of them: ”am happy, am itchy,
too big, seriously hate, so amazing, and so boring”, are agreed
8http://www.rapidminer.com
549
with subjective utterance purpose that reflects private point of
view, emotion, and opinion. Moreover, these sequences consist
of adjective and adverb that is also in line with our hypothesis.
Unlike the subjective, this table only shows that the objective
tweets tend to have POS sequence of NN-NNS. It indicates
that objective tweets tend to have more nouns than subjective
tweets. The example of these sequences are ”city politics and
house correspondents”.
In results of 3-tags sequence, the selected sequences are
in line with sequence of 2-tags. Subjective tweets also consist
of POS sequences containing adverb and adjective (RB-JJ-NN,
PRP-VBP-JJ, MD-RB-VB, and NN-VBZ-RB), while objective
tweets also tend to have POS sequence containing noun (NN-
NN-IN and NN-NN-NN).
TABLE IV. WORD COMBINATION IN POLARITY DOMAIN
Sequence of Description #Dataset Frequency
2 tags Positive Negative
RB-VB Adverb-Verb 4 792 1346
NN-DT Noun-Det 2 476 557
PRP-VBD Pronoun-Verb 2 195 292
NN-PRP Noun-Pronoun 2 834 853
PRP-RB Pronoun-Adverb 2 156 237
VBZ -DT Verb-Det 2 124 190
NN-WRB Noun-WH(adverb) 2 52 131
Sequence of Description #Dataset Frequency
3 tags Positive Negative
VBZ-DT-NN Verb-Det-Noun 3 54 116
PRP-VBP-RB Pronoun-Verb-Adverb 3 123 236
NN-NN-IN Noun-Noun-Conj 2 691 681
VBD-NN-NN Verb-Noun-Noun 2 109 95
MD-RB-VB Modal-Adverb-Verb 2 133 225
NN-DT-NN Noun-Det-Noun 2 296 350
2) Polarity: As shown in Table IV, sequence of 2-tags
results reveal that negative tweets tend to have POS sequence
of RB-VB, PRP-RB and PRP-VBD. The example of RB-VB:
”not love, firmly believe, ugly love, not regret”, indicate that
tweets with negative sentiment tend to have an affirmation
words before the verb. The affirmation is also shown by
POS sequence of PRP-RB which uses adverb to affirm the
negativity. The examples of this sequence are ”I highly, I
seriously, I never, me crazy, and I just”. In other side, result
of PRP-VBD reveal that people are prefer to use past tense
rather than present tense in expressing the negativity.
Affirmation words are also found in sequence of 3-tags
result. The negative tweet tends to have POS sequence of PRP-
VBP-RB and MD-RB-VB. In our dataset, the words that are
commonly used to express the negativity are ”not and never”.
The examples are ”I am not, can not wait, and will never buy”.
B. POS sequence to boost Sentiment Analysis Classification
To investigate the performance of POS sequence in sen-
timent classification, we compared AFINN lexicon with the
incorporation of POS sequence and AFINN lexicon for both
classification (subjectivity and polarity). AFINN lexicon is
used by extracting tweet into two main features called APO
(AFINN Positivity) and ANE (AFINN Ne-gativity). APO is
extracted by summing score of positive words (from 1 to 5),
while ANE is extracted by summing score of negative words
(score -5 to -1). The powerfulness of AFINN for sentiment
classification over Twitter is its words that include slang
and obscene words as also acronyms and web jargon. The
incorporation of AFINN and POS Sequence features is simply
done by concatenating both features. Thus, 102 attributes were
used for the incorporation.
Fig. 3. The Accuracy of Polarity Classification using AFINN and incorpo-
ration of AFINN and Top-100 sequence for each dataset
Fig. 4. The Accuracy of Subjectivity Classification using AFINN and
incorporation of AFINN and Top-100 sequence for each dataset
Due to the result in previous section, we discarded se-
quence of 5-tags and only used sequence of 3-tags like
Bandersky et al. work in classifying memorable quote [16].
As feature to perform sentiment classification, we used the
top-100 sequence yielded from each training set. The results
of experiment are shown in Figure 3 and Figure 4 and reveal
that the incorporations of AFINN and POS sequence are able
to boost the accuracy of AFINN lexicon.
In polarity classification, all five datasets give positively
improvement by 0.23%, 3.06%, 3.25%, 1.67% and 2.37% for
STS, Sanders, HCR, SemEval, and OMD consecutively. In
other side, The positive result are also shown in subjectiv-
ity classification. The accuracies increase by 2.52%, 0.42%,
0.18%, 1.24% and 11.44% for STS, Sanders, HCR, SemEval,
and OMD consecutively. These results enable us to affirm that
POS sequence is able to be utilized in Sentiment Classification
over Twitter in subjectivity and polarity classification.
550
VI. CONCLUSION
In this study, we discuss about the use of POS Sequence in
Sentiment Analysis over Twitter in two domains: subjectivity
and polarity. To achieve the most optimum POS sequence
in uncovering sentence pattern, we conducted the study in
three variations of POS sequence (n=2,n=3, and
n=5). In addition, we performed sentiment classification
by incorporating AFINN Lexicon and POS Sequence.
In our first experiment, the results reveal that subjective
tweets tend to have word combinations consisted of adverb
and adjective. This is in line with subjective utterance purpose
that expresses emotion or private point of view. In contrast,
the objective tweets tend to have word combination of nouns
which basically aims to express a fact or neutrality rather than
emotion. Whereas, in polarity domain, the negative tweets tend
to have word combination of affirmation words which often
appear as negation word. In the second experiment, the results
show that features of POS sequence are able to boost the
accuracy in incorporation between AFINN and POS sequence.
It affirms that POS sequence can be utilized for performing
Sentiment Analysis over Twitter.
REFERENCES
[1] B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury, “Twitter power:
Tweets as electronic word of mouth”. In Journal of the American society
for information science and technology 60.11, 2009, pp. 2169-2188.
[2] A. Hogenboom, D. Bal, F. Frasincar, M. Bal, F. de Jong, and U.
Kaymak, “Exploiting emoticons in sentiment analysis”. In Proc. of the
28th Annual ACM Symposium on Ap-plied Computing, 2013, pp. 703-
710.
[3] R. Prabowo, and M. Thelwall, “Sentiment Analysis: A Combined
Approach”. In Journal of Informetrics 3.2, 2009, pp. 143-157.
[4] F. Bravo-Marquez, M. Mendoza, and B. Poblete, “Combining strengths,
emotions and polarities for boosting Twitter sentiment analysis”. In
Proc. of the Second International Workshop on Issues of Sentiment
Discovery and Opinion Mining, 2013.
[5] S. Raaijmakers, and W. Kraaij, “A Shallow Approach to Subjectivity
Classification”. In ICWSM, 2008.
[6] F. Aisopos, G. Papadakis, K. Tserpes, and T. Varvarigou, “Content vs.
context for sentiment analysis: a comparative analysis over microblogs.
In Proc. of the 23rd ACM conference on Hypertext and social media,
2012, pp. 187-196.
[7] A. Go, R. Bhayani, and L. Huang, “Twitter sentiment classification
using distant supervision.” In CS224N Project Report, Stanford, 2009,
pp. 1-12.
[8] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau,
“Sentiment analysis of twitter data”. In Proc. of the Workshop on
Languages in Social Media, 2011, pp. 30-38.
[9] X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang, “Topic senti-
ment analysis in twitter: a graph-based hashtag sentiment classification
approach”. In Proc. of the 20th ACM international conference on
Information and knowledge management, 2011 pp. 1031-1040
[10] A. Cui, M. Zhang, Y. Liu, and S. Ma, “Emotion tokens: Bridging the gap
among multilingual twitter sentiment analysis”. In Proc. Information
retrieval technology, Springer, Berlin Heidelberg, 2011, pp. 238-249.
[11] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual polarity
in phrase-level sentiment analysis”. In Proc. of the conference on
human language technology and empirical methods in natural language
processing, 2005, pp. 347-354.
[12] A. Esuli, and F. Sebastiani, “Sentiwordnet: A publicly available lexical
resource for opinion mining”. In Proc. of LREC Vol. 6, 2006, pp. 417-
422.
[13] F. A. Nielsen, “A new ANEW: Evaluation of a word list
for sentiment analysis in microblogs”. 2001, Available at
http://arxiv.org/abs/1103.2903
[14] M. M. Bradley, and P. J. Lang, “Affective norms for English words
(ANEW): Instruction ma-nual and affective ratings”. Technical Report
C-1, The Center for Research in Psychophysiology, University of
Florida, 1999.
[15] S. M. Mohammad, P. D. Turney, “Crowdsourcing a wordemotion
association lexicon”. In Computational Intelligence, 2013, pp. 436-465.
[16] M. Bendersky, and D. A. Smith, “A dictionary of wisdom and wit:
Learning to extract quotable phrases.” In Proc. NAACL-HLT 2012, 2012,
pp. 69.
[17] A. Mukherjee, and B. Liu, “Improving Gender Classification of Blog
Authors”. In Proc. of the 2010 Conference on Empirical Methods in
Natural Language Processing, pp. 207-217.
[18] A. Ritter, S. Clark, Mausam, and E. Oren, “Named Entity Recognition
in Tweets: And Experimental Study”. In Proc. of the 2011 Conference
on Empirical Methods in Natural Language Processing, pp. 1524-1534.
[19] M. Speriosu, N. Sudan, S. Upadhyay, J. Baldridge, “Twitter polarity
classification with label propagation over lexical links and the follower
graph”. In Proc. of the EMNLP First workshop on Unsupervised
Learning in NLP. Edinburgh, Scotland, 2011.
[20] S. Bird, “NLTK: the natural language toolkit”. In Proc. of the COL-
ING/ACL on Interactive presentation sessions, 2006, pp. 69-72.
[21] C. C. Chang, and C. J. Lin, “LIBSVM: a library for support vector
machines”. In ACM Transactions on Intelligent Systems and Technology
(TIST), 2011, 2(3), 27.
[22] F. Akthar, and C. Akthar, “RapidMiner 5 Operator Reference”, 2012.
551
... While, Koto et al. (2015) [4] study the tendencies of sentence pattern which distinguish between positive, negative, subjective and objective Tweets. Their approach also shows that Part of Speech (POS) sequence can improve Sentiment Analysis accuracy. ...
... Eliminating pos-tag sequences that diminish the Popularity As mentioned before, Koto et al. (2015) [4] finds that the POS tag sequence for a text determines whether it creates a positive sentiment or a negative sentiment [17]. The results from Koto 246 tend to have POS sequence of PRPRB (personal pronoun-adverb), PRP-VBD (personal pronoun-verb, past tense), RB-VB (adverb-verb, base form), PRP-VBP-RB (personal pronoun-verb, non-3rd person singular present-adverb), and MD-RB-VB (modaladverb-verb, base form). ...
... Eliminating pos-tag sequences that diminish the Popularity As mentioned before, Koto et al. (2015) [4] finds that the POS tag sequence for a text determines whether it creates a positive sentiment or a negative sentiment [17]. The results from Koto 246 tend to have POS sequence of PRPRB (personal pronoun-adverb), PRP-VBD (personal pronoun-verb, past tense), RB-VB (adverb-verb, base form), PRP-VBP-RB (personal pronoun-verb, non-3rd person singular present-adverb), and MD-RB-VB (modaladverb-verb, base form). ...
Conference Paper
Full-text available
Social media marketing is a form of Internet marketing that utilizes social networking websites as a marketing tool. Marketers post advertisements on social networking websites to promote their products and services. Most often, advertisements on social media are paid advertisements. However, not all advertisements reach the target audience. The gain obtained through advertisements is far less than the expenditure. This study proposes a model to increase the popularity of advertisements posted on social media. The cosmetics industry was taken as the case study and advertisements posted by cosmetics companies on Twitter were studied. This study identifies the most prominent features that impact Twitter advertisements to go viral. In order to reach a larger number of viewers, improvements to these features are suggested.
... This field refers to a broad area of natural language processing, computational linguistics, and text mining. These sentiments can be categorized into two categories: positive and negative; or on an n-point scale, for example, very good, good, satisfactory, bad, or very bad [42]. In this case, the task of sentiment analysis can be interpreted as a classification task in which each category represents a sentiment [43]. ...
... The specified time range for gathering these tweets is between March 2020 to November 2021. The data that has been obtained then analyzed using lexicon analysis with analytical instruments using part of speech (POS) Tagging [42]. POS Tagging is the process of marking words in a text/corpus according to a certain part of the speech, based on their definition and context. ...
Article
Full-text available
The objective of this research is to uncover Indonesians' perceptions of online learning during the COVID-19 pandemic by determining the polarity of language texts (positive, neutral, or negative) compiled from Twitter. The data required to reveal the Indonesian people's opinion on online learning during the COVID-19 pandemic is a tweet on Twitter with the hashtag #Pembelajaran daring (Online learning); #Pembelajaran jarak jauh (distance learning); #Belajar dari rumah (learning from home); #Belajar di rumah (learning in the home) (learning at home). The time frame for collecting these tweets is March 2020 to November 2021. The data was then analyzed using lexicon analysis and analytical tools that used Part of Speech Tagging. According to the results, 77.58% of the tweets are positive, 17.97% are negative, and the remainder are neutral. People prefer to refer to learning support, teachers, schools, education, students, and distance learning. Distance learning is the most positively received category among online learning. However, learning support is the most widely discussed topic among the general public. The overwhelming positive sentiment across all categories suggests that the majority of Indonesians have high hopes for online learning during the pandemic.
... However, the authors in [358] showcase the importance and potential of NLP within this domain, where they investigated the pattern or word combination of tweets in subjectivity and polarity by considering their POS sequence. Results reveal that subjective tweets tend to have word combinations consisting of adverb and adjective, whereas objective tweets tend to have a word combination of nouns. ...
... Studies subjectivity and sentiment polarity [508,429,338,396,213,398,214,403,16,426,210,181,513,334,358,376,98,381,106,374,413,328,199,326,325,196,234,321,121] sentiment polarity and emotion [543,428,248,383,425,62,342,360,71,555,365,491,125,87,57,85,51,558,505,149,148,447] sentiment polarity and mood [194] sentiment polarity and irony [404] sentiment polarity and sarcasm [511] sentiment polarity and affect [507] emotion and anger [446,445] irony and sarcasm [352] subjectivity, sentiment polarity and emotion [73] subjectivity, sentiment polarity, emotion and irony [310] In this domain, objective statements are usually classified as being neutral (in terms of polarity), whereas subjective statements are non-neutral. In the latter cases, sentiment analysis is performed to determine the polarity classification (more information on this below). ...
Preprint
Social media popularity and importance is on the increase, due to people using it for various types of social interaction across multiple channels. This social interaction by online users includes submission of feedback, opinions and recommendations about various individuals, entities, topics, and events. This systematic review focuses on the evolving research area of Social Opinion Mining, tasked with the identification of multiple opinion dimensions, such as subjectivity, sentiment polarity, emotion, affect, sarcasm and irony, from user-generated content represented across multiple social media platforms and in various media formats, like text, image, video and audio. Therefore, through Social Opinion Mining, natural language can be understood in terms of the different opinion dimensions, as expressed by humans. This contributes towards the evolution of Artificial Intelligence, which in turn helps the advancement of several real-world use cases, such as customer service and decision making. A thorough systematic review was carried out on Social Opinion Mining research which totals 485 studies and spans a period of twelve years between 2007 and 2018. The in-depth analysis focuses on the social media platforms, techniques, social datasets, language, modality, tools and technologies, natural language processing tasks and other aspects derived from the published studies. Such multi-source information fusion plays a fundamental role in mining of people's social opinions from social media platforms. These can be utilised in many application areas, ranging from marketing, advertising and sales for product/service management, and in multiple domains and industries, such as politics, technology, finance, healthcare, sports and government. Future research directions are presented, whereas further research and development has the potential of leaving a wider academic and societal impact.
... Linguistic pattern mining has been applied to a variety of research applications, such as the question answering system [36], sentiment analysis [37], and customer-aspect extraction [38]. In particular, mining of linguistic pattern has been useful in processing patent documents to perform a property-function network analysis [39]. ...
Article
Full-text available
In the age of digital economy, customers actively share their experiences and issues about products via online product reviews. Mining potential product improvement ideas from customer needs could provide valuable insights into new functionality expected by the markets. Numerous studies have attempted to identify customer needs using these reviews, but they paid less attention to the customer’s specific context in which the product was used. This study provides a novel approach for identifying customer needs based on both context information and product functions of target products. The context information and product functions are derived from online product reviews through linguistic pattern mining, whereby the customer needs are determined by the combination of extracted context information and product functions using a semantic embedding method and a clustering approach. A case study on the Amazon-Echo series was conducted to verify the applicability of the proposed approach. Consequently, we identified 1430 different customer needs, which could be used as an input for improving product design. This study is one of the first attempts to integrate context information for identifying customer needs. The proposed approach can be useful in the idea creation process for future product planning and is expected to add new empirical perspective for the e-commerce industry.
... Unlike applying the classifier to the corpus data which was filtered in the labeling process ensuring that they all refer to one location, when using the classifier with the application data, there is a need to previously identify which tweets refer to the user being in one location and which refer to other topics. For that purpose, with the tagged and untagged corpus data, we selected the PoS sequences with Bag-Of-PoS and identified the i-most frequent sequences that represent each data set, as it is done in [39]. Depending on the presence or not of the sequences of each class in the text ([it is/it is not] in a location), the tweet is classified by the MNB approach. ...
Article
Full-text available
The Internet generates large volumes of data at a high rate, in particular, posts on social networks. Although social network data have numerous semantic adulterations and are not intended to be a source of geo-spatial information, in the text of posts we find pieces of important information about how people relate to their environment, which can be used to identify interesting aspects of how human beings interact with portions of land based on their activities. This research proposes a methodology for the identification of land uses using Natural Language Processing (NLP) from the contents of the popular social network Twitter. It will be approached by identifying keywords with linguistic patterns from the text, and the geographical coordinates associated with the publication. Context-specific innovations are introduced to deal with data across South America and, in particular, in the city of Arequipa, Peru. The objective is to identify the five main land uses: residential, commercial, institutional-governmental, industrial-offices and unbuilt land. Within the framework of urban planning and sustainable urban management, the methodology contributes to the optimization of the identification techniques applied for the updating of land use cadastres, since the results achieved an accuracy of about 90%, which motivates its application in the real context. In addition, it would allow the identification of land use categories at a more detailed level, in situations such as a complex/mixed distribution building based on the amount of data collected. Finally, the methodology makes land use information available in a more up-to-date fashion and, above all, avoids the high economic cost of the non-automatic production of land use maps for cities, mostly in developing countries.
... Unlike applying the classifier to the corpus data which was filtered in the labeling process ensuring that they all refer to one location, when using the classifier with the application data, there is a need to previously identify which tweets refer to the user being in one location and which refer to other topics. For that purpose, with the tagged and untagged corpus data, we selected the PoS sequences with Bag-Of-PoS and identified the i-most frequent sequences that represent each data set, as it is done in [Koto and Adriani, 2015]. Depending on the presence or not of the sequences of each class in the text ([it is / it is not] in a location), the tweet is classified by the MNB approach. ...
Preprint
Full-text available
The Internet generates large volumes of data at a high rate, in particular, posts on social networks. Although social network data has numerous semantic adulterations, and is not intended to be a source of geo-spatial information, in the text of posts we find pieces of important information about how people relate to their environment, which can be used to identify interesting aspects of how human beings interact with portions of land based on their activities. This research proposes a methodology for the identification of land uses using Natural Language Processing (NLP) from the contents of the popular social network Twitter. It will be approached by identifying keywords with linguistic patterns from the text, and the geographical coordinates associated with the publication. Context-specific innovations are introduced to deal with data across South America and, in particular, in the city of Arequipa, Peru. The objective is to identify the five main land uses: residential, commercial, institutional-governmental, industrial-offices and unbuilt land. Within the framework of urban planning and sustainable urban management, the methodology contributes to the optimization of the identification techniques applied for the updating of land use cadastres, since the results achieved an accuracy of about 90%, which motivates its application in the real context. In addition, it would allow the identification of land use categories at a more detailed level, in situations such as a complex/mixed distribution building based on the amount of data collected. Finally, the methodology makes land use information available in a more up-to-date fashion and, above all, avoids the high economic cost of the non-automatic production of land use maps for cities, mostly in developing countries.
Article
Full-text available
Social media popularity and importance is on the increase due to people using it for various types of social interaction across multiple channels. This systematic review focuses on the evolving research area of Social Opinion Mining, tasked with the identification of multiple opinion dimensions, such as subjectivity, sentiment polarity, emotion, affect, sarcasm and irony, from user-generated content represented across multiple social media platforms and in various media formats, like text, image, video and audio. Through Social Opinion Mining, natural language can be understood in terms of the different opinion dimensions, as expressed by humans. This contributes towards the evolution of Artificial Intelligence which in turn helps the advancement of several real-world use cases, such as customer service and decision making. A thorough systematic review was carried out on Social Opinion Mining research which totals 485 published studies and spans a period of twelve years between 2007 and 2018. The in-depth analysis focuses on the social media platforms, techniques, social datasets, language, modality, tools and technologies, and other aspects derived. Social Opinion Mining can be utilised in many application areas, ranging from marketing, advertising and sales for product/service management, and in multiple domains and industries, such as politics, technology, finance, healthcare, sports and government. The latest developments in Social Opinion Mining beyond 2018 are also presented together with future research directions, with the aim of leaving a wider academic and societal impact in several real-world applications.
Preprint
Full-text available
Although some linguists (Rusmali et al., 1985; Crouch, 2009) have fairly attempted to define the morphology and syntax of Minangkabau, information processing in this language is still absent due to the scarcity of the annotated resource. In this work, we release two Minangkabau corpora: sentiment analysis and machine translation that are harvested and constructed from Twitter and Wikipedia. We conduct the first computational linguistics in Minangkabau language employing classic machine learning and sequence-to-sequence models such as LSTM and Transformer. Our first experiments show that the classification performance over Minangkabau text significantly drops when tested with the model trained in Indonesian. Whereas, in the machine translation experiment, a simple word-to-word translation using a bilingual dictionary outperforms LSTM and Transformer model in terms of BLEU score.
Conference Paper
Full-text available
As people increasingly use emoticons in text in order to express, stress, or disambiguate their sentiment, it is crucial for automated sentiment analysis tools to correctly account for such graphical cues for sentiment. We analyze how emoticons typically convey sentiment and demonstrate how we can exploit this by using a novel, manually created emoticon sentiment lexicon in order to improve a state-of-the-art lexicon-based sentiment classification method. We evaluate our approach on 2,080 Dutch tweets and forum messages, which all contain emoticons and have been manually annotated for sentiment. On this corpus, paragraph-level accounting for sentiment implied by emoticons significantly improves sentiment classification accuracy. This indicates that whenever emoticons are used, their associated sentiment dominates the sentiment conveyed by textual cues and forms a good proxy for intended sentiment.
Conference Paper
Full-text available
There is high demand for automated tools that assign polarity to microblog content such as tweets (Twitter posts), but this is challenging due to the terseness and informality of tweets in addition to the wide variety and rapid evolution of language in Twitter. It is thus impractical to use standard supervised machine learning techniques dependent on annotated training examples. We do without such annotations by using label propagation to incorporate labels from a maximum entropy classifier trained on noisy labels and knowledge about word types encoded in a lexicon, in combination with the Twitter follower graph. Results on polarity classification for several datasets show that our label propagation approach rivals a model supervised with in-domain annotated tweets, and it outperforms the noisily supervised classifier it exploits as well as a lexicon-based polarity ratio classifier.
Article
Full-text available
Even though considerable attention has been given to the polarity of words (positive and negative) and the creation of large polarity lexicons, research in emotion analysis has had to rely on limited and small emotion lexicons. In this paper we show how the combined strength and wisdom of the crowds can be used to generate a large, high-quality, word-emotion and word-polarity association lexicon quickly and inexpensively. We enumerate the challenges in emotion annotation in a crowdsourcing scenario and propose solutions to address them. Most notably, in addition to questions about emotions associated with terms, we show how the inclusion of a word choice question can discourage malicious data entry, help identify instances where the annotator may not be familiar with the target term (allowing us to reject such annotations), and help obtain annotations at sense level (rather than at word level). We conducted experiments on how to formulate the emotion-annotation questions, and show that asking if a term is associated with an emotion leads to markedly higher inter-annotator agreement than that obtained by asking if a term evokes an emotion.
Article
Full-text available
We examine sentiment analysis on Twitter data. The contributions of this paper are: (1) We introduce POS-specific prior polarity fea- tures. (2) We explore the use of a tree kernel to obviate the need for tedious feature engineer- ing. The new features (in conjunction with previously proposed features) and the tree ker- nel perform approximately at the same level, both outperforming the state-of-the-art base- line. kernel based model. For the feature based model we use some of the features proposed in past liter- ature and propose new features. For the tree ker- nel based model we design a new tree representa- tion for tweets. We use a unigram model, previously shown to work well for sentiment analysis for Twit- ter data, as our baseline. Our experiments show that a unigram model is indeed a hard baseline achieving over 20% over the chance baseline for both classifi- cation tasks. Our feature based model that uses only 100 features achieves similar accuracy as the uni- gram model that uses over 10,000 features. Our tree kernel based model outperforms both these models by a significant margin. We also experiment with a combination of models: combining unigrams with our features and combining our features with the tree kernel. Both these combinations outperform the un- igram baseline by over 4% for both classification tasks. In this paper, we present extensive feature analysis of the 100 features we propose. Our ex- periments show that features that have to do with Twitter-specific features (emoticons, hashtags etc.) add value to the classifier but only marginally. Fea- tures that combine prior polarity of words with their parts-of-speech tags are most important for both the classification tasks. Thus, we see that standard nat- ural language processing tools are useful even in a genre which is quite different from the genre on which they were trained (newswire). Furthermore, we also show that the tree kernel model performs roughly as well as the best feature based models, even though it does not require detailed feature en-
Conference Paper
Full-text available
Twitter sentiment analysis or the task of automatically retrieving opinions from tweets has received an increasing interest from the web mining community. This is due to its importance in a wide range of fields such as business and politics. People express sentiments about specific topics or entities with different strengths and intensities, where these sentiments are strongly related to their personal feelings and emotions. A number of methods and lexical resources have been proposed to analyze sentiment from natural language texts, addressing different opinion dimensions. In this article, we propose an approach for boosting Twitter sentiment classification using different sentiment dimensions as meta-level features. We combine aspects such as opinion strength, emotion and polarity indicators, generated by existing sentiment analysis methods and resources. Our research shows that the combination of sentiment dimensions provides significant improvement in Twitter sentiment classification tasks such as polarity and subjectivity.
Conference Paper
Full-text available
Microblog content poses serious challenges to the applicabil-ity of traditional sentiment analysis and classification meth-ods, due to its inherent characteristics. To tackle them, we introduce a method that relies on two orthogonal, but com-plementary sources of evidence: content-based features cap-tured by n-gram graphs and context-based ones captured by polarity ratio. Both are language-neutral and noise-tolerant, guaranteeing high effectiveness and robustness in the set-tings we are considering. To ensure our approach can be integrated into practical applications with large volumes of data, we also aim at enhancing its time efficiency: we pro-pose alternative sets of features with low extraction cost, ex-plore dimensionality reduction and discretization techniques and experiment with multiple classification algorithms. We then evaluate our methods over a large, real-world data set extracted from Twitter, with the outcomes indicating sig-nificant improvements over the traditional techniques.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
We introduce a novel approach for automatically classify-ing the sentiment of Twitter messages. These messages are classified as either positive or negative with respect to a query term. This is useful for consumers who want to re-search the sentiment of products before purchase, or com-panies that want to monitor the public sentiment of their brands. There is no previous research on classifying sen-timent of messages on microblogging services like Twitter. We present the results of machine learning algorithms for classifying the sentiment of Twitter messages using distant supervision. Our training data consists of Twitter messages with emoticons, which are used as noisy labels. This type of training data is abundantly available and can be obtained through automated means. We show that machine learn-ing algorithms (Naive Bayes, Maximum Entropy, and SVM) have accuracy above 80% when trained with emoticon data. This paper also describes the preprocessing steps needed in order to achieve high accuracy. The main contribution of this paper is the idea of using tweets with emoticons for distant supervised learning.