Conference PaperPDF Available

Evaluation Datasets for Twitter Sentiment Analysis. A survey and a new dataset, the STS-Gold

Authors:
Conference Paper

Evaluation Datasets for Twitter Sentiment Analysis. A survey and a new dataset, the STS-Gold

Abstract and Figures

Sentiment analysis over Twitter offers organisations and indi-viduals a fast and effective way to monitor the publics' feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in the last few years. In this paper we present an overview of eight publicly available and manually annotated evaluation datasets for Twitter sentiment analysis. Based on this review, we show that a common limitation of most of these datasets, when assessing sentiment analysis at target (entity) level, is the lack of distinctive sentiment annotations among the tweets and the entities contained in them. For example, the tweet "I love iPhone, but I hate iPad" can be annotated with a mixed sentiment label, but the entity iPhone within this tweet should be anno-tated with a positive sentiment label. Aiming to overcome this limitation, and to complement current evaluation datasets, we present STS-Gold, a new evaluation dataset where tweets and targets (entities) are annotated individually and therefore may present different sentiment labels. This paper also provides a comparative study of the various datasets along several dimensions including: total number of tweets, vocabulary size and sparsity. We also investigate the pair-wise correlation among these dimensions as well as their correlations to the sentiment classification performance on different datasets.
Content may be subject to copyright.
Evaluation Datasets for
Twitter Sentiment Analysis
A survey and a new dataset, the STS-Gold
Hassan Saif1, Miriam Fernandez1, Yulan He2and Harith Alani1
1Knowledge Media Institute, The Open University, United Kingdom
{h.saif, m.fernandez, h.alani}@open.ac.uk
2School of Engineering and Applied Science, Aston University, UK
y.he@cantab.net
Abstract.
Sentiment analysis over Twitter offers organisations and indi-
viduals a fast and effective way to monitor the publics’ feelings towards
them and their competitors. To assess the performance of sentiment
analysis methods over Twitter a small set of evaluation datasets have
been released in the last few years. In this paper we present an overview
of eight publicly available and manually annotated evaluation datasets for
Twitter sentiment analysis. Based on this review, we show that a common
limitation of most of these datasets, when assessing sentiment analysis
at target (entity) level, is the lack of distinctive sentiment annotations
among the tweets and the entities contained in them. For example, the
tweet “I love iPhone, but I hate iPad” can be annotated with a mixed
sentiment label, but the entity iPhone within this tweet should be anno-
tated with a positive sentiment label. Aiming to overcome this limitation,
and to complement current evaluation datasets, we present STS-Gold, a
new evaluation dataset where tweets and targets (entities) are annotated
individually and therefore may present different sentiment labels. This
paper also provides a comparative study of the various datasets along
several dimensions including: total number of tweets, vocabulary size
and sparsity. We also investigate the pair-wise correlation among these
dimensions as well as their correlations to the sentiment classification
performance on different datasets.
Keywords: Sentiment Analysis, Twitter, Datasets
1 Introduction
With the emergence of social media, the performance of sentiment analysis tools
has become increasingly critical. In the current commercial competition, designers,
developers, vendors and sales representatives of new information products need to
carefully study whether and how do their products offer competitive advantages.
Twitter, with over 500 million registered users and over 400 million messages per
day,
3
has become a gold mine for organisations to monitor their reputation and
3http://www.alexa.com/topsites
brands by extracting and analysing the sentiment of the tweets posted by the
public about them, their markets, and competitors.
Developing accurate sentiment analysis methods requires the creation of
evaluation datasets that can be used to assess their performances. In the last few
years several evaluation datasets for Twitter sentiment analysis have been made
publicly available. The general evaluation dataset consists of a set of tweets, where
each tweet is annotated with a sentiment label [
1
,
8
,
16
,
22
]. The most common
sentiment labels are positive,negative and neutral, but some evaluation datasets
consider additional sentiment labels such as mixed,other or irrelevant [
1
,
23
].
Instead of the final sentiment labels associated to the tweets, some datasets
provide a numeric sentiment strength between -5 and 5 defining a range from
negative to positive polarity [
24
,
25
]. In addition to sentiment labels associated to
the tweets some evaluation datasets also provide sentiment labels associated to
targets (entities) within the tweets. However, these datasets do not distinguish
between the sentiment label of the tweet and the sentiment labels of the entities
contained within it [
23
]. For example, the tweet “
iPhone 5 is awesome, but I
can’t upgrade :(
” presents a negative sentiment. However, the entity “
iPhone
5” should receive a positive sentiment.
Aiming to overcome this limitation, we present STS-Gold, an evaluation
dataset for Twitter sentiment analysis that targets sentiment annotation at
both, tweet and entity levels. The annotation process allows a dissimilar polarity
annotation between the tweet and the entities contained within it. To create this
dataset a subset of tweets was selected from the Standford Twitter Sentiment
Corpus [
8
] and entities were extracted from this subset of tweets by using a
third-party entity extraction tool. Tweets and entities were manually annotated
by three different human evaluators. The final evaluation dataset contains 2,206
tweets and 58 entities with associated sentiment labels. The purpose of this dataset
is therefore to complement current state of the art datasets by providing entity
sentiment labels, therefore supporting the evaluation of sentiment classification
models at entity as well as tweet level.
Along with the description of the STS-Gold dataset, this paper summarises
eight publicly available and manually annotated evaluation datasets for Twitter
sentiment analysis. Our goal is to provide the reader with an overview of the
existing evaluation datasets and their characteristics. To this aim, we provide
a comparison of these datasets along different dimensions including: the total
number of tweets, the vocabulary size and the degree of data sparsity. We also
investigate the pair-wise correlation among these dimensions as well as their
correlations to the sentiment classification performance on all datasets. Our
study shows that the correlation between the sparsity and the classification
performance is intrinsic, meaning that it might exists within the dataset itself,
but not necessarily across the datasets. We also show that the correlations
between sparsity, vocabulary size and number of tweets are all strong. However,
the large number of tweets in a dataset is not always an indication for a large
vocabulary size or a high sparsity degree.
The rest of the paper is structured as follows: Section2 presents an overview of
the existing evaluation datasets for Twitter sentiment analysis. Section 3 describes
STS-Gold, our proposed evaluation dataset. Section 4 presents a comparison
study across the evaluation datasets. We conclude the paper in Section 5.
2 Twitter Sentiment Analysis Datasets
In this section we present 8 different datasets widely used in the Twitter senti-
ment analysis literature. We have focused our selection on those datasets that
are: (i) publicly available to the research community, (ii) manually annotated,
providing a reliable set of judgements over the tweets and, (iii) used to evaluate
several sentiment sentiment analysis models. Tweets in these datasets have been
annotated with different sentiment labels including: Negative,Neutral,Positive,
Mixed,Other and Irrelevant. Table 1 displays the distribution of tweets in the
eight selected datasets according to these sentiment labels.
Variations of the evaluation datasets are due to the particularities of the
different sentiment analysis tasks. Sentiment analysis on Twitter spans multiple
tasks, such as polarity detection (positive vs. negative), subjectivity detection
(polar vs. neutral) or sentiment strength detection. These tasks can also be
performed either at tweet level or at target (entity) level. In the following
subsections, we provide an overview of the available evaluation datasets and the
different sentiment tasks for which they are used.
Dataset No. of Tweets #Negative #Neutral #Positive #Mixed #Other #Irrelevant
STS-Test 498 177 139 182 - - -
HCR 2,516 1,381 470 541 - 45 79
OMD 3,238 1,196 - 710 245 1,087 -
SS-Twitter 4,242 1,037 1,953 1,252 - - -
Sanders 5,513 654 2,503 570 - - 1,786
GASP 12,771 5,235 6,268 1,050 - 218 -
WAB 13,340 2,580 3,707 2,915 - 420 3,718
SemEval 13,975 2,186 6,440 5,349 - - -
Table 1. Total number of tweets and the tweet sentiment distribution in all datasets
Stanford Twitter Sentiment Test Set (STS-Test)
The Stanford Twitter sentiment corpus (
http://help.sentiment140.com/
),
introduced by Go et al. [
8
] consists of two different sets, training and test. The
training set contains 1.6 million tweets automatically labelled as positive or
negative based on emotions. For example, a tweet is labelled as positive if it
contains :), :-), : ), :D, or =) and is labelled as negative if it contains :(, :-(, or : (.
Although automatic sentiment annotation of tweets using emoticons is fast, its
accuracy is arguable because emoticons might not reflect the actual sentiment
of tweets. In this study, we focus on those datasets that have been manually
annotated. Therefore, although we acknowledge the relevance of the STS training
dataset for building sentiment analysis models, we discard it from the rest of our
study.
The test set (STS-Test), on the other hand, is manually annotated and
contains 177 negative, 182 positive and 139 neutrals tweets. These tweets were
collected by searching Twitter API with specific queries including names of
products, companies and people. Although the STS-Test dataset is relatively
small, it has been widely used in the literature in different evaluation tasks.
For example, Go et al. [
8
], Saif et al. [
19
,
20
], Speriosu et al. [
23
], and Bakliwal
et al. [
2
] use it to evaluate their models for polarity classification (positive vs.
negative). In addition to polarity classification, Marquez et al. [
3
] use this dataset
for evaluating subjectivity classification (neutral vs. polar).
Health Care Reform (HCR)
The Health Care Reform (HCR) dataset was built by crawling tweets containing
the hashtag “#hcr” (health care reform) in March 2010 [
23
]. A subset of this
corpus was manually annotated by the authors with 5 labels (positive, negative,
neutral, irrelevant, unsure(other)) and split into training (839 tweets), develop-
ment (838 tweets) and test (839 tweets) sets. The authors also assigned sentiment
labels to 8 different targets extracted from all the three sets (Health Care Re-
form, Obama, Democrats, Republicans, Tea Party, Conservatives, Liberals, and
Stupak). However, both the tweet and the targets within it, were assigned the
same sentiment label, as can be found in the published version of this dataset
(
https://bitbucket.org/speriosu/updown
). In this paper, we consider all the
three subsets (training, development and test) as one unique dataset for the
analysis (see Section 4). The final datasets, as shown in Table 1, consists of 2,516
tweets including 1,381 negative, 470 neutral and 541 positive tweets.
The HCR dataset has been used to evaluate polarity classification [
23
,
21
] but
can also be used to evaluate subjectivity classification since it identifies neutral
tweets.
Obama-McCain Debate (OMD)
The Obama-McCain Debate (OMD) dataset was constructed from 3,238 tweets
crawled during the first U.S. presidential TV debate in September 2008 [
22
].
Sentiment labels were acquired for these tweets using Amazon Mechanical Turk,
where each tweet was rated by at least three annotators as either positive,
negative, mixed, or other. The authors in [
6
] reported an inter-annotator agreement
of 0.655, which shows a relatively good agreement between annotators. The
dataset is provided at
https://bitbucket.org/speriosu/updown
along with
the annotators’ votes on each tweet. We considered those sentiment labels, which
two-third of the voters agree on, as final labels of the tweets. This resulted in a
set of 1,196 negative, 710 positive and 245 mixed tweets.
The OMD dataset is a popular dataset, which has been used to evaluate
various supervised learning methods [
10
,
23
,
21
], as well as unsupervised methods
[
9
] for polarity classification of tweets. Tweets’ sentiments in this dataset were
also used to characterize the Obama-McCain debate event in 2008 [6].
Sentiment Strength Twitter Dataset (SS-Tweet)
This dataset consists of 4,242 tweets manually labelled with their positive and
negative sentiment strengths. i.e., a negative strength is a number between -1
(not negative) and -5 (extremely negative). Similarly, a positive strength is a
number between 1 (not positive) and 5 (extremely positive). The dataset was
constructed by [
24
] to evaluate SentiStrenth (
http://sentistrength.wlv.ac.
uk/), a lexicon-based method for sentiment strength detection.
In this paper we propose re-annotating tweets in this dataset with sentiment
labels (negative, positive, neutral) rather than sentiment strengths, which will
allow using this dataset for subjectivity classification in addition to sentiment
strength detection. To this end, we assign a single sentiment label to each tweet
based on the following two rules inspired by the way SentiStrength works:
4
(i) a
tweet is considered neutral if the absolute value of the tweet’s negative to positive
strength ratio is equals to 1, (ii) a tweet is positive if its positive sentiment
strength is 1.5 times higher than the negative one, and negative otherwise. The
final dataset, as shown in table 1, consists of 1,037 negative, 1,953 neutral and
1,252 positive tweets.
The original dataset is publicly available at
http://sentistrength.wlv.
ac.uk/documentation/
along with other 5 datasets from different social media
platforms including MySpace, Digg, BBC forum, Runners World forum, and
YouTube.
Sanders Twitter Dataset
The Sanders dataset consists of 5,512 tweets on four different topics (Apple,
Google, Microsoft, Twitter). Each tweet was manually labelled by one annotator
as either positive, negative, neutral, or irrelevant with respect to the topic. The
annotation process resulted in 654 negative, 2,503 neutral, 570 positive and 1,786
irrelevant tweets.
The dataset has been used in [
3
,
12
,
5
] for polarity and subjectivity classification
of tweets.
The Sanders dataset is available at http://www.sananalytics.com/lab
The Dialogue Earth Twitter Corpus
The Dialogue Earth Twitter corpus consists of three subsets of tweets. The
first two sets (WA, WB) contain 4,490 and 8,850 tweets respectively about
the weather, while the third set (GASP) contains 12,770 tweets about gas
prices. These datasets were constructed as a part of the Dialogue Earth Project
5
(
www.dialogueearth.org
) and were hand labelled by several annotators with
five labels: positive, negative, neutral, not related and can’t tell (other). In this
work we merge the two sets about the weather in one dataset (WAB) for our
analysis study in Section 4. This results in 13,340 tweets with 2,580 negative,
3,707 neutral, and 2,915 positive tweets. The GASP dataset on the other hand
consists of 5,235 negative, 6,268 neutral and 1,050 positive tweets.
The WAB and the GASP datasets have been used to evaluate several machine
learning classifiers (e.g., Naive Bayes, SVM, KNN) for polarity classification of
tweets [1].
4http://sentistrength.wlv.ac.uk/documentation/SentiStrengthJavaManual.
doc
5
Dialogue Earth, is former program of the Institute on the Environment at the
University of Minnesota
SemEval-2013 Dataset (SemEval)
This dataset was constructed for the Twitter sentiment analysis task (Task 2) [
16
]
in the Semantic Evaluation of Systems challenge (SemEval-2013).
6
The original
SemEval dataset consists of 20K tweets split into training, development and test
sets. All the tweets were manually annotated by 5 Amazon Mechanical Turk
workers with negative, positive and neutral labels. The turkers were also asked
to annotate expressions within the tweets as subjective or objective. Using a list
of the dataset’s tweet ids provided by [
16
], we managed to retrieve 13,975 tweets
with 2,186 negative, 6,440 neutrals and 5,349 positives tweets.
Participants in the SemEval-2013 Task 2 used this dataset to evaluate their
systems for expression-level subjectivity detection[
15
,
4
], as well as tweet-level
subjectivity detection[14,18].
Summary:
Based on the above reviews we can identify two main shortcomings
of these datasets when using them to assess the performance of Twitter sentiment
analysis models. The first shortcoming is the lack of specifications provided by
some datasets (e.g., STS-Test, HCR, Sanders) about the annotation methodology
used to assign sentiment labels to the tweets. For example [
8
] do not report the
number of annotators. Similarly [
23
] do not report annotation agreement among
annotators. The second shortcoming is that most of these datasets are focused
on assessing the performance of sentiment analysis models working at tweet level
but not at entity level (i.e., they provide human annotations for tweets but not
for entities). In the few cases where the annotation process also targets entities
as in the HCR dataset, these entities are assigned similar sentiment labels to the
label of the tweet they belong to. Entity sentiment analysis is however a highly
relevant task, since it is closely related to the problem of mining the reputation
of individuals and brands in Twitter.
3 STS-Gold Dataset
In the following subsections we described our proposed dataset, STS-Gold. The
goal of this dataset is to complement existing Twitter sentiment analysis evalua-
tion datasets by providing a new dataset where tweets and entities are annotated
independently, allowing for different sentiment labels between the tweet and
the entities contained within it. The purpose is to support the performance
assessment for entity-based sentiment analysis models, which is currently hardly
addressed in the datasets that have been released to date (see Section 2).
3.1 Data Acquisition
To construct this dataset, we first extracted all named entities from a collection
of 180K tweets randomly selected from the original Stanford Twitter corpus (see
Section 2). To this end, we used AlchemyAPI,
7
an online service that allows
for the extraction of entities from text along with their associated semantic
concept class (e.g., Person, Company, City). After that, we identified the top
most frequent semantic concepts and, selected under each of them, the top 2
6http://www.cs.york.ac.uk/semeval-2013/task2/
7www.alchemyapi.com
most frequent and 2 mid-frequent entities. For example, for the semantic concept
Person we selected the top most frequent entities (Taylor Swift and Obama)
as well as two mid frequent entities (Oprah and Lebron). This resulted in 28
different entities along with their 7 associated concepts as shown in Table 2.
Concept Top 2 Entities Mid 2 Entities
Person Taylor Swift, Obama Oprah, Lebron
Company Facebook, Youtube Starbucks, McDonalds
City London, Vegas Sydney, Seattle
Country England, US Brazil, Scotland
Organisation Lakers, Cavs Nasa, UN
Technology iPhone, iPod Xbox, PSP
HealthCondition Headache, Flu Cancer, Fever
Table 2. 28 Entities, with their semantic concepts, used to build STS-Gold.
The next step was to construct and prepare a collection of tweets for sentiment
annotation, ensuring that each tweet in the collection contains one or more of the
28 entities listed in Table 2. To this aim, we randomly selected 100 tweets from
the remaining part of the STS corpus for each of the 28 entities, i.e., a total of
2,800 tweets. We further added another 200 tweets without specific reference to
any entities to add up a total of 3,000 tweets. Afterwards, we applied AlchemyAPI
on the selected 3,000 tweets. Apart from the initial 28 entities the extraction tool
returned 119 additional entities, providing a total of 147 entities for the 3,000
selected tweets.
3.2 Data Annotation
We asked three graduate students to manually label each of the 3,000 tweets with
one of the five classes: (
Negative, Positive, Neutral, Mixed
and
Other
).
The “
Mixed
” label was assigned to tweets containing mixed sentiment and
Other
” to those that were difficult to decide on a proper label. The students
were also asked to annotate each entity contained in a tweet with the same five
sentiment classes. The students were provided with a booklet explaining both
the tweet-level and the entity-level annotation tasks. The booklet also contains a
list of key instructions as shown in this paper’s appendix. It is worth noting that
the annotation was done using Tweenator,
8
an online tool that we previously
built to annotate tweet messages [20].
We measured the inter-annotation agreement using the Krippendorff’s alpha
metric [
11
], obtaining an agreement of
αt
= 0
.
765 for the tweet-level annotation
task. For the entity-level annotation task, if we measured sentiment of entity for
each individual tweet, we only obtained
αe
= 0
.
416 which is relatively low for the
annotated data to be used. However, if we measured the aggregated sentiment
for each entity, we got a very high inter-annotator agreement of αe= 0.964.
To construct the final STS-Gold dataset we selected those tweets and entities
for which our three annotators agreed on the sentiment labels, discarding any
8http://tweenator.com
possible noisy data from the constructed dataset. As shown in Table 3 the STS-
Gold dataset contains 13 negative, 27 positive and 18 neutral entities as well as
1,402 negative, 632 positive and 77 neutral tweets. The STS-Gold dataset contains
independent sentiment labels for tweets and entities, supporting the evaluation
of tweet-based as well as entity-based Twitter sentiment analysis models.
Class Negative Positive Neutral Mixed Other
No. of Entities 13 27 18 - -
No. of Tweets 1402 632 77 90 4
Table 3. Number of tweets and entities under each class
4 Comparative study of Twitter Sentiment Analysis
Datasets
In this section, we present a comparison of the described datasets according
to three different dimensions: the vocabulary size, the total number of tweets,
and the data sparsity. We also study the pair-wise intrinsic correlation between
these dimensions as well as their correlation with the sentiment classification
performance (correlation are computed using the Pearson correlation coefficient).
To this end, we perform a binary sentiment classification (positive vs. negative)
on all the datasets using a Maximum Entropy classifier (MaxEnt). Note that
no stemming or filtering was applied to the data since our aim by providing
this comparison is not to build better classifiers. Instead, we aim at showing
the particularities of each dataset and how these particularities may affect the
performance of sentiment classifiers.
Vocabulary Size
The vocabulary size of a dataset is commonly determined by the number of
the unique word unigrams that the dataset contains. To extract the number of
unigrams, we use the TweetNLP tokenizer [
7
], which is specifically built to work
on tweets data.
9
Note that we considered all tokens found in the tweets including
words, numbers, URLs, emoticons, and speical characters (e.g., question marks,
intensifiers, hashtags, etc).
Figure 1 depicts the correlation between the the vocabulary size and the
total number of tweets in the datasets. Although the correlation between the
two quantities seems to be positively strong (
ρ
= 0
.
95), increasing the number
of tweets does not always lead to increasing the vocabulary size. For example,
the OMD dataset has higher number of tweets than the HCR dataset, yet the
former has a smaller vocabulary size than the latter.
Data Sparsity
Dataset sparsity is an important factor that affects the overall performance of
typical machine learning classifiers [
17
]. According to Saif et al. [
20
], tweets data
9
The TweetNLP tokenizer can be downloaded from
http://www.ark.cs.cmu.edu/
TweetNLP/
Fig. 1. Total number of tweets and the vocabulary size of each dataset.
are sparser than other types of data (e.g., movie review data) due to a large
number of infrequent words in tweets.
In this section, we aim to compare the presented datasets with respect to
their sparsity. To calculate the sparsity degree of a given dataset we use the
following formula from [13]:
Sd= 1 Pn
iNi
n× |V|(1)
Where
Ni
is the the number of distinct words in tweet
i
,
n
is the number of
tweets in the dataset and |V|the vocabulary size.
According to Figure 2, all datasets have a high sparsity degree, with SemEval
being the sparsest. It is also worth noticing that there is a strong correlation
between the sparsity degree and the total number of tweets in a dataset (
ρ
= 0
.
71)
and an even stronger correlation between the sparsity degree and the vocabulary
size of the dataset (ρ= 0.77).
Fig. 2.
Sparsity degree, vocabulary size and the total number of tweets across the
datasets
Classification Performance
We perform a binary sentiment classification on all the datasets using a MaxEnt
classifier from Mallet.
10
To this end, we selected for each dataset only the subset
of positive and negative tweets.
Table 4 reports the classification results (using 10-fold cross validation) in
accuracy and the average F-measure (F-average) on all datasets. The highest
accuracy is achieved on the GASP dataset with 90.897%, while the highest
average F-measure of 84.621% is obtained on the WAB dataset. It is also worth
noticing that the per-class performance is highly affected by the distribution
of positive and negative tweets in the dataset. For example, F-measure for
detecting positive tweets (F-positive) is higher than F-measure for detecting
negative tweets (F-negative) for positive datasets (i.e., datasets that have higher
number of positive tweets than negative ones) such as STS-Test, SS-Twitter,
WAB and SemEval. Similarly, F-negative score is higher than F-positive for
negative datasets (i.e., datasets that have higher number of negative tweets than
positive ones). However, the average accuracy for negative datasets is 84.53%,
while it is 80.37% for positive tweets, suggesting that detecting positive tweets is
more difficult than detecting negative tweets.
Dataset STS-Test STS-Gold HCR OMD SS-Twitter Sanders GASP WAB SemEval
Accuracy 80.171 85.69 78.679 82.661 73.399 83.84 90.897 84.668 83.257
F-negative 79.405 89.999 85.698 86.617 69.179 84.964 94.617 83.745 68.668
F-positive 81.21 74.909 58.23 75.47 76.621 82.548 70.682 85.498 88.578
F-average 80.307 82.454 71.964 81.044 72.9 83.756 82.65 84.621 78.623
Table 4.
Accuracy and the average harmonic mean (F measure) obtained from identi-
fying positive and negative sentiment.
Makrehchi and Kamel [
13
] showed that the performance trend of text classifiers
can be estimated using the sparsity degree of the dataset. In particular, they
found that reducing the sparsity of a given dataset enhances the performance of
a SVM classifier. Their observation is based on changing the sparsity degree of
the same dataset by removing/keeping specific terms.
Figure 3 illustrates the correlation across all datasets between Accuracy and
F-measure on the one hand, and the dataset sparsity on the other hand. As
illustrated by this figure, there is almost no correlation (
ρacc
=
0
.
06,
ρf1
= 0
.
23)
between the classification performance and the sparsity degree across the datasets.
In other words, the sparsity-performance correlation is intrinsic, meaning that it
might exists within the dataset itself, but not necessarily across the datasets. This
is not surprising given that there are other dataset characteristics in addition
to data sparsity, such as polarity class distribution, which may also affect the
overall performance as we discussed earlier in this section.
10 http://mallet.cs.umass.edu/
Fig. 3. F-Measure and the Sparsity degree of the datasets
5 Conclusions
In this paper, we provided an overview of eight publicly available and manually
annotated evaluation datasets for Twitter sentiment analysis. Based on our
review, we found that unlike the tweet level, very few annotation efforts were
spent towards providing datasets for evaluating sentiment classifiers at the entity
level. This motivated us to build a new evaluation dataset, STS-Gold, which
allows for the evaluation of sentiment classification models at both the entity and
the tweet levels. Our dataset, unlike most of the other datasets, distinguishes
between the sentiment of a tweet and the sentiment of entities mentioned within
it.
We also provided a comparative study across all the reported datasets in
terms of different characteristics including the vocabulary size, the total number
of tweets and the degree of sparsity. Finally, we studied the various pair-wise
correlations among these characteristics as well as the correlation between the
data sparsity degree and the sentiment classification performance across the
datasets. Our study showed that the large number of tweets in a dataset is
not always an indication for a large vocabulary size although the correlation
between these two characteristics is relatively strong. We also showed that the
sparsity-performance correlation is intrinsic, where it might exists within the
dataset itself, but not necessarily across the datasets.
Acknowledgment
The work of the authors was supported by the EU-FP7 projects: ROBUST (grant
no. 257859) and SENSE4US (grant no. 611242).
Appendix: Annotation Booklet
We need to manually annotate 3000 tweets with their sentiment label (Negative,
Positive, Neutral, Mixed) using the online annotation tool “Tweenator.com”.
The task consists of two subtasks:
Task A. Tweet-Level Sentiment Annotation
Given a tweet message, de-
cide weather it has a positive, negative, neutral or mixed sentiment.
Task B. Entity-Level Sentiment Annotation
Given a tweet message and
a named entity, decided weather the entity received a negative, positive or neutral
sentiment. The named entities to annotate are highlighted in yellow within the
tweets.
Please note that:
A Tweet could have a different sentiment from an entity within it. For ex-
ample, the tweet “
iPhone 5 is very nice phone, but I can’t upgrade
:(
” has a negative sentiment. However, the entity “
iPhone 5
” receives a
positive sentiment.
“Mixed” label refers to a tweet that has mixed sentiment. For example, the
Kobe is the best in the world not Lebron” has a mixed sentiment.
Some tweets might have emoticons such as :), :-), :(, or :-(. Please give less
attention to the emoticons and focus more on the content of the tweets.
Emoticons can be very misleading indicators sometimes.
Try to be objective with your judgement and feel free to take a break whenever
you feel tired or bored.
References
1.
Asiaee T, A., Tepper, M., Banerjee, A., Sapiro, G.: If you are happy and you know
it... tweet. In: Proceedings of the 21st ACM international conference on Information
and knowledge management. pp. 1602–1606. ACM (2012)
2.
Bakliwal, A., Arora, P., Madhappan, S., Kapre, N., Singh, M., Varma, V.: Mining
sentiments from tweets. Proceedings of the WASSA 12 (2012)
3.
Bravo-Marquez, F., Mendoza, M., Poblete, B.: Combining strengths, emotions and
polarities for boosting twitter sentiment analysis. In: Proceedings of the Second
International Workshop on Issues of Sentiment Discovery and Opinion Mining.
ACM (2013)
4. Chalothorn, T., Ellman, J.: Tjp: Using twitter to analyze the polarity of contexts.
In: In Proceedings of the seventh international workshop on Semantic Evaluation
Exercises (SemEval-2013), Atlanta, Georgia, USA, June 2013. (2013)
5.
Deitrick, W., Hu, W.: Mutually enhancing community detection and sentiment
analysis on twitter networks. Journal of Data Analysis and Information Processing
1, 19–29 (2013)
6. Diakopoulos, N., Shamma, D.: Characterizing debate performance via aggregated
twitter sentiment. In: Proceedings of the 28th international conference on Human
factors in computing systems. ACM (2010)
7.
Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman,
M., Yogatama, D., Flanigan, J., Smith, N.A.: Part-of-speech tagging for twitter:
Annotation, features, and experiments. Tech. rep., DTIC Document (2010)
8.
Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant
supervision. CS224N Project Report, Stanford (2009)
9.
Hu, X., Tang, J., Gao, H., Liu, H.: Unsupervised sentiment analysis with emotional
signals. In: Proceedings of the 22nd international conference on World Wide Web.
pp. 607–618. International World Wide Web Conferences Steering Committee (2013)
10.
Hu, X., Tang, L., Tang, J., Liu, H.: Exploiting social relations for sentiment analysis
in microblogging. In: Proceedings of the sixth ACM international conference on
Web search and data mining. pp. 537–546. ACM (2013)
11. Krippendorff, K.: Content analysis: an introduction to its methodology. (1980)
12.
Liu, K.L., Li, W.J., Guo, M.: Emoticon smoothed language models for twitter
sentiment analysis. In: AAAI (2012)
13.
Makrehchi, M., Kamel, M.S.: Automatic extraction of domain-specific stopwords
from labeled documents. In: Advances in information retrieval, pp. 222–233. Springer
(2008)
14.
Martınez-C´amara, E., Montejo-R´aez, A., Martın-Valdivia, M., Urena-L´opez, L.:
Sinai: Machine learning and emotion of the crowd for sentiment analysis in mi-
croblogs (2013)
15.
Mohammad, S.M., Kiritchenko, S., Zhu, X.: Nrc-canada: Building the state-of-the-
art in sentiment analysis of tweets. In: In Proceedings of the seventh international
workshop on Semantic Evaluation Exercises (SemEval-2013), Atlanta, Georgia,
USA, June 2013. (2013)
16.
Nakov, P., Rosenthal, S., Kozareva, Z., Stoyanov, V., Ritter, A., Wilson, T.: Semeval-
2013 task 2: Sentiment analysis in twitter. In: In Proceedings of the 7th International
Workshop on Semantic Evaluation. Association for Computational Linguistics.
(2013)
17.
Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text
& web with hidden topics from large-scale data collections. In: Proceedings of the
17th international conference on World Wide Web. pp. 91–100. ACM (2008)
18.
Remus, R.: Asvuniofleipzig: Sentiment analysis in twitter using data-driven machine
learning techniques (2013)
19.
Saif, H., He, Y., Alani, H.: Semantic Smoothing for Twitter Sentiment Analysis. In:
Proceeding of the 10th International Semantic Web Conference (ISWC) (2011)
20.
Saif, H., He, Y., Alani, H.: Alleviating data sparsity for twitter sentiment analysis.
In: Proceedings, 2nd Workshop on Making Sense of Microposts (#MSM2012) in
conjunction with WWW 2012. Layon, France (2012)
21.
Saif, H., He, Y., Alani, H.: Semantic sentiment analysis of twitter. In: Proceedings
of the 11th international conference on The Semantic Web. Boston, MA (2012)
22.
Shamma, D., Kennedy, L., Churchill, E.: Tweet the debates: understanding com-
munity annotation of uncollected sources. In: Proceedings of the first SIGMM
workshop on Social media. pp. 3–10. ACM (2009)
23.
Speriosu, M., Sudan, N., Upadhyay, S., Baldridge, J.: Twitter polarity classification
with label propagation over lexical links and the follower graph. In: Proceedings
of the EMNLP First workshop on Unsupervised Learning in NLP. Edinburgh,
Scotland (2011)
24.
Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social
web. Journal of the American Society for Information Science and Technology 63(1),
163–173 (2012)
25.
Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength
detection in short informal text. Journal of the American Society for Information
Science and Technology 61(12), 2544–2558 (2010)
... The analysis of Weng et al.[60] shows that "homophily" does exist in Twitter data, implying that the "following" relationships between twitterers are related to their topical similarity.14 Similarly, Saif, Fernandez, He and Alani[61] construct a data set, namely, the Stanford Twitter Sentiment Gold (STS-Gold) that has been used for Twitter Sentiment Analysis. This data set contains 2,034 tweets which are automatically annotated with sentiment labels. ...
Article
Full-text available
In this paper, we construct a Twitter-based high-frequency Economic Policy Uncertainty (TEPU) index built on a select set of Twitter user accounts whose tweets are considered to reflect expert opinion on the topic. We study the relationship between the TEPU index and a set of key financial indicators for tracking financial developments in Turkey over the sample period 2013–2021. Based on the results from a vector autoregressive analysis, we find evidence that changes in expert opinion described by fluctuations in the TEPU index interact with fluctuations in financial indicators such as the exchange rate and the stock market index to capture information about high frequency events during our sample period. Second, fluctuations in the TEPU index emerge as a key indicator that helps to predict the country risk premium measured by the CDS spread. We also find evidence that the conditional volatility of the different series reflects salient events that occurred over our sample period.
... While working with these libraries, it is essential to tune the thresholds that categorize the text to neutral, positive, or negative. For this framework, the tuning of the parameters was done using a Twitter sentiment dataset obtained in 2013 (Saif et al. 2013). Therefore, using each library, an algorithm tested all the possible threshold combinations. ...
Article
Full-text available
The present study aims to create a framework that analyses user posts related to a product of interest on social networking platforms. More precisely, by applying information mining techniques, posts are categorised according to the intention they express, the sentiment polarisation, and the type of opinion. The model operates based on linguistic rules, machine learning, and combinations. Six different methodologies are implemented to extract intent, sentiment, and type of opinion from a tweet. The final model automatically detects intention to buy or not to buy the product, intention to compare the product with other competitors, and finally, intention to search for information about the product. It then categorises the text according to the sentiment and depending on their expressed opinion. The dataset comprises tweets for each day of the iPhone 5’s life cycle, corresponding to 365 days. Additionally, it demonstrated that the business’s external or internal decisions affect the public purchasing audience’s opinions, sentiments, and intentions expressed on social media. Lastly, as a Business Intelligence tool, the framework recognises and analyses these points, which contribute substantially to the company’s decision-making through the findings.
... A good report on Tweet sentiment analysis using neural networks is seen in [13,18]. Using baseline features, such as, BOW (bag of words) with unigram and bigram [18] works on different data-sets, (STSTd [16], SE2014, STSGd [34], SSTd [19], SED) with GLoVe embedding and ngram as sentiment feature upon convolution neural network, support vector machine to perform the sentiment analysis task. It claims using the neural networks approach outperformed classification over machine learning classifier. ...
Article
Full-text available
Clustering is one of the widely used techniques in information retrieval. This experiment intends to categorize Tweets (based on their content) as representative of social media/user-generated content by exploiting statistical and semantic features. tf-idf, being widespread, is employed in combination with a synonym-based weighting scheme. The output of tf-idf in the form of the weight vector is transferred to the next phase as input, where based on the word synonyms, the system generate another weighted vector. Both vectors are used as a feature for clustering. The synonym-based feature technique adds semantic importance to the formation of the clusters. Using a density-based categorical clustering algorithm (with 8 as minpoints and 1.5 as epsilon), we categorized the Tweets into clusters. Six clusters are formed from 1K Tweets, which are evaluated manually and found cohesive. The Silhouette coefficient score (0.47) is used to validate the clusters.
... STS-Gold may present different sentiment labels because tweets and targets (entities) are annotated individually [34]. This dataset contains 1.6 million manually classified tweets. ...
Article
Full-text available
Sentiment analysis has been one of the most active research areas in the past decade due to its vast applications. Sentiment quantification, a new research problem in this field, extends sentiment analysis from individual documents to an aggregated collection of documents. Sentiment analysis has been widely researched, but sentiment quantification has drawn less attention despite offering a greater potential to enhance current business intelligence systems. In this research, to perform sentiment quantification, a framework based on feature engineering is proposed to exploit diverse feature sets such as sentiment, content, and part of speech, as well as deep features including word2vec and GloVe. Different machine learning algorithms, including conventional, ensemble learners, and deep learning approaches, have been investigated on standard datasets of SemEval2016, SemEval2017, STS-Gold, and Sanders. The empirical-based results reveal the effectiveness of the proposed feature sets in the process of sentiment quantification when applied to machine learning algorithms. The results also reveal that the ensemble-based algorithm AdaBoost outperforms other conventional machine learning algorithms using a combination of proposed feature sets. The deep learning algorithm RNN, on the other hand, shows optimal results using word embedding-based features. This research has the potential to help diverse applications of sentiment quantification, including polling, trend analysis, automatic summarization, and rumor or fake news detection.
... Sentiment140 dataset (Go et al. 2009) is the Twitter dataset that consists of 1.6 million annotated tweets (0 = negative, 2 = neutral, and 4 = positive). STS-Gold (Saif et al. 2013) dataset consists of a total of 2026 tweets with their IDs and polarities. We have divided SemEval-2014 restaurant reviews dataset, Sentiment140 and STS-Gold datasets into 70% training and 30% testing datasets. ...
Article
Full-text available
The reviews posted online by the end-users can help the business owners obtain a fair evaluation of their products/services and take the necessary steps. However, due to the large volume of online reviews being generated from time to time, it becomes challenging for business owners to track each review. The Customer Review Summarization (CRS) model that can present the summarized information and offer businesses with significant acumens to understand the reason behind customers' choices and behavior, would therefore be desirable. We propose the Hybrid Analysis of Sentiments (HAS) for the perspective of effective CRS in this paper. The HAS consists of steps like pre-processing, feature extraction, and review classification. The pre-processing phase removes the unwanted data from the text reviews using Natural Language Processing (NLP) based on different pre-processing functions. For efficient feature extraction, the hybrid mechanism consisting of aspect-related features and review-related features is proposed to build the unique feature vector for each customer review. Review classification is performed using different supervised classifiers like Support Vector Machine (SVM), Naïve Bayes, and Random Forest. The experimental results show that HAS efficiently performed the sentiment analysis and outperformed the existing state-of-the-art techniques with an F1 score of 92.2%.
... Three Twitter datasets including STS [12] (small: 498 records), sander [13] (medium: 5,113 records), tweet emotions [14] (large: 40,000 records) are used in this research to observe the scalability and functionality of the model. They are selected according to the literature [43] and dataset quality evaluation results. For this, they are tested using SentimentDLModel [42] in terms of misspelling rate and label accuracy. ...
... The study was based on observation of several resources in which firstly, various websites, surveys and research papers are studied related to sentiment analysis and Secondly, an Enhanced approach using association rule mining is proposed which is implemented using R programming language. The dataset used is STS-Gold dataset [13] [19], which was taken from Kaggle repository [20]. This dataset was used to compare the Enhanced approach using association rule mining, with the baseline approach i.e. svm alone used for sentiment analysis. ...
Article
In today’s world social networking platforms like Facebook, YouTube, twitter etc. are a great source of communication for internet users and loaded with large number of emotions, views and opinions of the people. Sentiment analysis is the study of attitudes, emotions and opinions of the people and is also known as opinion mining. Sentiment analysis is used to find the opinion i.e. negative or positive about a particular subject. In this paper an Enhanced sentiment analysis approach is presented by using the Association rule mining i.e. Apriori and machine learning approach such as Support Vector Machine. The Enhanced approach is compared with the baseline approach, on accuracy, precision, recall, and F1-score measures. The Enhanced approach for sentiment analysis is implemented using the R programming language. The Enhanced approach shows better performance in comparison to the baseline approach.
... Modern sentiment analysis tools are advancing, but they are still not mature enough to accurately recognise sarcasm, jokes and nuances of language. There is also the limitation of a lack of distinctive sentiment annotations amongst an already limited amount of datasets readily available, as observed in [73], which subsequently makes it harder to perform accurate analysis on dialogues of a more extensive lexicon. What's more, sentiment analysis is sensitive to social conventions which are prevalent in human communication. ...
Preprint
Task-oriented dialogue systems (TODS) are continuing to rise in popularity as various industries find ways to effectively harness their capabilities, saving both time and money. However, even state-of-the-art TODS are not yet reaching their full potential. TODS typically have a primary design focus on completing the task at hand, so the metric of task-resolution should take priority. Other conversational quality attributes that may point to the success, or otherwise, of the dialogue, may be ignored. This can cause interactions between human and dialogue system that leave the user dissatisfied or frustrated. This paper explores the literature on evaluative frameworks of dialogue systems and the role of conversational quality attributes in dialogue systems, looking at if, how, and where they are utilised, and examining their correlation with the performance of the dialogue system.
Article
The exponential rise in social media via microblogging sites like Twitter has sparked curiosity in sentiment analysis that exploits user feedback towards a targeted product or service. Considering its significance in business intelligence and decision-making, numerous efforts have been made in this area. However, lack of dictionaries, unannotated data, large-scale unstructured data, and low accuracies have plagued these approaches. Also, sentiment classification through classifier ensemble has been underexplored in literature. In this article, we propose a Semantic Relational Machine Learning (SRML) model that automatically classifies the sentiment of tweets by using classifier ensemble and optimal features. The model employs the Cascaded Feature Selection (CFS) strategy, a novel statistical assessment approach based on Wilcoxon rank sum test, univariate logistic regression assisted significant predictor test and cross-correlation test. It further uses the efficacy of word2vec-based continuous bag-of-words and n-gram feature extraction in conjunction with SentiWordNet for finding optimal features for classification. We experiment on six public Twitter sentiment datasets, the STS-Gold dataset, the Obama-McCain Debate (OMD) dataset, the healthcare reform (HCR) dataset and the SemEval2017 Task 4A, 4B and 4C on a heterogeneous classifier ensemble comprising fourteen individual classifiers from different paradigms. Results from the experimental study indicate that CFS supports in attaining a higher classification accuracy with up to 50% lesser features compared to count vectorizer approach. In Intra-model performance assessment, the Artificial Neural Network-Gradient Descent (ANN-GD) classifier performs comparatively better than other individual classifiers, but the Best Trained Ensemble (BTE) strategy outperforms on all metrics. In inter-model performance assessment with existing state-of-the-art systems, the proposed model achieved higher accuracy and outperforms more accomplished models employing quantum-inspired sentiment representation (QSR), transformer-based methods like BERT, BERTweet, RoBERTa and ensemble techniques. The research thus provides critical insights into implementing similar strategy into building more generic and robust expert system for sentiment analysis that can be leveraged across industries.
Chapter
In sentiment analysis, we try to find out the writer's view about any product, events, government policy, services, topics, individual, etc., through the text written by them on social media platforms like Twitter, Facebook, etc. This study has considered two datasets (STS-Gold and IMDb) on a different domain and with varying lengths of text. The objective of this study is to know which classification algorithm performs better on two domains of text with different length. We have applied six machine learning algorithms (support vector machine, logistic regression, K-Nearest Neighbors, random forest, Naïve Bayes, and decision tree) and compared them on the basis f-score, precision, recall, and accuracy. In the IMDb dataset, logistic regression performs better among all and gives the highest accuracy of 96.3% and f-score of 80.6%. The second highest is achieved with Naïve Bayes with 95.89 and 80.05% f-score. Naïve Bayes gives the highest accuracy of 81.08% and an f-score of 42.45% in the STS-Gold dataset. The second highest is achieved with logistic regression giving an accuracy of 80.09 and 41.52% f-score. We found that logistic regression and Naïve Bayes are performing better among all the algorithms on both datasets.
Conference Paper
Full-text available
We describe the Sentiment Analysis in Twitter task, ran as part of SemEval-2014. It is a continuation of the last year’s task that ran successfully as part of SemEval2013. As in 2013, this was the most popular SemEval task; a total of 46 teams contributed 27 submissions for subtask A (21 teams) and 50 submissions for subtask B (44 teams). This year, we introduced three new test sets: (i) regular tweets, (ii) sarcastic tweets, and (iii) LiveJournal sentences. We further tested on (iv) 2013 tweets, and (v) 2013 SMS messages. The highest F1score on (i) was achieved by NRC-Canada at 86.63 for subtask A and by TeamX at 70.96 for subtask B.
Conference Paper
Full-text available
Extracting sentiment from Twitter data is one of the fundamental problems in social media analytics. Twitter's length constraint renders determining the positive/negative sentiment of a tweet difficult, even for a human judge. In this work we present a general framework for per-tweet (in contrast with batches of tweets) sentiment analysis which consists of: (1) extracting tweets about a desired target subject, (2) separating tweets with sentiment, and (3) setting apart positive from negative tweets. For each step, we study the performance of a number of classical and new machine learning algorithms. We also show that the intrinsic sparsity of tweets allows performing classification in a low dimensional space, via random projections, without losing accuracy. In addition, we present weighted variants of all employed algorithms, exploiting the available labeling uncertainty, which further improve classification accuracy. Finally, we show that spatially aggregating our per-tweet classification results produces a very satisfactory outcome, making our approach a good candidate for batch tweet sentiment analysis.
Conference Paper
Full-text available
There is high demand for automated tools that assign polarity to microblog content such as tweets (Twitter posts), but this is challenging due to the terseness and informality of tweets in addition to the wide variety and rapid evolution of language in Twitter. It is thus impractical to use standard supervised machine learning techniques dependent on annotated training examples. We do without such annotations by using label propagation to incorporate labels from a maximum entropy classifier trained on noisy labels and knowledge about word types encoded in a lexicon, in combination with the Twitter follower graph. Results on polarity classification for several datasets show that our label propagation approach rivals a model supervised with in-domain annotated tweets, and it outperforms the noisily supervised classifier it exploits as well as a lexicon-based polarity ratio classifier.
Conference Paper
Full-text available
Microblogging, like Twitter and Sina Weibo, has become a popular platform of human expressions, through which users can easily produce content on breaking news, public events, or products. The massive amount of microblogging data is a useful and timely source that carries mass sentiment and opinions on various topics. Existing sentiment analysis approaches often assume that texts are independent and identically distributed (i.i.d.), usually focusing on building a sophisticated feature space to handle noisy and short texts, without taking advantage of the fact that the microblogs are networked data. Inspired by the social sciences findings that sentiment consistency and emotional contagion are observed in social networks, we investigate whether social relations can help sentiment analysis by proposing a Sociological Approach to handling Noisy and short Texts (SANT) for sentiment classification. In particular, we present a mathematical optimization formulation that incorporates the sentiment consistency and emotional contagion theories into the supervised learning process; and utilize sparse learning to tackle noisy texts in microblogging. An empirical study of two real-world Twitter datasets shows the superior performance of our framework in handling noisy and short tweets.
Article
Twitter sentiment analysis (TSA) has become a hot research topic in recent years. The goal of this task is to discover the attitude or opinion of the tweets, which is typically formulated as a machine learning based text classification problem. Some methods use manually labeled data to train fully supervised models, while others use some noisy labels, such as emoticons and hashtags, for model training. In general, we can only get a limited number of training data for the fully supervised models because it is very labor-intensive and time-consuming to manually label the tweets. As for the models with noisy labels, it is hard for them to achieve satisfactory performance due to the noise in the labels although it is easy to get a large amount of data for training. Hence, the best strategy is to utilize both manually labeled data and noisy labeled data for training. However, how to seamlessly integrate these two different kinds of data into the same learning framework is still a challenge. In this paper, we present a novel model, called emoticon smoothed language model (ESLAM), to handle this challenge. The basic idea is to train a language model based on the manually labeled data, and then use the noisy emoticon data for smoothing. Experiments on real data sets demonstrate that ESLAM can effectively integrate both kinds of data to outperform those methods using only one of them.
Conference Paper
Twitter is a micro blogging website, where users can post messages in very short text called Tweets. Tweets contain user opinion and sentiment towards an object or person. This sentiment information is very useful in various aspects for business and governments. In this paper, we present a method which performs the task of tweet sentiment identification using a corpus of pre-annotated tweets. We present a sentiment scoring function which uses prior information to classify (binary classification) and weight various sentiment bearing words/phrases in tweets. Using this scoring function we achieve classification accuracy of 87% on Stanford Dataset and 88% on Mejaj dataset. Using supervised machine learning approach, we achieve classification accuracy of 88% on Stanford dataset.
Conference Paper
The explosion of social media services presents a great opportunity to understand the sentiment of the public via analyzing its large-scale and opinion-rich data. In social media, it is easy to amass vast quantities of unlabeled data, but very costly to obtain sentiment labels, which makes unsupervised sentiment analysis essential for various applications. It is challenging for traditional lexicon-based unsupervised methods due to the fact that expressions in social media are unstructured, informal, and fast-evolving. Emoticons and product ratings are examples of emotional signals that are associated with sentiments expressed in posts or words. Inspired by the wide availability of emotional signals in social media, we propose to study the problem of unsupervised sentiment analysis with emotional signals. In particular, we investigate whether the signals can potentially help sentiment analysis by providing a unified way to model two main categories of emotional signals, i.e., emotion indication and emotion correlation. We further incorporate the signals into an unsupervised learning framework for sentiment analysis. In the experiment, we compare the proposed framework with the state-of-the-art methods on two Twitter datasets and empirically evaluate our proposed framework to gain a deep understanding of the effects of emotional signals.