Conference PaperPDF Available

Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts

Authors:
  • IBM Research, T. J. Watson Research Center

Abstract and Figures

Sentiment analysis of short texts such as single sentences and Twitter messages is challenging because of the limited contextual information that they normally contain. Effectively solving this task requires strategies that combine the small text content with prior knowledge and use more than just bag-of-words. In this work we propose a new deep convolutional neural network that exploits from character-to sentence-level information to perform sentiment analysis of short texts. We apply our approach for two corpora of two different domains: the Stanford Sentiment Tree-bank (SSTb), which contains sentences from movie reviews; and the Stanford Twitter Sentiment corpus (STS), which contains Twitter messages. For the SSTb corpus, our approach achieves state-of-the-art results for single sentence sentiment prediction in both binary positive/negative classification, with 85.7% accuracy, and fine-grained classification, with 48.3% accuracy. For the STS corpus, our approach achieves a sentiment prediction accuracy of 86.4%.
Content may be subject to copyright.
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers,
pages 69–78, Dublin, Ireland, August 23-29 2014.
Deep Convolutional Neural Networks for
Sentiment Analysis of Short Texts
C´
ıcero Nogueira dos Santos
Brazilian Research Lab
IBM Research
cicerons@br.ibm.com
Ma´
ıra Gatti
Brazilian Research Lab
IBM Research
mairacg@br.ibm.com
Abstract
Sentiment analysis of short texts such as single sentences and Twitter messages is challenging
because of the limited contextual information that they normally contain. Effectively solving this
task requires strategies that combine the small text content with prior knowledge and use more
than just bag-of-words. In this work we propose a new deep convolutional neural network that ex-
ploits from character- to sentence-level information to perform sentiment analysis of short texts.
We apply our approach for two corpora of two different domains: the Stanford Sentiment Tree-
bank (SSTb), which contains sentences from movie reviews; and the Stanford Twitter Sentiment
corpus (STS), which contains Twitter messages. For the SSTb corpus, our approach achieves
state-of-the-art results for single sentence sentiment prediction in both binary positive/negative
classification, with 85.7% accuracy, and fine-grained classification, with 48.3% accuracy. For the
STS corpus, our approach achieves a sentiment prediction accuracy of 86.4%.
1 Introduction
The advent of online social networks has produced a crescent interest on the task of sentiment analysis for
short text messages (Go et al., 2009; Barbosa and Feng, 2010; Nakov et al., 2013). However, sentiment
analysis of short texts such as single sentences and and microblogging posts, like Twitter messages, is
challenging because of the limited amount of contextual data in this type of text. Effectively solving this
task requires strategies that go beyond bag-of-words and extract information from the sentence/message
in a more disciplined way. Additionally, to fill the gap of contextual information in a scalable manner, it
is more suitable to use methods that can exploit prior knowledge from large sets of unlabeled texts.
In this work we propose a deep convolutional neural network that exploits from character- to sentence-
level information to perform sentiment analysis of short texts. The proposed network, named Character
to Sentence Convolutional Neural Network (CharSCNN), uses two convolutional layers to extract rele-
vant features from words and sentences of any size. The proposed network can easily explore the richness
of word embeddings produced by unsupervised pre-training (Mikolov et al., 2013). We perform experi-
ments that show the effectiveness of CharSCNN for sentiment analysis of texts from two domains: movie
review sentences; and Twitter messages (tweets). CharSCNN achieves state-of-the-art results for the two
domains. Additionally, in our experiments we provide information about the usefulness of unsupervised
pre-training; the contribution of character-level features; and the effectiveness of sentence-level features
to detect negation.
This work is organized as follows. In Section 2, we describe the proposed the Neural Network archi-
tecture. In Section 3, we discuss some related work. Section 4 details our experimental setup and results.
Finally, in Section 5 we present our final remarks.
2 Neural Network Architecture
Given a sentence, CharSCNN computes a score for each sentiment label τT. In order to score
a sentence, the network takes as input the sequence of words in the sentence, and passes it through
This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer
are added by the organizers. License details: http://creativecommons.org/licenses/by/4.0/
69
a sequence of layers where features with increasing levels of complexity are extracted. The network
extracts features from the character-level up to the sentence-level. The main novelty in our network
architecture is the inclusion of two convolutional layers, which allows it to handle words and sentences
of any size.
2.1 Initial Representation Levels
The first layer of the network transforms words into real-valued feature vectors (embeddings) that cap-
ture morphological, syntactic and semantic information about the words. We use a fixed-sized word
vocabulary Vwrd, and we consider that words are composed of characters from a fixed-sized character
vocabulary Vchr. Given a sentence consisting of Nwords {w1, w2, ..., wN}, every word wnis con-
verted into a vector un= [rwrd;rwch], which is composed of two sub-vectors: the word-level embedding
rwrd Rdwr d and the character-level embedding rwch Rcl0
uof wn. While word-level embeddings are
meant to capture syntactic and semantic information, character-level embeddings capture morphological
and shape information.
2.1.1 Word-Level Embeddings
Word-level embeddings are encoded by column vectors in an embedding matrix Wwrd Rdwrd ×|Vwrd |.
Each column Wwrd
iRdwrd corresponds to the word-level embedding of the i-th word in the vocabulary.
We transform a word winto its word-level embedding rwrd by using the matrix-vector product:
rwrd =Wwrdvw(1)
where vwis a vector of size Vwrdwhich has value 1at index wand zero in all other positions. The
matrix Wwrd is a parameter to be learned, and the size of the word-level embedding dwrd is a hyper-
parameter to be chosen by the user.
2.1.2 Character-Level Embeddings
Robust methods to extract morphological and shape information from words must take into consideration
all characters of the word and select which features are more important for the task at hand. For instance,
in the task of sentiment analysis of Twitter data, important information can appear in different parts
of a hash tag (e.g., “#SoSad”, “#ILikeIt”) and many informative adverbs end with the suffix “ly” (e.g.
beautifully”, “perfectly” and “badly”). We tackle this problem using the same strategy proposed in
(dos Santos and Zadrozny, 2014), which is based on a convolutional approach (Waibel et al., 1989). As
depicted in Fig. 1, the convolutional approach produces local features around each character of the word
and then combines them using a max operation to create a fixed-sized character-level embedding of the
word.
Given a word wcomposed of Mcharacters {c1, c2, ..., cM}, we first transform each character cminto
a character embedding rchr
m. Character embeddings are encoded by column vectors in the embedding
matrix Wchr Rdchr×|Vchr |. Given a character c, its embedding rchr is obtained by the matrix-vector
product:
rchr =Wchrvc(2)
where vcis a vector of size Vchrwhich has value 1at index cand zero in all other positions. The input
for the convolutional layer is the sequence of character embeddings {rchr
1, rchr
2, ..., rchr
M}.
The convolutional layer applies a matrix-vector operation to each window of size kchr of successive
windows in the sequence {rchr
1, rchr
2, ..., rchr
M}. Let us define the vector zmRdchr kchr as the con-
catenation of the character embedding m, its (kchr 1)/2left neighbors, and its (kchr 1)/2right
neighbors1:
zm=rchr
m(kchr1)/2, ..., rchr
m+(kchr1)/2T
1We use a special padding character for the characters with indices outside of the word boundaries.
70
Figure 1: Convolutional approach to character-level feature extraction.
The convolutional layer computes the j-th element of the vector rwch Rcl0
u, which is the character-level
embedding of w, as follows:
[rwch]j= max
1<m<M W0zm+b0j(3)
where W0Rcl0
u×dchrkchr is the weight matrix of the convolutional layer. The same matrix is used to
extract local features around each character window of the given word. Using the max over all character
windows of the word, we extract a “global” fixed-sized feature vector for the word.
Matrices Wchr and W0, and vector b0are parameters to be learned. The size of the character vector
dchr, the number of convolutional units cl0
u(which corresponds to the size of the character-level embed-
ding of a word), and the size of the character context window kchr are hyper-parameters.
2.2 Sentence-Level Representation and Scoring
Given a sentence xwith Nwords {w1, w2, ..., wN}, which have been converted to joint word-level
and character-level embedding {u1, u2, ..., uN}, the next step in CharSCNN consists in extracting a
sentence-level representation rsent
x. Methods to extract a sentence-wide feature set most deal with two
main problems: sentences have different sizes; and important information can appear at any position in
the sentence. We tackle these problems by using a convolutional layer to compute the sentence-wide
feature vector rsent. This second convolutional layer in our neural network architecture works in a very
similar way to the one used to extract character-level features for words. This layer produces local
features around each word in the sentence and then combines them using a max operation to create a
fixed-sized feature vector for the sentence.
The second convolutional layer applies a matrix-vector operation to each window of size kwrd of
successive windows in the sequence {u1, u2, ..., uN}. Let us define the vector znR(dwrd +cl0
u)kwrd as
the concatenation of a sequence of kwrd embeddings, centralized in the n-th word2:
zn=un(kwrd 1)/2, ..., un+(kwrd 1)/2T
2We use a special padding token for the words with indices outside of the sentence boundaries.
71
The convolutional layer computes the j-th element of the vector rsent Rcl1
uas follows:
[rsent]j= max
1<n<N W1zn+b1j(4)
where W1Rcl1
u×(dwrd +cl0
u)kwrd is the weight matrix of the convolutional layer. The same matrix is
used to extract local features around each word window of the given sentence. Using the max over
all word windows of the sentence, we extract a “global” fixed-sized feature vector for the sentence.
Matrix W1and vector b1are parameters to be learned. The number of convolutional units cl1
u(which
corresponds to the size of the sentence-level feature vector), and the size of the word context window
kwrd are hyper-parameters to be chosen by the user.
Finally, the vector rsent
x, the “global’ feature vector of sentence x, is processed by two usual neural
network layers, which extract one more level of representation and compute a score for each sentiment
label τT:
s(x) = W3h(W2rsent
x+b2) + b3(5)
where matrices W2Rhlu×cl1
uand W3R|Thlu, and vectors b2Rhluand b3R|T|are parameters
to be learned. The transfer function h(.)is the hyperbolic tangent. The number of hidden units hluis a
hyper-parameter to be chosen by the user.
2.3 Network Training
Our network is trained by minimizing a negative likelihood over the training set D. Given a sentence x,
the network with parameter set θcomputes a score sθ(x)τfor each sentiment label τT. In order to
transform these scores into a conditional probability distribution of labels given the sentence and the set
of network parameters θ, we apply a softmax operation over the scores of all tags τT:
p(τ|x, θ) = esθ(x)τ
X
iT
esθ(x)i(6)
Taking the log, we arrive at the following conditional log-probability:
log p(τ|x, θ) = sθ(x)τlog X
iT
esθ(x)i!(7)
We use stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to θ:
θ7→ X
(x,y)Dlog p(y|x, θ)(8)
where (x, y)corresponds to a sentence in the training corpus Dand yrepresents its respective label.
The backpropagation algorithm is a natural choice to efficiently compute gradients of network archi-
tectures such as the one proposed in this work (Lecun et al., 1998; Collobert, 2011). In order to perform
our experiments, we implement the proposed CharSCNN architecture using the Theano library (Bergstra
et al., 2010). Theano is a versatile Python library that allows the efficient definition, optimization, and
evaluation of mathematical expressions involving multi-dimensional arrays. We use Theanos automatic
differentiation capabilities in order to implement the backpropagation algorithm.
3 Related Work
There are a few works on neural network architectures for sentiment analysis. In (Socher et al., 2011),
the authors proposed a semi-supervised approach based on recursive autoencoders for predicting senti-
ment distributions. The method learns vector space representation for multi-word phrases and exploits
the recursive nature of sentences. In (Socher et al., 2012), it is proposed a matrix-vector recursive neu-
ral network model for semantic compositionality, which has the ability to learn compositional vector
72
representations for phrases and sentences of arbitrary length. The vector captures the inherent meaning
of the constituent, while the matrix captures how the meaning of neighboring words and phrases are
changed. In (Socher et al., 2013b) the authors propose the Recursive Neural Tensor Network (RNTN)
architecture, which represents a phrase through word vectors and a parse tree and then compute vectors
for higher nodes in the tree using the same tensor-based composition function. Our approach differ from
these previous works because it uses a feed-forward neural network instead of a recursive one. Moreover,
it does not need any input about the syntactic structure of the sentence.
Regarding convolutional networks for NLP tasks, in (Collobert et al., 2011), the authors use a convo-
lutional network for the semantic role labeling task with the goal avoiding excessive task-specific feature
engineering. In (Collobert, 2011), the authors use a similar network architecture for syntactic parsing.
CharSCNN is related to these works because they also apply convolutional layers to extract sentence-
level features. The main difference in our neural network architecture is the addition of one convolutional
layer to extract character features.
In terms of using intra-word information in neural network architectures for NLP tasks, Alexandrescu
et al. (2006) present a factored neural language model where each word is represented as a vector of
features such as stems, morphological tags and cases and a single embedding matrix is used to look
up all of these features. In (Luong et al., 2013), the authors use a recursive neural network (RNN) to
explicitly model the morphological structures of words and learn morphologically-aware embeddings.
Lazaridou et al. (Lazaridou et al., 2013) use compositional distributional semantic models, originally
designed to learn meanings of phrases, to derive representations for complex words, in which the base
unit is the morpheme. In (Chrupala, 2013), the author proposes a simple recurrent network (SRN) to learn
continuous vector representations for sequences of characters, and use them as features in a conditional
random field classifier to solve a character level text segmentation and labeling task. The main advantage
of our approach to extract character-level features is it flexibility. The convolutional layer allows the
extraction of relevant features from any part of the word and do not need handcrafted inputs like stems
and morpheme lists (dos Santos and Zadrozny, 2014).
4 Experimental Setup and Results
4.1 Sentiment Analysis Datasets
We apply CharSCNN for two different corpora from two different domains: movie reviews and Twitter
posts. The movie review dataset used is the recently proposed Stanford Sentiment Treebank (SSTb)
(Socher et al., 2013b), which includes fine grained sentiment labels for 215,154 phrases in the parse
trees of 11,855 sentences. In our experiments we focus in sentiment prediction of complete sentences.
However, we show the impact of training with sentences and phrases instead of only sentences.
The second labeled corpus we use is the Stanford Twitter Sentiment corpus (STS) introduced by
(2009). The original training set contains 1.6 million tweets that were automatically labeled as posi-
tive/negative using emoticons as noisy labels. The test set was manually annotated by Go et al. (2009).
In our experiments, to speedup the training process we use only a sample of the training data consisting
of 80K (5%) randomly selected tweets. We also construct a development set by randomly selecting 16K
tweets from Go et al.’s training set. In Table 1, we present additional details about the two corpora.
Dataset Set # sentences / tweets # classes
SSTb
Train 8544 5
Dev 1101 5
Test 2210 5
STS
Train 80K 2
Dev 16K 2
Test 498 3
Table 1: Sentiment Analysis datasets.
73
4.2 Unsupervised Learning of Word-Level Embeddings
Word-level embeddings play a very important role in the CharSCNN architecture. They are meant to
capture syntactic and semantic information, which are very important to sentiment analysis. Recent
work has shown that large improvements in terms of model accuracy can be obtained by performing
unsupervised pre-training of word embeddings (Collobert et al., 2011; Luong et al., 2013; Zheng et
al., 2013; Socher et al., 2013a). In our experiments, we perform unsupervised learning of word-level
embeddings using the word2vec tool3, which implements the continuous bag-of-words and skip-gram
architectures for computing vector representations of words (Mikolov et al., 2013).
We use the December 2013 snapshot of the English Wikipedia corpus as a source of unlabeled data.
The Wikipedia corpus has been processed using the following steps: (1) removal of paragraphs that are
not in English; (2) substitution of non-western characters for a special character; (3) tokenization of the
text using the tokenizer available with the Stanford POS Tagger (Manning, 2011); (4) and removal of
sentences that are less than 20 characters long (including white spaces) or have less than 5 tokens. Like
in (Collobert et al., 2011) and (Luong et al., 2013), we lowercase all words and substitute each numerical
digit by a 0 (e.g., 1967 becomes 0000). The resulting clean corpus contains about 1.75 billion tokens.
When running the word2vec tool, we set that a word must occur at least 10 times in order to be included
in the vocabulary, which resulted in a vocabulary of 870,214 entries. To train our word-level embeddings
we use word2vecs skip-gram method with a context window of size 9. The training time for the English
corpus is around 1h10min using 12 threads in a IntelrXeonrE5-2643 3.30GHz machine.
In our experiments, we do not perform unsupervised pre-training of character-level embeddings, which
are initialized by randomly sampling each value from an uniform distribution: U(r, r), where r=
r6
|Vchr|+dchr . There are 94 different characters in the SSTb corpus and 453 different characters in
the STS corpus. Since the two character vocabularies are relatively small, it has been possible to learn
reliable character-level embeddings using the labeled training corpora. The raw (not lowercased) words
are used to construct the character vocabularies, which allows the network to capture relevant information
about capitalization.
4.3 Model Setup
We use the development sets to tune the neural network hyper-parameters. Many different combinations
of hyper-parameters can give similarly good results. We spent more time tuning the learning rate than
tuning other parameters, since it is the hyper-parameter that has the largest impact in the prediction
performance. The only two parameters with different values for the two datasets are the learning rate
and the number of units in the convolutional layer that extract sentence features. This provides some
indication on the robustness of our approach to multiple domains. For both datasets, the number of
training epochs varies between five and ten. In Table 2, we show the selected hyper-parameter values for
the two labeled datasets.
Parameter Parameter Name SSTb STS
dwrd Word-Level Embeddings dimension 30 30
kwrd Word Context window 5 5
dchr Char. Embeddings dimension 5 5
kchr Char. Context window 3 3
cl0
uChar. Convolution Units 10 50
cl1
uWord Convolution Units 300 300
hluHidden Units 300 300
λLearning Rate 0.02 0.01
Table 2: Neural Network Hyper-Parameters
3https://code.google.com/p/word2vec/
74
In order to assess the effectiveness of the proposed character-level representation of words, we com-
pare the proposed architecture CharSCNN with an architecture that uses only word embeddings. In
our experiments, SCNN represents a network which is fed with word representations only, i.e, for each
word wnits embedding is un=rwrd. For SCNN, we use the same NN hyper-parameters values (when
applicable) shown in Table 2.
4.4 Results for SSTb Corpus
In Table 3, we present the result of CharSCNN and SCNN for different versions of the SSTb corpus. Note
that SSTb corpus is a sentiment treebank, hence it contains sentiment annotations for all phrases in all
sentences in the corpus. In our experiments, we check whether using examples that are single phrases, in
addition to complete sentences, can provide useful information for training the proposed NN. However,
in our experiments the test set always includes only complete sentences. In Table 3, the column Phrases
indicates whether all phrases (yes) or only complete sentences (no) in the corpus are used for training.
The Fine-Grained column contains prediction results for the case where 5 sentiment classes (labels) are
used (very negative,negative,neutral,positive,very positive). The Positive/Negative column presents
prediction results for the case of binary classification of sentences, i.e, the neutral class is removed, the
two negative classes are merged as well as the two positive classes.
Model Phrases Fine-Grained Positive/Negative
CharSCNN yes 48.3 85.7
SCNN yes 48.3 85.5
CharSCNN no 43.5 82.3
SCNN no 43.5 82.0
RNTN (Socher et al., 2013b) yes 45.7 85.4
MV-RNN (Socher et al., 2013b) yes 44.4 82.9
RNN (Socher et al., 2013b) yes 43.2 82.4
NB (Socher et al., 2013b) yes 41.0 81.8
SVM (Socher et al., 2013b) yes 40.7 79.4
Table 3: Accuracy of different models for fine grained (5-class) and binary predictions using SSTb.
In Table 3, we can note that CharSCN and SCNN have very similar results in both fine-grained and bi-
nary sentiment prediction. These results suggest that the character-level information is not much helpful
for sentiment prediction in the SSTb corpus. Regarding the use of phrases in the training set, we can note
that, even not explicitly using the syntactic tree information when performing prediction, CharSCNN
and SCNN benefit from the presence of phrases as training examples. This result is aligned with Socher
et al.’s (2013b) suggestion that information of sentiment labeled phrases improves the accuracy of other
classification algorithms such as support vector machines (SVM) and naive Bayes (NB). We believe
that using phrases as training examples allows the classifier to learn more complex phenomena, since
sentiment labeled phrases give the information of how words (phrases) combine to form the sentiment
of phrases (sentences). However, it is necessary to perform more detailed experiments to confirm this
conjecture.
Regarding the fine-grained sentiment prediction, our approach provides an absolute accuracy improve-
ment of 2.6 over the RNTN approach proposed by (Socher et al., 2013b), which is the previous best
reported result for SSTb. CharSCN, SCNN and Socher et al.’s RNTN have similar accuracy performance
for binary sentiment prediction. Compared to RNTN, our method has the advantage of not needing the
output of a syntactic parser when performing sentiment prediction. For comparison reasons, in Table
3 we also report Socher et al.’s (2013b) results for sentiment classifiers trained with recursive neural
networks (RNN), matrix-vector RNN (MV-RNN), NB, and SVM algorithms.
Initializing word-embeddings using unsupervised pre-training gives an absolute accuracy increase of
around 1.5 when compared to randomly initializing the vectors. The Theano based implementation of
CharSCNN takes around 10 min. to complete one training epoch for the SSTb corpus with all phrases
75
and five classes. In our experiments, we use 4 threads in a IntelrXeonrE5-2643 3.30GHz machine.
4.5 Results for STS Corpus
In Table 4, we present the results of CharSCNN and SCNN for sentiment prediction using the STS cor-
pus. As expected, character-level information has a greater impact for Twitter data. Using unsupervised
pre-training, CharSCNN provides an absolute accuracy improvement of 1.2 over SCNN. Additionally,
initializing word-embeddings using unsupervised pre-training gives an absolute accuracy increase of
around 4.5 when compared to randomly initializing the word-embeddings.
In Table 4, we also compare CharSCNN performance with other approaches proposed in the literature.
In (Speriosu et al., 2011), a label propagation (LProp) approach is proposed, while Go et al. (2009)
use maximum entropy (MaxEnt), NB and SVM-based classifiers. CharSCNN outperforms the previous
approaches in terms of prediction accuracy. As far as we know, 86.4 is the best prediction accuracy
reported so far for the STS corpus.
Model Accuracy Accuracy (random
(unsup. pre-training) word embeddings)
CharSCNN 86.4 81.9
SCNN 85.2 82.2
LProp (Speriosu et al., 2011) 84.7
MaxEnt (Go et al., 2009) 83.0
NB (Go et al., 2009) 82.7
SVM (Go et al., 2009) 82.2
Table 4: Accuracy of different models for binary predictions (positive/negative) using STS Corpus.
4.6 Sentence-level features
In figures 2 and 3 we present the behavior of CharSCNN regarding the sentence-level features extracted
for two cases of negation, which are correctly predicted by CharSCNN. We choose these cases because
negation is an important issue in sentiment analysis. Moreover, the same sentences are also used as
illustrative examples in (Socher et al., 2013b). Note that in the convolutional layer, 300 features are first
extracted for each word. Then the max operator selects the 300 features which have the largest values
among the words to construct the sentence-level feature set rsent. Figure 2 shows a positive sentence
(left) and its negation. We can observe that in both versions of the sentence, the extracted features
concentrate mainly around the main topic, “film”, and the part of the phrase that indicates sentiment
(“liked” and “did ’nt like”). Note in the left chart that the word “liked” has a big impact in the set of
extracted features. On the other hand, in the right chart, we can see that the impact of the word “like’’ is
reduced because of the negation “did ’nt”, which is responsible for a large part of the extracted features.
In Figure 3 a similar behavior can be observed. While the very negative expression “incredibly dull”
is responsible for 69% of the features extracted from the sentence in the left, its negation “definitely
not dull”, which is somewhat more positive, is responsible for 77% of the features extracted from the
sentence in the chart at right . These examples indicate CharSCNN’s robustness to handle negation, as
well as its ability to capture information that is important to sentiment prediction.
5 Conclusions
In this work we present a new deep neural network architecture that jointly uses character-level, word-
level and sentence-level representations to perform sentiment analysis. The main contributions of the
paper are: (1) the idea of using convolutional neural networks to extract from character- to sentence-
level features; (2) the demonstration that a feed-forward neural network architecture can be as effective
as RNTN (Socher et al., 2013a) for sentiment analysis of sentences; (3) the definition of new state-of-
the-art results for SSTb and STS corpora.
76
10
20
30
40
50
60
70
I
liked
every
single
minute
of
this
film
.
Figure 2: Number of local features selected at each word when forming the sentence-level representation.
In this example, we have a positive sentence (left) and its negation (right).
10
20
30
40
50
60
70
80
90
100
110
120
It
’s
just
incredibly
dull
Figure 3: Number of local features selected at each word when forming the sentence-level representation.
In this example, we have a negative sentence (left) and its negation (right).
As future work, we would like to analyze in more detail the role of character-level representations
for sentiment analysis of tweets. Additionally, we would like to check the impact of performing the
unsupervised pre-training step using texts from the specific domain at hand.
References
Andrei Alexandrescu and Katrin Kirchhoff. 2006. Factored neural language models. In Proceedings of the Human
Language Technology Conference of the NAACL, pages 1–4, New York City, USA, June.
Luciano Barbosa and Junlan Feng. 2010. Robust sentiment detection on twitter from biased and noisy data. In
Proceedings of the 23rd International Conference on Computational Linguistics, pages 36–44.
James Bergstra, Olivier Breuleux, Fr´
ed´
eric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins,
Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: a CPU and GPU math expression
compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy).
Grzegorz Chrupala. 2013. Text segmentation with character-level text embeddings. In Proceedings of the ICML
workshop on Deep Learning for Audio, Speech and Language Processing.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural language processing
(almost) from scratch. Journal of Machine Learning Research, 12:2493–2537.
R. Collobert. 2011. Deep learning for efficient discriminative parsing. In Proceedings of the Fourteenth Interna-
tional Conference on Artificial Intelligence and Statistics (AISTATS), pages 224–232.
C´
ıcero Nogueira dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-
speech tagging. In Proceedings of the 31st International Conference on Machine Learning, JMLR: W&CP
volume 32, Beijing, China.
77
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. Tech-
nical report, Stanford University.
Angeliki Lazaridou, Marco Marelli, Roberto Zamparelli, and Marco Baroni. 2013. Compositional–ly derived rep-
resentations of morphologically complex words in distributional semantics. In Proceedings of the 51st Annual
Meeting of the Association for Computational Linguistics (ACL), pages 1517–1526.
Yann Lecun, Lon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document
recognition. In Proceedings of the IEEE, pages 2278–2324.
Minh-Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recur-
sive neural networks for morphology. In Proceedings of the Conference on Computational Natural Language
Learning, Sofia, Bulgaria.
Christopher D. Manning. 2011. Part-of-speech tagging from 97% to 100%: Is it time for some linguistics? In
Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing,
CICLing’11, pages 171–189.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in
vector space. In Proceedings of Workshop at International Conference on Learning Representations.
Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva, Veselin Stoyanov, Alan Ritter, and Theresa Wilson. 2013.
Semeval-2013 task 2: Sentiment analysis in twitter. In Second Joint Conference on Lexical and Computational
Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation
(SemEval 2013), pages 312–320, Atlanta, Georgia, USA, June. Association for Computational Linguistics.
Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. 2011. Semi-
supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing, pages 151–161.
Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality
through recursive matrix-vector spaces. In Proceedings of theConference on Empirical Methods in Natural
Language Processing, pages 1201–1211.
Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013a. Parsing with compositional
vector grammars. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher
Potts. 2013b. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing, pages 1631–1642.
Michael Speriosu, Nikita Sudan, Sid Upadhyay, and Jason Baldridge. 2011. Twitter polarity classification with la-
bel propagation over lexical links and the follower graph. In Proceedings of the First Workshop on Unsupervised
Learning in NLP, EMNLP, pages 53–63.
A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. 1989. Phoneme recognition using time-delay
neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(3):328–339.
Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. Deep learning for chinese word segmentation and pos
tagging. In Proceedings of the Conference on Empirical Methods in NLP, pages 647–657.
78
... In various studies, machine learning has been used to perform sentiment analysis. For instance, [11] and [2] have used deep learning based convolutional neural network for analysis of sentiments. It is hypothesized that machine learning can outperform other techniques and three machine learning classifiers i.e. ...
... Another innovation in recent years is the use of recurrent architecture such as recurrent neural network (RNN) and long short term memory models (LSTM) for text data. In this direction, various researches have employed recurrent neural network for sentiment analysis [11]. A combination of supervised and unsupervised machine learning techniques have been used in [10] for sentiment analysis. ...
Research
Full-text available
This paper performs the sentiment analysis on tweets and social media posts of general people of Pakistan about one of Pakistan's political party i.e. Pakistan Democratic Movement (PDM). PDM is a political movement comprised of 11 political parties of Pakistan founded against the current government of Pakistan. This paper focus on analyzing the sentiments of common Pakistanis towards PDM. Sentiment analysis is also called opinion mining or text mining. It is a way to find out public opinion and their reaction towards a particular entity or a topic. In the proposed system, data is extracted from Facebook using instant data scraper, and also tweets from twitter were extracted using twitter API. The data was extracted based on the query: current situations in Pakistan i.e. Pakistan Democratic Movement. This paper focuses on mining social media comments written in different languages and mostly in English. After pre-processing, data is labelled manually using 5 emotions which are agree, disagree, neutral, sarcastic and angry. After labeling the data several algorithms are used like support vector machines, Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN) to classify the tweets/ posts.
... With the development of deep learning in natural language processing-related research applications, researchers found that the related methods can solve the problem of overreliance on the emotional dictionary in the experiment effectively [10,11]. Santos et al. used a convolution neural network (CNN) model to obtain the local features of sentences or words by using two convolution layers, and they targeted the mining of semantic information to improve the sentiment analysis effect of short texts [12]. Irsoy and Cardie added sequential characteristics to the model of a recurrent neural network (RNN) to obtain sentence representation on timeseries information, which further improved the accuracy of sentiment analysis [13]. ...
Article
Full-text available
False content in microblogs affects users’ judgment of facts. An evaluation of microblog content credibility can find false information as soon as possible, which ensures that social networks maintain a positive environment. The influence of sentiment polarity can be used to analyze the correlation between sentiment polarity in comments and Weibo content through semantic features and sentiment features in comments, to improve the effect of content credibility assessment. This paper proposes a Weibo content credibility evaluation model, CEISP (Credibility Evaluation based on the Influence of Sentiment Polarity). The semantic features of microblog content are extracted by a bidirectional-local information processing network. Bidirectional long short-term memory (BiLSTM) is used to mine the sentiment features of comments. The attention mechanism is used to capture the impact of different sentiment polarities in comments on microblog content, and the influence of sentiment polarities is obtained for the credibility assessment of microblog content. The experimental results on real datasets show that the evaluation performance of the CEISP model is improved compared with the comparison model. Compared with the existing Att-BiLSTM model, the evaluation accuracy of the CEISP model is improved by 0.0167.
... Thus, emerging AI algorithms can be applied to (semi-)automate that evaluation or at least some parts of it. For example, public administrations can train powerful Machine Learning(ML) models like neural networks that solve typical natural language processing(NLP) tasks like text classification [37,46], sentiment analysis [84,21] and summarization [41]. Applying such models on the gathered feedbacks will provide a more structured and concise input for public administration to evaluate their services. ...
Thesis
Public administrations can be supported by Machine Learning (ML) in various ways. For instance, ML can enhance the usability of public online services, ease the treatment of citizen’s concerns and facilitate the daily work of public officers. However, training and optimising ML models would require the collaboration of public administration by exchanging sensitive citizen data, which is difficult to achieve in highly regulated domains such as government. In this work, we propose an architecture of a system for data processing based on Federated Learning(FL), a novel technique for decentralised optimisation of ML models while keeping training data local. Thereby, we focus on privacy and accountability aspects, which are of crucial importance for the utilization of ML in government. First, we conduct a stakeholder analysis based on a use case scenario form the public sector to identify the system requirements. Second, we derive four patterns for privacy-preserving data handling based on industrial best practices and integrate one pattern in the design of the proposed architecture. Then, we implement a prototype of the designed architecture using state-of-the-art tools and test it with real-world data. Finally, we evaluate the architecture against the prototypical implementation and the identified requirements showing that our system represents a suitable solution for the described challenges.
Article
Today online networking has become an indispensable part of life for people all over the world. It is difficult for users to reduce their internet/online communications, as the flow of information increases everyday. While the free flow of information benefits online communications, the high toxicity of online communication is a drawback. Toxic texts are described as disrespectful or insulting messages that make the recipient feel uncomfortable. Deep Learning based Convolutional Neural Networks (CNN) have given exceptional outcomes in Computer Vision Domain, and AlexNet has proven to be the leading architecture in image classification and object detection problems. This article presents a 3-tier CNN architecture that is inspired by the AlexNet model to classify the toxic comments on the Wikipedia forum available in the Google Jigsaw dataset. Fast text-crawl-300d-2m is used to formulate the pre-trained word embeddings matrix. The Exponential Linear Unit (ELU) activation function is applied in the Convolutional blocks for faster convergence. Dropout is used sufficiently along with different layers of the network to prevent overfitting. From the simulation and subsequent comparative analysis, it is found that the proposed model achieved a decent average accuracy of 98.505% and an average F1 score of 0.79. ROC-AUC score is used as an evaluation parameter. The value of ROC-AUC for the proposed model is approximately 0.9854, which shows that the said model differentiates between the comment classes more accurately.
Article
Modeling and understanding users interests has become an essential part of our daily lives. A variety of business processes and a growing number of companies employ various tools to such an end. The outcomes of these identification strategies are beneficial for both companies and users: the former are more likely to offer services to those customers who really need them, while the latter are more likely to get the service they desire. Several works have been carried out in the area of user interests identification. As a result, it might not be easy for researchers, developers, and users to orient themselves in the field; that is, to find the tools and methods that they most need, to identify ripe areas for further investigations, and to propose the development and adoption of new research plans. In this study, to overcome these potential shortcomings, we performed a systematic literature review on user interests identification. We used as input data browsing tab titles. Our goal here is to offer a service to the readership, which is capable of systematically guiding and reliably orienting researchers, developers, and users in this very vast domain. Our findings demonstrate that the majority of the research carried out in the field gathers data from either social networks (such as Twitter, Instagram and Facebook) or from search engines, leaving open the question of what to do when such data is not available.
Article
Full-text available
Data imbalance in datasets is a common issue where the number of instances in one or more categories far exceeds the others, so is the case with the educational domain. Collecting feedback on a course on a large scale and the lack of publicly available datasets in this domain limits models’ performance, especially for deep neural network based models which are data hungry. A model trained on such an imbalanced dataset would naturally favor the majority class. However, the minority class could be critical for decision-making in prediction systems, and therefore it is usually desirable to train a model with equally high class-level accuracy. This paper addresses the data imbalance issue for the sentiment analysis of users’ opinions task on two educational feedback datasets utilizing synthetic text generation deep learning models. Two state-of-the-art text generation GAN models namely CatGAN and SentiGAN, are employed for synthesizing text used to balance the highly imbalanced datasets in this study. Particular emphasis is given to the diversity of synthetically generated samples for populating minority classes. Experimental results on highly imbalanced datasets show significant improvement in models’ performance on CR23K and CR100K after balancing with synthetic data for the sentiment classification task.
Article
Sentiment analysis has become a highly effective research field in the natural language domain and has a large scope of real-world implementations. An existing active study concentration for sentiment analysis is the development of graininess at the document level, appearing with two featured objectives: subjectivity classification, which determines whether a document is objective or subjective and sentiment detection which defines whether or not a document has a sentiment. Deep learning approaches have featured as a chance for developing these objectives with their ability to present both syntactic and semantic characteristics of text without demands for high-level attribute engineering. In this paper, we focus to produce a systematic literature review of deep learning methods for document-based sentiment analysis to determine different features in the text. In addition, this systematic literature review presents a brief survey, evaluation, enhancement of recent developments in the field of sentiment analysis techniques and applications of documents for deep learning, starting with the Convolutional Neural Network, continues to cover the Recurrent Neural Network, including Long Short-Term Memory and Gated Repetitive Units. This review also contains the implementation and application of Recursive Neural Network, Deep Belief Network, Domain-Adversarial Network Models and Hybrid Neural Network. This work considers most of the papers published when the history of deep learning began, and specifically the sentiment analysis of the documents.
Conference Paper
Full-text available
We describe the Sentiment Analysis in Twitter task, ran as part of SemEval-2014. It is a continuation of the last year’s task that ran successfully as part of SemEval2013. As in 2013, this was the most popular SemEval task; a total of 46 teams contributed 27 submissions for subtask A (21 teams) and 50 submissions for subtask B (44 teams). This year, we introduced three new test sets: (i) regular tweets, (ii) sarcastic tweets, and (iii) LiveJournal sentences. We further tested on (iv) 2013 tweets, and (v) 2013 SMS messages. The highest F1score on (i) was achieved by NRC-Canada at 86.63 for subtask A and by TeamX at 70.96 for subtask B.
Conference Paper
Full-text available
This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by the deep learning. We try to avoid task-specific feature engineering, and use the deep layers of neural networks to discover relevant features to the tasks. We leverage a large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-the-art performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to implement.
Article
Full-text available
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
Conference Paper
Full-text available
Speakers of a language can construct an unlimited number of new words through morphological derivation. This is a major cause of data sparseness for corpus-based approaches to lexical semantics, such as distributional semantic models of word meaning. We adapt compositional methods originally developed for phrases to the task of deriving the distributional meaning of morphologically complex words from their parts. Semantic representations constructed in this way beat a strong baseline and can be of higher quality than representations directly constructed from corpus data. Our results constitute a novel evaluation of the proposed composition methods, in which the full additive model achieves the best performance, and demonstrate the usefulness of a compositional morphology component in distributional semantics.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment composition-ality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases.
Conference Paper
Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.
Conference Paper
Single-word vector space models have been very successful at learning lexical information. However, they cannot capture the compositional meaning of longer phrases, preventing them from a deeper understanding of language. We introduce a recursive neural network (RNN) model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. Our model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector RNN can learn the meaning of operators in propositional logic and natural language. The model obtains state of the art performance on three different experiments: predicting fine-grained sentiment distributions of adverb-adjective pairs; classifying sentiment labels of movie reviews and classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them.