Conference PaperPDF Available

Presenting A Sentiment Analysis System Using Deep Learning Models On Persian Texts (In Persian)

Authors:

Abstract and Figures

Sentiment Analysis or Opinion Mining is one of the developing fields in text mining that used to define and extract the people’s opinions, emotions toward entities, issues, events or topics. A lot of research has been done to improve the function of sentiment analysis systems, such as using simple linear models in machine learning and more complex deep neural networks. Recently, deep learning has achieved great successes in the field of sentiment analysis and is considered as the state-of-the-art model in various languages. However, based on the last scientific progress, sentiment analysis systems still need to increase the accuracy of their output. On the other hand, the Persian language imposes many challenges, due to its complex structure, various dialects, lack of accessibility to enough dataset, and shortage of precise text processing tools. Therefore, the main purpose of this thesis is to present a sentiment analysis system using deep learning models and also, comparing the common approaches in machine learning with deep learning in Persian texts. For this goal, by using an input corpus (SentiPers), the pre-processing step has been applied to the input dataset. after this step, it is observed that the number of existing data is not sufficient, especially for deep learning models. So three different ways have been suggested for balancing and data augmentation. The system has been utilized TF-IDF, and two methods such as a pre-trained and neural networks in the baseline and deep learning respectively, since the importance of word embedding in sentiment analysis systems. Also, classification has been done by using Naïve Bayes, SVM and SGD in baseline models and additionally CNN and BI- LSTM in deep learning models. Consequently, the impact of size increment and dataset balancing has been shown in the function of deep learning models and they could achieve equal or more accuracy than baseline by using suggested data augmentation techniques.
Content may be subject to copyright.

















javad.pourmostafa@gmail.com
parsa.abbasi1996@gmail.com
mirroshandel@gmail.com







  











   
         

(Liu, 2012)
1
Text mining
2
Sentiment analysis
3
Deep learning
4
Machine learning
5
Natural language processing
6
Opinion mining



""
""
""""(Liu, 2015)










(Rojas-Barahona & Maria, 2016)


 











7
User-generated content
8
Natural Language Processing (NLP)
Pang






)2002Pang, et al., (







NBSVM
(Wang & Manning, 2012)

PersianCluesLDASA



  
(Shams, et al., 2012)





(Basiri, et al., 2014)




)2015LeCun, et al., (
one-hot

(Maas, et al., 2011)









)2018Dashtipour, et al., (

Naive Bayes (NB)
11
Maximum Entropy

Support Vector Machine (SVM)

Wan

Mang

Bigram

Latent Dirichlet allocation (LDA)
17
Logistic regression

Machine Translation

Convolutional Neural Network (CNN)
20
Autoencoder
21
Multilayer perceptron (MLP)




SentiPers)2018Hosseini, et al., (


























SentiPers




22
Dataset

https://www.digikala.com
24
Interface
25
https://github.com/JoyeBright/Sentiment-Analysis




)2017Xie, et al., (




              




(Fadaee, et al., 2017)











Data noising
27
Google Translate





































28
Multinomial classification














 




idf-tf



)2003Ramos, (






N(Sugathadasa, et al., 2018)N=2

tf-idf

tf-idf
29
Binary classification
30
Term frequencyinverse document frequency
31
Embedding
32
https://github.com/sobhe/hazm




        )1997Thorsten, (

(Prasetijo, et al., 2017)(Li & Li, 2013)






SVM
SGD
NB



















SVM
SGD
NB




















        (Day, 2016)   
(Vateekul & Koomsubha, 2016)




Stochastic Gradient Descend (SGD)

G.E. Hilton
35
Recursive Neural Networks (RNN)





(Collobert, et al., 2011)
(Rojas-Barahona & Maria, 2016)




(Collobert, et al., 2011)







   












)2010Turian, et al., (




36
Deep Belief Networks (DBN)
37
Long short-term memory (LSTM)
38
Semantic
39
Grammatical
40
Word Feature




Keras




   

FastText





""""""
""
""""""

FastText




FastText
41
Pre-trained Word Embedding
42
https://fasttext.cc/docs/en/crawl-vectors

Keras







(Srivastava, et al., 2014)







BI- LSTM

FastText


CNN

Keras

Dropout

Overfitting

Bidirectional Long Short-Term Memory (BI-LSTM)






(Kim, 2014)






CNN
FastText
CNN Keras
LSTM FastText
LSTM Keras























CNN
FastText
CNN Keras
LSTM FastText
LSTM Keras


























Kim









SVM


FastTextBI-LSTMCNN
CNNKerasBI-LSTMFastText














Basiri, M. E., Nilchi, A. R. N. & Ghassem-aghaee, N., 2014. A Framework for Sentiment
Analysis in Persian.
Collobert, R. et al., 2011. Natural Language Processing (Almost) from Scratch. The Journal
of Machine Learning Research, Volume 12, pp. 2493-2537.
Dashtipour, K. et al., 2018. Exploiting Deep Learning for Persian Sentiment Analysis. s.l., s.n.
Day, M., 2016. Deep Learning for Financial Sentiment Analysis on Finance News Providers.
Fadaee, M., Bisazza, A. & Monz, C., 2017. Data Augmentation for Low-Resource Neural
Machine Translation. arXiv.
Hosseini, P. et al., 2018. SentiPers: A Sentiment Analysis Corpus for Persian. arXiv.
Kim, Y., 2014. Convolutional Neural Networks for Sentence Classification. Doha, Qatar, s.n.
LeCun, Y., Bengio, Y. & Hinton, G., 2015. Deep learning. Nature, Volume 521, pp. 436-444.
Liu, B., 2012. Sentiment Analysis and Opinion Mining. Synthesis lectures on human
language technologies, pp. 1-167.
Liu, B., 2015. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions.
s.l.:Cambridge University Press.
Li, Y.-M. & Li, T.-Y., 2013. Deriving market intelligence from microblogs. Decision Support
Systems, 55(1), pp. 206-217.
Maas, A. L. et al., 2011. Learning Word Vectors for Sentiment Analysis. Proceedings of the
49th Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies, pp. 142-150.

Pang, B., Lee, L. & Vaithyanathan, S., 2002. Thumbs up?: sentiment classification using
machine learning techniques. Proceeding, pp. 79-86.
Prasetijo, A. B. et al., 2017. Hoax detection system on Indonesian news sites based on text
classification using SVM and SGD. s.l., s.n., pp. 45-49.
Ramos, J., 2003. Using TF-IDF to Determine Word Relevance in Document Queries. Arxiv.
Rojas-Barahona & Maria, L., 2016. Deep learning for sentiment analysis, Language and
Linguistics Compass. Language and Linguistics Compass.
Shams, M., Shakery, A. & Faili, H., 2012. A non-parametric LDA-based induction method for
sentiment analysis. Shiraz, Iran, s.n.
Srivastava, N. et al., 2014. Dropout: a simple way to prevent neural networks from
overfitting. Journal of Machine Learning Research, Volume 15, pp. 1929-1958.
Sugathadasa, K., Ayesha, B., de Silva, N. & Perera, A., 2018. Legal Document Retrieval
using Document Vector Embeddings and Deep Learning. ArXiv.
Thorsten, J., 1997. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text
Categorization. San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., pp. 143--151.
Turian, J., Ratinov, L. & Bengi, Y., 2010. Word representations: a simple and general
method for semi-supervised learning. Stroudsburg, s.n.
Vateekul, P. & Koomsubha, T., 2016. A Study of Sentiment Analysis Using Deep Learning
Techniques on Thai Twitter Data. Khon Kaen, s.n.
Wang, S. & Manning, C. D., 2012. Baselines and Bigrams : Simple, Good Sentiment and
Topic Classification. Proceedings of the 50th Annual Meeting of the Association for
Computational Linguistics: Short Papers, Volume 2, pp. 90-94.
Xie, Z. et al., 2017. Data Noising as Smoothing in Neural Network Language Models. ICLR.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Sentiment Analysis (SA) is a major field of study in natural language processing, computational linguistics and information retrieval. Interest in SA has been constantly growing in both academia and industry over the recent years. Moreover, there is an increasing need for generating appropriate resources and datasets in particular for low resource languages including Persian. These datasets play an important role in designing and developing appropriate opinion mining platforms using supervised, semi-supervised or unsupervised methods. In this paper, we outline the entire process of developing a manually annotated sentiment corpus, SentiPers, which covers formal and informal written contemporary Persian. To the best of our knowledge, SentiPers is a unique sentiment corpus with such a rich annotation in three different levels including document-level, sentence-level, and entity/aspect-level for Persian. The corpus contains more than 26,000 sentences of users' opinions from digital product domain and benefits from special characteristics such as quantifying the positiveness or negativity of an opinion through assigning a number within a specific range to any given sentence. Furthermore, we present statistics on various components of our corpus as well as studying the inter-annotator agreement among the annotators. Finally, some of the challenges that we faced during the annotation process will be discussed as well.
Conference Paper
Full-text available
A deliberate falsehood intentionally fabricated to appear as the truth, or often called as hoax (hocus to trick) has been increasing at an alarming rate. This situation may cause restlessness/anxiety and panic in society. Even though hoaxes have no effect on threats, however, new perceptions can be spread that they can affect both the social and political conditions. Imagery blown from hoaxes can bring negative effects and intervene state policies that may decrease the economy. An early detection on hoaxes helps the Government to reduce and even eliminate the spread. There are some system that filter hoaxes based on title and also from voting processes from searching processes in a search engine. This research develops Indonesian hoax filter based on text vector representation based on Term Frequency and Document Frequency as well as classification techniques. There are several classification techniques and for this research, Support Vector Machine and Stochastic Gradient Descent are chosen. Support Vector Machine divides a word vector using linear function and Stochastic Gradient Descent divides a word vector using non-linear function. SVM and SGD are chosen because the characteristic of text classification includes multidimensional matrixes. Each word in news articles can be modeled as feature and with Linear SVC and SGD, the feature of word vector can be reduced into two dimensions and can be separated using linear and non-linear lines. The highest accuracy obtained from SGD classifier using modified-huber is 86% over 100 hoax and 100 non-hoax websites which are randomly chosen outside dataset which are used in the training process.
Article
Full-text available
The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts. Experimental results on simulated low-resource settings show that our method improves translation quality by up to 2.9 BLEU points over the baseline and up to 3.2 BLEU over back-translation.
Article
Full-text available
Data noising is an effective technique for regularizing neural network models. While noising is widely adopted in application domains such as vision and speech, commonly used noising primitives have not been developed for discrete sequence-level settings such as language modeling. In this paper, we derive a connection between input noising in neural network language models and smoothing in n-gram models. Using this connection, we draw upon ideas from smoothing to develop effective noising schemes. We demonstrate performance gains when applying the proposed schemes to language modeling and machine translation. Finally, we provide empirical analysis validating the relationship between noising and smoothing.
Chapter
Domain specific information retrieval process has been a prominent and ongoing research in the field of natural language processing. Many researchers have incorporated different techniques to overcome the technical and domain specificity and provide a mature model for various domains of interest. The main bottleneck in these studies is the heavy coupling of domain experts, that makes the entire process to be time consuming and cumbersome. In this study, we have developed three novel models which are compared against a golden standard generated via the on line repositories provided, specifically for the legal domain. The three different models incorporated vector space representations of the legal domain, where document vector generation was done in two different mechanisms and as an ensemble of the above two. This study contains the research being carried out in the process of representing legal case documents into different vector spaces, whilst incorporating semantic word measures and natural language processing techniques. The ensemble model built in this study, shows a significantly higher accuracy level, which indeed proves the need for incorporation of domain specific semantic similarity measures into the information retrieval process. This study also shows, the impact of varying distribution of the word similarity measures, against varying document vector dimensions, which can lead to improvements in the process of legal information retrieval.
Chapter
The rise of social media is enabling people to freely express their opinions about products and services. The aim of sentiment analysis is to automatically determine subject’s sentiment (e.g., positive, negative, or neutral) towards a particular aspect such as topic, product, movie, news etc. Deep learning has recently emerged as a powerful machine learning technique to tackle a growing demand of accurate sentiment analysis. However, limited work has been conducted to apply deep learning algorithms to languages other than English, such as Persian. In this work, two deep learning models (deep autoencoders and deep convolutional neural networks (CNNs)) are developed and applied to a novel Persian movie reviews dataset. The proposed deep learning models are analyzed and compared with the state-of-the-art shallow multilayer perceptron (MLP) based machine learning model. Simulation results demonstrate the enhanced performance of deep learning over state-of-the-art MLP.
Article
Research and industry are becoming more and more interested in finding automatically the polarised opinion of the general public regarding a specific subject. The advent of social networks has opened the possibility of having access to massive blogs, recommendations, and reviews. The challenge is to extract the polarity from these data, which is a task of opinion mining or sentiment analysis. The specific difficulties inherent in this task include issues related to subjective interpretation and linguistic phenomena that affect the polarity of words. Recently, deep learning has become a popular method of addressing this task. However, different approaches have been proposed in the literature. This article provides an overview of deep learning for sentiment analysis in order to place these approaches in context.
Book
Sentiment analysis is the computational study of people's opinions, sentiments, emotions, and attitudes. This fascinating problem is increasingly important in business and society. It offers numerous research challenges but promises insight useful to anyone interested in opinion analysis and social media analysis. This book gives a comprehensive introduction to the topic from a primarily natural-language-processing point of view to help readers understand the underlying structure of the problem and the language constructs that are commonly used to express opinions and sentiments. It covers all core areas of sentiment analysis, includes many emerging themes, such as debate analysis, intention mining, and fake-opinion detection, and presents computational methods to analyze and summarize opinions. It will be a valuable resource for researchers and practitioners in natural language processing, computer science, management sciences, and the social sciences.