Table 4 - uploaded by Ramy Baly
Content may be subject to copyright.
Sentiment Distribution 

Sentiment Distribution 

Source publication
Article
Full-text available
Sentiment analysis in Arabic is challenging due to the complex morphology of the language. The task becomes more challenging when considering Twitter data that contain significant amounts of noise such as the use of Arabizi, code-switching and different dialects that varies significantly across the Arab world, the use of non-textual objects to expr...

Context in source publication

Context 1
... Table 4 shows how sentiment is distributed across tweets of each country. Sentiment in Egyptian tweets is normally-distributed, with most tweets being neutral and very few have high sentiment intensities. ...

Similar publications

Preprint
Full-text available
In this paper, we present Arap-Tweet, which is a large-scale and multi-dialectal corpus of Tweets from 11 regions and 16 countries in the Arab world representing the major Arabic dialectal varieties. To build this corpus, we collected data from Twitter and we provided a team of experienced annotators with annotation guidelines that they used to ann...

Citations

... The machine learning methods are more accurate than the other methods when it comes to binary classification. A deep learning framework proposed by [17] identified the polarity of tweets in a 5scale classification that spans from extremely positive to extremely negative. They collected a total of about 470 thousand tweets from twelve Arab nations in four regions (North Africa, Egypt, the Levant and the Arab Gulf). ...
Preprint
Full-text available
Social networks are popular for advertising, idea sharing, and opinion formation. Due to COVID-19, coronavirus information disseminated on social media affects people's lives directly. Individuals sometimes managed it well, but it often hampered daily activities. As a result, analyzing people's feelings is important. Sentiment analysis identifies opinions or sentiments from text. In this paper, we present an effective model that leverages the benefits of Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) to categorize Arabic tweets using a stacked ensemble learning model. First, the tweets are represented as vectors using a word embedding model, then the text feature is extracted by CNN, and finally the context information of the text is acquired by BiLSTM. Aravec, FastText, and ArWordVec are employed separately to assess the impact of the word embedding on the our model. We also compare the proposed method to various deep learning models: CNN, LSTM, and BiLSTM. Experiments are performed on three different Arabic datasets related to COVID-19 and vaccines. Empirical findings show that the proposed model outperformed the other models' results by achieving F-measures of 76.76\%, 87.\%, and 80.5\% on the SenWave, AraCOVID19-SSD, and ArCovidVac datasets, respectively.
... Sentiment analysis S.A. plays an important role in many real-world applications. S.A. aids business intelligence in evaluating consumer reviews about a product [1], helps in decision making for stock market prediction [2], and in the classification of different dialects in different languages, as in Arabic [3]. The spam identification on social media is successfully checked by automatically identifying spam terms across many forums, emails, and blogs through S.A. systems [4], opinion summarization, and public opinion analysis [5,6]. ...
Article
Full-text available
Multilabel emotion classification is a high priority because it mimics real-life scenarios in which people display a variety of emotions. The text could express a collection of emotions such as happiness, love, and optimism, or sadness, anger, and pessimism. In this framework, the Arabic tweets data provided by SemEval 2018-Task1, E-c subtask have been first preprocessed through different normalization steps, including stemming, stop word removal, special characters, and digits removal. An emotion lexicon has been built to replace the emotions with their meaning related to emotion classes. A word embedding pre-trained model Aravec has been implemented for the feature extraction process because word embedding performed better in this task than other features such as the N-gram model. In the classification process of our framework, different machine learning techniques have been implemented, including Multi-Layer Perceptron (MLP), Support Vector Machine SVM, K Nearest Neighbor (KNN), Ensemble Random Forest (RF), and Ensemble Extra Tree. The best performance was achieved using MLP, whereas SVM proved to perform best over other Traditional machine learning techniques such as KNN, RF, and Extra tree. Extra tree achieved a multilabel Jaccard accuracy of 26.2%, Nearest Neighbor (KNN) of 37.5%, Ensemble Random Forest (RF) of 29.1%, and SVM accuracy of 46.3%. A neural network model Multi-Layer Perceptron (MLP), achieved an accuracy of 48%. The proposed framework has been compared with different previous machine learning models built for this task; the results obtained by the proposed framework outperform other previous models in most cases.
... Recently, as for Dialect Identification, researchers and developers started using deep learning networks for Sentiment Analysis with word embeddings and pretrained language models. A CNN feature extractor and transformation network was proposed in (Soumeur et al., 2018) to determine the sentiment of Algerian users' comments on various Facebook brand pages of companies in Algeria, while (Baly et al., 2017) present an LSTM network with pre-trained word embeddings to build a 5-scale Sentiment Analysis model for 4 Arabic dialects. A combination of word and document embeddings in addition to a set of semantic features were used in (Abdullah et al., 2018) for Arabic tweets. ...
Conference Paper
Full-text available
The usage of social media platforms has resulted in the proliferation of work on Arabic Natural Language Processing (ANLP), including the development of resources. There is also an increased interest in processing Arabic dialects and a number of models and algorithms have been utilized for the purpose of Dialectal Arabic Natural Language Processing (DANLP). In this paper, we conduct a comparison study between some of the most well-known and most commonly used methods in NLP in order to test their performance on different corpora and two NLP tasks: Dialect Identification and Sentiment Analysis. In particular, we compare three general classes of models: a) traditional Machine Learning models with features, b) classic Deep Learning architectures (LSTMs) with pre-trained word embeddings and lastly c) different Bidirectional Encoder Representations from Transformers (BERT) models such as (Multilingual-BERT, Ara-BERT, and Twitter-Arabic-BERT). The results of the comparison show that using feature-based classification can still compete with BERT models in these dialectal Arabic contexts. The use of transformer models have the ability to outperform traditional Machine Learning approaches, depending on the type of text they have been trained on, in contrast to classic Deep Learning models like LSTMs which do not perform well on the tasks.
... Some public datasets consist of positive and negative classes such as the Large-Scale Arabic Book Review [27] and Ar-Twitter, proposed by [28]. The rest of the available dataset consists of four more classes, such as [29], which proposed four classes, and ArsenTb, which employs five classes [10,30,31]. ...
... They used the English dataset translated into Arabic, carried out the classification using RCNN, and achieved 94% prediction accuracy. Ref. [30] implemented LSTM on a small corpus with five classes in two Arabic dialects: Emirati and Egyptian. They achieved accuracies of 70% on Egyptian dialects and 63.7% on Emirati dialects. ...
Article
Full-text available
Arabic is one of the official languages recognized by the United Nations (UN) and is widely used in the middle east, and parts of Asia, Africa, and other countries. Social media activity currently dominates the textual communication on the Internet and potentially represents people’s views about specific issues. Opinion mining is an important task for understanding public opinion polarity towards an issue. Understanding public opinion leads to better decisions in many fields, such as public services and business. Language background plays a vital role in understanding opinion polarity. Variation is not only due to the vocabulary but also cultural background. The sentence is a time series signal; therefore, sequence gives a significant correlation to the meaning of the text. A recurrent neural network (RNN) is a variant of deep learning where the sequence is considered. Long short-term memory (LSTM) is an implementation of RNN with a particular gate to keep or ignore specific word signals during a sequence of inputs. Text is unstructured data, and it cannot be processed further by a machine unless an algorithm transforms the representation into a readable machine learning format as a vector of numerical values. Transformation algorithms range from the Term Frequency–Inverse Document Frequency (TF-IDF) transform to advanced word embedding. Word embedding methods include GloVe, word2vec, BERT, and fastText. This research experimented with those algorithms to perform vector transformation of the Arabic text dataset. This study implements and compares the GloVe and fastText word embedding algorithms and long short-term memory (LSTM) implemented in single-, double-, and triple-layer architectures. Finally, this research compares their accuracy for opinion mining on an Arabic dataset. It evaluates the proposed algorithm with the ASAD dataset of 55,000 annotated tweets in three classes. The dataset was augmented to achieve equal proportions of positive, negative, and neutral classes. According to the evaluation results, the triple-layer LSTM with fastText word embedding achieved the best testing accuracy, at 90.9%, surpassing all other experimental scenarios.
... A Multi-Dialect Arabic Sentiment Twitter Dataset (MD-ArSenTD) was proposed to analyze tweets collected from Egypt and the United Arab Emirates (UAE) using different deep learning models [25]. In [26], a massive amount of tweets (6M) for Arabic sentiment analysis were collected and labeled using emojis sentiment lexicons. ...
Article
Full-text available
In the field of sentiment analysis, most of research has conducted experiments on datasets collected from Twitter for manipulating a specific language. Little number of datasets has been collected for detecting sentiments expressed in Arabic tweets. Moreover, very limited number of such datasets is suitable for conducting recent research directions such as target dependent sentiment analysis and open-domain targeted sentiment analysis. Thereby, there is a dire need for reliable datasets that are specifically acquired for open-domain targeted sentiment analysis with Arabic language. Therefore, in this paper, we introduce AT-ODTSA, a dataset of Arabic Tweets for Open-Domain Targeted Sentiment Analysis, which includes Arabic tweets along with labels that specify targets (topics) and sentiments (opinions) expressed in the collected tweets. To the best of our knowledge, our work presents the first dataset that manually annotated for applying Arabic open-domain targeted sentiment analysis. We also present a detailed statistical analysis of the dataset. The AT-ODTSA dataset is suitable for train numerous machine learning models such as a deep learning-based model.
... Baly et al. (Baly et al. 2017b) provided the first multidialectal Arabic sentiment Twitter dataset (MD-ArSenTD). It contains annotated tweets (for both sentiment and dialect) that were collected from 12 Arab countries from the Gulf, Levant, and North Africa. ...
Article
Full-text available
Over the last decade, the amount of Arabic content created on websites and social media has grown significantly. Opinions are shared openly and freely on social media and thus provide a rich source for trend analyses, which are accomplished by conventional methods of language interpretation, such as sentiment analysis. Due to its accuracy in studying unstructured data, deep learning has been increasingly used to test opinions. Recurrent neural networks (RNNs) are a promising approach in textual analysis and exhibit large morphological variations. In total, 193 studies used RNNs in English-language sentiment analysis, and 24 studies used RNNs in Arabic-language sentiment analysis. Those studies varied in the areas they address, the functionality and weaknesses of the models, and the number and scale of the available datasets for different dialects. Such variations are worthy of attention and monitoring; thus, this paper presents a systematic examination of the literature to label, evaluate, and identify state-of-the-art studies using RNNs for Arabic sentiment analysis.
... However, the Arabic language is one of the most popular internet users, resulting in a growing interest in the research area of Arabic SA and resources. [11], [12] The analysis of Arabic SA in various dialects, as well as the enhancement of Arabic ABSA, is addressed in [13], [14]. Developing extensive and comprehensive Arabic lexicons is critical for the advancement of the discipline, articles [15], [16], and [17] focused on building Arabic Lexicons. ...
Article
Full-text available
Sentiment analysis (SA) or opinion mining extracts and analyses subjective information from various sources such as the web, social media, and other sources to determine people's opinions using natural language processing (NLP), computational linguistics, and text analysis. This analyzed information gives the public's feelings or attitudes about specific items, persons, or ideas and identifies the information's contextual polarity. This systematic review gives a clear image of recent work in sentiment analysis SA; it studies the papers published in the SA field between 2016 and 2020 using the science direct and Springer databases. Furthermore, it explains the various approaches employed and the various uses of SA systems. In science Direct, 99 publications meet our research requirements, whereas, in Springer, 57 papers meet the same conditions, with a total of 156 papers reviewed and assessed in this systematic review. Techniques, performance, language, and the domain have been analyzed.
... In [16], the authors proposed a deep learning model based on Long short-term memory (LSTM) architecture to identify the sentiments of documents written in Egyptian and Emirati dialects. To train this model, the authors collected and annotated a corpus of 470k tweets. ...
Conference Paper
Full-text available
In this article, we tackle the issue of sentiment analysis in three Maghrebi dialects used in social networks. More precisely, we are interested by analysing sentiments in Algerian, Moroccan and Tunisian corpora. To do this, we built automatically three lexicons of sentiments, one for each dialect. Each lexicon is composed of words with their polarities , a dialect word could be written in Arabic or in Latin scripts. These lexicons may include French or English words as well as words in Arabic dialect and standard Arabic. The semantic orientation of a word represented by an embedding vector is determined automatically by calculating its distance with several embedding seed words. The embedding vectors are trained on three large corpora collected from YouTube. The proposed approach is evaluated by using few existing annotated corpora in Tunisian and Moroccan dialects. For the Algerian dialect, in addition to a small corpus we found in the literature, we collected and annotated one composed of 10k comments extracted from Youtube. This corpus represents a valuable resource which is proposed for free 1 .
... The corpus is manually annotated for sentiment and labeled with four labels for sentiment: positive, negative, neutral and mixed. Another corpus created based on Twitter is the Multi-Dialect Arabic Sentiment Twitter Dataset (MD-ArSenTD) [31] which is a multidialect Arabic corpus collected from tweets from 12 Arab countries (KW, SA, QA, UAE, Jordan, Lebanon, Palestine, Syria, Algeria, Morocco, Tunisia, Egypt) and annotated for sentiment and dialect. The Twitter4J API [32] was used to collect 470K tweets posted from 3/1/2017 to 4/30/2017. ...
Article
Full-text available
Due to the rapid developments in technology and the sudden expansion of social media use, Dialect Arabic has become an important source of data that needs to be addressed when building Arabic corpora. In this paper, thirty-three Arabic corpora are surveyed to show that despite all of the developments in the literature, Saudi dialect (SD) corpora still need further expansion. This paper contributes to the literature on SD corpora by creating the largest Saudi corpus – the King Saud University Saudi Corpus (KSUSC) – with +1B total words, including +119M SD words. The KSUSC not only is the newest and largest SD corpus but is also diverse, covering 26 domains in text collected from five different sources. This paper also contributes to the literature by developing a new incremental preprocessing system that is used to create relevant lexicons that are then used to clean and normalize the collected data. This incremental system is scalable and can be adapted for different resources and dialects. Moreover, the collection process for building the KSUSC is discussed in detail, and the challenges in collecting SD text with respect to each platform are highlighted. By the end of this paper, different design criteria are proposed and used with the KSUSC to conclude that the resulting corpus can be of great benefit to researchers who are interested in integrating the corpus with their own work or using its resulting lexicons with Saudi-based NLP tasks.
... It is a fast and excellent tool to build NLP models and generate live predictions [6]. LSTM with word embeddings is used to perform the sentiment classification, as this classifier outperforms the traditional techniques in text classification [7].The classifier performed well with embeddings, especially when dealing with the sentiment classification of Arabic dialects [8]. ...
Article
Full-text available
At a time when research in the field of sentiment analysis tends to study advanced topics in languages, such as English, other languages such as Arabic still suffer from basic problems and challenges, most notably the availability of large corpora. Furthermore, manual annotation is time-consuming and difficult when the corpus is too large. This paper presents a semi-supervised self-learning technique, to extend an Arabic sentiment annotated corpus with unlabeled data, named AraSenCorpus. We use a neural network to train a set of models on a manually labeled dataset containing 15,000 tweets. We used these models to extend the corpus to a large Arabic sentiment corpus called “AraSenCorpus”. AraSenCorpus contains 4.5 million tweets and covers both modern standard Arabic and some of the Arabic dialects. The long-short term memory (LSTM) deep learning classifier is used to train and test the final corpus. We evaluate our proposed framework on two external benchmark datasets to ensure the improvement of the Arabic sentiment classification. The experimental results show that our corpus outperforms the existing state-of-the-art systems.