Conference Paper

Opinion Mining on US Airline Twitter Data Using Machine Learning Techniques

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Abdelrahman I. Saad [4] proposed a machine learning model to categorize Twitter posts into positive, negative and neutral categories. He implemented the model on a dataset consists of tweets of six distinct airlines in the US. ...
Conference Paper
Full-text available
Nowadays, a lot of people express their opinions on various topics using social networking sites. Twitter has become a famous social networking site where people can express their opinions to the point and so it has become a great source for opinion mining. In this research, the goal was to train and build a model that can automatically and accurately categorize the opinion of customer tweet reviews about popular cell phone brands. We have used python TextBlob library for getting the polarity values of all the tweet reviews of the dataset. We have also used Support Vector Machine (SVM), Naïve Bayes, Logistic Regression, Decision Tree and Random Forest algorithms along with Bag of Words and TFIDF vectorizers separately to train and build the model that will categorize the customer tweet reviews into five opinion categories: Strongly Negative, Weakly Negative, Neutral, Weakly Positive and Strongly Positive. We have observed that SVM and Logistic Regression algorithms have outperformed other algorithms with 88% accuracy using Bag of Words vectorizer while SVM algorithm has outperformed other algorithms with 87% accuracy using TF-IDF vectorizer.
... The research work [10] used six machine learning models for the sentiment analysis of the US Airlines twitter data. Preprocessing steps including stop word removal, punctuation removal, case folding, and stemming were performed. ...
Article
Full-text available
Due to the rapid development of technology, social media has become more and more common in human daily life. Social media is a platform for people to express their feelings, feedback, and opinions. To understand the sentiment context of the text, sentiment analysis plays the role to determine whether the sentiment of the text is positive, negative, neutral or any other personal feeling. Sentiment analysis is prominent from the perspective of business or politics where it highly impacts the strategic decision making. The challenges of sentiment analysis are attributable to the lexical diversity, imbalanced dataset and long-distance dependencies of the texts. In view of this, a data augmentation technique with GloVe word embedding is leveraged to synthesize more lexically diverse samples by similar word vector replacements. The data augmentation also focuses on the oversampling of the minority classes to mitigate the imbalanced dataset problems. Apart from that, the existing sentiment analysis mostly leverages sequence models to encode the long-distance dependencies. Nevertheless, the sequence models require a longer execution time as the processing is done sequentially. On the other hand, the Transformer models require less computation time with parallelized processing. To that end, this paper proposes a hybrid deep learning method that combines the strengths of sequence model and Transformer model while suppressing the limitations of sequence model. Specifically, the proposed model integrates Robustly optimized BERT approach and Long Short-Term Memory for sentiment analysis. The Robustly optimized BERT approach maps the words into a compact meaningful word embedding space while the Long Short-Term Memory model captures the long-distance contextual semantics effectively. The experimental results demonstrate that the proposed hybrid model outshines the state-of-the-art methods by achieving F1-scores of 93%, 91%, and 90% on IMDb dataset, Twitter US Airline Sentiment dataset, and Sentiment140 dataset, respectively.
... Owing to the need and growth of big data technologies in the last decade, the compilation and deployment of tweets are easier. Twitter is a far better reliable source of knowledge as people post their actual ideas and feedback to help them in research purposes [2]. Once the tweets of the airline have been registered, obsolete details would be removed from the airline. ...
Conference Paper
The airline industry has evolved quite dynamically over the last two decades. Airline firms use traditional customer feedback types that are very routine and time-intensive. Sentiment analysis may be a crucial approach to the analysis of input in order to minimize the problem. Twitter data acts as a valuable method for gathering user tweets and viewpoint analyzes. This paper proposed a novel deep learning model that effectively combines different word embedding with deep learning methods to evaluate a dataset made up of tweets for six major US Airlines and multi-class sentiment analysis. System selections integrate these features with different deep-learning approaches for term embedding and classify sentimental documents. This methodology starts with raw DNN data extraction and tweet-cleaning pre-processing methods for CNN. The test set product is a positive/negative/neutral tweet interpretation with a 3-class data set and data set precision assessment. Finally, we understand the findings obtained from the models presented by various researchers and prove that our model is more reliable than the previous frameworks.
Chapter
With the increasing power of Internet, businesses get a huge number of customer feedbacks through: their business website, social media page, business listings, etc. Majority of business do not know how to use this information to improve themselves. However, unstructured feedback on Facebook/Instagram/Twitter is where the volume lies. But the problem is these feedbacks are unstructured and there is no aggregated sentiment that we may conclude from them. To analyze these unstructured customer feedbacks at scale, machine learning is used. In this work we present a survey on various machine learning techniques that have been used in past eight years for analysis of tweets/comments related to airline industry.KeywordsSentiment analysisNatural language processing (NLP)Naïve BayesLogistic regressionDeep learningCNNLSTM
Preprint
Full-text available
Social Media is a major part of human life in the current era. People posts their regular activities, self-indulgent feelings, and real-life experiences on various platforms such as Twitter, Instagram, Facebook, YouTube etc. For social media surveillance, Twitter is considered to be the most widely used platform (about 64%). Twitter data is a valuable method to gather tweets and analyses of user perspectives. Along with the other industries the airline industry also wants to be up to date and keep its sectors alive having the current scenario. Airlines use traditional types of customer feedback that are very common and require a great deal of time. To minimize the problems, the analysis of feelings is considered to be a mandatory approach. After the pandemic when traveling is again resuming and flights are finally taking off, the airline industry is also giving its best to keep in touch with their customers more than ever. The dynamic evolution of the airline industry is commendable over the last decade. Millions of people share their experiences related to different airline companies every day where happy customers are posting their pictures with clouds and staff, some angry customers are complaining about bad services and difficulties they faced like missing baggage, delayed flights, changes in boarding schedules, an IT system failure, etc. This kind of real-time feedback not only helps the passengers to decide which flight they have to choose but also helps the management team and staff of airlines to analyze the situation and take immediate action regarding it to improve their services for passengers’ better experience. In the research paper, a hybrid model composed of Machine Learning algorithms including the classifiers of Random Forest and Logistic Regression named as HMRFLR is proposed to analyze the tweets of Airlines in the US for categorization of the posts according to positivity, neutrality, and negativity of the posts. For revealing the current level of customer satisfaction towards the airlines, sentiment analysis is undertaken. This hybrid model achieves a better accuracy score of 88.16%, however, the individual accuracy score of Logistic Regression is 79.1% and Random Forest is 76.87% respectively.
Article
Sentiment analysis (SA) is a widely used contextual mining technique for extracting useful and subjective information from text-based data. It applies on Natural Language Processing (NLP), text analysis, biometrics, and computational linguistics to identify, analyse, and extract responses, states, or emotions from the data. The features analysis technique plays a significant role in the development and improvement of a SA model. Recently, GloVe and Word2vec embedding models have been widely used for feature extractions. However, they overlook sentimental and contextual information of the text and need a large corpus of text data for training and generating exact vectors. These techniques generate vectors for just those words that are included in their vocabulary and ignore Out of Vocabulary Words (OOV), which can lead to information loss. Another challenge for the classification of sentiments is that of the lack of readily available annotated data. Sometimes, there is a contradiction between the review and their label that may cause misclassification. The aim of this paper is to propose a generalized SA model that can handle noisy data, OOV words, sentimental and contextual loss of reviews data. In this research, an effective Bi-directional Encoder Representation from Transformers (BERT) based Convolution Bi-directional Recurrent Neural Network (CBRNN) model is proposed with for exploring the syntactic and semantic information along with the sentimental and contextual analysis of the data. Initially, the zero-shot classification is used for labelling the reviews by calculating their polarity scores. After that, a pre-trained BERT model is employed for obtaining sentence-level semantics and contextual features from that data and generate embeddings. The obtained contextual embedded vectors were then passed to the neural network, comprised of dilated convolution and Bi-LSTM. The proposed model uses dilated convolution instead of classical convolution to extract local and global contextual semantic features from the embedded data. Bi-directional Long Short-Term Memory (Bi-LSTM) is used for the entire sequencing of the sentences. The CBRNN model is evaluated across four diverse domain text datasets based on accuracy, precision, recall, f1-score and AUC values. Thus, CBRNN can be efficiently used for performing SA tasks on social media reviews, without any information loss.
Article
Full-text available
The study of public opinion can provide us with valuable information. The analysis of sentiment on social networks, such as Twitter or Facebook, has become a powerful means of learning about the users’ opinions and has a wide range of applications. However, the efficiency and accuracy of sentiment analysis is being hindered by the challenges encountered in natural language processing (NLP). In recent years, it has been demonstrated that deep learning models are a promising solution to the challenges of NLP. This paper reviews the latest studies that have employed deep learning to solve sentiment analysis problems, such as sentiment polarity. Models using term frequency-inverse document frequency (TF-IDF) and word embedding have been applied to a series of datasets. Finally, a comparative study has been conducted on the experimental results obtained for the different models and input features.
Article
Full-text available
The use of data from social networks such as Twitter has been increased during the last few years to improve political campaigns, quality of products and services, sentiment analysis, etc. Tweets classification based on user sentiments is a collaborative and important task for many organizations. This paper proposes a voting classifier (VC) to help sentiment analysis for such organizations. The VC is based on logistic regression (LR) and stochastic gradient descent classifier (SGDC) and uses a soft voting mechanism to make the final prediction. Tweets were classified into positive, negative and neutral classes based on the sentiments they contain. In addition, a variety of machine learning classifiers were evaluated using accuracy, precision, recall and F1 score as the performance metrics. The impact of feature extraction techniques, including term frequency (TF), term frequency-inverse document frequency (TF-IDF), and word2vec, on classification accuracy was investigated as well. Moreover, the performance of a deep long short-term memory (LSTM) network was analyzed on the selected dataset. The results show that the proposed VC performs better than that of other classifiers. The VC is able to achieve an accuracy of 0.789, and 0.791 with TF and TF-IDF feature extraction, respectively. The results demonstrate that ensemble classifiers achieve higher accuracy than non-ensemble classifiers. Experiments further proved that the performance of machine learning classifiers is better when TF-IDF is used as the feature extraction method. Word2vec feature extraction performs worse than TF and TF-IDF feature extraction. The LSTM achieves a lower accuracy than machine learning classifiers.
Article
Full-text available
Cancer is one of the most influential factors causing death in the world. Adenosine which is a molecule, found in all human cells by coupling with G protein it turns into an adenosine receptor. Adenosine receptor is an important target for cancer therapy. Adenosine stops the growth of malignant tumor cells such as lymphoma, melanoma and prostate carcinoma. Adenosine is activated by interacting with drugs to stop tumor cells from spreading and cure cancer disease. This research aims to predict drugs and potential drug candidates that interact with adenosine receptors. We built a machine learning model using three different classification techniques: Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) then we chose the best technique after comparing the results. Unlike other researches, we used the drug side effect integrated into drug fingerprint as a feature to train our model to classify drugs (interacting and non-interacting) with adenosine receptors. We ranked the interacting drugs with adenosine receptors based on drug side effects to find the most preferred drug (least side effect) among several drugs, which helps in drug design. Most existing datasets contain drugs, targets and the interactions between them, neglecting drug side effects. We formed a new dataset that has the drug side effect. The new dataset is composed of 400 drugs, 794 targets and 3990 drug side effects. Since the dataset was imbalanced we applied Synthetic Minority Oversampling Technique (SMOTE). After conducting experiments, RF achieved the best classification performance with an accuracy of 75.09%.
Article
Full-text available
Nowadays, many applications that use large data have been developed due to the existence of the Internet of Things. These applications are translated into different languages and require automated text classification (ATC). The ATC process depends on the content of one or more predefined classes. However, this process is problematic for the Arabic translation of the data. This study aims to solve this issue by investigating the performances of three classification algorithms, namely, k-nearest neighbor (KNN), decision tree (DT), and naïve Bayes (NB) classifiers, on Saudi Press Agency datasets. Results showed that the NB algorithm outperformed DT and KNN algorithms in terms of precision, recall, and F1. In future works, a new algorithm that can improve the handling of the ATC problem will be developed.
Article
Recommendation systems (RSs) have garnered immense interest for applications in e-commerce and digital media. Traditional approaches in RSs include such as collaborative filtering (CF) and content-based filtering (CBF) through these approaches that have certain limitations, such as the necessity of prior user history and habits for performing the task of recommendation. To minimize the effect of such limitation, this article proposes a hybrid RS for the movies that leverage the best of concepts used from CF and CBF along with sentiment analysis of tweets from microblogging sites. The purpose to use movie tweets is to understand the current trends, public sentiment, and user response of the movie. Experiments conducted on the public database have yielded promising results.
Conference Paper
Drug discovery is an important step before drug development. Drug discovery is the process of identifying, testing a drug before medical use. Drugs are used to cure diseases by interacting with the target, which is the protein in the human cells. Many resources are wasted (cost and time) on lab experiments to discover drugs and its application. Yet machine learning enhanced the process of drug discovery and the prediction of drug-target interaction, which helped in predicting new drugs and finding more applications for old drugs. Predicting drug-target interaction starting by studying the nature of drugs and its properties. Most of the datasets existing are drugs, targets and their interactions datasets. We compiled our dataset to include side effect as drug feature. The dataset contains 400 drugs, 794 targets and 3990 side effects. In this study, a machine-learning model is implemented using three different classifiers: Decision Tree, Random Forest (RF) and K-Nearest Neighbors (K-NN) for classification. Drug fingerprint and side effect were used as input features to train our model. Three different experiments were conducted using fingerprint, side effect and both fingerprint and side effect. Results showed improvement in prediction when integrating both drug fingerprint and side effect. K-NN scored best results in the three experiment with an average accuracy of 94.69%.
Article
Social media is the main advertising and analysis tool in today’s world. Twitter is a microblogging service usually used as an instant communication platform. It contains a rich amount of data in semi-structured format. The capacity to provide information in real time has stimulated many companies to use this service to understand their consumers. In this work, a method had been adopted to fetch the data in real time environment directly from twitter and sentiment analysis was performed on the streamed data. To provide scalability and reduced cost of analytics, the work was implemented using the Apache Open Source Platform, Hadoop. Further, the sentiment analysis was conducted based on the dictionary model.
Chapter
Social networks represent an emerging challenging sector where the natural language expressions of people can be easily reported through short but meaningful text messages. Key information that can be grasped from social environments relates to the polarity of text messages (ie, positive, negative, or neutral). In this chapter we present a literature review regarding polarity classification in social networks, by distinguishing between supervised, unsupervised, and semisupervised machine learning models. In particular, the most recent advancements of the state of the art are presented, focusing on the real nature of the messages that are actually provided in an informal and networked environment.
Text classification and Naive Bayes
  • C D Manning
  • P Raghavan
  • H Schutze
Sentiment analysis in twitter
  • W Xiao
W. Xiao, "Sentiment analysis in twitter."