Conference Paper

A Critical Analysis of Twitter Data for Movie Reviews Through ‘Random Forest’ Approach

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Using Sentiment analysis one can understand interaction of a user with the movies through their feedback. Here analysis is done based on the movie reviews that can be collected from many sources. Twitter is one among the foremost frequent on-line social media and micro blogging services. Due to the popularity of twitter it has become a useful resource for collecting sentiments through API or other data mining techniques. Our work here presents an examination on the evaluation of the machine learning algorithms (Random Forest, bagging, SVM and Naïve Bayes) in R together the public opinion for example opinion about ‘Civil War’ Movie. Here we have used ‘Random Forest’ to show its better performance in the analysis of movie reviews.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... On the Internet, the reviews by the customers and buyers play a major role in assessing the quality of a product or service. Several research papers have been published on prediction of box office success of movies [6][7][8][9][20][21][22][23][24][25][26][27][28][29]. In this paper, the author presents a 2 layered back-propagation neural network model with 23 numeric inputs (BPNN-N23) developed for prediction of the success status among 3 success classes. ...
... Dubey & Agarwal (2017) [8] collected 306 tweets on the reviews of the movie "Civil War" on Twitter. They considered two classes viz. ...
Article
Full-text available
Forecasting commercial success of motion pictures remained challenging for producers, critics and other industry leaders in this changing world of web and online media. In this study, the author has explored a back-propagation neural network model with 23 numeric input (BPNN-N23) for classification of Bollywood movies released during the years 2014 through 2017. The proposed model classifies movies in three classes namely “HIT”, “AVERAGE” and “FLOP”. Common procedures like data filtering, data cleaning and data normalization have been followed prior to feeding those data to the neural network. After comparing the performance of the proposed model with the benchmark models and works, the results show that the said model shows performance that is comparable to the published ones with respect to the assumed Indian empirical settings. This research reveals the extent of the effects and roles of the considered factors as well as the proposed model in predicting the fate of a Bollywood movie in India.
... Thus, it has been extensively used in recent research on text classification (e.g., [23]), as well as in the context of cyber security [1], [15], [16]. On the other hand, Random Forest is also found to perform well on text classification, e.g., on sentiment analysis [24] and clinical text classification [25]. ...
... RF is also known as an ensemble methods of random decision trees (Vens, 2013), combining the predictions of the individual trees. In previous research such as (Kamanksha & Sanjay, 2018) proved that RF is the better classifier in their analysis. Thus, the results in this study shows that RF is a good classifier or combined as a base classifier for other ensemble methods. ...
Conference Paper
Full-text available
The aim of this paper is to investigate the effects of combining feature selection and ensemble classifiers on the prediction performance in addressing the multiclass imbalance data learning .This research uses data obtained from the Malaysian medicinal leaf images shape data and three other large benchmark data sets in which six ensemble methods from Weka machine learning tool were selected to perform the classification task.These ensemble methods include the AdaboostM1, Bagging, Decorate, END, MultiboostAB, and Rotation Forest.In addition, five base classifiers were used; Naïve Bayes, SMO, J48, Random Forest, and Random Tree in order to examine the performance of the ensemble methods. There are two feature selection approaches implemented which are filter-based (CfsSubsetEval, ConsistencySubsetEval and FilteredSubsetEval) and wrapper-based (WrapperSubsetEval). The results obtained from the experiments show that although the performance accuracy is not much improved, however, with less number of attributes, the classifiers are able to achieve similar accuracy or slightly improved with less processing time.In knowledge management, the findings provide important insight of which algorithm is suitable for decision making when dealing with high dimensional and large data.
... RF is also known as an ensemble methods of random decision trees [14], combining the predictions of the individual trees. Previous research such as [15,16] proved that RF is the better classifier in their analysis. Thus, the results of RF in this study is another support that it is a strong classifier even combined as a base classifier for other ensemble methods. ...
Chapter
Full-text available
The aim of this paper is to investigate the effects of combining various sampling and ensemble classifiers on the prediction performance in addressing the multiclass imbalance data learning. This research uses data obtained from the Malaysian medicinal leaf images shape data and three other large benchmark datasets in which seven ensemble methods from Weka machine learning tool were selected to perform the classification task. These ensemble methods include the AdaboostM1, Bagging, Decorate, END, MultiboostAB, RotationForest, and stacking methods. In addition to that, five base classifiers were used; Naïve Bayes, SMO, J48, Random Forest, and Random Tree in order to examine the performance of the ensemble methods. Two methods of combining the sampling and ensemble classifiers were used which are called the Resample with ensemble classifier and SMOTE with ensemble classifier. The results obtained from the experiments show that there is actually no single configuration that is “one design that fits all”. However, it is proven that when using the sampling and ensemble classifier which is coupled with Random Forest, the prediction performance of the classification task can be improved on the multiclass imbalance dataset.
... RF is also known as an ensemble methods of random decision trees [14], combining the predictions of the individual trees. Previous research such as [15,16] proved that RF is the better classifier in their analysis. Thus, the results of RF in this study is another support that it is a strong classifier even combined as a base classifier for other ensemble methods. ...
Conference Paper
Full-text available
The aim of this paper is to investigate the effects of combining various sampling and ensemble classifiers on the prediction performance in addressing the multiclass imbalance data learning. This research uses data obtained from the Malaysian medicinal leaf images shape data and three other large benchmark datasets in which seven ensemble methods from Weka machine learning tool were selected to perform the classification task. These ensemble methods include the AdaboostM1, Bagging, Decorate, END, MultiboostAB, RotationForest, and stacking methods. In addition to that, five base classifiers were used; Naïve Bayes, SMO, J48, Random Forest, and Random Tree in order to examine the performance of the ensemble methods. Two methods of combining the sampling and ensemble classifiers were used which are called the Resample with ensemble classifier and SMOTE with ensemble classifier. The results obtained from the experiments show that there is actually no single configuration that is “one design that fits all”. However, it is proven that when using the sampling and ensemble classifier which is coupled with Random Forest, the prediction performance of the classification task can be improved on the multiclass imbalance dataset.
Conference Paper
This study used a sentiment analysis to identify the polarity of the tweets in order to address any of the higher education issues in Saudi Arabia and to track the sentiments over time. Shaqra University has been selected to be a sample for the Saudi higher education system; thus, this study detects social media content regarding Shaqra University and applies sentiment analysis techniques to classify the sentiment and subject of the posts. The study applies this technique to social media content in the Arabic language, which is more difficult than languages such as English, and shows its application to uncover previously unknown issues for a target organization. Unlike previous studies, this study did not begin with known concerns but rather discovered unknown concerns. It applies a structured vocabulary model to a language that is typically difficult to structure programmatically and uses this technique to structure knowledge found in social media content. In such a case, we have already built our lexicon (SaudiSentiPlus). It is a Saudi dialect sentiment lexicon that was developed early in the author's prior study and contains 7,139 words. SaudiSentiPlus and its algorithms have been used in this study to investigate and track sentiments toward Shaqra University’s performance over the seven periods. This study created its own dataset from Twitter. The dataset contains tweets regarding sentiments toward Shaqra University collected using the Twitter API for seven periods, starting from 1–2016 to 7–2019. A total of 49,270 Arabic tweets were crawled and indexed. This study proposes a new application for sentiment analysis on social media. This shows that sentiment analysis can play an important role in developing higher education. By using sentiment analysis, the study identified 18 university-related issues. Half of them accounted for 66% of the weight of the university’s issues.
Article
Full-text available
Due to the sheer volume of opinion rich web resources such as discussion forum, review sites , blogs and news corpora available in digital form, much of the current research is focusing on the area of sentiment analysis. People are intended to develop a system that can identify and classify opinion or sentiment as represented in an electronic text. An accurate method for predicting sentiments could enable us, to extract opinions from the internet and predict online customer's preferences, which could prove valuable for economic or marketing research. Till now, there are few different problems predominating in this research community, namely, sentiment classification, feature based classification and handling negations. This paper presents a survey covering the techniques and methods in sentiment analysis and challenges appear in the field.
Chapter
Movies have become a significant part of today's generation. In this chapter, the authors worked on data mining and ML techniques like random forest regression, decision tree regression, support vector regression, and predict the success of the movies on the basis of ratings from IMDb and data retrieved from comments on social media platforms. Based on ML techniques, the chapter develops a model that will predict movie success before the release of the movie and thereby decrease the risk. Twitter sentimental analysis is used to retrieve data from Twitter, and polarity and subjectivity of the movie is calculated based on the user reviews, and those retrieved data machine learning algorithms are used to predict the IMDb rating. A predictive model is developed by using three algorithms, decision tree regression, SVR, and random forest regression. The chapter compared the results using three different techniques to get the movie success prediction at a reasonable accuracy.
Article
Over the past decade humans have experienced exponential growth in the use of online resources, in particular social media and microblogging websites such as Facebook, Twitter, YouTube and also mobile applications such as WhatsApp, Line, etc. Many companies have identified these resources as a rich mine of marketing knowledge. This knowledge provides valuable feedback which allows them to further develop the next generation of their product. In this paper, sentiment analysis of a product is performed by extracting tweets about that product and classifying the tweets showing it as positive and negative sentiment. The authors propose a hybrid approach which combines unsupervised learning in the form of K-means clustering to cluster the tweets and then performing supervised learning methods such as Decision Trees and Support Vector Machines for classification.
Article
Sentiment analysis, which addresses the computational treatment of opinion, sentiment, and subjectivity in text, has received considerable attention in recent years. In contrast to the traditional coarse-grained sentiment analysis tasks, such as document-level sentiment classification, we are interested in the fine-grained aspect-based sentiment analysis that aims to identify aspects that users comment on and these aspects’ polarities. Aspect-based sentiment analysis relies heavily on syntactic features. However, the reviews that this task focuses on are natural and spontaneous, thus posing a challenge to syntactic parsers. In this paper, we address this problem by proposing a framework of adding a sentiment sentence compression (Sent_Comp) step before performing the aspect-based sentiment analysis. Different from the previous sentence compression model for common news sentences, Sent_Comp seeks to remove the sentiment-unnecessary information for sentiment analysis, thereby compressing a complicated sentiment sentence into one that is shorter and easier to parse. We apply a discriminative conditional random field model, with certain special features, to automatically compress sentiment sentences. Using the Chinese corpora of four product domains, Sent_Comp significantly improves the performance of the aspect-based sentiment analysis. The features proposed for Sent_Comp, especially the potential semantic features, are useful for sentiment sentence compression.
Conference Paper
With the rapid development of E-commerce, more and more online reviews for products and services are created, which form an important source of information for both sellers and customers. Research on sentiment and opinion mining for online review analysis has attracted increasingly more attention because such study helps leverage information from online reviews for potential economic impact. In this paper, we apply sentiment analysis and machine learning methods to study the relationship between the online reviews for a movie and the movie's box office revenue performance. We show that a simplified version of the sentiment-aware autoregressive model proposed in [5] can produce very good accuracy for predicting the box office sale using online review data. Our simplified version considers only positive and negative sentiments, and uses a very simple set of features with 14 affective key words for representing the sentiments in a review. In this way we obtain a simpler model which could be more efficient to train and use. Experiments indicate that the autoregressive model using both review sentiment data and the previous days' sale data results in higher accuracy than just using previous sale data alone. In addition, we create a classification model using Naïve Bayes Classifier for predicting the trend of the box office revenue from the review sentiment data.
Conference Paper
Sentiment analysis deals with identifying and classifying opinions or sentiments expressed in source text. Social media is generating a vast amount of sentiment rich data in the form of tweets, status updates, blog posts etc. Sentiment analysis of this user generated data is very useful in knowing the opinion of the crowd. Twitter sentiment analysis is difficult compared to general sentiment analysis due to the presence of slang words and misspellings. The maximum limit of characters that are allowed in Twitter is 140. Knowledge base approach and Machine learning approach are the two strategies used for analyzing sentiments from the text. In this paper, we try to analyze the twitter posts about electronic products like mobiles, laptops etc using Machine Learning approach. By doing sentiment analysis in a specific domain, it is possible to identify the effect of domain information in sentiment classification. We present a new feature vector for classifying the tweets as positive, negative and extract peoples' opinion about products.
Conference Paper
Mining is used to help people to extract valuable information from large amount of data. Sentiment analysis focuses on the analysis and understanding of the emotions from the text patterns. It identifies the opinion or attitude that a person has towards a topic or an object and it seeks to identify the viewpoint underlying a text span. Sentiment analysis is useful in social media monitoring to automatically characterize the overall feeling or mood of consumers as reflected in social media toward a specific brand or company and determine whether they are viewed positively or negatively on the web. This new form of analysis has been widely adopted in customer relation management especially in the context of complaint management. For automating the task of classifying a single topic textual review, document-level sentiment classification is used for expressing a positive or negative sentiment. So analyzing sentiment using Multi-theme document is very difficult and the accuracy in the classification is less. The document level classification approximately classifies the sentiment using Bag of words in Support Vector Machine (SVM) algorithm. In proposed work, a new algorithm called Sentiment Fuzzy Classification algorithm with parts of speech tags is used to improve the classification accuracy on the benchmark dataset of Movies reviews dataset.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
Posting reviews online has become an increasingly popular way for people to express opinions and sentiments toward the products bought or services received. Analyzing the large volume of online reviews available would produce useful actionable knowledge that could be of economic values to vendors and other interested parties. In this paper, we conduct a case study in the movie domain, and tackle the problem of mining reviews for predicting product sales performance. Our analysis shows that both the sentiments expressed in the reviews and the quality of the reviews have a significant impact on the future sales performance of products in question. For the sentiment factor, we propose Sentiment PLSA (S-PLSA), in which a review is considered as a document generated by a number of hidden sentiment factors, in order to capture the complex nature of sentiments. Training an S-PLSA model enables us to obtain a succinct summary of the sentiment information embedded in the reviews. Based on S-PLSFA, we propose ARSA, an Autoregressive Sentiment-Aware model for sales prediction. We then seek to further improve the accuracy of prediction by considering the quality factor, with a focus on predicting the quality of a review in the absence of user-supplied indicators, and present ARSQA, an Autoregressive Sentiment and Quality Aware model, to utilize sentiments and quality for predicting product sales performance. Extensive experiments conducted on a large movie data set confirm the effectiveness of the proposed approach.