Conference PaperPDF Available

Comparison of Naïve Bayes, Support Vector Machine, Decision Trees and Random Forest on Sentiment Analysis

Authors:
  • Polytechnic of Coimbra - Portugal
Comparison of Naïve Bayes, Support Vector Machine, Decision Trees
and Random Forest on Sentiment Analysis
Márcio Guia1, Rodrigo Rocha Silva2,3 a and Jorge Bernardino1,2 b
1Polytechnic of Coimbra ISEC, Rua Pedro Nunes, Quinta da Nora, 3030-199 Coimbra, Portugal
2CISUC Centre of Informatics and Systems of University of Coimbra, Pinhal de Marrocos, 3030-290 Coimbra, Portugal
3FATEC Mogi das Cruzes, São Paulo Technological College, 08773-600 Mogi das Cruzes, Brazil
Keywords: Data Mining, Sentiment Analysis, Text Classification, Naïve Bayes, Support Vector Machine, Random
Forest, Decision Trees.
Abstract: Every day, we deal with a lot of information on the Internet. This information can have origin from many
different places such as online review sites and social networks. In the midst of this messy data, arises the
opportunity to understand the subjective opinion about a text, in particular, the polarity. Sentiment Analysis
and Text Classification helps to extract precious information about data and assigning a text into one or more
target categories according to its content. This paper proposes a comparison between four of the most popular
Text Classification Algorithms - Naive Bayes, Support Vector Machine, Decision Trees and Random Forest
- based on the Amazon Unlocked mobile phone reviews dataset. Moreover, we also study the impact of some
attributes (Brand and Price) on the polarity of the review. Our results demonstrate that the Support Vector
Machine is the most complete algorithm of this study and achieve the highest values in all the metrics such
as accuracy, precision, recall, and F1 score.
1 INTRODUCTION
Text Mining is the process that can extract valuable
information from a text (Mouthami, Devi and
Bhaskaran, 2013). One of many applications of Text
Mining is Sentiment Analysis, which is the process
used to determine the opinion or the emotion that a
person writes about an item or topic (Mouthami, Devi
and Bhaskaran, 2013).
With the growth of the Internet, especially social
networks, people can easily express their opinion
about any topic in a few seconds, and valuable
information can be extracted from this, not only about
the person who wrote it but also about a particular
subject.
There are three categories to classify Sentiment:
Machine Learning, Lexicon-Based and an hybrid that
combines Machine Learning and Lexicon- Based
(Ahmad, Aftab and Muhammad, 2017). In literature,
the Machine Learning categories to extract Sentiment
are one of the most discussed areas and for this
reason, in this paper, we propose to do a comparison
a https://orcid.org/0000-0002-5741-6897
b https://orcid.org/0000-0001-9660-2011
between four of the most popular Machine Learning
algorithms: Naive Bayes (Kononenko, 1993),
Support Vector Machine (Cortes and Vapnik, 1995),
Decision Trees (Quinlan, 1986) and Random Forest
(Ho, 1995). In order to evaluate these classifiers, we
use Amazon Reviews: Unlocked Mobile Phones
dataset and our focus goes to the Polarity Review of
a text, which can be Negative or Positive.
The main contributions of this work are the
following:
Compare Naive Bayes, Support Vector Machine,
Decision Trees and Random Forest on Polarity
Text Review based on Accuracy, Precision,
Recall, and F1 score;
Compare different types of each studied
classifier models;
Evaluate the impact of Brand and Price of the
mobile phones on final Polarity Review.
The rest of this paper is organized as follows.
Section 2 presents related work. Section 3 describes
the experimental approach. Section 4 presents the
Guia, M., Silva, R. and Bernardino, J.
Comparison of Naïve Bayes, Support Vector Machine, Decision Trees and Random Forest on Sentiment Analysis.
DOI: 10.5220/0008364105250531
In Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2019), pages 525-531
ISBN: 978-989-758-382-7
Copyright c
2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
525
results and discussion. Finally, Section 5 concludes
the paper and presents future work.
2 RELATED WORK
Sentiment Analysis has been utilized by many authors
to classify documents, especially with machine
learning approaches. However, the researches usually
just focus on one of the most popular machine
learning algorithms like the Support Vector Machine,
Naïve Bayes or Random Forest classifier.
(Moe et al., 2018) compares Naïve Bayes with
Support Vector Machine on Document Classification.
The authors conclude that Support Vector Machine is
more accurate than Naïve Bayes classifier.
(Xu, Li and Zheng, 2017) defend that although
Multinomial Naïve classifier is commonly used on
Text Classification with good results, it's not a fully
Bayesian Classifier. So, the authors propose a
Bayesian Multinomial Naïve Bayes classifier and the
results show that the new approach has similar
performance when compared to the classic
Multinomial Naïve Bayes classifier.
(Manikandan and Sivakumar, 2018) propose an
overview of the most popular machine learning
algorithms to deal with document classification. The
authors provide the advantages and main applications
of each algorithm. However, this paper does not
provide any practical study about the algorithms and
does not do a comparison between them.
(Rodrigues, Silva and Bernardino, 2018) propose
a new ontology to deal with social event
classification. Instead of label an event with just one
category the authors propose a classification based on
tags. So, an event can have more than one tag and this
approach can more successfully achieve the interest
of a user. To make the classification tests the authors
use the Random Forest Classifier which achieve good
results. However, to do the classification the authors
have just use one algorithm.
(Parmar, Bhanderi and Shah, 2014) study Random
Forest classifier on Sentiment Analysis. The authors
proposed an approach that tunes the hyperparameters
like number of trees to construct the Decision Forest,
number of features to select at random and depth of
each tree. They conclude that with optimized
hyperparameters the Random Forest classifier can
achieve better results. In (Text Mining Amazon
Mobile Phone Reviews: Interesting Insights, no date)
the authors of the dataset that we use in this paper
provided a statistical study about the relationship
between the attributes of the dataset and they also
extract the sentiments that are present in the reviews.
In our paper we also added some statistical study to
the one done initially by the authors of the database
by study the impact of brand and price in the polarity
review.
The main difference of these works with ours is
that we don´t focus on just one machine learning
algorithm. We propose a comparison between four
algorithms: Naïve Bayes, Support Vector Machine,
Decision Trees and Random Forest. Besides none of
these works studies the impact of the attributes of the
dataset in the classification of documents.
3 EXPERIMENTAL APPROACH
This section presents the experimental approach used
for the classification task, Fig.1 displays the overall
architecture. The proposed architecture consists of
five parts. The first one deals with cleaning the
dataset, described in section 3.1. After cleaning
dataset, we do Pre-Processing and Text
Transformation, all these steps are described in
section 3.2. In section 3.3 we describe the
classification process. The Evaluation process and the
compare of results are described in section 4.
Figure 1: Overview of our approach.
3.1 Dataset
The dataset that we use for this study (Amazon
Reviews: Unlocked Mobile Phones | Kaggle, 2016)
consists of 400 000 reviews of unlocked mobile
phones sold on Amazon.com and contains attributes
such as Brand (string), Price (real number), Rating
(integer number) and Review text (string).
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
526
For the classification task, we only select the
Rating and Review text attributes. Rating is a
numerical value from 1 to 5, and the Review text is a
String which contains the opinion of the user. Before
using the dataset, we apply a few steps to get better
results. These steps are described as following:
1. Assign Rating value 1 and 2, to Negative;
2. Assign Rating value 4 and 5, to Positive
3. Remove all the instances that contain a Rating
value equal to 3.
3.2 Pre-processing and Text
Transformation
In order to improve results for the four algorithms that
we study in this paper, it is necessary to do some pre-
processing steps which will make it possible to reduce
data dimension without affecting the classification
task (Eler et al., 2018). The first step is to convert all
the instances of the dataset into lowercase. Next, we
remove some noisy formatting like HTML Tags and
Punctuation. Tokenization, removal of stop words
and stemming are described as follows:
- Tokenization: is the process that splits strings and
text into small pieces called tokens (Mouthami, Devi
and Bhaskaran, 2013). This process is widely used
and popular in pre-processing tasks.
- Removal of Stop Words: A stop word is a
commonly used word that appears frequently in any
document. These words are usually articles and
prepositions. An example of these terms is “the”,
is,” are”, “I” and “of” (Eler et al., 2018). Hence, we
can say that these terms do not add meaning to a
sentence, and for this reason, we can retrieve them
from the text before doing the classification task. For
this study, we use a list of common words of the
English Language which includes about 150 words.
- Stemming: is the process that reduces a word to their
base or root form. For example, the words
swimmer and swimming” after the stemming
process are transformed into “swim”. In this study,
we use the Porter Stemmer because is one of the most
popular English rule-based stemmers (Jasmeet and
Gupta, 2016) and compared with Lovins Stemmer it´s
a more light stemmer. Moreover, produces the best
output as compared to other stemmers (Ganesh
Jivani, 2011).
Text Transformation: Machine learning
algorithms do not work with text features, so, for this
reason, we need to convert text into numerical
features. To deal with that, we use the TF-IDF (Term
Frequency-Inverse Document Frequency). This
algorithm assigns to each word of the sentence a
weight based on the TF and IDF (Yang and Salton,
1973).
The TF (term frequency) of a word is defined as
the number of times that the word appears in a
document.
The IDF (inverse document frequency) of a term
is defined as how important a term is (Salton and
Buckley, 1988) (Yang and Salton, 1973).
3.3 Classification Process
After cleaning the dataset and apply pre-processing
and text transformation steps, we split the data into
training and test. The percentage used for training is
80% and the remaining 20% are used for test. It is
necessary to feed the classification algorithms, so the
train data will be used for training the classifiers and
the test data will be used to evaluate them. The four
classifiers that we use are described in the following:
- Random Forest: is defined as a classifier with a
collection of tree-structured classifiers {h(x, k ), k =
1,...} where the {k} are independent identically
distributed random vectors and each tree casts a unit
vote for the most popular class at input x. When a
large number of trees is generated each one of them
will vote for a class, and the winner is the class that
has more votes (Breiman, 2001). For this study we
evaluate the Random Forest classifier with a different
number of trees to construct the Decision Forest, in
particular, we test the classifier with 50,100,200 and
400 trees.
-Naive Bayes: is a probabilistic machine learning
classifier based on the Bayes Theorem with an
assumption of independence among predictors, in
other words, this algorithm considers that a presence
of a feature in a class is independent of any other
features (Ahmad, Aftab and Muhammad, 2017). For
this study we evaluate two types: Multinomial and
Bernoulli.
Support Vector Machine: is a supervised learning
model which can achieve good results in text
categorization. Basically this classifier locates the
best possible boundaries to separate between positive
and negative training samples (Ahmad, Aftab and
Muhammad, 2017) For this study, we evaluate two
distinct kernel models for Support Vector Machine:
RBF and Linear (Minzenmayer et al., 2014) .
Decision Trees: is an algorithm that use trees to
predict the outcome of an instance. Essentially, a test
node computes an outcome based on the attribute
values of an instance, where each possible outcome is
associated with one of the subtrees. The process of
classify an instance starts on the root node of the tree.
If the root node is a test, the outcome for the instance
Comparison of Naïve Bayes, Support Vector Machine, Decision Trees and Random Forest on Sentiment Analysis
527
it is predicted to one of the subtrees and the process
continues until a leaf node it is encountered, in this
situation the label of the leaf node gives the predicted
class of the instance (Quinlan and Quinlan J. R.,
1996).
4 EXPERIMENTAL
EVALUATION
We use the Amazon Reviews: Unlocked Mobile
Phones dataset (Amazon Reviews: Unlocked Mobile
Phones | Kaggle,2016) and we split the dataset into
80 % for train and 20% for the test. As mentioned,
before we provide a comparison between four
algorithms and also offer a statistical about the impact
of the brand and the price in the final polarity review.
These experiments are described as follows:
4.1 Algorithms Classification
In order to evaluate the results of the four algorithms
we use four of the most popular measures: Accuracy,
Precision, Recall, and F1 score. These four metrics
are explained in the following:
- Accuracy: is the most popular measure and also very
easy to understand because is a simple ratio between
the number of instances correctly predicted to the
total number of instances used in the observation, in
other words, accuracy gives the percentage of
correctly predicted instances (Mouthami, Devi and
Bhaskaran, 2013).
- Precision: is a measure that provides for each class
the ratio between correctly positive predicted
instances and total of positive instances predicted
(Mouthami, Devi and Bhaskaran, 2013).
- Recall: is a measure that provides for each class the
ratio between the true positive instances predicted and
the sum of true positives and false negatives in the
observation (Mouthami, Devi and Bhaskaran, 2013).
- Fl score: is the weighted average of Precision and
Recall (Mouthami, Devi and Bhaskaran, 2013), and
it's considered perfect when it´s 1.0 and the worst
possible value is 0.0, so a good F1 score means that
we have low false positives and low false negatives.
4.2 Naive Bayes
Table 1 shows the results of application Naive Bayes
on the dataset. The first experimental for the Naive
Bayes classifier was the Multinomial variant. The
results demonstrated that the classifier obtains 0.83
which means that in 83% of times the polarity reviews
was correctly predicted. Precision and Recall obtain
similar values, 0.84 and 0.83 respectively, F1 score
obtains 0.80. The second experimental was with
Bernoulli variant and the results show an
improvement of 2% for Accuracy and Recall and 4%
for F1 score.
In conclusion, the two variants of Naive Bayes
can both achieve good results in Sentiment Analysis
especially the Bernoulli Variant. However, the Naive
Bayes classifier when compared to Random Forest
and especially Support Vector Machine obtain
modest results.
Table 1: Results for the measures of application Naïve
Bayes on the dataset.
Accuracy
Precision
Recall
F1
score
Multinomial
0.83
0.84
0.83
0.80
Bernoulli
0.85
0.84
0.85
0.84
4.3 Random Forest
Table 2 shows the results of application Random
Forest on the dataset. When the number of estimators
was 50 the classifier obtains 0.87 for Accuracy,
Precision, Recall and F1 score, which can be
considered a good result considering the small
number of estimators. When the numbers of
estimators were 100 the results demonstrate an
increment of 1% for Accuracy and Recall, and the
Precision and F1 score remained the same values. The
results for the third experimental test with 200
estimators for the Random Forest classifier
demonstrate that Precision achieves 0.88 which is
more 0.1% than the experimental with 100. Finally,
in the last experimental, the number of estimators was
400 and the results show that with this high number
of estimators the results for all the measures are still
equal to the experiment with 200 estimators.
In conclusion, the results for the application of
Random Forest classifier show that this algorithm can
achieve high values for all the measures even when
the number of estimators is low, it means that
Random Forest can be used with success on text
classification tasks. It is also possible to conclude that
when the number of estimators increases the
Precision, Recall and Accuracy also increases.
However, the best result of Random Forest was with
200 estimators. Increasing the number of estimators
did not achieve better results.
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
528
Table 2: Results for the measures of application Random
Forest on the dataset.
Accuracy
Precision
F1
score
50
estimators
0.87
0.87
0.87
100
estimators
0.88
0.87
0.87
200
estimators
0.88
0.88
0.87
400
estimators
0.88
0.88
0.87
4.4 Support Vector Machine
Table 3 shows the results of application Support
Vector Machine on the dataset. As mentioned before
we use two types of kernel models to evaluate the
Support Vector Machine. The first experimental
evaluation demonstrates that with Linear kernel, the
classifier obtains 0.89 for Accuracy, Precision, Recall
and F1 score which means that 89% of the times the
classifier predicted correctly the polarity of a review.
The second experimental demonstrates that with RBF
Kernel the results obtained are significantly lower
than the results with Linear Kernel, namely, the
results for Accuracy and Recall decrease 16 %, the
value of Precision drastically decreases 36 % and the
value of F1 score decreases 28%.
In conclusion, the Support Vector Machine with
Linear Kernel achieves the best results of this study
and proves that it is one of the best algorithms to deal
with Sentiment Analysis. However, the poor results
of the application of Support Vector Machine with
RBF kernel demonstrate that the latter it is not a good
classifier for Sentiment Analysis.
Table 3: Results for the measures of application Support
Vector Machine on the dataset.
Accuracy
Precision
Recall
F1 score
Linear
0.89
0.89
0.89
0.89
RBF
0.73
0.53
0.73
0.61
4.5 Decision Trees
Table 4 shows the results of the application of
Decision Trees on the dataset. The results show that
the Decision Trees classifier obtains the same value
(0.82) for all the four measures: Accuracy, Precision,
Recall, and F1 score. These results are similar to the
Multinomial Naive Bayes and we can conclude that
Naive Bayes and Decision Trees achieve similar
values in the Sentiment Analysis task which can be
explained by the lower complexity of these two
algorithms when compared to Random Forest and
Support Vector Machine.
Table 4: Results for the measures of application Decision
Trees on the dataset.
Accuracy
Precision
Recall
F1
score
Decision
Trees
0.82
0.82
0.82
0.82
4.6 Impact of Brand and Price
In this study, we also make a statistical comparison of
the impact of attributes (brand and price) in the final
polarity review. For brand, we study the most popular
brands of phones that are present in the dataset and
for price we provide an overview of all the prices that
are presented in the dataset.
4.6.1 Brand
Table 5 shows the impact of the brand in the polarity
review. After having analyzed these results we
conclude that the impact of the brands is similar and
is in a range of 77% to 79%. However, there are two
brands which stand out from the rest. The first one is
the BlackBerry with only 74.3 % positive reviews.
The second one is ZTE which has the best results with
82.9% positive reviews. We think that the significant
difference in the percentage of positive reviews
between BlackBerry and ZTE could be explained by
a phone model from BlackBerry that has the potential
to give problems or does not match customer
expectations and the high results of ZTE can be
explained by the fewer models that are present in the
dataset.
Table 5: Results for the impact of the brand on polarity
review.
Brand
% of reviews
Positive
Negative
Samsung
79.94
20.06
Apple
77.3
22.7
Nokia
78.01
21.99
BlackBerry
74.3
25.7
Asus
77.41
22.59
LG
77.2
22.8
Sony
79.86
20.14
ZTE
82.9
17.1
Comparison of Naïve Bayes, Support Vector Machine, Decision Trees and Random Forest on Sentiment Analysis
529
4.6.2 Price
Table 6 shows the impact of the price in the polarity
review. After having analyzed these results we
conclude that there’s a significant difference between
the range of fewer than 100 dollars (73.2 % of
positive reviews) and the range of 1000 to 1500
dollars ( 84.3% of positive reviews). It's also possible
to conclude that as the price range increase the
percentage of positive reviews also increases
reaching the maximum in the range of 1000 to 1500
after that the percentage of positive reviews falls by
one percentage point to 83.3 %. These results can be
explained by the quality of the phones, it means that
products with a lower price may have less quality than
products with high price, which have more features
and also more quality. Hence it is expected that as the
price increases the percentage of positive reviews also
increases.
Table 6: Results for the impact of price on polarity review.
Price (Dollars)
% of reviews
Positive
Negative
Less than 100
73.2
26.8
100 to 200
76.8
23.2
200 to 300
79.1
20.9
300 to 400
79.2
20.8
400 to 500
81.4
18.6
500 to 1000
81.4
18.6
1000 to 1500
84.3
15.7
1500 to 2000
83.3
16.7
Above 2000
83.3
16.7
5 CONCLUSIONS AND FUTURE
WORK
In this paper, we analyzed four of the most popular
machine learning algorithms to deal with Sentiment
Analysis, based on four measures: Accuracy,
Precision, Recall, and F1 score. We found that the
Support Vector Machine classifier is not only the
most accurate of this study but also the most complete
classifier with high values to all the measures. Our
results show that Random Forest is also a classifier to
take into account and can achieve high values to all
the measures being just slightly worse than the
Support Vector Machine classifier.
This study also proposes a statistical study about
the impact of brand and price in the polarity review
and concludes with some interesting facts about each
one of these attributes. For the brand, we can have an
overview of the impact of each brand in the polarity
review and concluded that ZTE is the brand with the
most positive reviews with 82.9 %, as opposed to
BlackBerry with just only 74.3 %. For the price, we
can conclude that as the price increases the
percentage of positive reviews also increases,
reaching a maximum of positive reviews in the range
of 1000 to 1500 dollars after that the percentage of
positive reviews falls from 84.3% to 83.3 %.
As future work, we plan to continue the study of
other algorithms that are usually applied to Sentiment
Analysis and evaluate them with the measures that we
used in this study. We also plan to propose an
architecture to improve the results of each one of the
four algorithms that we evaluated and compared in
this study.
REFERENCES
Ahmad, M., Aftab, S. and Muhammad, S. S. (2017)
‘Machine Learning Techniques for Sentiment Analysis:
A Review’, International Journal of Multidisciplinary
Sciences and Engineering, 8(3), p. 27.
Amazon Reviews: Unlocked Mobile Phones | Kaggle (no
date). Available at: https://www.kaggle.com/
PromptCloudHQ/amazon-reviews-unlocked-mobile-
phones (Accessed: 21 March 2019).
Breiman, L. E. O. (2001) ‘18_Breiman.pdf’, pp. 5–32. doi:
10.1023/A:1010933404324.
Cortes, C. and Vapnik, V. (1995) ‘Support-vector
networks’, Machine Learning, 20(3), pp. 273297. doi:
10.1007/BF00994018.
Eler, M. D. et al., (2018) ‘Analysis of Document Pre-
Processing Effects in Text and Opinion Mining’,
Information . doi: 10.3390/info9040100.
Ganesh Jivani, A. (2011) ‘A Comparative Study of
Stemming Algorithms’, International Journal of
Computer Technology and Applications, 2(6), pp.
19301938. doi: 10.1.1.642.7100.
Ho, T. K. (1995) ‘Random decision forests’, in Proceedings
of 3rd International Conference on Document Analysis
and Recognition, pp. 278282 vol.1. doi:
10.1109/ICDAR.1995.598994.
Jasmeet, S. and Gupta, V. (2016) ‘Text Stemming:
Approches’, ACM Computing Surveys, 69(3), pp. 633
636.
Kononenko, I. (1993) ‘Successive Naive Bayesian
Classifier.’, Informatica (Slovenia), 17(2).
Manikandan, R. and Sivakumar, D. R. (2018) ‘Machine
learning algorithms for text-documents classification:
A review’, International Journal of Academic Research
and Development, 3(2), pp. 384389. Available at:
www.academicsjournal.com.
Minzenmayer, R. R. et al., (2014) ‘Evaluating unsupervised
and supervised image classification methods for
mapping cotton root rot’, Precision Agriculture, 16(2),
pp. 201215. doi: 10.1007/s11119-014-9370-9.
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
530
Moe, Z. H. et al., (2018) ‘Comparison Of Naive Bayes And
Support Vector Machine Classifiers On Document
Classification’, in 2018 IEEE 7th Global Conference on
Consumer Electronics (GCCE), pp. 466467. doi:
10.1109/GCCE.2018.8574785.
Mouthami, K., Devi, K. N. and Bhaskaran, V. M. (2013)
‘Sentiment analysis and classification based on textual
reviews’, 2013 International Conference on
Information Communication and Embedded Systems,
ICICES 2013. IEEE, pp. 271276. doi:
10.1109/ICICES.2013.6508366.
Parmar, H., Bhanderi, S. and Shah, G. (2014) Sentiment
Mining of Movie Reviews using Random Forest with
Tuned Hyperparameters.
Quinlan, J. and Quinlan J. R. (1996) ‘Learning decision tree
classifiers’, ACM Computing Surveys (CSUR), 28(1),
pp. 23. Available at: http://dl.acm.org/
citation.cfm?id=234346.
Quinlan, J. R. (1986) ‘Induction of Decision Trees’,
Machine Learning, 1(1), pp. 81106. doi:
10.1023/A:1022643204877.
Rodrigues, M., Silva, R. R. and Bernardino, J. (2018)
‘Linking Open Descriptions of Social Events
(LODSE): A new ontology for social event
classification’, Information (Switzerland), 9(7). doi:
10.3390/info9070164.
Salton, G. and Buckley, C. (1988) ‘Term-weighting
approaches in automatic text retrieval’, Information
Processing & Management. Pergamon, 24(5), pp. 513
523. doi: 10.1016/0306-4573(88)90021-0.
Text Mining Amazon Mobile Phone Reviews: Interesting
Insights (no date). Available at: https://www.
kdnuggets.com/2017/01/data-mining-amazon-mobile-
phone-reviews-interesting-insights.html (Accessed: 19
May 2019).
Xu, S., Li, Y. and Zheng, W. (2017) Bayesian Multinomial
Naïve Bayes Classifier to Text Classification. doi:
10.1007/978-981-10-5041-1_57.
Yang, C. S. and Salton, G. (1973) ‘On the specification of
term values in automatic indexing’, Journal of
Documentation. Emerald, 29(4), pp. 351372. doi:
10.1108/eb026562.
Comparison of Naïve Bayes, Support Vector Machine, Decision Trees and Random Forest on Sentiment Analysis
531
... Pre-trained Turkish BERT: BERT (Bidirectional Encoder Representations from Transformers) is a groundbreaking model in deep learning-based NLP (Natural Language Processing) tasks [61]. BERT dissects the meaning of a word by looking at the context on both the right and the left of the word. ...
... Lastly, we employed Random Forest, an ensemble learning method known for its high accuracy. Random Forest combines multiple Decision Trees to improve the overall classification performance, showing exceptional capability in handling complex classification tasks [61]. ...
Article
Full-text available
The use of the sentiment analysis technique, which aims to extract emotions and thoughts from texts, has become a remarkable research topic today, where the importance of human-robot interaction is gradually increasing. In this study, a new hybrid sentiment analysis model is proposed using machine learning algorithms to increase emotional performance for Turkish question and answer systems. In this context, as a first, we apply text preprocessing steps to the Turkish question-answer-emotion dataset. Subsequently, we convert the preprocessed question and answer texts into text vector form using Pretrained Turkish BERT Model and two different word representation methods, TF-IDF and word2vec. Additionally, we incorporate pre-determined polarity vectors containing the positive and negative scores of words into the question-answer text vector. As a result of this study, we propose a new hybrid sentiment analysis model. We separate vectorized and expanded question-answer text vectors into training and testing data and train and test them with machine learning algorithms. By employing this previously unused method in Turkish question-answering systems, we achieve an accuracy value of up to 91.05% in sentiment analysis. Consequently, this study contributes to making human-robot interactions in Turkish more realistic and sensitive.
... Moreover, aspect-context extracted data frame never used in general ML classifiers for evaluation [19]- [21]. The three classifiers, namely Support Vector Machine (SVM), Naive Bayes (NB) and Random Forest (RF), were deployed for extended evaluation on extracted aspect-context features [21], [22]. ...
... The extracted features by deploying the deep learning-based model, i.e., BACIA, were extracted and then general ML-based classifiers were used It turned out that ML-based classifiers outperformed. The three classifiers, namely Support Vector Machine (SVM), Naive Bayes (NB) and Random Forest (RF), were deployed [22]. All three models performed better than the deep learning-based classifier. ...
Article
Full-text available
Aspect-context sentiment classification aims to classify the sentiments about an aspect that corresponds to its context. Typically, machine learning models considers the aspect and context separately. They do not execute the aspect and context in parallel. To model the contexts and aspects separately, most of the methods with attention mechanisms typically employ the Long Short Term Memory network approach. Attention mechanisms, on the other hand, take this into account and compute the parallel sequencing of the aspects-context. The interactive attention mechanism extracts features of a specific aspect regarding its context in the sequence, which means aspects are considered when generating context sequence representations. However, when determining the relationship between words in a sentence, the interactive attention mechanism does not consider semantic dependency information. Moreover, the attention mechanisms did not capture the polysemous words. Normally conventional embedding models, such as GloVe word vectors, have been used . In this study, transformers are embedded into the attention mechanism approaches to overcome the semantic relationship problem. For this reason, the BERT pre-train language model is used to capture the relationship among the words in a sentence. The interactive attention mechanism is then applied to the model’s distribution of that word. The final sequence-to-sequence representation in terms of context and aspect is used into general machine learning classifiers for aspect-level sentiment classification. The proposed model was evaluated on the two datasets, i.e., Restaurant and Laptop review. The proposed approach has state-of-the-art results with all attention mechanisms and attained significantly better performance than the existing ones.
... Using the attributes of the current node, then it can predict the next node. Based on forecast criteria, the Decision Tree's variance is determined [20]. The following is the general form of a Decision Tree in equation (4): ...
Article
Full-text available
Differences in human facial structures, especially those recorded in a digital image, can be used as an automatic gender comparison tool. This research utilizes machine learning using the support vector machine (SVM) algorithm to perform gender identification based on human facial images. The transfer learning technique using the Inception-v3 model is combined with the SVM algorithm to produce six models that implement polynomial, radial basis function (RBF), and sigmoid kernel functions. The results obtained are models with excellent performance, as seen from the lowest values of accuracy = 0.852, precision = 0.856, recall = 0.852, and the highest values of 0.957, 0.957, and 0.957. This combination also produces a model with excellent reliability, where the probability of overfitting or underfitting obtained is below 1%.
... Marcio Guia et al. [3] compared classification result of SVM, Decision Tree, and Random Forest on 400 reviews of cellphone on Amazon with TF-IDF weighting. Of the three classifier, they claimed SVM linear kernel have highest matrix score (accuracy, precision, recall, and F1-score). ...
Article
Full-text available
The Kanjuruhan disaster on 1 October 2022, gained the peoples attention. People share their thoughts on social media. Their posts contain a variety of perspectives. Sentiment analysis is possible to use on a dataset of people's posts. This final project applies the supervised learning Support Vector Machine (SVM) method with feature expansion using Word2Vec and TF-IDF as weighting. Three SVM kernels—rbf, linear, and polynomial—are applied. Three split data techniques and two different types of training data are used to train each kernel. Training data with oversampling and training data without oversampling are the two types of training data. The best result gained from using rbf kernel, split ratio 70:30, and oversampling. From it, oversampling trained model have relatively stable in every split rasio and kernel without having significant difference.
... Disusul dengan metode Random Forest yang menghasilkan akurasi tertinggi 88% kemudian metode Naïve Bayes dengan nilai 85% dan terakhir metode Decision Tree dengan nilai 82%. [11] Yang terakhir adalah penelitian sebelumnya yang membandingkan Naïve Bayes, Random Forest dan Support Vector Machine menggunakan dataset opini aplikasi Ruang Guru pada Google Playstore dengan data sebanyak 1629. Dengan menggunakan metode pre-processing Setelah keseluruhan proses dilakukan, dataset hasil akan diterjemahkan kedalam wordcloud sehingga bisa diketahui kata yang sering muncul dalam komentar aplikasi Qasir ini. ...
Article
Full-text available
Qasir merupakan aplikasi Point-Of-Sale (POS) berbasis android yang bisa diakses secara gratis pada Google Playstore. Dengan banyaknya aplikasi POS yang tersedia, pengguna akan lebih selektif dalam memilih aplikasi yang akan digunakan. Salah satu aspek yang dapat mempengaruhi keputusan memilih aplikasi adalah opini pada aplikasi tersebut. Opini merupakan informasi yang didapatkan setelah menggunakan aplikasi bisa berisi kritik maupun saran. Sehingga berdasarkan hal tersebut pengguna dapat menyimpulkan bagaimana pengguna lain menggunakan aplikasi tersebut. Selain berguna untuk pengguna, opini jika diolah dengan baik akan menghasilkan sebuah informasi yang dapat digunakan untuk evaluasi bagi tim pengembang. Untuk menganalisa dan menemukan hubungan antar data yang dimiliki dapat menggunakan Data Mining. Penelitian ini akan menggunakan metode Support Vector Machine dan Random Forest, namun masing masing metode memiliki kekurangan dan kelebihannya sehingga kedua metode tersebut akan dibandingkan nilai akurasinya. Hasil yang didapat adalah Support Vector Machine memiliki nilai akurasi tertinggi dengan 80,63% sedangkan Random Forest sebesar 80,21%.
... Training data is data used to train the model whereas testing data is data used to test the performance of the model. In this research, the proportion of splitting data used was 80% for training and 20% for testing since this proportion has been chosen by many previous researchers [23,24,25]. After that, the classification process was made. ...
Conference Paper
Full-text available
Many individuals now trade online utilizing trading software in the digital world. Binomo is one of Indonesia's most popular trading platforms. This is because some influencers made several promises to Binomo customers. Since many customers were deceived, this case became quite popular. This study was executed to see how Indonesians felt about the Binomo application after the case went viral. The solution taken was in the form of sentiment analysis because there had been no previous research on sentiment analysis that discussed the Binomo case. The data was scanned using Netlytic tools, a cloud-based text and social network analyzer capable of identifying any talks on social media sites such as Twitter. The sentiment analysis of Binomo trading tweets by using the Multi�Perspective Question Answering lexicon utilized the KNIME tool. But unfortunately, the accuracy of sentiment analysis results is low. Furthermore, the Support Vector Machine technique is also being conducted. The Term Frequency-Inverse Document Frequency method is applied to perform feature extraction whilst the chi-square approach is utilized to identify features that are thought to be useful for inclusion in the classification process and to exclude features that are irrelevant to the target class. The obtained accuracy is 86%. The study proposes that words from the algorithm's outputs can be utilized to improve the quality of sentiment analysis using the lexicon. As an outcome of the algorithm, positive and negative terms are added to the lexicon, increasing the accuracy of sentiment analysis using the new vocabulary from 58.984% to 71.146%.
... In the early days, most of the research used artificial features (such as a bag of words (Wang et al., 2017), dictionaries (Abel and Lantow, 2019), n-gram (Suzuki et al., 2019), etc.) for text representation, and then matched them with traditional machine learning methods (like Naive Bayes (Gan et al., 2021), decision trees (Guia et al., 2019), support vector machines (Santucci et al., 2020), etc.) for sentiment classification. However, artificial feature-based methods cannot capture contextual connections, which limits the classification performance of the classifier. ...
Article
Full-text available
Aspect sentiment classification is an important branch of sentiment classification that has gained increasing attention recently. Existing aspect sentiment classification methods typically use different network branches to encode context and aspect words separately, and then use an attention mechanism to capture their associations. This attention-based approach cannot completely ignore the contexts unrelated to the current aspect words, which brings noise interference. In this paper, a gated filtering network based on BERT is suggested as a solution to this issue. We employ BERT to encode the text semantics of contexts and sentence pairs consisting of context and aspect words respectively, and to extract lexical features as well as associative features of context and aspect words. Based on this, we designed a gating module that, unlike the attention mechanism, uses association features to precisely filter irrelevant contexts. Additionally, because the BERT network parameters are so big, there is a tendency to over-fitting during training. To effectively combat this problem, we developed a loss function with a threshold. We carried out extensive experiments using three benchmark datasets to verify the performance of our proposed model. The experimental results show that the method improves the accuracy by 0.5%, 1.39% and 2.57% on the Laptop, Restaurant and Twitter datasets respectively, and 1.564%, 2.36% and 4.144% on Macro-F1 respectively, compared to the recent RA-CNN (BERT), proving that our method is effective in improving the presentation of aspect sentiment classification in comparison to other cutting-edge sentiment classification methods.
... Classification Algorithm is used for text classification. There are three algorithms which are commonly used in sentiment analysis: Decision Tree, Naïve-Bay, and Support Vector Machine [39] [40] [41] [42] 2.7.1 Decision Tree Decision Tree model consists of nodes and branches. Node represents attributes. ...
... Tabel confusion matrix dapat dilihat pada Tabel 1. Perhitungan tingkat akurasi, precision, dan recall menggunakan Persamaan 6-8, sedangkan untuk mengevaluasi model yang dihasilkan dengan menghitung F1-score menggunakan Persamaan 9 [17]. Model yang baik akan memiliki nilai F1-score mendekati nilai 1 [18]. ...
Article
Full-text available
K-Nearest Neighbor Algorithm Implementation in sentiment analysis towards Merdeka Belajar Kampus Merdeka (MBKM) Program. Merdeka Belajar Kampus Merdeka (MBKM) is a program that supports students to improve their skills by having direct experience in the work environment to prepare for competition and a future career. MBKM program has been implemented by Indonesia's Ministry of Education, Culture, Research, and Technology (Kemendikbudristek) since 2020. Every policy needs to be evaluated; a simple evaluation can be done through sentiment analysis to determine public responses to the MBKM program. The results are used as suggestions for program improvement. Sentiment analysis is done by applying the Natural Language Processing (NLP) algorithm to process crawled data from Twitter, then classified using the K-NN Algorithm. Based on the results, the sentiment is neutral. This illustrates that people are only partially interested in the MBKM program policy. The accuracy of the classification model using the K-NN algorithm is 95%, and an F1-score value of 0.96 for the classification model with a ratio of 80% training data and 20% test data.Keywords: MBKM, NLP, K-NN, F1-Score Program Merdeka Belajar Kampus Merdeka (MBKM) merupakan suatu kebijakan dalam mendukung pemberian kebebasan terhadap mahasiswa untuk mengasah kemampuan dengan merasakan langsung pengalaman di dunia kerja sebagai bekal untuk menghadapi persaingan dan persiapan berkarir di masa mendatang. Program MBKM mulai diberlakukan oleh Kementerian Pendidikan Kebudayaan Riset dan Teknologi (Kemendikbudristek) Republik Indonesia sejak tahun 2020. Setiap kebijakan tentunya perlu dievaluasi, evalusi sederhana dapat dilakukan melalui analisis sentimen untuk mengetahui tanggapan masyarakat mengenai program MBKM. Hasilnya digunakan sebagai saran perbaikan untuk pengembangan program. Analisis sentimen dilakukan dengan menerapkan algoritma Natural Language Processing (NLP) untuk memproses data hasil crawling dari Twitter, selanjutnya diklasifikasikan menggunakan algoritma K-NN. Berdasarkan hasil analisis diperoleh bahwa sentimen masyarakat bersifat netral. Hal ini menggambarkan bahwa masyarakat tidak sepenuhnya tertarik terhadap kebijakan program MBKM, sedangkan untuk tingkat akurasi model klasifikasi menggunakan algoritma K-NN sebesar 95% dan nilai F1-score sebesar 0,96 untuk model klasifikasi dengan perbandingan 80% data latih dan 20% data uji.Kata Kunci: MBKM, NLP, K-NN, F1-Score
Article
Full-text available
The digital era has brought a number of significant changes in the world of communications. Although technological evolution has allowed the creation of new social event platforms to disclose events, it is still difficult to know what is happening around a location. Currently, a large number of social events are created and promoted on social networks. With the massive quantity of information created in these systems, finding an event is challenging because sometimes the data is ambiguous or incomplete. One of the main challenges in social event classification is related to the incompleteness and ambiguity of metadata created by users. This paper presents a new ontology, named LODSE (Linking Open Descriptions of Social Events) based on the LODE (Linking Open Descriptions of Events) ontology to describe the domain model of social events. The aim of this ontology is to create a data model that allows definition of the most important properties to describe a social event and to improve the classification of events. The proposed data model is used in an experimental evaluation to compare both ontologies in social event classification. The experimental evaluation, using a dataset based on real data from a popular social network, demonstrated that the data model based on the LODSE ontology brings several benefits in the classification of events. Using the LODSE ontology, the results show an increment of correctly classified events as well as a gain in execution time, when comparing with the data model based on the LODE ontology.
Article
Full-text available
Typically, textual information is available as unstructured data, which require processing so that data mining algorithms can handle such data; this processing is known as the pre-processing step in the overall text mining process. This paper aims at analyzing the strong impact that the pre-processing step has on most mining tasks. Therefore, we propose a methodology to vary distinct combinations of pre-processing steps and to analyze which pre-processing combination allows high precision. In order to show different combinations of pre-processing methods, experiments were performed by comparing some combinations such as stemming, term weighting, term elimination based on low frequency cut and stop words elimination. These combinations were applied in text and opinion mining tasks, from which correct classification rates were computed to highlight the strong impact of the pre-processing combinations. Additionally, we provide graphical representations from each pre-processing combination to show how visual approaches are useful to show the processing effects on document similarities and group formation (i.e., cohesion and separation).
Article
Full-text available
Social media platforms and micro blogging websites are the rich sources of user generated data. Through these resources, users from all over the world express and share their opinions about a variety of subjects. The analysis of such a huge amount of user generated data manually is impossible, therefore an effective and intelligent technique is needed which can analyze and provide the polarity of this textual data. Multiple tools and techniques are available today for automatic sentiment classification for this user generated data. Mostly, three approaches are used for this purpose Lexicon based techniques, Machine Learning based techniques and hybrid techniques (which combines lexicon based and machine learning based approach). Machine Learning approach is effective and reliable for opinion mining and sentiment classification. Many variants and extensions of machine learning techniques and tools are available today. The purpose of this study is to explore the different machine learning techniques to identify its importance as well as to raise an interest for this research area.
Conference Paper
Full-text available
Text classification is the task of assigning predefined classes to free-text documents, and it can provide conceptual views of document collections. The multinomial naïve Bayes (NB) classifier is one NB classifier variant, and it is often used as a baseline in text classification. However, multinomial NB classifier is not fully Bayesian. This study proposes a Bayesian version NB classifier. Finally, experimental results on 20 newsgroup show that Bayesian multinomial NB classifier with suitable Dirichlet hyper-parameters has similar performance with multinomial NB classifier.
Article
Full-text available
Stemming is a pre-processing step in Text Mining applications as well as a very common requirement of Natural Language processing functions. In fact it is very important in most of the Information Retrieval systems. The main purpose of stemming is to reduce different grammatical forms / word forms of a word like its noun, adjective, verb, adverb etc. to its root form. We can say that the goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. In this paper we have discussed different methods of stemming and their comparisons in terms of usage, advantages as well as limitations. The basic difference between stemming and lemmatization is also discussed
Article
Full-text available
Cotton root rot, caused by the soilborne fungus Phymatotrichopsis omnivora, is one of the most destructive plant diseases occurring throughout the southwestern United States. This disease has plagued the cotton industry for over a century, but effective practices for its control are still lacking. Recent research has shown that a commercial fungicide, flutriafol, has potential for the control of cotton root rot. To effectively and economically control this disease, it is necessary to identify infected areas within fields so that site-specific technology can be used to apply fungicide only to the infected areas. The objectives of this study were to evaluate unsupervised classification applied to multispectral imagery, unsupervised classification applied to the normalized difference vegetation index (NDVI)and six supervised classification techniques, including minimum distance, Mahalanobis distance, maximum likelihood and spectral angle mapper (SAM), neural net and support vector machine (SVM),for mapping cotton root rot from airborne multispectral imagery. Two cotton fields with a history of root rot infection in Texas, USA were selected for this study. Airborne imagery with blue, green, red and near-infrared bands was taken from the fields shortly before harvest when infected areas were fully expressed in 2011. The four-band images were classified into infected and non-infected zones using the eight classification methods. Classification agreement index values for infected area estimation between any two methods ranged from 0.90 to 1.00 for both fields, indicating a high degree of agreement among the eight methods. Accuracy assessment showed that all eight methods accurately identified root rot-infected areas with overall accuracy values from 94.0 to 96.5 % for Field 1 and 93.0 to 95.0 % for Field 2. All eight methods appear to be equally effective and accurate for detection of cotton root rot for site-specific management of this disease, though the NDVI-based classification, minimum distance and SAM can be easily implemented without the need for complex image processing capability. These methods can be used by cotton producers and crop consultants to develop prescription maps for effective and economical control of cotton root rot.
Article
The classification and clustering of e-documents, online news, blogs, e-mails and digital libraries need text mining, machine learning and natural language processing techniques to get meaningful knowledge. This paper provides a review of the principles, advantages and applications of document classification, Document clustering and text mining, focusing on the existing literature.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.