ArticlePDF Available

An Opinion Mining Task in Turkish Language: A Model for Assigning Opinions in Turkish Blogs to the Polarities *

Authors:

Abstract and Figures

In our day, social networks are also called as the new generation media, which give the daily increasing users the change to gather at these environments and thus interactive environments get created with the messages noteworthy enough to share. Thereby, the cooperation of big masses and joint production are rendered possible and new tendencies are provided a change. Thanks to these environments, users improved their position from "content consumer" to "content creator". To wit, while the users share their views on the product/service with the ones they know and their close ones use the spreading by word of mouth method, they've already started to cover these environments as well. These environments, which in a way globalize the customer opinions, are analysis tools for businesses due to customer focused structuring of our day. By analyzing the data they acquire from such environments, businesses are required to develop suitable strategies Marmara University, Istanbul, Turkey Global changes took place at a neck-breaking speed in lots of fields along with the Web 2.0 era, which can be stated as the new Internet trend. Web pages which once were a statical structure that can be said to become dynamic pages created by users, and in this regard they can be said to have been democratized by evolving. Social media, which were structured alongside with this era, by providing a large data flow for businesses, present new and improvable opportunities in the field of creating effective strategies. There are lots of blogs in today's Internet environment which includes customer ideas regarding the products/services that they possess. This environment, which in a way globalizes the customer ideas, is a new medium suitable for examination in terms of its increasing the business-customer interaction and due to its transporter nature; it provides the text data that may be analyzed in the field of Customer Relationship Management to businesses. Thus, businesses should follow blog environments to see how the product/service they provide is greeted in terms of the customer focus and it should be seen as an important job on which they can conduct effective analyses. For this purpose, a model proposal that will assign the ideas to the Turkish blogs was given in the study. Opinion mining methods were used in the model, and so to perceive a general look-on about products/services, a methodology was devised, which will assign the text based opinion data on the Turkish blogs to the poles. Success of the pole assignment of the model is evaluated with the precision measure.
Content may be subject to copyright.
Journalism and Mass Communication, ISSN 2160-6579
March 2013, Vol. 3, No. 3, 179-198
An Opinion Mining Task in Turkish Language: A Model for
Assigning Opinions in Turkish Blogs to the Polarities
Çiğdem Aytekin
In our day, social networks are also called as the new generation media, which give the daily increasing
users the change to gather at these environments and thus interactive environments get created with the
messages noteworthy enough to share. Thereby, the cooperation of big masses and joint production are
rendered possible and new tendencies are provided a change. Thanks to these environments, users improved
their position from “content consumer” to “content creator”. To wit, while the users share their views on the
product/service with the ones they know and their close ones use the spreading by word of mouth method,
they’ve already started to cover these environments as well. These environments, which in a way globalize the
customer opinions, are analysis tools for businesses due to customer focused structuring of our day. By
analyzing the data they acquire from such environments, businesses are required to develop suitable strategies
Marmara University, Istanbul, Turkey
Global changes took place at a neck-breaking speed in lots of fields along with the Web 2.0 era, which can be stated
as the new Internet trend. Web pages which once were a statical structure that can be said to become dynamic pages
created by users, and in this regard they can be said to have been democratized by evolving. Social media, which
were structured alongside with this era, by providing a large data flow for businesses, present new and improvable
opportunities in the field of creating effective strategies. There are lots of blogs in today’s Internet environment
which includes customer ideas regarding the products/services that they possess. This environment, which in a way
globalizes the customer ideas, is a new medium suitable for examination in terms of its increasing the
business-customer interaction and due to its transporter nature; it provides the text data that may be analyzed in the
field of Customer Relationship Management to businesses. Thus, businesses should follow blog environments to
see how the product/service they provide is greeted in terms of the customer focus and it should be seen as an
important job on which they can conduct effective analyses. For this purpose, a model proposal that will assign the
ideas to the Turkish blogs was given in the study. Opinion mining methods were used in the model, and so to
perceive a general look-on about products/services, a methodology was devised, which will assign the text based
opinion data on the Turkish blogs to the poles. Success of the pole assignment of the model is evaluated with the
precision measure.
Keywords: opinion mining, text classification, sentiment classification, semantic orientation, positive/negative
polarity
Introduction
Acknowledgments: The author would like to thank Sepandar Kamvar for allowing me to use the color codes regarding
sentiment words in We Feel Fine web site. The author really appreciates that.
Çiğdem Aytekin, Ph.D., Faculty of Engineering, Marmara University.
D AV I D PUBLISHING
D
AN OPINION MINING TASK IN TURKISH LANGUAGE
180
as accordance with the customer choices and to update these strategies constantly.
On the other hand, these environments that include the customer opinions on the product/service possess
text data and these data aren’t structured the same way as the data saved into the database are structured.
Significant information is procured by analyzing the structured data with the data mining techniques in our day.
Yet, this form of the text data, that said the unstructured form, hardens the knowledge extraction from different
topics. But beneficial the text data increase constantly and in lots of cases, the knowledge that can be extracted
from these data is needed. Examination of these data manually is very hard, moreover, it is impossible in most
cases. The need of analyzing of the text data and the desire to lessen the amount of time wasted manually
created a structuring in this direction and thus, the concept text mining came to be. The text data become
structured with the text mining method and thanks to refinement after being analyzed with data mining
techniques, procurement process of the precious knowledge can be realized. Therefore, secret and unknown
meanings among the text data can be derived. And when the mentioned text data are opinion-stating statements,
tasks of opinion mining, which is an implementation field of the text mining, come into prominence.
Generally, opinion mining is the transition of automatically knowledge obtaining from the text data that
state opinion. In this context, as stated in the Wikipedia (Sentiment Analysis, 2013) sources, opinion mining
refers to a large field of Computational Linguistics, Text Mining, and Natural Language Processing. And when
the opinion mining explanations in the literature are examined, it can be seen that different authors state
opinions in the same parallel.
CHEN, ZHU, Kifer, and Lee (2010) based the opinion mining on the essence of the question “What do
people think about …?”. According to them, this question becomes more important in the web platforms where
people state their opinions. As people share their positive and negative opinions regarding products in these
environments, they actually end up being a representative on the topic of general opinions regarding the
product. These data belonging to the consumer experience can be analyzed with opinion mining methods, and
consumption behaviors of new customers can be affected from these opinions.
Conrad and Schilder (2007) evaluated the customer opinions on the product/service within the scope of
both the individual and businesses. According to them, individually-based consumer candidates may want to
gain a general perspective regarding a product they want to purchase or the service itself. This need also applies
to businesses. Similarly, they may also wish to gain a general view concerning how the service they provide or
the product they present are greeted.
LIU (2013) also approached the topic from two angles just as Conrad and Schilder did (2007). For him,
consumer experiences are vital in taking deliberative decisions on individual basis. With Opinion Mining
implementations, a general perspective about the product or service can be obtained without reading lots of
comments. Similarly, opinion mining is important for business as well. For example, a business must know
how its product or service is perceived by the consumers. Not just that, it also should know the perspective
regarding the rival products. Thus, it can conduct the necessary evaluations concerning the product or service
and innovations on different topics can be developed on this direction.
While Falcon (2010) stated the opinion mining as “a technology whose validity was proved”, he
emphasized on it using statistical model and software and stated that it had been used widely as a listening,
analysis, and relation tool in the defence, public administration, and marketing practices.
Works to be done with opinion mining can be evaluated within the tasks of opinion mining. As such, these
tasks have been focused on below.
AN OPINION MINING TASK IN TURKISH LANGUAGE
181
Opinion Mining Tasks
When the literature studies about the opinion mining tasks are examined, though different authors seem to
conduct evaluations within the frame of different categories, there are some common points. The reason why
these differences occur mostly bases on the purpose. Some certain opinion mining tasks were given below:
(1) Determining the positive-negative (PN) pole of the text. This is one of the most commonly-used tasks
of opinion mining. In this task, which of the positive or negative poles do the user opinions partake in is tried to
be determined. For example, answer to the question “Is the general take of the users on our products/services
positive or negative?” could be tried to find. Esuli and Sebastiani (2006), LIU (2007a), and Abbasi, CHEN, and
Salem (2008) are example authors who take up this task.
(2) Determining the intensity of the PN pole text. In this task, the text based opinion data that were
assigned to the PN poles are ranked on certain levels. For example, answer to the question “How many of the
opinions which approach positively/negatively to our products/services are poorly negative/positive, how many
are moderately negative/positive and how many are strongly positive/negative?” could be tried to find. Esuli
and Sebastiani (2006) is an example author who takes up this task.
(3) Feature-based mining. In this task, whether the findings regarding the features of the products/services
which users possess are positive or negative is tried to be determined. For example, answer to the question
“What do the users think about the Y feature of our X product/service?” could be tried to find. Leneve (2010),
Verbeke and Eynde (2013), and QIU, LIU, BU, and CHEN (2010) are example authors who take up this task.
(4) Comparative sentence and relation deduction. In this task, comparative relation is tried to be deducted.
By comparing an object with another, differences are determined. For example, evaluations regarding a brand
being superior to another one can be deducted from the sentence “X brand car is cheaper than Y brand car.
Leneve (2010), LIU (2007b), and Verbeke and Eynde (2013) are example authors who take up this task.
(5) Mentioning an issue as classification problem. In this task, an issue gets mentioned and conducting a
classification regarding this issue is tried. For example, a classification can be done concerning the person who
deserts the brand in the sentence that “I will never an x brand product”. Verbeke and Eynde (2013) are example
authors who take up this task.
On the other side, the opinion mining task regarding the subjectivity-objectivity of the text is one of the
tasks partaking in the literature. Yet, it didn’t receive as much attention as others. Due to the nature of opinion
mining, text data being statements of opinion and these possessing a rightful subjectivity caused this to happen.
For the study is about determining the positive/negative pole of a Turkish opinion, this classification task
is examined below in a more detailed way within the frame of literature studies.
Literature Review
Classification of the text data that state opinion has been the task on which, within the opinion mining
tasks, most number of applications was developed. And most of these classifications base upon
positive/negative pole differentiation. Yet, there are lots of hardships faced in the opinion mining. An important
portion of these is within the scope of spell check as stated in the fourth part of the text—“A model for
Assigning Opinions in Turkish Blogs to the Polarities”, and yet while the users write comments in which they
state their opinions in web platforms, they generally use colloquial language. Because these environments are
based upon an informal structure due to their nature. Stavrianou and Chauchat (2008) also emphasized on this
AN OPINION MINING TASK IN TURKISH LANGUAGE
182
problem and defined it as the difference between “usage regarding the colloquial language” and “usage in an
online newspaper”. Similarly, Pang and Lee (2008) stated that text behavior regarding the opinion would be
different that classic text mining applications.
When the literature is checked in respect to classifying the opinion, it is seen that supervised, unsupervised,
and semi-supervised learning techniques are the topics of subjects. For example, Turney (2002) used
unsupervised learning techniques. In this study, he used some adjective and adverb based words as his focus
and classified the user comments as “suggested ones” and “non-suggested ones”. Method has an accuracy value
of 74% on average.
Another example is the study of Devitt and Ahmad (2007) in the field of financial news. But, here, in
addition to determining the pole of a news text, establishing the pole intensity was also tried. As mentioned in
the second part of the text—“Opinion Mining Tasks, this is a different opinion mining task. In this direction, a
scale was devised in the study to check how positive or negative the text is. Scale consists from seven levels
from too positive to too negative. Suggested method has an accuracy value of 46% in determining the positive
and negative differentiation.
Dave, Lawrence, and Pennock (2003) combined the scores from the education set to classify the
positive/negative product comments with the method of choosing of the proper features and metrics and to this
end, they put forth various techniques. The highest accuracy value of the method, in which text mining
algorithms, such as Naive Bayes and Support Vector Machine are given comparatively, is 85.3%.
Esuli and Sebastiani (2005) suggested a method based on a semi-supervised learning basis so to determine
the positive/negative orientation of the terms. According to them, definitions of similarly oriented terms tend to
be similar. Method which makes use of the lexical relations such as synonymy, antonymy, and hypernymy is
based on WordNet. Accuracy value of the method is given separately for three classifiers: 83%, 87%, and 88%.
Lastly, Kamps, Marx, Mokken, and De Rijke (2004), in their studies where they used WordNet to measure
the semantic orientation of adjectives, made the suggestion of interpreting the topics on scales such as
positive/negative, ugly/beautiful, powerful/weak, excellent/poor, and active/passive. Accuracy value of the
method is 67.18%.
In this study, opinion classification task of opinion mining was given place and the suggestion of “a model
which will assign the customer post/comments on Turkish Blogs to the poles positively or negatively” was
presented. Here, the accuracy value is 72.71% and this will be discussed in the fourth part of the text—“A
model for Assigning Opinions in Turkish Blogs to the Polarities”.
A Model for Assigning Opinions in Turkish Blogs to the Polarities
The new media used in this study is blogs. Blogs include user opinions that state sincere comments and
sentiments regarding a specific subject. Besides, since the blogger and the user are voluntarily in this medium,
clear and transparent user views, which are difficult to obtain through other ways, are reached easier. Blogs,
which gather individuals who have similar interests, include millions of comments and their number is
increasing day after day. Problems, expectations, approaches, complaints, all of them, can be found in this
medium. Blogs are considered to be a “media that is developed by the consumers”. For this very reason, the
text mass in the blogs include crucial data. Therefore, enterprises should follow the blogs in order to learn how
the products and services they offer are received by the customers, and they should see the blogs as an
important medium through which they can perform an effective analysis.
AN OPINION MINING TASK IN TURKISH LANGUAGE
183
On the other hand, the comments in this medium are crucial sources of feedback since they are stated by
the users themselves. However, feedback is limited in the traditional mass media. The information gathered
from this medium will offer a solution for the limitation in feedback, and enterprises will be able to increase
efficiency by developing the strategies regarding enterprise decision process. In addition, while the data on the
enterprise’s database are limited to the use of the enterprise, these blogs are open to anyone. Therefore, positive
or negative opinions can be seen by anyone who reads that blog. For this reason, enterprises have to follow and
analyze blogs in order to protect their reputations.
In this regard, this study proposes a model that will assign the text-based opinion data in Turkish blogs to
positive and negative polarities in order to present a general view on products and services. The model is a
semi-supervised learning model. Training set comprises of English words stating sentiments. In order to
calculate the words’ probability for being in positive or negative polarities, color codes that have been assigned
to the words were used and through a repetitive test-investigation process, color-word meaning correlation was
provided.
In order to perform polarity assignment process automatically, a program called “Opinion Polarity
Detection” was developed. For this program, Visual Basic Language was used in Microsoft Net Framework
environment and Microsoft SQL Server 2005 was used as database. The program classifies the text data
according to the rules of Naive Bayes Algorithm, which is simple to use and produces efficient results. For
spell-check, the error correction module of the Microsoft Word text editor was used. Besides, posts’ and
comments’ length in terms of word number were analyzed and its effect on the success of polarity assignment
was studied in the study.
Preparation of Data
The data in the model comprise of the customer opinions gathered from Turkish blogs. These text data can
be blogger’s posts regarding the comments of the readers as well they can be the readers’ comments on the
relevant post.
A test database was developed especially for the aim of the study. This database is a sample group
comprising of 350 positive and 350 negative posts/comments. In the test database, posts/comments regarding
the products and services under the headings of “white goods, built-in products, electronics, small home
appliances, and heaters-air conditioners” were included. The most important factor for the selection of this
heading is that the products give a chance for the posts and comments regarding services as well as the products
themselves. On the other hand, using this heading will enable the posts and comments regarding both domestic
and foreign products and services. While searching for posts/comments, Google Blog Search Engine was used
in Turkish language and using brand names as keyword was found to be easiest way for finding text data
regarding the enterprise in question. While searching for the post/comment; instead of searching for the posts
regarding especially one single brand, the search was tried to be diversified through searching for as many
brands as possible. In addition to these, following points were of importance in the posts/comments collected:
(1) Both long and short entries were included in the posts/comments. While the average word number of
the negative post/comments was 86.66, and the average for the positive entries was 29.66;
(2) In data collecting, to the introductions that are evaluated semantically in positive or negative pole, a
part was given;
(3) The posts/comments which included positive opinions for a property of the product and service and
AN OPINION MINING TASK IN TURKISH LANGUAGE
184
negative opinions for other properties were not included. Yet, post/comment should take place in only one of
the positive or negative poles semantically in terms of its every feature;
(4) Some of the collected posts/comments might be placed in the positive/negative polarity for only a
single property while some might be placed in the related polarity for more than one property. In this regard,
the posts/comments were not limited;
(5) The data in the posts/comments might be texts that are related to the properties of the products and
services as well as they might be texts that are related to the enterprise’s approach. The data were not
differentiated in this sense;
(6) The negative posts/comments were text data including complaints in terms of discontent and
dissatisfaction while the positive posts/comments were text data including recommendations in terms of content
and satisfaction;
(7) The posts/comments were not differentiated based on time such as day, month, or year.
In order to ease collecting the posts/comments and transferring them to the program, a form in the
Personal Home Page (PHP) language was developed and it was ensured that this form was uploaded on the
Web after a domain name, “data collection form”, was taken. The posts/comments found through the search
engine on the basis of the points mentioned above were entered into this from. The text data which were
manually determined to be in the positive/negative polarity were copied into this from with their Uniform
Resource Locator (URL) addresses and a database was developed.
Setting up of the Model
XU and CHENG (2011), who have greatly contributed to the literature studies on opinion mining,
correlated the words “opinion” and “sentiment” in this way. “An opinion contains often sentiment words which
can be classified into polarities such as positive, negative, and neutral” (p. 11). From this point, it was assumed
that the sentiment words could be used as the training set in polarity determination and as the result of the
performed researches it was decided that the sentiment indicator words in the website called We Feel Fine
would be used. In this way, a semi-supervised learning model was developed.
(1) English training set for semi-supervised learning.
Training set comprises of 2,178 sentiment indicator words that can be found in
http://www.wefeelfine.org/data/files/feelings.txt. The methodology of the website can be summarized in this
way (the explanations are limited to the information given in http://www.wefeelfine.org/methodology.html).
Data collection: We Feel Fine website includes a “collecting motor” which collects human feelings
automatically from online sources such as LiveJournal, MSN Spaces, MySpace, Blogger, Flickr, and
Technorati. The purpose is to find the statements “I feel” or “I am feeling” in the blog posts. These statements
which are saved into the database get checked to see if they include valid feelings. Valid feelings consist from
adjective/adverb based words, which are structured manually and determined beforehand, and there are 5000 of
them. If a statement in the database includes even if a single one from these 5000 words, then it gets saved into
the real database along with the name of the blogger. By sorting out the blogger’s name from the URL address,
that blogger’s profile info can be accessed via blogging companies. Thus info about the blogger regarding his
age, gender, city, country, and even the weather conditions at the time of his post becomes saved into this
database. Repetition period of this process is 10 minutes.
Statistical measurements: We Fell Fine database can be questioned by the ones who will use its balls.
AN OPINION MINING TASK IN TURKISH LANGUAGE
185
When the application begins, balls scatter around the screen and when one of these balls are clicked, statements
including “I feel” or “I am feeling” appear. This screen is given in Figure 1. Also, screening conditions
according to the gender, age, weather condition, location, date, and feelings can be created.
Colors of feeling: Color equivalents of feelings are based on this: Bright yellow colors represent happy
positive feeling; dark blue colors represent sorrowful negative feeling, green color the feeling of calmness, and
red represents angry feelings.
Figure 1. We Feel Fine balls in the opening screen. Resource: Retrieved February 28, 2013, from
http://www.wefeelfine.org/wefeelfine_pc.html.
(2) Using Sentiment Dictionary in Turkish generated by English training set.
The full list of the 2,178 valid sentiments are given with their frequencies and assigned hexa color codes in
the URL address of the We Feel Fine website below. A part of this list is presented in the Table 1.
It was stated above that the words which formed the sentiment list were basically adverbs and adjectives.
Of all the 2,178 words in this list, 1,655 words that were used as adjective and/or adverb in Turkish were taken,
and in total 4,744 Turkish word/word groups were obtained because a word can have more than one counterpart
as adjective and/or adverb in Turkish. This dictionary comprising of 4,744 lines will be referred as “Sentiment
Dictionary in Turkish” from now on. However, it should be noted that the meanings of some English words
cannot be given with one Turkish word. For example, the word “best” is translated as “en iyi”, “en uygun”, …
with the superlative adjective “en”. For this reason, the Sentiment Dictionary in Turkish comprises of translated
words or word groups.
AN OPINION MINING TASK IN TURKISH LANGUAGE
186
Table 1
Some of the Sentiment Words in English.
Word Frequency Assigned hexa color code
better 128,155 FFA401
bad 93,390 07548A
good 76,610 FFF700
right 40,683 E97802
guilty 31,591 004E6F
sick 27,706 2E9127
same 25,389 017E94
sorry 23,779 00696F
well 22,428 E6C637
down 20,847 18213E
alone 17,988 595884
happy 17,849 FF7F00
Note. Retrieved February 28, 2013, from http://www.wefeelfine.org/data/files/feelings.txt.
Adjectives (“Adjectives, 2007) come before nouns and qualify or determine them in different ways.
Adjectives are divided into those groups in terms of task and meaning:
(1) Qualificative adjectives;
(2) Determinative adjectives;
(3) Demonstrative adjectives;
(4) Numeral adjectives: cardinal numeral adjectives, ordinal numeral adjectives, fractional numeral
adjectives, distributive numeral adjective, and group number adjective;
(5) Indefinite adjectives;
(6) Interrogative adjectives.
The adjectives in the Sentiment Dictionary in Turkish can be classified in terms of their types as follows:
Most of them were qualificative adjectives (“sakin”, “berrak”, “iyi”, “sözlü”, “rizikolu”, … ), a small part of
them were indefinite adjectives (“bütün”, “başka”, … ), two of them were ordinal numeral adjectives
(“birinci”, “ikinci”) and one of them was a group number adjective (“ikiz”). Due to the nature of text mining,
cardinal numeral adjectives (“bir”, “iki”, … ) were not included and they were cleaned from the English
dictionary having 2,178 words.
In terms of structure, adjectives (“Adjectives”, 2007) are divided into five groups: absolute adjectives,
derived adjectives, compound adjectives, enhanced adjectives, and adjectives as word/word groups. “Kırmızı”,
“iri”, “ufak”, … can be given as examples of the absolute adjectives in the Sentiment Dictionary in Turkish;
“ağlayan”, “oynayan”, “evsiz”, “çarpıklık”, and “yuvarlak” can be given as examples of derived adjectives; and
“cana yakın”, “yurtsever”, “vatanperver”, “kısa boylu”, “ustalık”, and “evsiz barksız” can be given as examples
of compound adjectives.
At this point, it is useful to explain derived adjective type further. If an addition is made to the verb stem,
this type of derived adjective is called verbal adjective. Kurt (1998) defines verbal adjective in this way: The
words that resemble verbs since they indicate action and that are considered to be adjectives since they form an
adjective clause by qualifying a noun are called verbal adjectives. Verbal adjectives are derived with the affixes
“-en, -esi, -mez, -(a)r, -dik, -ecek, and -miş” which are added to verb stems: giden çocuk, yıkılası şehir, pişmez
AN OPINION MINING TASK IN TURKISH LANGUAGE
187
et, bakar kör, çalmadık kapı, gelecek yıl, sıkılmış limon, etc.. Some of the words given in the English training
set have taken the affixes “ing” and “ed”. These word/word groups that correspond to the verbal adjective in
Turkish were translated by adding the affixes “-en, -miş” to the ends of the verbs with which they were used.
The examples of the English words that were translated in this way are given in the Table 2.
Table 2
Examples of Verbal Adjectives in English Training Set
English word Turkish verbal adjective
crying ağlayan
playing oynayan
wasting israf edilen
growing artan
changing değişen
turning dönen
asking soran
pushing iten
acting hareket eden
loved sevilen
accomplished başarılmış
used alışılmış
overwhelmed bunalmış
stressed baskı yapılmış
trapped set çekilmiş
However, in the training set, there are some English words which cannot be translated as verbal adjectives
using the affixes “-en, -miş” although they have taken the affixes “ing/ed”: beginning-başlangıç, tired-yorgun, etc.
Adverbs, on the other hand, are words that affect the meanings of verbs, gerundials, adjectives, or adverbs
in various ways (place, time, manner, amount, and interrogative), that determine and grade them. Most of the
adverbs can be used as adjectives or nouns. However, while adjective comes before the noun and qualifies or
determines it, adverb does not come before the noun (“Adverbs”, 2007). The examples of adverbs in the
Sentiment Dictionary in Turkish are presented in Table 3 according to adverb kinds.
Table 3
Adverb Examples From the Sentiment Dictionary in Turkish According to Adverb Kinds
Adverbs of manner Adverbs of time Adverbs of place Adverbs of amount
eğri öğleden sonra ileri pek
çocukça evvela ileri doğru çok
az yükle sabah geri fazla
tıklım tıklım geç geride azıcık
abartarak sürekli ağı çok az
övünerek ilkin aşağıya sık
gül hemen yukarıya Seyrek
hiç üst kata
elbette
ne olursa olsun
belki
AN OPINION MINING TASK IN TURKISH LANGUAGE
188
(3) Determining the polarity probability of the words using the assigned color codes.
The fact that the model established in order to determine polarities was a semi-supervised learning model
was stated in the beginning of this section. In supervised learning, the texts that determine the class are used in
order to train the system. In other words, the system is trained for finding the class of the new texts
automatically, based on the given text. The vector of the text whose class is to be found is mixed with text
mining algorithms and is assigned to the class to which it is related. However, in this study there are no polarity
indicator texts to train the system. Nevertheless, there are sentiment word/word groups associated with the
colors that will be used to determine the polarity. The model was considered to be a “semi-supervised learning
model” for this reason. The probability of each word/word group in the Sentiment Dictionary in Turkish for
being in positive/negative polarity was determined considering the color to which it was assigned and these
probabilities were used in Naive Bayes Algorithm.
In regard to the determination of probability, firstly, word/word groups’ fill colors regarding the hexa
color codes given in the English Training Set were formed. In order to do this, RGB color codes corresponding
to the hexa code were used. Hexa codes were transformed into RGB color codes using transforming programs
in order to be able to form fill colors.
RGB codes is a color space named after the first letters of the words “red-green-blue”
(“kırmızı-yeşil-mavi”) in English and is used frequently. Based on the light, the codes of all colors in the nature
are specified with reference to these three basic colors. When each color is mixed 100%, white is obtained,
when mixed 0%, black is obtained (RGB Color Space, 2013), take FFFF4F hexa code as an example. Here, FF
represents red, FF represents green, and 4F represents blue. When FFFF4F hexa code is transformed into RGB
code, the value 255 is obtained for red, the value 255 is obtained for green, and the value 79 is obtained for blue.
In RGB code, each of these three colors has a value between 0-255.
The fill colors regarding the all word/word groups in the Sentiment Dictionary in Turkish were obtained in
Microsoft Excel Program using RGB codes. The sentiment words updated with fill colors will be referred as the
“Sentiment Database in Turkish” from now on. In the database, there are 108 different hexa codes.
In the next step, HSL (hue-saturation-lightness) color codes of the word/word groups in the Sentiment
Database in Turkish were obtained. While obtaining HSL codes using hexa or RGB codes, transforming
programs were used again.
HSL is a coding system which has values separately for every single hue-saturation-lightness feature of the
color. Hue is a feature which can take up a value between 0o-360o and determines the place of the color on the
wheel. It starts with red at 0o and ends with red at 360o. Saturation stated as “%” shows the dullness and brightness
of the color. One hundred percent is the maximum value. It comes close to color grey with low color density and
at 0% grey color scale takes effect. And in the lightness feature which defines the lightness and darkness of the
color, while high L value means more white, low L value means more black, which means defining darkness
(How to Calculate a Complementary Colour (Inc. RGB/HSL Conversion)). In Table 4, a part of the Sentiment
Database in Turkish in which fill colors and HSL codes are included is presented. While calculating the
probability of each of these word/word groups for being in positive/negative polarity, these HSL codes will be
used.
The hypothesis regarding the calculation of the distance between colors: The color belonging to the
word/word group which has the most positive value in terms of meaning is determined as the “starting point
color”. The distance between the colors belonging to other word/word groups and the starting point color is
AN OPINION MINING TASK IN TURKISH LANGUAGE
189
calculated. Therefore, while the word/word group belonging to the starting point color is in positive polarity
with 100% probability, other colors and the word/word groups corresponding to these colors will gradually
become distant from this point and their probability for being at positive polarity will decrease.
Table 4
A Part of the Sentiment Database in Turkish With Fill Colors and HSL Codes
Sentiment words Frequencies Hexa code Fill color H S L
en iyi 6,433 FFFF4F 60 100 65
hoş 76,610 FFF700 58 100 50
katkısız 3,098 FFD93C 48 100 62
büyük 17,058 FFD801 51 100 50
gururlu 3,604 FFB300 42 100 50
dayanıklı 4,128 FFB200 42 100 50
daha çok 128,155 FFA401 39 100 50
geçer 4,615 FF7F1C 26 100 55
bahtiyar 17,849 FF7F00 30 100 50
etkili 5,296 FF6600 24 100 50
cana yakın 3,884 FF4B02 17 100 50
acılı 3,283 FF1A00 6 100 50
kısmetli 5,716 FED73C 48 99 62
cümbüş 1,223 FEA740 33 99 62
başarılmış 5,312 FEA53F 32 99 62
bütün 8,151 FE992D 31 99 59
belirli 3,826 FE9901 36 99 50
The word/word group that may have the most positive value in terms of meaning in the Sentiment
Database in Turkish was determined to be the Turkish correspondence of “best” and the 255-255-79 fill color
determined by the RGB code was selected as the starting point color. This color was “bright yellow” and it
coincided with the fact that the positive sentiments in We Feel Fine website were bright yellow. The HSL code
of the starting point color was 60-100-65. The value of H (60) indicated that the color was yellow, the value of
S (100) indicated that it was the brightest color, which means that it was in the most outer circle of the wheel
and the value of L (65) indicated that the 50% pure color value converged to white. In Figure 2, the place of the
starting point color in the color wheel is presented. In the HSL code, H indicates 60° and S indicates 100%. The
L value will be used later.
Once the place of the starting point color in the color wheel is determined, the distance between the other
colors, whose HSL values are known, and this point can be calculated. The calculation has two steps:
Calculating the length of the third edge of the triangle using H and S values;
Calculating the length of the hypotenuse of the right angled triangle using the length of the third edge and
L value.
For example, calculating the distance between the color belonging to the word “ayrı (different)” and the
starting point color.
Calculating the length of the third edge of the triangle using H and S values: For this word, H = 358, S = 81. In
order to calculate the distance between two points, a triangle is formed through connecting the points with the
central point of the wheel and with each other. Two edge lengths of triangle are known and equal to the S
values of the HSL codes. In addition, the angle can be calculated using the H values of the HSL codes. Once
AN OPINION MINING TASK IN TURKISH LANGUAGE
190
two edge lengths of the triangle and the angle between these two edges are known, the third edge length can be
calculated using the Cosine Law.
Figure 2. Place of the starting point color in the color wheel.
Calculating the length of the hypotenuse of the right angled triangle using the length of the third edge and
L value: In the first step, calculation was made using the H and S values of the two points in HSL color code
but L values were not included in the calculation. However, L values carry colors to a different point in vertical
plane.
The middle point of the height of the cylinder corresponds to the L value of 50% and represents the pure
color. While the ratio of 100% transforms the L value to white, the ratio of 0% transforms it to black. On the
other hand, when H and S values are 0, only L values matter and at this point, grey color scale from white to
black starts. In grey color scale, R, G, and B values are equal. For example, the HSL code of the color grey
corresponding to the values R = 150, G = 150 and B = 150 is 0°, 0%, and 69%. Grey color scale can be seen in
Figure 3 with its values of lightness and dullness.
Figure 3. Grey color scale.
Dullness (0%)
Lightness (100%) (100100100)
Pure color (50%)
AN OPINION MINING TASK IN TURKISH LANGUAGE
191
For example, the L value of the word “ayrı” is 55 while the L value of the starting point color is 65. The L
value of the starting point color (65) is the maximum value in the Sentiment Database in Turkish. Therefore,
the wheel belonging to the L = 65 value will always be on top of the other word/word groups. In this case, a
perpendicular can be lowered to the all wheels that have different L values from the starting point color. On the
other hand, the third edge length calculated using H and S values in the first step will form the other edge of the
right-angled triangle. Therefore, the length of the hypotenuse can be calculated and the distance between the
word “ayrı” and the starting point color can be found.
Now for 4,744 word/word groups, the probability of each for being in the positive polarity can be
calculated using the above calculations. In the database, the probability for being in the positive polarity was
taken as basis since when the probability for being in the positive polarity is subtracted from 100, the
probability for being in the negative polarity can be obtained. Firstly, the Sentiment Database in Turkish was
put in order from greater to lesser in terms of the distance to the starting point color. At this point, it was
assumed that the word/word group regarding the starting point color would be in the positive polarity with the
probability of 100% and the word/word group in the last line of the database would be in the positive polarity
with the probability of 1% (since the result of the 0% probability value in the algorithm calculation would be 0
due to the absorbing element, the probability was taken to be 1%). In order to find the other probabilities, a
coefficient had to be obtained according to these two values.
When looked to the color scale in the database put in order according to distance, it was found that it
coincided greatly with the explanation stated in the beginning of this section: “Happy positive feelings are
bright yellow. Sad negative feelings are dark blue. Angry feelings are red. Calm feelings are green”. In other
words, the opposite polarities were put in order between the colors “bright yellow” and “dark blue”. Therefore,
the probabilities might be put in order between 100% and 1%.
The most appropriate values regarding the probabilities of 4,744 word/word groups for being in the
positive class for the interval between 100%-1% were obtained with the coefficient 0.5159. When the distance
to the starting point color was multiplied with this coefficient and subtracted from 100, the values 99.9% and
1.01% were found, and these values were the values that were closest to the interval between 100%-1.106%
unique probability values were calculated in this way. The probability values obtained were divided into 100
and values lesser than 1 were obtained and these values were used in the algorithm. As the result, it was found
that 1,745 of 4,744 word/word groups had more than 50% probability for being in the positive polarity and
2,999 had less than 50% probability for being in the positive polarity. In Table 5, a part of the final version of
the Sentiment Database in Turkish which includes the probability for being in the positive polarity can be seen.
(4) Assigning to polarities with Naive Bayes Algorithm.
Classification is a text mining task based on prediction. Text is incorporated, through prediction, into one
class that has been determined in line with its properties. The procedure of assigning a text to one of the
positive or negative polarities is also examined within the framework of text classification subject in text
mining. Many algorithms have been developed for classification (Naive Bayes, support vector machine, rocchio,
decision trees). In this study, Naive Bayes Algorithm which was simple to use and produced efficient results
was used for the classification of the text data. This algorithm based on the calculation of each criterion’s
effects to the result as probability is among the methods mostly preferred for the class determination in the
cases including two probabilities (positive/negative). The algorithm works according to this principle:
P (text-positive) = P (text-negative) =1/2;
AN OPINION MINING TASK IN TURKISH LANGUAGE
192
Probability of the text for being in the positive polarity = ½ × Multiplication of the probability of each
word/word group that is present both in the text and in the Sentiment Database in Turkish for being in the
positive polarity;
Probability of the text for being in the negative polarity = ½ × Multiplication of the probability of each
word/word group that is present both in the text and in the Sentiment Database in Turkish for being in the
negative polarity.
Table 5
A Part of the Sentiment Database in Turkish including the Probability for Being in the Positive Polarity
Sentiment words Frequencies Hexa code Fill color Probability for being in the positive polarity
en iyi 6,433 FFFF4F
99.99
hoş 76,610 FFF700
92.05474706
kısmetli 5,716 FED73C
89.1455142
katkısız 3,098 FFD93C
89.10427035
büyük 17,058 FFD801
88.80089123
gururlu 3,604 FFB300
82.09990792
dayanıklı 4,128 FFB200
82.09990792
daha çok 128,155 FFA401
79.66678813
belirli 3,826 FE9901
77.28985031
cümbüş 1,223 FEA740
75.97838179
başarılmış 5,312 FEA53F
75.11009634
bütün 8,151 FE992D
74.10444177
bahtiyar 17,849 FF7F00
72.19642672
The greater of the statements will indicate the class of the text. The steps of the procedure of assigning to
polarities with Naive Bayes Algorithm can be summarized in this way:
Since Text Mining Practices require working only on text, firstly, the post/comment data are cleaned from
punctuations and numbers, all data are written in lower case letters;
The cleaned texts are spell checked. This spell check can be performed using various programs. For
example, the error correction module in Microsoft Word text editor can be used;
Each word in the post/comment that will be analyzed is compared to the word/word group recorded in the
Sentiment Database in Turkish and bit weighting is performed; for the words that are found value 1 and for the
words that are not found value 0 is given. The repetitions in the text are not considered (The algorithms in
which the repetition numbers are used work with different principles. For this reason, these algorithms were not
used although they were in the English training set). At this point, it should be useful to refer to the task of
finding stems: Some words in the post/comments may not be found in the Sentiment Database in Turkish
because they have inflectional suffixes. However, this disadvantage could be resolved through an application
directed at finding the stems of the words. But since the words in the Sentiment Database in Turkish are
adjective and adverb based words, there is no need to find the stems of these words. Because when adjectives
qualify or determine a noun, that is, when they are used as adjectives, cannot take the noun inflectional suffixes
(case suffixes, possessive suffixes and plural suffixes) that they can take when they are used alone, that is,
when they are considered as nouns (“Adjectives”, 2007). Adverbs, on the other hand, are non-finite words.
They do not take noun inflectional (case, possessive, plural suffixes etc.) suffixes. However, the adverbs that
can be used as nouns can take these suffixes when they are used as nouns (“Adverbs”, 2007). For this reason,
AN OPINION MINING TASK IN TURKISH LANGUAGE
193
an application regarding finding stems were not developed;
In order to determine on which polarity is the weighted text vector, the probabilities of “1” cases are taken
into consideration and two values are calculated. These values indicate the probability for being in
positive/negative polarities. As the result of the comparison, post/comment is assigned to the polarity whose
value is greater.
In addition, the length of the posts/comments in terms of word number was looked and the effect of the
length to the success of polarity assignment was investigated in the study.
Using the Model
In order to use this model that will assign the post/comments to positive or negative polarities, a program
that will automatically perform these procedures was developed. For this program called “Opinion Polarity
Detection”, Visual Basic Language was used in Microsoft NET Framework environment and Microsoft SQL
Server 2005 was used as database. Post/comment entry in the program interface is performed through these
steps:
Posts/comments can be manually entered or copied;
The text, if required, can be spell checked. In this case, program gets into contact with the error correction
module of the Microsoft Word text editor and offers alternatives regarding spell check;
Through this procedure, text can be assigned to positive or negative polarity with the help of the
commands operating in the background or if none of the words is included in the Sentiment Database in
Turkish, the words might not be classified.
Findings and Evaluation
(1) Findings Obtained with the Positive and Negative Precision Measures.
In order to test the success of the established model, a test database comprising of 350 positive and 350
negative samples was formed. A sample of the test database can be seen in Table 6.
Table 6
A Sample of the Test Database
Product/Service (Brand) View URL Post/Comment
X Positive http://adimsoyadim.blogspot.com/ I am a positive comment
Y Negative http://adimsoyadim.blogcu.com/ I am a negative comment
The posts/comments obtained in this way were divided into two classes as the posts/comments in the
positive polarity and in the negative polarity for test purposes and their contents were copied to the files named
“pozitifyorum.txt” and “negatifyorum.txt”. For the automatic process of the posts/comments in two text files by
the program, related combo options were developed. In this way, the posts/comments whose polarities are
known can be read from the related text files and whole analyses can be performed. The program, depending on
the hardware properties of the computer used, can perform the test of one word approximately in 0,002277
second. Both analyses are given in Table 7.
In this study, the model’s success of assigning to the related polarity was evaluated with Precision
Measure. Precision Measure is among the frequently used methods in measuring text classification activities
and can be stated as follows:
Precision = Number of the Posts/Comments classified as True/Total Number of Posts/Comments
AN OPINION MINING TASK IN TURKISH LANGUAGE
194
Table 7
Results of the Assignment to the Related Polarity Obtained With the Test Database
Polarity Total
number of
posts/comments Number of th
e posts/comments
classified as true Number
of the posts/comments
classified as false
number of the
posts/comments that
cannot be classified
Positive 350 253 93 4
Negative 350 256 90 4
Positive precision measure = 0.7228 = 72.28%/Negative precision measure = 0.7314 = 73.14%.
At this point, the method regarding the model assigning the positive/negative pole was compared with the
Turney method that is a method about which it being close to this method within the frame of some certain
criteria can be said and their similarities/differences were put forth (Turney, 2002).
Similarities:
Both methods try to determine the positive/negative pole of the text and are examined within the context
of the opinion mining task;
Both methods are used in classifying the customer comments as “suggested ones/positive” or
“non-suggested ones/negative”;
Both methods were picked randomly from the domains where comments were;
Both methods focus on the semantic orientation of the adjective/adverb based words.
Differences:
Turney Method is based upon the English language, yet contrary to this, the model mentioned in the study
is based on Turkish language;
Turney Method functions according to the unsupervised learning basis, the model mentioned in the study
functions accordingly with the semi-supervised learning basis;
In the Turney Method, semantic orientation of a phrase is calculated as mutual information in between the
phrase given with the words excellent/poor. And in the model mentioned in the study, words are arrayed
according to the distance between the color codes and in this direction they possess the possibility of being at
the positive/negative pole;
Turney Method’s number of sample comments is 410; study model’s number of sample comments is 700;
Database of the Turney Method comes from four different domains (automobiles, banks, movies and
travel destinations) that are taken from the http://www.epinions.com. Comments that partake in this site present
the objective views of real persons. And in the test database of the study model, comments come from a lot of
different blogs. Comments that partake in the blogs present the objective views of real persons. Also here, the
comment contents belong to the products/services titled as “white goods, built-in products, electronics, small
home appliances and heaters-air conditioners”;
For the comments in the Turney method come from four different domains, accuracy values were
calculated separately. These values respectively are 84%, 80%, 65.83%, and 70.53%. Average accuracy value is
74.39%. And the average accuracy value of the study model is 72.71%.
(2) Descriptive Statistics Regarding the Word Number.
In this section, the descriptive statistics regarding the word number of the posts/comments were included
and the effect of word number to the polarity assignment success was investigated. In order to perform this
research, a module that counted the words of the posts/comments was added to the program. The data regarding
AN OPINION MINING TASK IN TURKISH LANGUAGE
195
the column headings such as “post/comment”, “positive value”, “negative value”, “result”, and “word number”
can be listed using the database button in the interface. In Table 8, a sample of the test database accessed with
the database button can be seen. The columns regarding the results and word numbers for each polarity were
copied to Microsoft Excel and the evaluations below were made.
Table 8
A Sample of the Test Database Accessed With the Database Button
Post/Comment Positive value Negative value Result Word number
Comment 1 3.41284223E-10 4.48293103037529E-07 Negative 185
Comment 2 0.008944502 3.07589510232533E-05 Positive 30
Comment 3 0.00147905317 6.00719609260089E-05 Positive 51
Comment 4 0.00393425 0.0212485840062161 Negative 22
While the average word number of the negative posts/comments in the test database is 86.55, the average
word number for the positive entries is 29.66. From this, it can be inferred that contents are expressed with
shorter entries while discontents and complaints are expressed with longer entries.
The average word number for the positive posts/comments in the test database is 29.66. In the obtained
results, the average word number for the entries classified as true is 27.45 while the average word number for
the entries classified as false is 36.15. From this, for the positive polarity it might be stated that the entries
below the average word number are evaluated as true while the entries above the average word number are
evaluated as false.
The average word number for the negative posts/comments in the test database is 86.55. In the obtained
results, the average word number for the entries classified as true is 93.41 while the average word number for
the entries classified as false is 70.41. From this, for the negative polarity it might be stated that the entries
below the average word number are evaluated as false while the entries above the average word number are
evaluated as true.
When looked to the standard deviations:
For the positive polarity, the standard deviation of the entries classified as true is 22.04 and the standard
deviation of the entries classified as false is 36.30. The standard deviation of the 350 entries in this group is
26.82. From this, for the positive polarity, it might be argued that the entries in which the standard deviation is
lesser, that is, the distribution is more balanced in terms of length, are evaluated as true.
For the negative polarity, the standard deviation of the entries classified as true is 143.46 and the standard
deviation of the entries classified as false is 55.88. The standard deviation of the 350 entries in this group is
126.79. From this, for the negative polarity it might be stated that the entries in which the standard deviation is
greater, that is, the distribution is more unbalanced (including extreme values), are evaluated as true. As a
matter of fact, the values in this interval are between extreme points such as 2-1774.
(3) Some Factors Causing Incorrect Results.
The difficulties encountered in opinion mining field are mostly related to spell check. The program
performs the spell check using Microsoft Word 2007 and offers alternatives for spell check if there is any. Spell
check generally can be performed with the related alternative. However, Microsoft Word does not perform a
semantic spell check, which is another text mining task. The cases in which the words are not underlined
although they are misspelled can be given as an example. For example, the word “sakın” in the data collection
AN OPINION MINING TASK IN TURKISH LANGUAGE
196
form might be copied as “sakin” by mistake (probably due to Turkish character problems). Since the word
“sakin” is also a Turkish word, it will not be underlined and it will be accepted as a word which has been
spelled right until the user realizes and corrects is manually. If these words which are assumed to be spelled
right are included in the Sentiment Database in Turkish, the program will find this misspelled word and give it
a probability value. This, in turn, will affect the classification and might cause a wrong evaluation of the
polarities.
On the other hand, the deletion of the numbers and punctuations in the text due to the nature of text mining
might cause other difficulties. For example, the statements “1.” or “2.” in the text are deleted in the first step.
However, these two adjectives are included in the Sentiment Database in Turkish. If these statements written
with numbers and spots were written with alphabetic characters, they would be in the Sentiment Database in
Turkish.
The “ ’ character is a frequently used character for adding suffixes to the brand names in the
posts/comments. However, since this character is also considered to be a punctuation mark, it can be deleted in
the first step. For example, “x’tan çok memnunum” might be replaced by “x tan çok memnunum”. In this case,
the suffix “tan (twilight)” will be considered to be a different word and will be searched in the Sentiment
Database in Turkish. This word meaning “seher, alacakaranlık” is included in the database. Therefore, it will be
analyzed and it might cause a wrong evaluation of the polarities. In order to prevent this error, the program was
configured. The suffix is united with the word and becomes “xtan” when there is only an apostrophe. In this
way, the evaluation of the suffix as a separate word is prevented.
The developers of the We Feel Fine website, Kamvar and Harris, state that “positive feelings tend to not
co-occur often with negative feelings, with a few exceptions” (Kamvar & Harris, 2011). The following
word/word groups, with their probabilities for being in the positive class, can be given as examples of these
exceptional words: Obsessed (92.05%), damaged (75.11%), seedy (75.11%), failing (74.93%), wild (74.51%),
cruel (74.1%), unhealthy (74.04%), rebel (72.20%), and unappreciated (69.49%). The Turkish correspondences
of these words might cause that the texts in which they are included are assigned to wrong polarities.
Conclusions
In today’s world, there are many blogs on the internet that keep customer opinions on products and
services and the number of these blogs increases day by day. These blogs, by their conveying nature, offer text
data that will benefit the enterprises in terms of customer orientation in the area of Customer Relationship
Management. Besides, while the data on the enterprise’s database are only accessible by the enterprise, these
blogs are open to anyone. Therefore, positive or negative opinions can be seen by anyone who reads that blog.
For this reason, enterprises have to follow and analyze these blogs in order to protect their reputations. In this
way, enterprises can develop related strategies regarding the decision processes and ensure the increase in
efficiency.
In this regard, this study uses the Opinion Mining mMethods and proposes a model that will assign the
text-based opinion data in Turkish blogs to positive and negative polarities in order to present a general view on
products and services. In order to use this model that will assign the posts/comments to the polarities, a
program that will automatically perform these procedures was developed.
The model’s success of polarity assignment was evaluated with Precision Measure. The Positive Precision
Measure was calculated to be 72.28% and the Negative Precision Measure was calculated to be 73.14%. These
AN OPINION MINING TASK IN TURKISH LANGUAGE
197
obtained results support the validity of the methodology and give hope for the model’s future. A test database
that will be developed by taking more samples will provide for the possible predictions and controls and better
results will be obtained. It is my wish that this study will contribute to the beginning of the “Opinion Mining
Studies in Turkish Language”.
When it comes to the difficulties encountered in the test level, it might be stated that most of these
difficulties were in the area of spell check. As matter of fact, a semantic spell check is required, which is
another text mining task. On the other hand, the deletion of the numbers and punctuations in the text due to the
nature of text mining might cause other difficulties and might lead to wrong polarity evaluations. Besides, the
developers of the We Feel Fine website, of which the sentiment words were used in the training set, state that
positive feelings tend to not co-occur often with negative feelings, however, this might have a few exceptions.
As a matter of fact, these exceptional words might cause that the texts in which they are included are assigned
to wrong polarities.
This classification model, which might be considered as “Listening to the Social Media” in a sense, is also
important in terms of showing the new dimension added to the enterprise-customer interaction. For this reason,
enterprises have to pay attention to the voice of the masses in order to develop their strategies in the related
area.
References
Abbasi, A., CHEN, H., & Salem, A. (2008). Sentiment analysis in multiple languages: Feature selection for opinion classification
in web forums (ACM Trans.). Information Systems, 26(3).
Adjectives. (2007). Source website for Turkish language and literature. Retrieved February 28, 2013, from
http://www.turkceciler.com/sozcuk_turleri/sifatlar.html
Adverbs. (2007). Source Website for Turkish Language and Literature. Retrieved February 28, 2013, from
http://www.turkceciler.com/sozcuk_turleri/zarflar.html
CHEN, B., ZHU, L. L., Kifer, D., & Lee, D. (2010). What is an opinion about? Exploring political standpoints using opinion
scoring model (In Proceedings of The Twenty-Fourth AAAI Conference On Artificial Intelligence (AAAI), Atlanta, GA., July
11-15, 2010, p. 1).
Conrad, J., & Schilder, F. (2007). Opinion mining in legal blogs. In Proceedings of the International Conference on Artificial
Intelligence and Law (ICAIL). New York, N. Y., USA: ACM, 2007, pp. 231-236.
Dave, K., Lawrence, S., & Pennock, D. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of
product reviews. In www ’03: Proceedings of the 12th International Conference on World Wide Web. New York, N. Y.,
USA, ACM, 2003, pp. 519-528.
Devitt, A., & Ahmad, K. (2007). Sentiment polarity identification in financial news: A cohesion-based approach. In Proceedings
of the 45th Annual Meeting of the Association of Computational Linguistics, Association for Computational Linguistics.
Prague, Czech Republic, June 2007, pp. 984-991.
LIU, B. (2013). Opinion mining. Retrieved February 28, 2013, from http://www.cs.uic.edu/~liub/fbs/opinion-mining.pdf
Esuli, A., & Sebastiani, F. S. (2005). Determining the semantic orientation of terms through gloss analysis. In Proceedings of
CIKM-05, 14th ACM International Conference on Information and Knowledge Management. Bremen, DE., 2005, pp.
617-624.
Esuli, A., & Sebastiani, F. S. (2006). A publicly available lexical resource for opinion mining. In Proceedings of Language
Resources and Evaluation (LREC). Italy, Genoa, May 24-26, 2006, pp. 417-422.
Falcon, J. (2010, August 19). Opinion mining in ediscovery. Retrieved February 28, 2013, from
http://jadefalconit.com/opinion-mining/opinion-mining-in-ediscovery
How to Calculate a Complementary Colour (Inc. RGB/HSL Conversion). Retrieved February 28, 2013, from
http://serennu.com/colour/rgbtohsl.php
Kamvar, S., & Harris, J. (2011). We feel fine and searching the emotional web, WSDM’11. In Proceedings of the Fourth ACM
International Conference on Web Search and Data Mining. Hong Kong, February 9-12, 2011, pp. 117-126.
AN OPINION MINING TASK IN TURKISH LANGUAGE
198
Kamps, J., Marx, M., Mokken, R. J., & De Rijke, M. (2004). Using word net to measure semantic orientation of adjectives. In
Proceedings of LREC-04, 4th International Conference on Language Resources and Evaluation. Volume IV, Lisbon PT,
2004, pp. 1115-1118.
Kurt, H. (1998). Grammar for elementary school 7th grade. İstanbul, Morpa Kültür Yayınları.
Levene, M. (2010). An introduction to search engines and web navigation (2 nd.). New Jersey, Wiley Publisher.
LIU, B. (2007a). Web data mining. Chicago, Springer.
LIU, B. (2007b). From web content mining to natural language processing. ACL-2007 Tutorial, Prague. Retrieved February 28,
2013, from http://www.cs.uic.edu/~liub/acl-07-tutorial-wcm-to-nlp.pdf
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis, foundation, and trends. Information Retrieval, 2(1-2).
QIU, G., LIU, B., BU, J. J., & CHEN, C. (2010). Opinion word expansion and target extraction through double propagation.
Computational Linguistics, 1(1), 1-18.
RGB Color Space. (2013, February 13). Wikipedia. Retrieved February 28, 2013, from
http://tr.wikipedia.org/wiki/RGB_renk_uzay%C4%B1
Sentiment Analysis. (2013, February 28). Wikipedia. Retrieved from http://en.wikipedia.org/wiki/sentiment_analysis
Stavrianou, A., & Chauchat, J. H. (2008). Opinion mining issues and agreement identification in forum texts (FODOP 2008,
Atelier Fouille Des Données Dopinions, Fontainebleau, France, 27 May 2008, pp. 51-58).
Turney, P. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In
Proceedings of the 40th Annual Meeting On Association For Computational Linguistics (ACL ’02). Morristown, N. J., USA,
July 2002, pp. 417-424).
Verbeke, M., & Eynde, W. (2013). Opinion mining. Retrieved February 28, 2013, from
http://people.cs.kuleuven.be/~bettina.berendt/webmining10/l3.pdf
XU, F. Y., & CHENG, X. W. (2011, January 19). Opinion mining. Retrieved June 7, 2011, from
http://ebookbrowse.com/opinion-mining-2011-lecture-pdf-d90656735
... In addition, statistical approaches might not work well on agglutinating and morphologically-rich languages such as Turkish, Korean or Japanese (Kaya et al, 2012). There have been several attempts to perform and test polarity analyses in the Turkish language, some of which are based on translating English subjective lexicons into Turkish (Aytekin, 2013;Vural et al, 2012), while others rely on classifiers across conceptual topics (Kaya et al, 2012) or rely on the linguistic context in Turkish (Erogul, 2009). The use of sentiment analysis can vary according to the domain (e.g. ...
... Most of the studies for filtering and sentiment analyses of the tweets relate to the English language (Nielsen, 2011;Terpstra et al., 2012;Vo and Zhang, 2015). It has been discovered that using these techniques without proper adjustments based on the event or language, provides a very low rate of success for agglutinative languages like Turkish due to the use of suffixes at the end of words and large amounts of homonyms (Aytekin, 2013;Castillo et al, 2013;Dehkharghani et al, 2016). As Castillo et al (2013) indicate in their study, filtering and the credibility of tweets in Spanish requires manual labeling due to the possibility of non-relevant classifications. ...
Thesis
Social Media is a new age of data sources that emerged in the last decade. Users who have diverse different motivations (such as; entertainment, communicating, or promoting) sign up the platforms worldwide. Currently, there is 3.5 billion active social media account worldwide. This growing number of account holders are accepted as human sensors that provide information about their environment. Unlike traditional sensors, these human sensors have no certainty in their capacity to sense and share information. In addition, the data provided by human sensors is unstructured. Still, social media is an invaluable data source for studies, especially that require continuous and real-time data widely. Currently, the data is widely used for politics, marketing, and most importantly in crisis management. In this thesis, social media data is assessed for incidence mapping during or shortly after a disaster with the motivation of increasing resilience to the expected major earthquake in Istanbul. The disaster management cycle has four phases as response, recovery, mitigation, and preparation. In the response phase, having real-time data from the affected area is important to properly allocate the resources. Conventional mapping technologies such as remote sensing and photogrammetry have the capacity of detecting the occurrence of a natural hazard however they are not eligible for information retrieval about the impacts of the natural hazards on human life such as emotions, opinions, and emergency situations. At this point, social media become forward as an immediate data source for incidence mapping during the response time of a disaster. Incidence mapping for resources management requires fine-grained data analyses. However, the uncertainty in data capacity, questions in the reliability of chosen techniques for pre-processing, and data bias are the key obstacles to the fine-grained analyses with the use of social media data. In this thesis, social media data is evaluated in terms of these key obstacles for Istanbul City since the data varies to the area that belongs to depending on its own human sensors. The main objective of this thesis is the determination of social media data potential for its use during the response phase of disaster management. There are three sub-objectives in order to reach the main objective; revealing the adequacy of the data for incidence mapping, adapting the pre-processing steps to the Turkish language, and questioning the reliability of the used filtering and classifying techniques with the quantification of its impacts on mapping, and investigating the intrinsic quality of the data (such as anomalies, trends, and biases) for the further interpretation of the incidence maps. The thesis is composed of three papers tackling these three objectives. Istanbul City is determined as the case area of each paper. In the first paper, the capacity of social media data to detect incidences in a fine-grained spatiotemporal perspective is investigated. For the case, the coup attempt data georeferenced within Istanbul city boundary is used and a series of incidences by the hour is mapped with the hotspots. According to that study, it is revealed that social media data has the capacity to identify an incidence with a fine-grain spatiotemporal resolution. In the second paper, the reliability of the chosen techniques for pre-processing and filtering social media data is researched with their effects on incidence mapping. Two terror attack data that are georeferenced within Istanbul City are used for the case of this study. The study is not also testing the adaptation of the current pre-processing and filtering techniques to the Turkish language and also proposes a quantitative comparative index for quantifying the spatial reliability of each filtering process. This index named Giz Index which can be replicated for the similarity searches between two incidence maps. It is found in this study, with the proposed methodology for pre-processing and filtering, over 80% of spatial reliability can be achieved for incidence mapping based on social media data. In the third paper, the intrinsic quality of data is researched for the right interpretation of the incidence maps. The study overviews the weekly sampled social media data from each month during a year that is georeferenced within Istanbul City. The data is assessed from the perspective of data anomaly, trend, and bias with the spatial statistical tests. The study infers that the data has spatial representation bias, anomaly tendency in some parts of the city, the spatiotemporal bias in terms of the time of day and day of the week. The results of the study contribute to the incidence mapping with the reference maps to avoid biased hot spot occurrences or missing information due to less amount of data.
... In addition, statistical approaches might not work well on agglutinating and morphologically-rich languages such as Turkish, Korean, or Japanese [32]. There have been several attempts to perform and test polarity analyses in the Turkish language, some of which are based on translating English subjective lexicons into Turkish [33,34], while others rely on classifiers across conceptual topics [32] or rely on the linguistic context in Turkish [35]. The use of sentiment analysis can vary according to the domain (e.g., specific topics such as disasters, opinion columns, or music) that it is used for. ...
... Most of the studies for the filtering and sentiment analyses of the tweets relate to the English language [30,41,42]. It has been discovered that using these techniques without proper adjustments based on the event or language provides a very low rate of success for agglutinative languages like Turkish, due to the use of suffixes at the end of words and the large amounts of homonyms [21,31,33]. As Castillo Ocaranza, Mendoza, and Poblete Labra [21] indicate in their study, the filtering and ensuring the credibility of tweets in Spanish requires manual labeling, due to the possibility of non-relevant classifications. ...
Article
Full-text available
The data generated by social media such as Twitter are classified as big data and the usability of those data can provide a wide range of resources to various study areas including disaster management, tourism, political science, and health. However, apart from the acquisition of the data, the reliability and accuracy when it comes to using it concern scientists in terms of whether or not the use of social media data (SMD) can lead to incorrect and unreliable inferences. There have been many studies on the analyses of SMD in order to investigate their reliability, accuracy, or credibility, but that have not dealt with the filtering techniques applied to with the data before creating the results or after their acquisition. This study provides a methodology for detecting the accuracy and reliability of the filtering techniques for SMD and then a spatial similarity index that analyzes spatial intersections, proximity, and size, and compares them. Finally, we offer a comparison that shows the best combination of filtering techniques and similarity indices to create event maps of SMD by using the Getis-Ord Gi* technique. The steps of this study can be summarized as follows: an investigation of domain-based text filtering techniques for dealing with sentiment lexicons, machine learning-based sentiment analyses on reliability, and developing intermediate codes specific to domain-based studies; then, by using various similarity indices, the determination of the spatial reliability and accuracy of maps of the filtered social media data. The study offers the best combination of filtering, mapping, and spatial accuracy investigation methods for social media data, especially in the case of emergencies, where urgent spatial information is required. As a result, a new similarity index based on the spatial intersection, spatial size, and proximity relationships is introduced to determine the spatial accuracy of the fine-filtered SMD. The motivation for this research is to develop the ability to create an incidence map shortly after a disaster event such as a bombing. However, the proposed methodology can also be used for various domains such as concerts, elections, natural disasters, marketing, etc.
... adresinden film yorumlarını toplayarak, sözlük tabanlı duygu sınıflandırma analizi yaparak %76 oranında başarı elde etmişlerdir. Aytekin (2013), Türkçe blog sitelerinde beyaz eşya, teknoloji, ufak ev aletleri, ısıtıcı ve klima ürünlerine ait yorumlar üzerinde duygu polaritesinin yönünü tespit çalışması yapmıştır. Çalışmada sözlük tabanlı yaklaşım ile makine öğrenmesi yaklaşımlarından yararlanarak yarı-denetimli bir teknik benimsemiştir. ...
... This lexicon has 27K Turkish words and phrases with assigned binary polarity scores. Aytekin [32] used a machine learning approach to determine the sentiment polarity of texts on Turkish Blog pages related to production and service sectors. He produced a 4744-word lexicon via translation from English into Turkish. ...
Article
Full-text available
In our previous studies on developing a general-purpose Turkish sentiment lexicon, we constructed SWNetTR-PLUS, a sentiment lexicon of 37K words. In this paper, we show how to use Turkish synonym and antonym word pairs to extend SWNetTR-PLUS by almost 33% to obtain SWNetTR++, a Turkish sentiment lexicon of 49K words. The extension was done by transferring the problem into the graph domain, where nodes are words, and edges are synonym–antonym relations between words, and propagating the existing tone and polarity scores to the newly added words using an algorithm we have developed. We tested the existing and new lexicons using a manually labeled Turkish news media corpus of 500 news texts. The results show that our method yielded a significantly more accurate lexicon than SWNetTR-PLUS, resulting in an accuracy increase from 72.2% to 80.4%. At this level, we have now maximized the accuracy rates of translation-based sentiment analysis approaches, which first translate a Turkish text to English and then do the analysis using English sentiment lexicons.
... Aytekin, 2013 yılında Türkçe blog sitelerindeki beyaz eşya yorumları üzerinde duygu sınıflandırma çalışması yapmıştır (Aytekin, 2013). Çalışmada sözcük tabanlı yöntem ile makine öğrenmesi yönteminin karışımı yarı-denetimli bir yöntem uygulamıştır. ...
Article
Full-text available
Günümüzde Web uygulamalarının yaygınlaşmasıyla birlikte bireylerin fikir, düşünce ve duygularını ifade ettikleri platformların kullanımı büyük bir hızla artmıştır. Bu platformlarda bireylerden alınmış veriler çok büyük boyutlara ulaşmaktadır. Bu verilerin manuel olarak analiz edilmesi veya sınıflandırılması mümkün olmadığından otomatik analiz edilmesi ve sınıflandırılması zorunluluk haline gelmiştir. Bu nedenle fikir madenciliği ve duygu analizine yönelik araştırmalar son yıllarda giderek artmaya başlamıştır. Bu makalede fikir madenciliği ve duygu analizi konusu detaylarıyla, uygulanan yöntemlerle birlikte anlatılmış, bu alanda yapılmış olan çalışmalar incelenmiş ve literatür taraması şeklinde sunulmuştur.
... Their system obtains approximately 55% F-Measure. Aytekin [27] have proposed a system to classify comments that belong to products and services. The proposed system assigns positive and negative polarities to the words in the Turkish language automatically from an English sentiment dictionary. ...
Article
Full-text available
Sentiment analysis is the task of identifying overall attitude of the given text documents by using text analysis and natural language processing techniques. In this study, we present experimental results of sentiment analysis on movie and product reviews datasets that are in Turkish and English languages by using a Support Vector Machine (SVM) classifier. Moreover, we compare different document vector computation techniques and show their effects on the sentiment analysis. We empirically evaluate SVM types, kernel types, weighting schemes such as TF or TF*IDF, TF variances, IDF variances, tokenization methods, feature selection systems, text preprocessing techniques and vector normalizations. We have obtained 91.33% accuracy as the best on our collected Turkish product reviews dataset by using C-SVC SVM type with linear kernel, log normalization TF* probabilistic IDF weighting scheme, L2 vector normalization, Chi-square feature selection, and unigram word tokenization. A very detailed comparison of the document vector computation methods over Turkish and English datasets are also presented. Keywords: Sentiment analysis, Classification, Data mining, Product reviews, Support vector machines
Article
Full-text available
Recursive Deep Models have been used as powerful models to learn compositional representations of text for many natural language processing tasks. However, they require structured input (i.e. sentiment treebank) to encode sentences based on their tree-based structure to enable them to learn latent semantics of words using recursive composition functions. In this paper, we present our contributions and efforts for the Turkish Sentiment Treebank construction. We introduce MS-TR, a Morphologically Enriched Sentiment Treebank, which was implemented for training Recursive Deep Models to address compositional sentiment analysis for Turkish, which is one of the well-known Morphologically Rich Language (MRL). We propose a semi-supervised automatic annotation, as a distant-supervision approach, using morphological features of words to infer the polarity of the inner nodes of MS-TR as positive and negative. The proposed annotation model has four different annotation levels: morph-level, stem-level, token-level, and review-level. Each annotation level’s contribution was tested using three different domain datasets, including product reviews, movie reviews, and the Turkish Natural Corpus essays. Comparative results were obtained with the Recursive Neural Tensor Networks (RNTN) model which is operated over MS-TR, and conventional machine learning methods. Experiments proved that RNTN outperformed the baseline methods and achieved much better accuracy results compared to the baseline methods, which cannot accurately capture the aggregated sentiment information.
Thesis
Full-text available
In this study, a novel product search engine system which consists of a focused crawler, a record linkage system and a sentiment analyzer is proposed. We develop an original focused web crawler for E-commerce sites, and the challenges and our proposed solutions are presented in detail. A sentiment analyzer is developed to classify E-commerce product comments into polarities as negative or positive. A novel record linkage system for E-commerce products is proposed to recognize the same product names collected from different E-commerce sites. The record linkage system is based on a modified dynamic/incremental Hierarchical Agglomerative Clustering algorithm which employs our proposed product code matching system to reduce number of product name comparisons during clustering. In addition to these systems, a search system and a user interface are developed for the product search engine. In this thesis, we present a full scale product search engine that obtains %472 performance boosts in the crawler, 91.08% accuracy in the sentiment analysis, 96.25% F-measure in the record linkage, and 100% precision in most related products search. The proposed system achieves to provide better user experience than the existing systems.
Article
Full-text available
Supervised machine learning studies are gaining more significant recently because of the availability of the increasing number of the electronic documents from different resources. Text classification can be defined that the task was automatically categorized a group documents into one or more predefined classes according to their subjects. Thereby, the major objective of text classification is to enable users for extracting information from textual resource and deals with process such as retrieval, classification, and machine learning techniques together in order to classify different pattern. In text classification technique, term weighting methods design suitable weights to the specific terms to enhance the text classification performance. This paper surveys of text classification, process of different term weighing methods and comparison between different classification techniques.
Article
Full-text available
ÖZET: Günümüz internet kullanıcıları bir yandan yeni nesil teknolojilerin getirdiği kolaylıklar sayesinde ağ üzerinde sosyalleşirken, diğer yandan da birçok iş alanında dönüşümlere neden olmaktadır. Artık işletmelerin sosyal medya olarak adlandırdığımız bu ortamlara eğilmeleri bir zorunluluk halini almıştır. Zira bu ortamlar kullanıcıyı pasif konumdan içerik üreten aktif kullanıcı haline getirmiş ve bünyesinde ürün ve hizmetlere ilişkin barındırdığı milyonlarca yorum ile işletmelere rekabet analizi, itibar yönetimi, kriz yönetimi, halkla ilişkiler gibi alanlarda yeni rekabet yöntemlerine olanak sağlamıştır. Bu doğrultuda çalışmada sosyal medyada rekabet analizi için karşılaştırma görevine yönelik bir fikir madenciliği modeli geliştirilmiştir. Bu amaçla karşılaştırma siteleri, YouTube ve teknoloji forumlarından iz sürme tekniği ile karşılaştırma ifadesi içeren 100 yorum manuel olarak derlenmiş ve bu yolla bir test veri tabanı oluşturulmuştur. Elde edilen bulgular araştırmanın sınırlılığı kapsamında ele alınmış ve modelin başarısına ilişkin ölçümlemeler duyarlılık ölçütü ile değerlendirilmiştir. Başarı oranları modelin geleceği ile ilgili ümit vermekte olup sonuç kısmında önerilere de ayrıca yer verilmiştir. ABSTRACT: Since social media environments can turn users from a passive position into active users who self-generate contents, in their millions of comments, relating to products and services thus providing enterprises the opportunities for new competition methods in areas such as competitive analysis, reputation management, crisis management, and public relations; an opinion mining model is used in the study for analysis of social media aimed at comparison purposes. One hundred comments containing comparison expressions were manually collected with the chasing technique from the comparison sites, YouTube and technology forums, and by this way a test database was created. The obtained findings were examined within the constraints of the study and computations relating to the performance of the model and were evaluated with the sensitivity criterion. Because the success rates of the model are promising for the future, suggestions are provided in the concluding part of the study.
Article
Full-text available
Opinion Mining refers to the identification of opin ions and arguments in a text. Recently, it has received great attention due to th e abundance of opinion data that reside in online discussions, reviews and conversational texts . In this paper, we study the challenges of Opinion Mining together with the published technique s and methodologies and we evaluate a method for detecting agreement or disagreement in a text. The method is still at its early stage and its originality lies in the fact that it attempts to find out agreement or disagreement statements as opposed to most current approaches th at deal with positive, negative or neutral statements. RÉSUMÉ . La fouille des données d'opinion (Opinion Mining) désigne les méthodes d'identification des opinions et argumentations au sein d'un ensemble de textes. Depuis peu, les recherches sur ce sujet se développent face au volume des textes d'opinions produits dans les discussions online, les commentaires sur des pr oduits ou services, et les "chats". Dans cet article, nous présentons un état des méthodes et te chniques publiées, puis nous évaluons une méthode pour détecter l'accord ou le désaccord dans un texte. La méthode en est à ses débuts et reste à perfectionner ; son originalité est de tenter d'identifier dans les textes des points d'accord ou de désaccord, contrairement à de nombre uses études actuelles qui recherchent des phrases à connotation positive, négative ou neu tre.
Conference Paper
Full-text available
We perform a survey into the scope and utility of opinion mining in legal Weblogs (a.k.a. blawgs). The number of 'blogs' in the legal domain is growing at a rapid pace and many potential applications for opinion detection and monitoring are arising as a result. We summarize current approaches to opinion mining before describing different categories of blawgs and their potential impact on the law and the legal profession. In addition to educating the community on recent developments in the legal blog space, we also conduct some introductory opinion mining trials. We first construct a Weblog test collection containing blog entries that discuss legal search tools. We subsequently examine the performance of a language modeling approach deployed for both subjectivity analysis (i.e., is the text subjective or objective?) and polarity analysis (i.e., is the text affirmative or negative towards its subject?). This work may thus help establish early baselines for these core opinion mining tasks.
Conference Paper
Full-text available
Text is not unadulterated fact. A text can make you laugh or cry but can it also make you short sell your stocks in company A and buy up options in company B? Research in the domain of finance strongly suggests that it can. Studies have shown that both the informational and affective aspects of news text affect the markets in profound ways, im- pacting on volumes of trades, stock prices, volatility and even future firm earnings. This paper aims to explore a computable metric of positive or negative polarity in financial news text which is consistent with human judgments and can be used in a quantita- tive analysis of news sentiment impact on fi- nancial markets. Results from a preliminary evaluation are presented and discussed.
Article
Full-text available
The Internet is frequently used as a medium for exchange of information and opinions, as well as propaganda dissemination. In this study the use of sentiment analysis methodologies is proposed for classification of web forum opinions in multiple languages. The utility of stylistic and syntactic features is evaluated for sentiment classification of English and Arabic content. Specific feature extraction components are integrated to account for the linguistic characteristics of Arabic. The Entropy Weighted Genetic Algorithm (EWGA) is also developed, which is a hybridized genetic algorithm that incorporates the information gain heuristic for feature selection. EWGA is designed to improve performance and get a better assessment of the key features. The proposed features and techniques are evaluated on a benchmark movie review data set and U.S. and Middle Eastern web forum postings. The experimental results using EWGA with SVM indicate high performance levels, with accuracy over 95% on the benchmark data set and over 93% for both the U.S. and Middle Eastern forums. Stylistic features significantly enhanced performance across all test beds while EWGA also outperformed other feature selection methods, indicating the utility of these features and techniques for document level classification of sentiments.
Article
Full-text available
Analysis of opinions, known as opinion mining or sentiment analysis, has attracted a great deal of attention recently due to many practical applications and challenging research problems. In this article, we study two important problems, namely, opinion lexicon expansion and opinion target extraction. Opinion targets (targets, for short) are entities and their attributes on which opinions have been expressed. To perform the tasks, we found that there are several syntactic relations that link opinion words and targets. These relations can be identified using a dependency parser and then utilized to expand the initial opinion lexicon and to extract targets. This proposed method is based on bootstrapping. We call it double propagation as it propagates information between opinion words and targets. A key advantage of the proposed method is that it only needs an initial opinion lexicon to start the bootstrapping process. Thus, the method is semi-supervised due to the use of opinion word seeds. In evaluation, we compare the proposed method with several state-of-the-art methods using a standard product review test collection. The results show that our approach outperforms these existing methods significantly.
Conference Paper
We present We Feel Fine, an emotional search engine and web-based artwork whose mission is to collect the world's emotions to help people better understand themselves and others. We Feel Fine continuously crawls blogs, microblogs, and social networking sites, extracting sentences that include the words "I feel" or "I am feeling", as well as the gender, age, and location of the people authoring those sentences. The We Feel Fine search interface allows users to search or browse over the resulting sentence-level index, asking questions such as "How did young people in Ohio feel when Obama was elected?" While most research in sentiment analysis focuses on algorithms for extraction and classification of sentiment about given topics, we focus instead on building an interface that provides an engaging means of qualitative exploration of emotional data, and a flexible data collection and serving architecture that enables an ecosystem of data analysis applications. We use our observations on the usage of We Feel Fine to suggest a class of visualizations called Experiential Data Visualization, which focus on immersive item-level interaction with data. We also discuss the implications of such visualizations for crowdsourcing qualitative research in the social sciences.
Book
This book is a second edition, updated and expanded to explain the technologies that help us find information on the web. Search engines and web navigation tools have become ubiquitous in our day to day use of the web as an information source, a tool for commercial transactions and a social computing tool. Moreover, through the mobile web we have access to the web's services when we are on the move. This book demystifies the tools that we use when interacting with the web, and gives the reader a detailed overview of where we are and where we are going in terms of search engine and web navigation technologies.
Article
An important part of our information-gathering behavior has always been to find out what other people think. With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area, of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as a direct response to the surge of interest in new systems that deal directly with opinions as a first-class object. This survey covers techniques and approaches that promise to directly enable opinion-oriented information-seeking systems. Our focus is on methods that seek to address the new challenges raised by sentiment-aware applications, as compared to those that are already present in more traditional fact-based analysis. We include material on summarization of evaluative text and on broader issues regarding privacy, manipulation, and economic impact that the development of opinion-oriented information-access services gives rise to. To facilitate future work, a discussion of available resources, benchmark datasets, and evaluation campaigns is also provided.