Conference Paper

Semi-supervised Acquisition of Croatian Sentiment Lexicon

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Sentiment analysis aims to recognize subjectivity expressed in natural language texts. Subjectivity analysis tries to answer if the text unit is subjective or objective, while polarity analysis determines whether a subjective text is positive or negative. Sentiment of sentences and documents is often determined using some sort of a sentiment lexicon. In this paper we present three different semi-supervised methods for automated acquisition of a sentiment lexicon that do not depend on pre-existing language resources: latent semantic analysis, graph-based propagation, and topic modelling. Methods are language independent and corpus-based, hence especially suitable for languages for which resources are very scarce. We use the presented methods to acquire sentiment lexicon for Croatian language. The performance of the methods was evaluated on the task of determining both subjectivity and polarity at (subjectivity + polarity task) and the task of determining polarity of subjective words (polarity only task). The results indicate that the methods are especially suitable for the polarity only task.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Even though most research on sentiment lexicon acquisition and lexicon-based sentiment classification deals with English, there has been some work on Slavic languages as well, including Macedonian (Jovanoski et al., 2015), Croatian (Glavaš et al., 2012b), Slovene (Fišer et al., 2016), and Serbian (Mladenović et al., 2016). While we follow the work of Glavaš et al. (2012b), who focused on the task of semi-supervised lexicon acqusition, we turn our attention to evaluating the so-obtained lexicons on the task of sentiment classification. ...
... Even though most research on sentiment lexicon acquisition and lexicon-based sentiment classification deals with English, there has been some work on Slavic languages as well, including Macedonian (Jovanoski et al., 2015), Croatian (Glavaš et al., 2012b), Slovene (Fišer et al., 2016), and Serbian (Mladenović et al., 2016). While we follow the work of Glavaš et al. (2012b), who focused on the task of semi-supervised lexicon acqusition, we turn our attention to evaluating the so-obtained lexicons on the task of sentiment classification. ...
... We acquired a domain-specific lexicon of unigrams, bigrams, and trigrams (henceforth: ngrams) using a semi-supervised graph-based method. We follow the previous work (Hatzivassiloglou and McKeown, 1997;Turney and Littman, 2003;Glavaš et al., 2012b) and employ bootstrapping, which amounts to manually labeling a small set of seed words whose labels are then propagated across the graph. For this, we use a random walk algorithm. ...
... being within the interquartile range) of the original corpus (∼3.8M sentences). Having the set of "average sentences", we used the Croatian gold standard sentiment lexicon created by (Glavaš et al., 2012), translated it to Serbian with a rule-based Croatian-Serbian translator (Klubička et al., 2016), combined both lexicons, and extracted unique entries with a single sentiment affinity, and used them as seed words for sampling sentences for manual annotation. The final pool of seed words contains 381 positive and 239 negative words (neutral words are excluded). ...
Preprint
Full-text available
Expression of sentiment in parliamentary debates is deemed to be significantly different from that on social media or in product reviews. This paper adds to an emerging body of research on parliamentary debates with a dataset of sentences annotated for detection sentiment polarity in political discourse. We sample the sentences for annotation from the proceedings of three Southeast European parliaments: Croatia, Bosnia-Herzegovina, and Serbia. A six-level schema is applied to the data with the aim of training a classification model for the detection of sentiment in parliamentary proceedings. Krippendorff's alpha measuring the inter-annotator agreement ranges from 0.6 for the six-level annotation schema to 0.75 for the three-level schema and 0.83 for the two-level schema. Our initial experiments on the dataset show that transformer models perform significantly better than those using a simpler architecture. Furthermore, regardless of the similarity of the three languages, we observe differences in performance across different languages. Performing parliament-specific training and evaluation shows that the main reason for the differing performance between parliaments seems to be the different complexity of the automatic classification task, which is not observable in annotator performance. Language distance does not seem to play any role neither in annotator nor in automatic classification performance. We release the dataset and the best-performing model under permissive licences.
... Our approach was based on a sentiment lexicon trained on a Croatian corpus (with excellent correspondence to Serbian) which consists of two lists of words each containing approximately 37,000 lemmas (dictionary base forms of words) ranked by their positivity and negativity. The ranks were created automatically based on small positive and negative seed sets and co-occurrence frequencies, using the PageRank algorithm (Glavaš et al. 2012). In order to use the lexicons, we lemmatised the letters, so the actual counting was based on a comparison between lemmas. ...
Article
Full-text available
We studied prewar public discourse by analysing the origin, content and sentiment of more than 4,000 letters written by people from all walks of life and published in the Belgrade broadsheet Politika loyal to the regime of Slobodan Milošević during the three years directly preceding the Yugoslav wars. Our analysis combined lexicon-based tools of automated topic and sentiment analysis with data on the sociodemographic characteristics of the letter writers and their localities. The results of our analysis show the importance of the politicisation of a history of violence in shaping public discourse in the run-up to war.
... Another problem is the lack of lexical resources for sentiment and emotions in the Croatian language. Glavaš and co-workers [10] developed a Croatian sentiment lexicon called CroSentiLex, which consists of positive and negative lists of words ranked with PageRank scores. Nevertheless, there is no available lexicon for the analysis of emotions for the Croatian language. ...
Conference Paper
Full-text available
The research aims to identify topics and sentiments related to the COVID-19 pandemic in Croatian online news media. For analysis, we used news related to the COVID-19 pandemic from the Croatian portal Tportal.hr published from 1 st January 2020 to 19 th February 2021. Topic modelling was conducted by using the LDA method, while dominant emotions and sentiments related to extracted topics were identified by National Research Council Canada (NRC) word-emotion lexicon created originally for English and translated into Croatian, among other languages. We believe that the results of this research will enable a better understanding of the crisis communication in the Croatian media related to the COVID-19 pandemic.
... Motivational messages written by the robot Matko at each level are selected from a subset of positively labelled lemmas in the lexicon CroSentiLex [17]. The minimum level of positive sentiment of the selected words is 0.68. ...
Article
Full-text available
One of the main drawbacks of delivering new teaching lessons in e-learning systems is the lack of motivation for using those systems. This paper analyses which elements of computer games for learning mathematics have a beneficial effect on intrinsic motivation and give students continuous feedback in order to improve the learning process. While the control group has access to the basic version of the educational computer game, the experimental group uses the version enriched with additional motivational elements which include enhanced graphics for indulging in the game, messages of support while playing the game, and the possibility to compare results with fellow peers in terms of trophies and medals won.
... Sentiment lexicons of various sizes, quality and development methods exist for most Slavic languages; examples are lexicons for Bulgarian (Kapukaranov and Nakov 2015), Croatian (Glavaš et al. 2012), Czech (Veselovská 2013), Macedonian (Jovanoski et al. 2015), Polish (Wawer 2012) andSlovak (Okruhlica 2013). ...
Chapter
This paper deals with automatic two classdocument-level sentiment classification. We retrieved textual documents with political, business, economic and financial content from five Slovenian web media. By annotating a sample of 10,427 documents, we obtained a labelled corpus in the Slovenian language. Five classifiers were evaluated on this corpus: multinomial naïve Bayes, support vector machines, random forest, k-nearest neighbour and naïve Bayes, out of which the first three were used also in the assessment of the pre-processing options. Among the selected classifiers, multinomial naïve Bayes outperforms the naïve Bayes, k-nearest neighbour, random forest and support vector machines classifier in terms of classification accuracy. The best selection of pre-processing options achieves more than 95 % classification accuracy with Naïve Bayes Multinomial and more than 85 % with support vector machines and random forest classifier.
... 4 Available at http://www.wordle.net 5 Available at http://bib.irb.hr 6 Available at http://scrapy.org 7 Available at http://www.postgresql.org 8 PostgreSQL uses dictionaries to eliminate words that should not be considered in a search (so called stop words), and to normalize words so that different derived forms of the same word will match (lexemes). ...
Article
Full-text available
The social web has become a major repository of social and behavioral data that is of exceptional interest to the social science and humanities research community. Computer science has only recently developed various technologies and techniques that allow for harvesting, organizing and analyzing such data and provide knowledge and insights into the structure and behavior or people on-line. Some of these techniques include social web mining, conceptual and social network analysis and modeling, tag clouds, topic maps, folksonomies, complex network visualizations, modeling of processes on networks, agent based models of social network emergence, speech recognition, computer vision, natural language processing, opinion mining and sentiment analysis, recommender systems, user profiling and semantic wikis. All of these techniques are briefly introduced, example studies are given and ideas as well as possible directions in the field of political attitudes and mentalities are given. In the end challenges for future studies are discussed.
Article
Full-text available
COVID-crisis has made significant changes in the educational process of many coun-tries, including the need for new management decisions that would solve the complex problem of accelerating the development of online resources for distance learning. Management, particularly in education, is valuable when it is able to combine both general and specific goals. Especially when it comes to a specific educational process for training future TV and radio journalists, advertisers and PR-managers, screenwrit-ers and directors, sound directors, TV presenters, film and cameramen. The peculiarity of these professions is the combination of both creative and technological components of production and placement of professional audio-video content, i.e. content pro-duced by TV and radio companies, film or TV studios, advertising agencies, and aimed at a mass audience. One of the basic priorities in training such specialists is, first of all, the practice, which is based on the planned implementation of educational audiovisual projects and the ability to put them into effect in certain circumstances, including COVID-crisis caused by COVID-19 virus. Therefore, the aim of the article is to hypothesize how to build a productive distance educational strategy in the condi-tions of COVID-crisis, which specifically affected the quality of practical training of specialists in the field of audiovisual media and arts in Ukraine.
Article
Full-text available
How do politicians in postwar societies talk about the war past? How do they discursively represent vulnerable social groups created by the ended conflict? Does the nature of this representation depend on the politicians' ideology or their record of combat service? We answer these questions by pairing natural language processing tools and a large corpus of parliamentary debates with an extensive dataset of biographical information including detailed records of war service for all members of parliament during two recent terms in Croatia. We demonstrate not only that veteran politicians talk about war differently than their non-veteran counterparts, but also that the sentiment of war-related political discourse is highly dependent on the speakers' exposure to combat and ideological orientation. These results improve our understanding of the representational role played by combat veterans, as well as of the link between descriptive and substantive representation of vulnerable groups in postwar societies.
Article
In this study, we introduce Slovene web-crawled news corpora with sentiment annotation on three levels of granularity: sentence, paragraph and document levels. We describe the methodology and tools that were required for their construction. The corpora contain more than 250,000 documents with political, business, economic and financial content from five Slovene media resources on the web. More than 10,000 of them were manually annotated as negative, neutral or positive. All corpora are publicly available under a Creative Commons copyright license. We used the annotated documents to construct a Slovene sentiment lexicon, which is the first of its kind for Slovene, and to assess the sentiment classification approaches used. The constructed corpora were also utilised to monitor within-the-document sentiment dynamics, its changes over time and relations with news topics. We show that sentiment is, on average, more explicit at the beginning of documents, and it loses sharpness towards the end of documents.
Article
Full-text available
In this paper, we explore the utility of attitude types for improving question answering (QA) on both web-based dis-cussions and news data. We present a set of attitude types developed with an eye toward QA and show that they can be reliably annotated. Using the attitude annotations, we develop automatic classifiers for recognizing two main types of attitudes: sentiment and arguing. Finally, we exploit in-formation about the attitude types of questions and answers for improving opinion QA with promising results.
Conference Paper
Full-text available
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Conference Paper
Full-text available
We develop an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA). Online LDA is based on online stochastic optimization with a natural gradient step, which we show converges to a local optimum of the VB objective function. It can handily analyze massive document collections, including those arriving in a stream. We study the performance of online LDA in several ways, including by fitting a 100-topic topic model to 3.3M articles from Wikipedia in a single pass. We demonstrate that online LDA finds topic models as good or better than those found with batch VB, and in a fraction of the time. 1
Conference Paper
Full-text available
Lexical features are key to many approaches to sentiment analysis and opinion detection. A variety of representations have been used, including single words, multi-word Ngrams, phrases, and lexico-syntactic patterns. In this paper, we use a subsumption hierarchy to formally define different types of lexical features and their relationship to one another, both in terms of representational coverage and performance. We use the subsumption hierarchy in two ways: (1) as an analytic tool to automatically identify complex features that outperform simpler features, and (2) to reduce a feature set by removing unnecessary features. We show that reducing the feature set improves performance on three opinion classification tasks, especially when combined with traditional feature selection.
Conference Paper
Full-text available
This paper presents an application of PageR- ank, a random-walk model originally de- vised for ranking Web search results, to ranking WordNet synsets in terms of how strongly they possess a given semantic prop- erty. The semantic properties we use for ex- emplifying the approach are positivity and negativity, two properties of central impor- tance in sentiment analysis. The idea derives from the observation that WordNet may be seen as a graph in which synsets are con- nected through the binary relation "a term belonging to synset sk occurs in the gloss of synset si", and on the hypothesis that this relation may be viewed as a transmit- ter of such semantic properties. The data for this relation can be obtained from eX- tended WordNet, a publicly available sense- disambiguated version of WordNet. We ar- gue that this relation is structurally akin to the relation between hyperlinked Web pages, and thus lends itself to PageRank analysis. We report experimental results supporting our intuitions.
Article
Full-text available
We present a lexicon-based approach to extracting sentiment from text. The Semantic Orientation CALculator (SO-CAL) uses dictionaries of words annotated with their semantic orientation (polarity and strength), and incorporates intensification and negation. SO-CAL is applied to the polarity classification task, the process of assigning a positive or negative label to a text that captures the text's opinion towards its main subject matter. We show that SO-CAL's performance is consistent across domains and in completely unseen data. Additionally, we describe the process of dictionary creation, and our use of Mechanical Turk to check dictionaries for consistency and reliability.
Article
Full-text available
The evaluative character of a word is called its semantic orientation. Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong). An automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). This article introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (PMI) and latent semantic analysis (LSA). The method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). The method attains an accuracy of 82.8
Conference Paper
Full-text available
It is a common practice that merchants selling products on the Web ask their customers to review the products and associated services. As e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. For a popular product, the number of reviews can be in hundreds. This makes it difficult for a potential customer to read them in order to make a decision on whether to buy the product. In this project, we aim to summarize all the customer reviews of a product. This summarization task is different from traditional text summarization because we are only interested in the specific features of the product that customers have opinions on and also whether the opinions are positive or negative. We do not summarize the reviews by selecting or rewriting a subset of the original sentences from the reviews to capture their main points as in the classic text summarization. In this paper, we only focus on mining opinion/product features that the reviewers have commented on. A number of techniques are presented to mine such features. Our experimental results show that these techniques are highly effective.
Article
Full-text available
Current WordNet-based measures of distance or similarity focus almost exclusively on WordNet's taxonomic relations. This effectively restricts their applicability to the syntactic categories of noun and verb. We investigate a graph-theoretic model of WordNet's most important relation---synonymy---and propose measures that determine the semantic orientation of adjectives for three factors of subjective meaning. Evaluation against human judgments shows the effectiveness of the resulting measures.
Article
L'Analyse Semantique Latente est une approche statistique introduite pour ameliorer la recherche d'information. Elle consiste a reduire la dimensionalite du probleme de la recherche d'information pour surmonter les difficultes liees a la synonymie et la polysemie. Au-dela de ses applications en recherche d'information (filtrage, multilinguisme), elle a egalement ete utilisee en sciences cognitives pour la modelisation de la memoire humaine. Un certain nombre de developpements ont concerne les modeles probabilistes et les aspects computationnels.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Chapter
WordNet is a large electronic lexical database for English (Miller 1995, Fellbaum 1998a). It originated in 1986at Princeton University where it continues to be developed and maintained. George A. Miller, a psycholinguist, was inspired by experiments in Artificial Intelligence that tried to understand human semantic memory (e.g., Collins and Quillian 1969). Given the fact that speakers possess knowledge about tens of thousands of words and the concepts expressed by these words, it seemed reasonable to assume efficient and economic storage and access mechanisms for words and concepts. The Collins and Quillian model proposed a hierarchical structure of concepts, where more specific concepts inherit information from their superordinate, more general concepts; only knowledge particular to more specific concepts needs to be stored with such concepts. Thus, it took subjects longer to confirm a statement like “canaries have feathers” than the statement “birds have feathers” since, presumably, the property “has-feathers” is stored with the concept bird and not redundantly with the concept for each kind of bird.
Article
Due to natural language morphology, words can take on various morphological forms. Morphological normalisation – often used in information retrieval and text mining systems – conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.
Article
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Conference Paper
Many of the tasks required for semantic tagging of phrases and texts rely on a list of words annotated with some semantic features. We present a method for ex- tracting sentiment-bearing adjectives from WordNet using the Sentiment Tag Extrac- tion Program (STEP). We did 58 STEP runs on unique non-intersecting seed lists drawn from manually annotated list of positive and negative adjectives and evalu- ated the results against other manually an- notated lists. The 58 runs were then col- lapsed into a single set of 7, 813 unique words. For each word we computed a Net Overlap Score by subtracting the total number of runs assigning this word a neg- ative sentiment from the total of the runs that consider it positive. We demonstrate that Net Overlap Score can be used as a measure of the words degree of member- ship in the fuzzy category of sentiment: the core adjectives, which had the high- est Net Overlap scores, were identified most accurately both by STEP and by hu- man annotators, while the words on the periphery of the category had the lowest scores and were associated with low rates of inter-annotator agreement.
Article
Sentiment analysis is concerned with the automatic extraction of sentiment-related information from text. Although most sentiment analysis addresses commercial tasks, such as extracting opinions from product reviews, there is increasing interest in the affective dimension of the social web, and Twitter in particular. Most sentiment analysis algorithms are not ideally suited to this task because they exploit indirect indicators of sentiment that can reflect genre or topic instead. Hence, such algorithms used to process social web texts can identify spurious sentiment patterns caused by topics rather than affective phenomena. This article assesses an improved version of the algorithm SentiStrength for sentiment strength detection across the social web that primarily uses direct indications of sentiment. The results from six diverse social web data sets (MySpace, Twitter, YouTube, Digg, RunnersWorld, BBCForums) indicate that SentiStrength 2 is successful in the sense of performing better than a baseline approach for all data sets in both supervised and unsupervised cases. SentiStrength is not always better than machine-learning approaches that exploit indirect indicators of sentiment, however, and is particularly weaker for positive sentiment in news-related discussions. Overall, the results suggest that, even unsupervised, SentiStrength is robust enough to be applied to a wide variety of different social web contexts.
Article
ity of the phrase in which a particular instance of a word appears may be quite different from the word's prior polarity. Positive words are used in phrases expressing negative sentiments, or vice versa. Also, quite often words that are positive or negative out of context are neutral in context, meaning they are not even being used to express a sentiment. The goal of this work is to automatically distinguish between prior and contextual polarity, with a focus on understanding which features are important for this task. Because an important aspect of the problem is identifying when polar terms are being used in neutral contexts, features for distinguishing between neutral and polar instances are evaluated, as well as features for distinguishing between positive and negative contextual polarity. The evaluation includes assessing the performance of features across multiple machine learning algorithms. For all learning algorithms except one, the combination of all features together gives the best performance. Another facet of the evaluation considers how the presence of neutral instances affects the performance of features for distinguishing between positive and negative polarity. These experiments show that the presence of neutral instances greatly degrades the performance of these features, and that perhaps the best way to improve performance across all polarity classes is to improve the system's ability to identify when an instance is neutral.
Article
We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.
Article
We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classification problem more challenging.
Article
We identify and validate from a large corpus constraints from conjunctions on the positive or negative semantic orientation of the conjoined adjectives. A log-linear regression model uses these constraints to predict whether conjoined adjectives are of same or different orientations, achiev- ing 82% accuracy in this task when each conjunction is considered independently.