Conference PaperPDF Available

Recognition of emotions, valence and arousal in large-scale multi-domain text reviews


Abstract and Figures

In this article, we present a novel multidomain dataset of Polish text reviews. The data were annotated as part of a large study involving over 20,000 participants. A total of 7,000 texts were described with metadata, each text received about 25 annotations concerning polarity, arousal and eight basic emotions, marked on a multilevel scale. We present a preliminary approach to data labelling based on the distribution of manual annotations and to the classification of labelled data using logistic regression and bi-directional long short-term memory recurrent neural networks.
Content may be subject to copyright.
Recognition of emotions, valence and arousal
in large-scale multi-domain text reviews
Jan Koco´
n, Arkadiusz Janz, Piotr Miłkowski, Monika Riegel, Małgorzata Wierzba,
Artur Marchewka, Agnieszka Czoska‡, Damian Grimling ‡,
Barbara Konat‡¡, Konrad Juszczyk‡¡, Katarzyna Klessa¡, Maciej Piasecki
Wroclaw University of Science and Technology
ze Wyspia´
nskiego 27, 50-370 Wrocław, Poland
{jan.kocon, arkadiusz.janz, piotr.milkowski, maciej.piasecki}
Laboratory of Brain Imaging, Nencki Institute of Experimental Biology of Polish Academy of Sciences
Ludwika Pasteura 3, 02-093 Warszawa
{a.marchewka, m.riegel, m.wierzba}
‡W3A.PL Sp. z o.o.
Pi ˛atkowska 110A/1, 60-649 Pozna´
n, Polska
{agnieszka, damian, barbara, konrad}
¡Adam Mickiewicz University, Faculty of Modern Languages and Literatures
sci 4, 61-874 Pozna´
In this article, we present a novel multidomain dataset of Polish text reviews. The data were annotated as part of a large study involving
over 20,000 participants. A total of 7,000 texts were described with metadata, each text received about 25 annotations concerning
polarity, arousal and eight basic emotions, marked on a multilevel scale. We present a preliminary approach to data labelling based on
the distribution of manual annotations and to the classification of labelled data using logistic regression and bi-directional long short-term
memory recurrent neural networks.
1. Introduction
Emotions are a crucial part of natural human commu-
nication, conveyed by both what we say and how we say it.
In this study, we focus on emotions attributed by Polish na-
tive speakers to written Polish texts. The results presented
in this paper combine machine learning with an empirical
approach to language and emotions expressed verbally.
Introduction of machine learning (ML) to the area of
text mining resulted in the rapid growth of the field in re-
cent years. However, an automatic emotion recognition
with Machine Learning remains a challenging task due to
the scarcity of high quality and large scale data sources.
Numerous approaches were attempted to annotate words
concerning their polarity and emotions for various lan-
guages (Riegel et al., 2015). Such datasets, however, are
limited in size, typically consisting of several thousands of
words, while lexicons are known to be much bigger.1The
size of the known and annotated affective word lists con-
strains their usage in natural language processing.
In emotion research, words are usually characterised
according to two dominant theoretical approaches to the
nature of emotion: dimensional account and categorical
account. According to the first account proposed in (Rus-
sell and Mehrabian, 1977), each emotion state can be
1The largest dictionary of English, Oxford English Dictio-
nary, for example, contains around 600,000 words in its online
represented by its location in a multidimensional space,
with valence or polarity (negativity/positivity) and arousal
(low/high) explaining most of the observed variance. In the
competing account, several basic or elementary emotion
states are distinguished, with more complex, subtle emo-
tion states emerging as their combination. To categorise
emotions, semantic concepts drawn from natural language
are used, as corresponding to particular behavioural or
physiological response patterns. The concept of basic
emotions itself has been interpreted in various ways, and
thus different theories posit different numbers of categories
of emotion, with (Ekman, 1992) and (Plutchik, 1982) gain-
ing most recognition in the scientific community.
On the other hand, the most popular approach in natural
language processing, but also in applied usages of emotion
annotation is sentiment analysis which takes into account
only polarity (negativity/positivity). It is understandable
since the emotion annotation of textual data faces difficul-
ties in the two conventional approaches to annotation. In
the first approach, a small number (usually 2 to 5) trained
annotators are engaged and because of the differences be-
tween individual opinions, enhanced by multiple choice
possibilities (most commonly 6 or 8 emotions), may lead
to poor results of inter-annotator agreement (Hripcsak and
Rothschild, 2005). The other approach, based on crowd
annotations on platforms such as Amazon Turk (Paolacci
and Chandler, 2014) leads to a similar problem: the inter-
labeller variability of annotations is high because such plat-
forms are open to users of different nationalities while only
the native speakers of a given language can distinguish
subtleties of emotional connotations.
In this study, we applied an approach that proved useful
in previous experiments (Riegel et al., 2015). Thus, our an-
notation schema follows the account of Russel and Mehra-
bian, as well as those proposed by Ekman or Plutchik. Fi-
nally, by combining simple annotation schema with crowd
annotation, we were able to effectively acquire a large
amount of data, while at the same time preserving the high
quality of the data. Sentiment analysis enhanced with eight
basic emotions leads to new possibilities of studying peo-
ple’s attitudes towards brands, products and their features,
political views, movie or music choices or financial deci-
sions, including stock exchange activity. Moreover, com-
paring the results of meaning and text ranking leads to
a better understanding of text processing, especially con-
structing the emotional meaning of texts by readers.
2. Data annotation
To create Sentimenti database, a total of over 20,000
unique respondents (with approximately equal number of
male and female participants) was sampled from Polish
population (sex, age, native language, place of residence,
education level, marital status, employment status, politi-
cal beliefs and income were controlled, among other fac-
tors). To collect the data, a combined approach of different
methodologies was used, namely: Computer Assisted Per-
sonal Interview (CAPI) and Computer Assisted Web Inter-
view (CAWI).
The annotation schema was based on procedures most
widely used in previous studies aiming to create the first
datasets of Polish words annotated in terms of emotion
(NAWL, (Riegel et al., 2015); NAWL BE, (Wierzba et al.,
2015); plWordNet-emo (Za´
nska et al., 2015; Janz
et al., 2017)). Thus, we collected extensive annotations of
valence (polarity), arousal, as well as eight emotion cate-
gories: joy, sadness, trust, disgust, fear, anger, surprise and
The total number of over 30,000 word meanings from
Polish WordNet (Piasecki et al., 2009) was annotated, with
each meaning ranked at least 50 times on each scale. The
selection of word meanings was based on the results of
the plWordNet-emo (Za´
nska et al., 2015) project,
in which linguists annotated over 87K lexical units with
over 178K annotations containing information about emo-
tions, valence (polarity) and valuations (statistics from
May 2019). At the time when the selection was made (July
2017) 84K annotations were covering 54K word meanings
and 41K synsets. We observed that 27% of all annotations
(23K) were not neutral. The number of synsets having lex-
ical units with polarity different than neutral was 9K. We
have adopted the following assumptions for the selection
word meanings that we know are not neutral are more
polarity sign of the synset is the polarity sign of word
meanings within the synset (valid in 96% of cases),
• the maximum number of selected word meanings
from the same synset is 3,
the degree of synsets (treated as nodes in plWordNet
graph) which are sources of selected word meanings
should be in range [3,6].
Word meanings were presented to respondents as colloca-
tions manually prepared by linguists.
Moreover, in a follow-up study, a total number of over
7,000 texts (short phrases or paragraphs of text) were an-
notated in the same way, with each text assessed at least
25 times on each scale. Before attempting the assess-
ment task, subjects were instructed to rank word mean-
ings rather than words, as well as encouraged to indicate
their immediate, spontaneous reactions. Participants had
unlimited time to complete the task, and they were able to
quit the assessment session at any time and resume their
work later on. The final collection of texts for emotive an-
notation was acquired from Web reviews of two distinct
domains: medicine2(2000 reviews) and hotels3(2000 re-
views). Due to the scarcity of neutral reviews in these data
sources, we decided to acquire yet another sample from po-
tentially neutral Web sources being thematically consistent
with selected domains, i.e. medical information sources
4(500 paragraphs) and hotel industry news 5(500 para-
graphs). The phrases for annotation were extracted using
lexico-semantic-syntactic patterns (LSS) manually created
by linguists to capture one of the four effects affecting sen-
timent: increase, decrease, transition, drift. Most of these
phrases belong to previously mentioned thematic domains.
The source for the remaining phrases were Polish WordNet
glosses and usage examples (Piasecki et al., 2009).
3. Data transformation
We decided to carry out the recognition of specific di-
mensions as a classification task. Eight basic emotions
were annotated by respondents on a scale of integers from
range [0,4] and the same scale was also used for arousal di-
mension. For valence dimension, a scale of integers from
range [3,3] was proposed to obtain a more clear gra-
dation of effect size. We divided the valence scores into
two groups: positive (valence_p) and negative (valence_n).
This division results from the fact that there were texts
that received scores from both polarities. We wanted to
keep that distribution (see Algorithm 1). For the rest of
dimensions, we assigned the average value of all scores
(normalised to the range [0,1]) to the text.
3.1. Scores distribution
As a part of this study, a collection of 7004 texts was
annotated. To investigate the underlying empirical distri-
bution of emotive scores we analysed our data concerning
each dimension separately. We performed two statistical
tests to verify the multimodality of scores distribution in
our sample for each dimension. The main purpose of this,
Algorithm 1 Estimating the average value of positive and
negative valence for a single review.
Require: V: list of all valence scores;
m= 3: the maximum absolute value of polarity;
Ensure: Pair (p, n)where pis average positive valence,
and nis average negative valence;
1: (p, n) = (0,0)
2: for vVdo
3: if v < 0then n=n+|v|else p=p+v
return p÷(|V| · m), n ÷(|V| · m)
analysis was to identify if there exists a specific decision
boundary splitting our data into distinct clusters, to sep-
arate the examples sharing the same property (e.g. posi-
tive texts) from the examples that do not share this prop-
erty (e.g. non-positive texts). The first test was Harti-
gans’ dip test. It uses the maximum difference for all av-
eraged scores, between the empirical distribution function,
and also the unimodal distribution function that minimises
the maximum difference (Hartigan et al., 1985). There
are the unimodal null hypothesis and a multimodal alterna-
tive. The second one is Silverman’s mode estimation test
which uses kernel density estimation methods to examine
the number of modes in a sample (Silverman, 1981). If the
null hypothesis of unimodality (k= 1) was rejected, we
also tested if there are two modes (k= 2) or more (Neville
and Brownstein, 2018). We used locmodes R package to
apply statistical testing (Ameijeiras-Alonso et al., 2016)
with Hartigans’ and Silverman’s tests on our annotation
data. For all dimensions we could not reject the null hy-
pothesis of bimodality and only in 2 cases (arousal, dis-
gust) we could reject the null hypothesis of unimodality by
the result of both tests (see Table 1).
Dimension SI_mod1 SI_mod2 HH
valence_n 0.000 0.812 0.000
valence_p 0.000 0.460 0.000
arousal 0.340 0.606 0.118
joy 0.000 0.842 0.000
sadness 0.000 0.424 0.000
fear 0.892 0.674 0.032
disgust 0.784 0.500 0.178
surprise 0.288 0.360 0.000
anticipation 0.522 0.321 0.000
trust 0.034 0.736 0.226
anger 0.000 0.630 0.000
Table 1: p–values for Silverman’s test with k= 1
(SI_mod1), k= 2 (SI_mod2) and Hartigans’ dip test (HH).
The distributions of averaged scores for all texts are
presented in Figure 1. We decided to partition all scores for
each dimension into two clusters using k-means cluster-
ing (Hartigan and Wong, 1979). Clusters are represented
in Figure 1 with different colours. We assign a label (cor-
responding to the dimension) if the score for the dimension
is higher than the threshold determined by k-means. Each
review may be described with multiple labels.
Figure 1: Distribution of avg. scores for all dimensions.
4. Experiments
In our experimental part, we decided to use a popular
baseline model based on fastText algorithm (Bojanowski
et al., 2017; Joulin et al., 2017) as a reference method for
the evaluation. FastText’s supervised models were used in
many NLP tasks, especially in the area of sentiment analy-
sis, e.g. for hate speech detection (Badjatiya et al., 2017),
emotion and sarcasm recognition (Felbo et al., 2017) or
aspect-based sentiment analysis in social media (Wojatzki
et al., 2017). The unsupervised fastText models were
also used to prepare word embeddings of Polish (see Sec-
tion 4.1.). In our experiments, we used supervised fastText
models as a simple multi-label text classifier for sentiment
and emotion recognition. We used one-versus-all cross-
entropy loss and 250 training epochs, with KGR10 pre-
trained word vectors (Koco´
n and Gawor, 2019) (described
in Section 4.1.) for all evaluation cases.
In recent years deep neural networks have begun
to dominate natural language processing (NLP) field.
The most popular solutions incorporate bidirectional long
short-term memory neural networks (henceforth BiL-
STM). BiLSTM-based approaches were mainly applied in
the information extraction area, e.g. in the task of proper
names recognition, where the models are often combined
with conditional random fields (CRF) to impose additional
constraints on sequences of tags as presented in (Habibi
et al., 2017).
LSTM networks have proved to be very effective in
sentiment analysis, especially for the task of polarity de-
tection (Wang et al., 2016; Baziotis et al., 2017; Ma
et al., 2018). In this study, we decided to adopt the
multi-labelled BiLSTM networks and expand our research
to the more challenging task of emotion detection. As
an input for BiLSTM networks we used pre-trained fast-
Text embeddings trained on KGR10 corpus (Koco´
n and
Gawor, 2019). The parameters used for training pro-
cedure were as follows: MAX_WORDS=128 (94% of re-
views have 128 words or less), HIDDEN_UNITS=1024,
4.1. Word embeddings
The most popular text representations in recent ma-
chine learning solutions are based on word embeddings.
Dense vector space representations follow the distribu-
tional hypothesis that the words with similar meaning tend
to appear in similar contexts. Word embeddings capture
the similarity between words and are often used as an in-
put for the first layer of deep learning models. Contin-
uous Bag-of-Words (CBOW) and Skip-gram (SG) models
are the most common methods proposed to generate dis-
tributed representations of words embedded in a continu-
ous vector space (Mikolov et al., 2013).
With the progress of machine learning methods, it is
possible to train such models on larger data sets, and these
models often outperform the simple ones. It is possible
to use a set of text documents containing even billions of
words as training data. Both architectures (CBOW and SG)
describe how the neural network learns the vector repre-
sentations for each word. In CBOW architecture the task
is predicting the word given its context, and in SG the task
is predicting the context given the word.
Numerous methods have been developed to prepare
vector space representations of words, phrases, sentences
or even full texts. The quality of vector space models
depends on the quality and the size of the training cor-
pora used to prepare the embeddings. Hence, there is a
strong need for proper evaluation metrics, both intrinsic
and extrinsic (task-based evaluation), to evaluate the qual-
ity of vector space representations including word embed-
dings (Schnabel et al., 2015), (Piasecki et al., 2018). Pre-
trained word embeddings built on various corpora are al-
ready available for many languages, including the most
representative group of models built for English (Kutuzov
et al., 2017) language.
In (Koco´
n and Gawor, 2019) we introduced mul-
tiple variants of word embeddings for Polish built on
KGR10 corpora. We used the implementation of CBOW
and Skip-gram methods provided with fastText tool (Bo-
janowski et al., 2017). These models are available un-
der an open license in the CLARIN-PL project reposi-
tory6. With these embeddings, we obtained a favourable
results in two NLP tasks: recognition of temporal ex-
pressions (Koco´
n and Gawor, 2019) and recognition of
named entities (Marci´
nczuk et al., 2018). For this rea-
son, the same model of word embeddings was used
for this work, which is EC1 (Koco´
n and Gawor, 2019)
4.2. Evaluation procedure
We prepared three evaluation scenarios to test the per-
formance of fastText and BiLSTM baseline models. The
most straightforward scenario is a single domain setting
(SD) where the classifier is trained and tested on the data
representing the same thematic domain. In a more realistic
scenario, the thematic domain of training data differs from
the application domain. This means that there may exist
a discrepancy between feature spaces of training and test-
ing data which leads to a significant decrease of classifier’s
performance in the application domain. To test the clas-
sifier’s ability to bridge the gap between source and target
domains we propose a second evaluation scenario called 1-
Domain-Out (DO). This scenario is closely related to the
task of unsupervised domain adaptation (UDA), where we
focus on transferring the knowledge from labelled training
data to unlabelled testing data. The last evaluation scenario
is a multidomain setting where we merge all available la-
belled data representing different thematic domains into a
single training dataset (MD).
Single Domain, SD – train/dev/test sets are from the
same domain (3 settings, metric: F1-score).
1-Domain-Out, DO – train/dev sets are from two do-
mains, test set is from the third domain (3 settings,
metric: F1-score).
Mixed Domains, MD – train/dev/test sets are ran-
domly selected from all domains (1 setting, metrics:
precision, recall, F1-score, AUC_ROC).
We prepared seven evaluation settings with a different
domain-based split of the initial set of texts. The final di-
vision is presented in Table 2.
Type Setting Train Dev Test SUM
Hotels 2504 313 313 3130
Medicine 2352 293 293 2938
Other 750 93 93 936
Hotels-Other 3660 406 - 4066
Hotels-Medicine 5462 606 - 6068
Medicine-Other 3487 387 - 3874
MD All 5604 700 700 7004
Table 2: The number of texts in the evaluation settings.
To tune our baseline methods we decided to use a dev
set. We calculated the optimal decision threshold for each
dimension using receiver operating characteristic (ROC)
curve, taking the threshold which produces the point on
ROC closest to (FPR,TPR) = (0,1).
5. Results
Table 3 shows the results for SD evaluation. There are
11 results for each of the 3 domains. BiLSTM classifier
outperformed FastText in 27 out of 33 cases. Table 4 shows
the results for DO evaluation. Here BiLSTM classifier pro-
vided better quality for 31 out of 33 cases. The last MD
evaluation results are in Table 5 (P, R, F1-score) and Fig-
ure 2 (ROC). BiLSTM outperformed FastText in 31 out of
36 cases (Table 5). ROC_AUC is the same for both clas-
sifiers in 4 cases (2 of them are micro and macro-average
ROC). For the rest of the curves, BiLSTM outperformed
FastText in 7 out of 9 cases. The most interesting phe-
nomenon can be observed in Table 4 where the differences
are the greatest. This may indicate that the deep neural
network was able to capture domain-independent features
(pivots) which is an important ability for domain adaption
Figure 2: ROC curves for FastText and BiLSTM classi-
6. Conclusions
In this preliminary study, we focused on basic neu-
ral language models to prepare and evaluate baseline ap-
proaches to recognise emotions, valence and arousal in
multi-domain textual reviews. Further plans include the
evaluation of hybrid approaches combining machine learn-
ing approaches and lexico-syntactic rules augmented with
semantic analysis of word meanings. We also plan to au-
tomatically expand the annotations of word meanings to
the rest of lexical units within plWordNet using the prop-
agation methods presented in (Koco´
n et al., 2018a; Koco´
et al., 2018b). We intend to test other promising methods
later, such as Google BERT (Devlin et al., 2018), Ope-
nAI GPT-2 (Radford et al., 2019) and domain dictionaries
construction methods utilising WordNet (Koco´
n and Mar-
nczuk, 2016).
Automatic emotion annotation has both scientific and
applied value. Modern business is interested in the opin-
ions, emotions and values associated with brands and prod-
ucts. Retailers and merchants collect vast amounts of cus-
tomer feedback and rumours both from in-store and posted
online. What is more, relation departments monitor the
impact of their campaigns and need to know whether it
was positive and touching for customers. In this context,
the results of monitoring opinions, reactions, and emotions
present great value, because they fuel decisions and be-
haviour (Tversky and Kahneman, 1989). However, most
of the existing solutions are still limited to manual annota-
tion and simplified methods of analysis.
The large database built in the Sentimenti project cov-
ers a wide range of Polish vocabulary and introduces an
extensive emotive annotation of word meanings in terms
of their polarity, basic emotions and affective arousal. The
results of such research can be used in several applications
– media monitoring, chatbots, stock prices forecasting,
search engine optimisation for advertisements and other
types of content. In the last decades, the development of
Internet services gave us an unprecedented amount of data,
resulting in the big data revolution (Kitchin, 2014). This
also includes the textual data coming directly from social
media and other sources.
We also provide a preliminary overview of ML meth-
ods for automatic analysis of people’s opinions in terms
of expressed emotions and their attitudes. Since the par-
ticipants of our CAPI and CAWI studies represent a wide
cross-section of population we can adapt our methods to
specific target groups of people. This introduces the much
needed human aspect to artificial intelligence and machine
learning in natural language processing.
7. Data availability
Due to the commercial nature of the Sentimenti project,
it is planned to make 10% of the project data available
soon. The data will be published at
We will consider making more data accessible in the fu-
Co-financed by the Polish Ministry of Education and
Science, CLARIN-PL Project and by the National Cen-
tre for Research and Development, Poland, grant no
POIR.01.01.01-00-0472/16 – Sentimenti (http://w3a.
8. References
Ameijeiras-Alonso, Jose, Rosa M Crujeiras, and Alberto
Rodríguez-Casal, 2016. Mode testing, critical band-
width and excess mass. TEST:1–20.
1. Hotels FastText 90.53 88.43 66.67 89.08 62.63 77.91 83.41 86.04 88.33 65.81 81.86
BiLSTM 89.74 89.54 67.66 86.84 46.62 82.11 80.83 88.46 89.54 63.53 82.76
2. Medicine FastText 75.37 56.18 61.54 75.00 62.00 75.49 74.14 64.32 59.09 45.90 73.20
BiLSTM 82.18 82.40 65.31 84.15 64.38 80.31 82.47 86.33 85.23 83.04 74.04
3. Other FastText 66.67 66.67 62.34 62.86 48.57 51.52 45.28 77.27 48.15 45.28 46.51
BiLSTM 80.52 75.95 65.17 80.49 33.90 64.71 70.37 79.52 65.52 68.66 62.75
Table 3: F1-scores for Single Domain evaluation. (Train, Dev, Test) sets for settings are the same as in Table 2, rows 1-3.
4. Hotels-Other
vs Medicine
FastText 61.44 72.79 63.08 61.73 59.03 58.10 65.54 75.27 71.97 71.33 63.20
BiLSTM 74.56 76.61 66.00 71.25 62.62 70.32 67.52 80.40 73.97 74.03 69.80
5. Hotels-Medicine
vs Other
FastText 61.05 39.29 37.50 65.96 20.51 45.95 42.42 25.45 05.71 17.65 48.65
BiLSTM 73.17 56.34 35.29 75.00 51.52 60.53 56.67 61.90 43.48 57.69 48.39
6. Medicine-Other
vs Hotels
FastText 73.93 78.26 35.18 71.86 56.32 73.25 73.45 72.96 76.60 50.96 71.21
BiLSTM 88.89 87.07 51.88 87.07 62.07 84.76 82.79 86.14 87.14 63.44 82.57
Table 4: F1-scores for 1-Domain-Out evaluation. (Train/Dev, Test) sets (see Table 2) for these settings are: 4. (Hotels-
Other.Train/Dev, Medicine.Test), 5. (Hotels-Medicine.Train/Dev, Other.Test), 6. (Medicine-Other.Train/Dev, Hotels.Test).
Dim. FastText BiLSTM
Valencep73.41 77.41 75.36 77.61 84.10 80.72
Valencen75.79 87.00 81.01 81.31 89.53 85.22
Arousal 67.48 69.16 68.31 67.09 66.04 66.56
Joy 70.61 81.14 75.51 77.51 84.65 80.92
Surprise 65.07 64.31 64.69 67.67 59.88 63.54
Anticip. 72.28 77.66 74.78 79.66 81.91 80.77
Trust 65.32 79.02 71.52 73.91 82.93 78.16
Sadness 81.73 82.55 82.14 83.88 85.57 84.72
Anger 80.92 78.52 79.70 82.03 89.63 85.66
Fear 69.20 77.78 73.24 68.84 81.20 74.51
Disgust 66.80 77.73 71.85 71.71 84.09 77.41
Avg. 71.69 77.48 74.38 75.57 80.87 78.02
Table 5: Precision, recall and F1-score for Mixed Domains
Badjatiya, Pinkesh, Shashank Gupta, Manish Gupta, and
Vasudeva Varma, 2017. Deep learning for hate speech
detection in tweets. In Proceedings of the 26th Inter-
national Conference on World Wide Web Companion.
International World Wide Web Conferences Steering
Baziotis, Christos, Nikos Pelekis, and Christos Doulk-
eridis, 2017. Datastories at semeval-2017 task 4: Deep
lstm with attention for message-level and topic-based
sentiment analysis. In Proceedings of the 11th Inter-
national Workshop on Semantic Evaluation (SemEval-
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and
Tomas Mikolov, 2017. Enriching word vectors with
subword information. Transactions of the Association
for Computational Linguistics, 5:135–146.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova, 2018. Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv
preprint arXiv:1810.04805.
Ekman, Paul, 1992. An argument for basic emotions.
Cognition and Emotion, 6(3-4):169–200.
Felbo, Bjarke, Alan Mislove, Anders Søgaard, Iyad Rah-
wan, and Sune Lehmann, 2017. Using millions of emoji
occurrences to learn any-domain representations for de-
tecting sentiment, emotion and sarcasm. In Proceedings
of the 2017 Conference on Empirical Methods in Natu-
ral Language Processing.
Habibi, Maryam, Leon Weber, Mariana Neves, David Luis
Wiegandt, and Ulf Leser, 2017. Deep learning with
word embeddings improves biomedical named entity
recognition. Bioinformatics, 33(14):i37–i48.
Hartigan, John A, Pamela M Hartigan, et al., 1985. The dip
test of unimodality. The annals of Statistics, 13(1):70–
Hartigan, John A and Manchek A Wong, 1979. Algorithm
as 136: A k-means clustering algorithm. Journal of the
Royal Statistical Society. Series C (Applied Statistics),
Hripcsak, George and Adam S. Rothschild, 2005. Techni-
cal Brief: Agreement, the F-Measure, and Reliability in
Information Retrieval. JAMIA, 12(3):296–298.
Janz, Arkadiusz, Jan Koco´
n, Maciej Piasecki, and Za´
nska Monika, 2017. plWordNet as a Basis for
Large Emotive Lexicons of Polish. In LTC’17 8th Lan-
guage and Technology Conference. Pozna´
n, Poland:
Fundacja Uniwersytetu im. Adama Mickiewicza w Poz-
Joulin, Armand, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov, 2017. Bag of tricks for efficient text
classification. In Proceedings of the 15th Conference
of the European Chapter of the Association for Compu-
tational Linguistics: Volume 2, Short Papers. Valencia,
Spain: Association for Computational Linguistics.
Kitchin, Rob, 2014. The data revolution: Big data,
open data, data infrastructures and their consequences.
n, Jan, Arkadiusz Janz, and Maciej Piasecki, 2018a.
Classifier-based Polarity Propagation in a Wordnet. In
Proceedings of the 11th International Conference on
Language Resources and Evaluation (LREC’18).
n, Jan, Arkadiusz Janz, and Maciej Piasecki, 2018b.
Context-sensitive Sentiment Propagation in WordNet.
In Proceedings of the 9th International Global Wordnet
Conference (GWC’18).
n, Jan and Michal Gawor, 2019. Evaluating
KGR10 Polish word embeddings in the recognition
of temporal expressions using BiLSTM-CRF. CoRR,
n, Jan and Michał Marci´
nczuk, 2016. Generating of
Events Dictionaries from Polish WordNet for the Recog-
nition of Events in Polish Documents. In Text, Speech
and Dialogue, Proceedings of the 19th International
Conference TSD 2016, volume 9924 of Lecture Notes in
Artificial Intelligence. Brno, Czech Republic: Springer.
Kutuzov, Andrei, Murhaf Fares, Stephan Oepen, and Erik
Velldal, 2017. Word vectors, reuse, and replicability:
Towards a community repository of large-text resources.
In Proceedings of the 58th Conference on Simulation
and Modelling. Linköping University Electronic Press.
Ma, Yukun, Haiyun Peng, and Erik Cambria, 2018. Tar-
geted aspect-based sentiment analysis via embedding
commonsense knowledge into an attentive lstm. In
Thirty-Second AAAI Conference on Artificial Intelli-
nczuk, Michał, Jan Koco´
n, and Michał Gawor, 2018.
Recognition of Named Entities for Polish-Comparison
of Deep Learning and Conditional Random Fields Ap-
proaches. In Proceedings of PolEval 2018 Workshop.
Warsaw, Poland: Institute of Computer Science, Polish
Academy of Sciences.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean, 2013. Distributed representations
of words and phrases and their compositionality. In Ad-
vances in neural information processing systems.
Neville, Zachariah and Naomi C Brownstein, 2018.
Macros to conduct tests of multimodality in SAS.
Journal of Statistical Computation and Simulation,
Paolacci, Gabriele and Jesse Chandler, 2014. Inside the
turk: Understanding mechanical turk as a participant
pool. Current Directions in Psychological Science,
Piasecki, Maciej, Bernd Broda, and Stanislaw Szpakow-
icz, 2009. A wordnet from the ground up. Oficyna
Wydawnicza Politechniki Wrocławskiej Wrocław.
Piasecki, Maciej, Gabriela Czachor, Arkadiusz Janz, Do-
minik Kaszewski, and Paweł K˛edzia, 2018. Wordnet-
based Evaluation of Large Distributional Models for
Polish. In Proceedings of the 9th Global WordNet Con-
ference (GWC 2018).
Plutchik, Robert, 1982. A psychoevolutionary theory of
emotions. Social Science Information, 21(4-5):529–
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever, 2019. Language
models are unsupervised multitask learners. OpenAI
Riegel, Monika, Małgorzata Wierzba, Marek Wypych,
Łukasz ˙
Zurawski, Katarzyna Jednoróg, Anna
Grabowska, and Artur Marchewka, 2015. Nencki
Affective Word List (NAWL): the cultural adaptation of
the Berlin Affective Word List–Reloaded (BAWL-R) for
Polish. Behavior Research Methods, 47(4):1222–1236.
Russell, James A and Albert Mehrabian, 1977. Evidence
for a three-factor theory of emotions. Journal of Re-
search in Personality, 11(3):273 – 294.
Schnabel, Tobias, Igor Labutov, David M Mimno, and
Thorsten Joachims, 2015. Evaluation methods for un-
supervised word embeddings. In EMNLP.
Silverman, Bernard W, 1981. Using kernel density es-
timates to investigate multimodality. Journal of the
Royal Statistical Society: Series B (Methodological),
Tversky, Amos and Daniel Kahneman, 1989. Rational
choice and the framing of decisions. In Multiple Cri-
teria Decision Making and Risk Analysis Using Micro-
computers. Springer, pages 81–126.
Wang, Yequan, Minlie Huang, Li Zhao, et al., 2016.
Attention-based lstm for aspect-level sentiment classi-
fication. In Proceedings of the 2016 conference on em-
pirical methods in natural language processing.
Wierzba, M., M. Riegel, M. Wypych, K. Jednorwóg,
P. Turnau, A. Grabowska, and A. Marchewka, 2015. Ba-
sic emotions in the nencki affective word list (NAWL
be): New method of classifying emotional stimuli.
PLoS ONE, 10(7).
Wojatzki, Michael, Eugen Ruppert, Sarah Holschneider,
Torsten Zesch, and Chris Biemann, 2017. Germeval
2017: Shared task on aspect-based sentiment in so-
cial media customer feedback. Proceedings of the Ger-
nska, Monika, Maciej Piasecki, and Stan Sz-
pakowicz, 2015. A large wordnet-based sentiment lexi-
con for Polish. In Proceedings of the International Con-
ference Recent Advances in Natural Language Process-
... So far, Polish language corpora have been developed, which are the subject of research projects on sentiment analysis. For the Sentimenti project, in addition to annotating selected lexical units from the Polish wordnet (Słowosieć), emotional annotation was also applied to consumer reviews [32]. This corpus (PolEmo) is continuously used to develop methods for machine extraction of sentiment from texts [32], [33]. ...
... For the Sentimenti project, in addition to annotating selected lexical units from the Polish wordnet (Słowosieć), emotional annotation was also applied to consumer reviews [32]. This corpus (PolEmo) is continuously used to develop methods for machine extraction of sentiment from texts [32], [33]. The corpus established within the framework of the PolEval 2017 project [34], which contains the opinions of users of various types of products (such as perfume, clothes) with added negative, positive, and neutral annotations, is definitely noteworthy. ...
Conference Paper
Aspect-based sentiment analysis (ABSA) is a text analysis method that categorizes data by aspects and identifies the sentiment assigned to each aspect. Aspect-based sentiment analysis can be used to analyze customer opinions by associating specific sentiments with different aspects of a product or service. Most of the work in this topic is thoroughly performed for English, but many low-resource languages still lack adequate annotated data to create automatic methods for the ABSA task. In this work, we present annotation guidelines for the ABSA task for Polish and preliminary annotation results in the form of the AspectEmo corpus, containing over 1.5k consumer reviews annotated with over 63k annotations. We present an agreement analysis on the resulting annotated corpus and preliminary results using transformer-based models trained on AspectEmo.
... For the purpose of the present project, we selected 30,080 word meanings from plWordNet. The selection was based on the results of the plWordNet-emo project (Janz et al., 2017), where more than 87,000 word meanings were annotated with valence, emotions, as well as fundamental values (Kocoń et al., 2019), covering 54,000 synsets (i.e. sets of word meanings representing the same concept). ...
... sets of word meanings representing the same concept). We used the following criteria for the selection process (Kocoń et al., 2019): (1) we chose non-neutral word meanings first; (2) the maximum number of selected word meanings belonging to one synset was 3; (3) the degree of the synset node containing a word meaning (number of relations to other synsets) in the plWordNet graph was in the range of 3-6. ...
Full-text available
Emotion lexicons are useful in research across various disciplines, but the availability of such resources remains limited for most languages. While existing emotion lexicons typically comprise words, it is a particular meaning of a word (rather than the word itself) that conveys emotion. To mitigate this issue, we present the Emotion Meanings dataset, a novel dataset of 6000 Polish word meanings. The word meanings are derived from the Polish wordnet (plWordNet), a large semantic network interlinking words by means of lexical and conceptual relations. The word meanings were manually rated for valence and arousal, along with a variety of basic emotion categories (anger, disgust, fear, sadness, anticipation, happiness, surprise, and trust). The annotations were found to be highly reliable, as demonstrated by the similarity between data collected in two independent samples: unsupervised ( n = 21,317) and supervised ( n = 561). Although we found the annotations to be relatively stable for female, male, younger, and older participants, we share both summary data and individual data to enable emotion research on different demographically specific subgroups. The word meanings are further accompanied by the relevant metadata, derived from open-source linguistic resources. Direct mapping to Princeton WordNet makes the dataset suitable for research on multiple languages. Altogether, this dataset provides a versatile resource that can be employed for emotion research in psychology, cognitive science, psycholinguistics, computational linguistics, and natural language processing.
... The studies have shown that the recognition of emotions should take into account the subjective assessments of individual annotators (Neviarouskaya et al., 2009;Chou and Lee, 2019;Kocoń et al., 2019a). A personal bias related to the individual beliefs may have its origins in the demographic background and many factors such as the first language, age, education (Wich et al., 2020a;Al Kuwatly et al., 2020), country of origin (Salminen et al., 2018), gender (Bolukbasi et al., 2016;Binns et al., 2017;Tatman, 2017;Wojatzki et al., 2018), and race (Blodgett and O'Connor, 2017;Sap et al., 2019;Davidson et al., 2019;Xia et al., 2020). ...
... To create a Sentimenti 1 dataset, a combined approach of different methodologies were used, namely: Computer Assisted Personal Interview (CAPI) and Computer Assisted Web Interview (CAWI) (Kocoń et al., 2019a). Two studies were carried out involving evaluation of: 30,000 word meanings (CAWI1) and 7,000 reviews from the Internet (CAWI2). ...
Conference Paper
Full-text available
Analysis of emotions elicited by opinions, comments, or articles commonly exploits annotated corpora, in which the labels assigned to documents average the views of all annotators, or represent a majority decision. The models trained on such data are effective at identifying the general views of the population. However, their usefulness for predicting the emotions evoked by the textual content in a particular individual is limited. In this paper, we present a study performed on a dataset containing 7,000 opinions, each annotated by about 50 people with two dimensions: valence, arousal, and with intensity of eight emotions from Plutchik’s model. Our study showed that individual responses often significantly differed from the mean. Therefore, we proposed a novel measure to estimate this effect – Personal Emotional Bias (PEB). We also developed a new BERT-based transformer architecture to predict emotions from an individual human perspective. We found PEB a major factor for improving the quality of personalized reasoning. Both the method and measure may boost the quality of content recommendation systems and personalized solutions that protect users from hate speech or unwanted content, which are highly subjective in nature.
... Each opinion on average was annotated by approximately 53 annotators independently in 10 categories: 8 basic emotions, valence, and emotional arousal. The dataset allows us to perform the analysis of emotions elicited by textual content [10], [32] (multiclass classification task) as well as their level of arousal (multivariate regression task) [10], [11]. ...
... The personalization solutions can also be applied to other NLP problems, where the content tends to be subjectively perceived as hate speech, cyberbullying, abusive or offensive, as well as in prediction of emotions elicited by text (Kocoń et al., 2019a;Milkowski et al., 2021) and even in sentiment analysis (Kocoń et al., 2019;Kanclerz et al., 2020). ...
Conference Paper
Full-text available
There is content such as hate speech, offensive, toxic or aggressive documents, which are perceived differently by their consumers. They are commonly identified using classifiers solely based on textual content that generalize pre-agreed meanings of difficult problems. Such models provide the same results for each user, which leads to high misclassification rate observable especially for contentious, aggressive documents. Both document controversy and user nonconformity require new solutions. Therefore, we propose novel personalized approaches that respect individual beliefs expressed by either user conformity-based measures or various embeddings of their previous text annotations. We found that only a few annotations of most controversial documents are enough for all our personalization methods to significantly outperform classic, generalized solutions. The more controversial the content, the greater the gain. The personalized solutions may be used to efficiently filter unwanted aggressive content in the way adjusted to a given person.
... aspects or features of the product, whose sentiment is expressed at the level of sentences [24]. It is important that the method should also work in as many domains as possible [1,17,18]. ...
This article presents MultiEmo, a new benchmark data set for the multilingual sentiment analysis task including 11 languages. The collection contains consumer reviews from four domains: medicine, hotels, products and university. The original reviews in Polish contained 8,216 documents consisting of 57,466 sentences. The reviews were manually annotated with sentiment at the level of the whole document and at the level of a sentence (3 annotators per element). We achieved a high Positive Specific Agreement value of 0.91 for texts and 0.88 for sentences. The collection was then translated automatically into English, Chinese, Italian, Japanese, Russian, German, Spanish, French, Dutch and Portuguese. MultiEmo is publicly available under the MIT Licence. We present the results of the evaluation using the latest cross-lingual deep learning models such as XLM-RoBERTa, MultiFiT and LASER+BiLSTM. We have taken into account 3 aspects in the context of comparing the quality of the models: multilingualism, multilevel and multidomain knowledge transfer ability.
... Overall, the subjective NLP tasks include: sentiment analysis (Agarwal, Xie, Vovsha, Rambow, & Passonneau, 2011;Kocoń, Miłkowski, & Zaśko-Zielińska, 2019;Kulisiewicz, Kajdanowicz, Kazienko, & Piasecki, 2015;Ptaszyński et al., 2017), recognition of emotions (Kocoń, Janz, Miłkowski, et al., 2019;Neviarouskaya, Prendinger, & Ishizuka, 2009), humor processing (Mihalcea & Strapparava, 2005), offensive content analysis (Xiang, Fan, Wang, Hong, & Rose, 2012), hate speech and cyberbullying identification (Davidson et al., 2017;Ptaszyński et al., 2019), deceptive language processing (Mihalcea & Strapparava, 2009) and many others. In these tasks, the level of agreement rarely exceeds a moderate level (Landis & Koch, 1977) without significant changes in the labels or without restrictive specifications or training that limit the freedom of expression of the annotators (Kiela et al., 2020;Zampieri et al., 2019b). ...
Full-text available
Analysis of subjective texts like offensive content or hate speech is a great challenge, especially regarding annotation process. Most of current annotation procedures are aimed at achieving a high level of agreement in order to generate a high quality reference source. However, the annotation guidelines for subjective content may restrict the annotators’ freedom of decision making. Motivated by a moderate annotation agreement in offensive content datasets, we hypothesize that personalized approaches to offensive content identification should be in place. Thus, we propose two novel perspectives of perception: group-based and individual. Using demographics of annotators as well as embeddings of their previous decisions (annotated texts), we are able to train multimodal models (including transformer-based) adjusted to personal or community profiles. Based on the agreement of individuals and groups, we experimentally showed that annotator group agreeability strongly correlates with offensive content recognition quality. The proposed personalized approaches enabled us to create models adaptable to personal user beliefs rather than to agreed offensiveness understanding. Overall, our individualized approaches to offensive content classification outperform classic data-centric methods that generalize offensiveness perception and it refers to all six tested models. Additionally, we developed requirements for annotation procedures, personalization and content processing to make the solutions human-centered.
... The Sentimenti database holds responses from 20,000 unique responders who evaluated emotions associated with different meanings of Polish lexemes and 7000 unique responders who evaluated short phrases and texts written in Polish (for example, opinions about hotels or medical services) 27,28 . Each participant annotated each word or utterance on 10 scales, describing the 8 basic emotions (from 0 -lack of the emotion, to 4), as well as a positive-negative scale and an arousal scale. ...
Full-text available
Introduction: Vaccinations are referred to as one of the greatest achievements of modern medicine. However, their effectiveness is also constantly denied by certain groups in society. This results in an ongoing dispute that has been gradually moving online in the last few years due to the development of technology. Our study aimed to utilize social media to identify and analyze vaccine-deniers’ arguments against child vaccinations. Method: All public comments posted to a leading Polish vaccination opponents’ Facebook page posted between 01/05/2019 and 31/07/2019 were collected and analyzed quantitatively in terms of their content according to the modified method developed by Kata (Kata, 2010). Sentiment analysis was also performed. Results: Out of 18,685 comments analyzed, 4,042 contained content covered by the adopted criteria: conspiracy theories (28.2%), misinformation and unreliable premises (19.9%), content related to the safety and effectiveness of vaccinations (14.0%), noncompliance with civil rights (13.2%), own experience (10.9%), morality, religion, and belief (8.5%), and alternative medicine (5.4%). There were also 1,223 pro-vaccine comments, of which 15.2% were offensive, mocking, or non-substantive. Sentiment analysis showed that comments without any arguments as well as those containing statements about alternative medicine or misinformation were more positive and less angry than comments in other topic categories. Conclusions: The large amount of content in the conspiracy theory and misinformation categories may indicate that authors of such comments may be characterized by a lack of trust in the scientific achievements of medicine. These findings should be adequately addressed in vaccination campaigns.
Full-text available
In this article we extend a WordNet structure with relations linking synsets to Desikan’s brain regions. Based on lexicographer files and WordNet Domains the mapping goes from synset semantic categories to behavioural and cognitive functions and then directly to brain lobes. A human brain connectome (HBC) adjacency matrix was utilised to capture transition probabilities between brain regions. We evaluated the new structure in several tasks related to semantic similarity and emotion processing using brain-expanded Princeton WordNet (207k LUs) and Polish WordNet (285k LUs, 30k annotated with valence, arousal and 8 basic emotions). A novel HBC vector representation turned out to be significantly better than proposed baselines. URL:
This paper presents a wide scope of wordnet applications on the example of applications of plWordNet – a wordnet of Polish. Wordnets are large lexical-semantic databases functioning as primary resources for language technology. They are machine-readable dictionaries. Thus, they are indispensible for tasks such as basic flow of text processing, text mining, word sense disambiguation, information extraction and retrieval. On a larger scale, wordnets are used in research, education and business. In this paper a few examples of specific plWordNet applications are described in detail.
Full-text available
The article introduces a new set of Polish word embeddings, built using KGR10 corpus, which contains more than 4 billion words. These embeddings are evaluated in the problem of recognition of temporal expressions (timexes) for the Polish language. We described the process of KGR10 corpus creation and a new approach to the recognition problem using Bidirectional Long-Short Term Memory (BiLSTM) network with additional CRF layer, where specific embeddings are essential. We presented experiments and conclusions drawn from them.
Conference Paper
Full-text available
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM) or a Bi-directional Gated Recurrent Units (Bi-GRU) with a CRF layer. Each of the networks was trained on different word embeddings. These approaches are the state-of-the-art techniques used in many tasks from Information Extraction field. The presented models got the second (PolDeepNer) and the third (Liner2) place in he PolEval competition. We also present a comparison of these two models in terms of their accuracy and algorithmic efficiency. The evaluation showed that the model based on deep learning outperformed conditional random fields, however with the cost of higher requirements for memory usage, model size and time processing. Both systems are publicly available under the GPL license.
Conference Paper
Full-text available
In this paper we present a novel approach to the construction of an extensive, sense-level sentiment lexicon built on the basis of a wordnet. The main aim of this work is to create a high-quality sentiment lexicon in a partially automated way. We propose a method called Classifier-based Polarity Propagation, which utilises a very rich set of wordnet-based features, to recognize and assign specific sentiment polarity values to wordnet senses. We have demonstrated that in comparison to the existing rule-base solutions using specific, narrow set of semantic relations, our method allows for the construction of a more reliable sentiment lexicon, starting with the same seed of annotated synsets.
Conference Paper
Full-text available
In this paper we present a comprehensive overview of recent methods of the sentiment propagation in a wordnet. Next, we propose a fully automated method called Classifier-based Polarity Propagation , which utilises a very rich set of features , where most of them are based on wordnet relation types, multi-level bag-of-synsets and bag-of-polarities. We have evaluated our solution using manually annotated part of plWordNet 3.1 emo, which contains more than 83k manual sentiment annotations, covering more than 41k synsets. We have demonstrated that in comparison to existing rule-based methods using a specific narrow set of semantic relations our method has achieved statistically significant and better results starting with the same seed synsets.
Full-text available
Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. Availability and implementation: The source code for LSTM-CRF is available at and the links to the corpora are available at . Contact:
Conference Paper
Full-text available
In this paper, we propose a solution to targeted aspect-based sentiment analysis. We augment the long short-term memory (LSTM) network with a hierarchical attention mechanism consisting of a target-level attention and a sentence-level attention. Commonsense knowledge of sentiment- related concepts is incorporated into the end-to-end training of a deep neural network for sentiment classification. In order to tightly integrate the commonsense knowledge into the recurrent encoder, we propose an extension of LSTM, termed Sentic LSTM. We conduct experiments on two publicly released datasets, which show that the combination of the proposed attention architecture and Sentic LSTM can outperform state-of-the-art methods in targeted aspect sentiment.
The Dip Test of Unimodality and Silverman's Critical Bandwidth Test are two popular tests to determine if an unknown density contains more than one mode. While the tests can be easily run in R, they are not included in SAS software. We provide implementations of the Dip Test and Silverman Test as macros in the SAS software, capitalizing on the capability of SAS to execute R code internally. Descriptions of the macro parameters, installation steps, and sample macro calls are provided, along with an appendix for troubleshooting. We illustrate the use of the macros on data simulated from one or more Gaussian distributions as well as on the famous $\textit{iris}$ dataset.