Conference PaperPDF Available

Multi-Level Sentiment Analysis of PolEmo 2.0: Extended Corpus of Multi-Domain Consumer Reviews


Abstract and Figures

In this article we present an extended version of PolEmo – a corpus of consumer reviews from 4 domains: medicine, hotels, products and school. Current version (PolEmo 2.0) contains 8,216 reviews having 57,466 sentences. Each text and sentence was manually annotated with sentiment in 2+1 scheme, which gives a total of 197,046 annotations. We obtained a high value of Positive Specific Agreement, which is 0.91 for texts and 0.88 for sentences. PolEmo 2.0 is publicly available under a Creative Commons copyright license. We explored recent deep learning approaches for the recognition of sentiment, such as Bi-directional Long Short-Term Memory (BiLSTM) and Bidirectional Encoder Representations from Transformers (BERT).
Content may be subject to copyright.
Proceedings of the 23rd Conference on Computational Natural Language Learning, pages 980–991
Hong Kong, China, November 3-4, 2019. c
2019 Association for Computational Linguistics
Multi-Level Sentiment Analysis of PolEmo 2.0:
Extended Corpus of Multi-Domain Consumer Reviews
Jan Koco´
Wrocław University
of Science and Technology
Wrocław, Poland
Piotr Miłkowski
Wrocław University
of Science and Technology
Wrocław, Poland
Monika Za´
University of Wrocław
Institute of Polish Studies
Wrocław, Poland
In this article we present an extended version
of PolEmo – a corpus of consumer reviews
from 4 domains: medicine, hotels, products
and school. Current version (PolEmo 2.0) con-
tains 8,216 reviews having 57,466 sentences.
Each text and sentence was manually anno-
tated with sentiment in 2+1 scheme, which
gives a total of 197,046 annotations. We ob-
tained a high value of Positive Specific Agree-
ment, which is 0.91 for texts and 0.88 for
sentences. PolEmo 2.0 is publicly available
under a Creative Commons copyright license.
We explored recent deep learning approaches
for the recognition of sentiment, such as Bi-
directional Long Short-Term Memory (BiL-
STM) and Bidirectional Encoder Representa-
tions from Transformers (BERT).
1 Introduction
In recent years, we have observed a growing in-
terest in methods of effective sentiment analysis,
especially in subjective, opinion-forming online
texts. This trend is perfectly illustrated by Fig-
ure 1, which compares the popularity of two terms:
customer feedback and sentiment analysis. A very
dynamic growth has been observed since 2010,
which correlates with the increase in the number
of scientific research in this area. Many stud-
ies focus on the perception of emotion and sen-
timent in text messages and, for example, their
impact on election results (Ramteke et al.,2016),
prediction of future events (Zhang and Skiena,
2010) and security issues around the world (Sub-
ramaniyaswamy et al.,2017;Al-Rowaily et al.,
2015). Automatic sentiment analysis systems have
proven to be effective in analyzing many differ-
ent types of text data such as emails, blogs, news,
tweets and books (Medhat et al.,2014). The in-
troduction of advanced computational techniques
(machine learning, deep learning) in natural lan-
customer feedback sentiment analysis
Figure 1: Google Trends (
com) data showing interest in time for search terms "cus-
tomer feedback" and "sentiment analysis". On the vertical
axis 100 means biggest search term popularity.
guage processing has resulted in a significant in-
crease in sentiment analysis techniques (Zhang
et al.,2018). This increase for some languages
is effectively limited by the lack of good quality
resources for this task, especially in the form of
manually annotated corpora (Balahur and Turchi,
2012;Dashtipour et al.,2016).
Analysis of the existing language resources in
the area of sentiment analysis shows that they
largely concern the English language (Dashtipour
et al.,2016). However, there is a clear grow-
ing interest in other languages, often much more
complex than English (e.g. Slavic languages in
the area of loose syntax and rich inflection) and
new resources become available for them, e.g.,
Slovene (Buˇ
car et al.,2018), Czech (Habernal and
Brychcín,2013) or Russian (Rogers et al.,2018).
Due to a small number of available corpora manu-
ally annotated with sentiment for the Polish lan-
guage, we decided that the construction of the
PolEmo resource will be a valuable contribution
to the collection of publicly available resources for
sentiment analysis and may in the future provide a
basis for the creation of shared tasks, in which the
recognition of sentiment for the Polish language
will also be included. Both for the construction of
the corpus and for further research, we used the ex-
perience from the work on the manual annotation
of the Polish WordNet – plWordNet 4.0 Emo (Janz
et al.,2017;Koco´
n et al.,2018a,b) – as a result of
which the sentiment metadata of more than 55,000
lexical units were described.
The main objectives of the article are to present:
The current state of resources related to the
analysis of sentiment for the Polish language;
The method of selecting data for the
PolEmo 2.0 corpus, the annotation method,
the annotation results and the analysis of an-
notation errors;
The results of research related to the auto-
matic analysis of sentiment, with particular
emphasis on the importance of the text do-
main in this topic.
The key contribution of these studies includes:
Detailed description of the procedure of
building PolEmo 2.0: manually annotated
corpus of consumer reviews from 4 domains
(medicine, school, hotels, products) at 2 lev-
els of sentiment granularity (document, sen-
Detailed analysis of manual annotation with
regard to frequently occurring errors;
Development of methods based on deep
learning (BiLSTM, BERT), adapted to
PolEmo 2.0 corpus, also using sentiment lex-
icon generated from plWordNet 4.0 Emo;
Performing tests on sets prepared for the
analysis of the quality of methods (1) eval-
uated on texts within a given domain, (2)
evaluated on texts from various domains (3)
trained on texts that do not include a given
domain and tested on a given domain;
Comparison of deep learning methods with
classic methods (Logistic Regression), espe-
cially in the context of the ability to general-
ize the problem of recognizing sentiment and
providing semantic representation, which is
as independent of the domain as possible;
Making PolEmo 2.0 corpus available under
an open license.
2 Related Work
There are several well-known resources anno-
tated with sentiment for English, e.g.: MPQA
3.0 (Deng and Wiebe,2015), the Stanford Sen-
timent Treebank (Socher et al.,2013), Amazon
Product Data (He and McAuley,2016), Pros And
Cons Dataset (Ganapathibhotla and Liu,2008),
corpora developed within the Semantic Evalua-
tion workshops (Nakov et al.,2016;Pontiki et al.,
2016), SentiWordNet (Baccianella et al.,2010) or
Opinion Lexicon (Hu and Liu,2004). There are
also different approches and tools used for multi-
lingual sentiment analysis (Lo et al.,2017) which
are based on transformations on the existing re-
sources. In this section we are focusing on the re-
sources prepared directly for Polish.
2.1 Polish Sentiment Corpora
There are corpora for the Polish language that
can be used for automatic sentiment analysis.
One of them is a corpus prepared for the senti-
ment recognition shared task within PolEval20171
workshop (Wawer and Ogrodniczuk,2017). The
corpus contains 1550 sentences annotated at the
level of phrases determined by the dependency
parser. The sentences came from consumer re-
views and covered 3 domains: perfume,clothing
and other. Each node of the dependency tree re-
ceived one of the three sentiment annotations: -1
(negative), 0 (neutral), 1 (positive). Most of the
systems participating in the PolEval2017 competi-
tion used Tree LSTM adapted to dependency trees,
including the best system, which reached an accu-
racy of 79% on this data.
Another resource is HateSpeech2corpus con-
taining 2,000 posts crawled from public Polish
web. These texts were annotated for hate speech.
The annotator team reached an agreement score
of Krippendorff’s α= 0.6(Krippendorff,2018).
The SVM model trained on a subset of 1500
texts (containing equal amounts of hate speech
and non-hate speech) obtained the precision of
0.8 (Troszy´
nski and Wawer,2017).
Other interesting resource is the Polish Corpus
of Suicide Notes (PCSN) (Za´
The PCSN is one of very few such resources in
the world. It includes 1,244 genuine SNs that have
been scanned and manually transcribed. Each SN
was linguistically annotated on several levels, in-
cluding selected semantic and pragmatic phenom-
ena (Za´
nska,2013). The annotation is
stored in a TEI-based format (Marci´
nczuk et al.,
2011) with corrected version in a separate layer.
PCSN includes also a subcorpus of 334 counter-
feited SNs (elicited). They were created by vol-
unteers who were asked to imitate a real SN for
imaginary person whose characteristic had been
provided at the beginning of the experiment. Most
volunteers were told that the notes written by them
would be used to deceive the computer program.
Due to the sensitive nature of the texts and legal
obligations of the author, the corpus is not pub-
licly available. In the experiment described in arti-
cle (Piasecki et al.,2017) we have collected 3,200
texts from the Internet as examples of non-letters.
Using SVM with a rich set of features we obtained
90,06% (F1-score) in the task of distinguishing be-
tween genuine SNs, counterfeited SNs and non-
2.2 Polish Sentiment Lexicons
One of the largest Polish sentiment lexical re-
sources in terms of number of annotations is
plWordNet 4.0 Emo3(Janz et al.,2017;Koco´
et al.,2018a). This dataset is available under
the WordNet 3.0 license. It was built within
CLARIN-PL4project (Piasecki,2014). The man-
ual annotation is done at the level of lexical
units (Za´
nska et al.,2015). Available val-
ues for polarity are: strong negative, weak nega-
tive, neutral, weak positive, strong positive, am-
biguous. One annotator could assign only one of
these values for a single lexical unit. There are
more than 83,000 annotations covering more than
54,000 lexical units and 41,000 synsets (Koco´
et al.,2018b). About 22,000 of the polarity anno-
tations are different than neutral and these annota-
tions cover 13,000 lexical units and 9,000 synsets
(22% of all synsets containing annotated units).
plWordNet 4.0 Emo is used in the research pre-
sented in this article as a knowledge base for the
sentiment recognition task.
Another lexicon is the Nencki Affective Word
List (NAWL)5(Wierzba et al.,2015;Riegel et al.,
2015). It is a database of Polish words suitable
for studying various aspects of language and emo-
tions. 2902 Polish words from the NAWL were
presented to 265 subjects, who were instructed to
rate them according to the intensity of each of the
five basic emotions: happiness, anger, sadness,
fear and disgust. The total number of ratings was
The next resource is called the Polish Sentiment
Dictionary6(Wawer,2012;Wawer and Rogozin-
ska,2012). It contains 3,704 words with senti-
ment scores computed using supervised methods
presented in (Wawer and Rogozinska,2012).
Recently, a new resource has appeared in the
Sentimenti project, containing a large database of
annotated lexical units and annotated texts. De-
tails are described in Section 2.3.
2.3 Sentimenti Project
This year, the first results of the Sentimenti7
project (Koco´
n et al.,2019a) were published,
which were aimed at creating methods of ana-
lyzing texts written on the Internet in terms of
emotions aroused by the recipients of the anal-
ysed content. A large database has been cre-
ated, in which 30,000 lexical units from plWord-
Net database (Piasecki et al.,2014) and 7,000 texts
were annotated. Most of the texts were consumer
reviews from the domain of hotels and medicine.
The elements were annotated by 20,000 unique
Polish respondents in the Computer Assisted Web
Interview survey and more than 50 marks were
obtained for each element. Within each mark,
polarisation of the element, stimulation and ba-
sic emotions aroused by the recipients are deter-
mined. The total number of manual annotations
is 3,742,611 for texts and 19,141,041 for lexical
units The first results concerning the automatic
recognition of polarity and emotions for this set
are presented in (Koco´
n et al.,2019a) and propa-
gation of this annotation with the use of Heteroge-
neous Structured Synset Embeddings is presented
in (Koco´
n et al.,2019b). Due to the commer-
cial nature of the Sentimenti project, it is planned
to publish only 20% of the project data available
soon. The data will be published at the main
project’s site7.
The Sentimenti project has interested both the
scientific community and business. Within the
CLARIN-PL project, we decided that in addi-
tion to a large annotated plWordNet lexicon, there
should also be a large corpus annotated with senti-
ment, available under an open license. In the next
part we present the works related to the prepara-
tion of PolEmo.
3 PolEmo Sentiment Corpus
3.1 Motivation
Linguistic research on sentiment recognition in-
volves two approaches: (1) bottom-up from the
perspective of analysing the occurrence of emo-
tional words and (2) top-down from the perspec-
tive of the entire document. The first attempt is
usually a consequence of the creation of the senti-
ment lexicon, e.g. manual annotation of the Word-
Net (Baccianella et al.,2010). The second re-
sults from the analysis of the specific text con-
tent in which we see that the sentiment of a word
or phrase changes under the influence of the sur-
rounding context (Taboada et al.,2008). This
change may vary depending on the domain of the
A discourse perspective in sentiment analysis
is an attempt to address limitations of bottom-
up methods (e.g. problems with negation, focus-
ing on adjectives). It used findings of Rhetorical
Structure Theory (Mann and Thompson,1988).
The attempt bears in mind local and global ori-
entation in the text, discourse structure or topical-
ity (Taboada et al.,2008). It allows the researcher
to extract the most important sentences from the
text in the perspective of the entire discourse con-
text: nucleus satellite method (Wang et al.,2012).
The relevance of the sentences is evaluated in rela-
tion to the main topic and the analysis omits some
less important parts of the text.
There are interesting articles focused at domain-
oriented sentiment analysis (Kanayama and Na-
sukawa,2006), where a system is trained on la-
beled reviews from one source domain but is
meant to be deployed on another (Glorot et al.,
2011). The latter article describes the research
carried out on the Amazon Product Data (He and
McAuley,2016). The ratings were assigned to re-
views by authors of the reviews. Moreover, the
ratings were applied to the entire text. Our idea
was to obtain such a set of reviews that would be
rated by the recipients and not by the authors of
the content. The annotation should take into ac-
count not only the level of the entire review, but
also the level of the individual sentences of the re-
view. Additionally, this dataset was supposed to be
ID Name Source Author Subject
H hotels visitor hotel
M medicine patient doctor
S school student teacher
P products buyer product
Table 1: Each review is described in its domain ID and do-
main Name with the given Source of the review, Authors
type and the general Subject of the review.
a multi-domain one, to evaluate potential knowl-
edge transfer across domains.
3.2 Dataset
In the initial part of the work, presented in arti-
cle (Koco´
n et al.,2019), we have chosen online
customer reviews from four domains, presented in
Table 1. At the beginning of our work we had
only 1000 texts for each of the following domains:
school, products, medicine. In the case of product
reviews, we also had metadata from the reviewer,
how many stars he assigned to a specific review
(from 1 to 5, where 5 means the most positive re-
view). We used this information to select the re-
views for the corpus, where 200 reviews from each
star category were added.
On the basis of a preliminary analysis of several
dozen examples of opinions, we have come to the
conclusion that neutral examples are very difficult
to find in the case of reviews. In the meantime, the
corpus was extended by 8000 texts from the cate-
gory Medicine and 17000 texts from the category
Hotels, also with a uniform distribution in relation
to the star categories available in the source data
(also 1 to 5). In order to capture the phenomenon
of neutral text, we decided to add 2000 new texts
to each of the last two fields (medicine, hotels).
These texts were fragments of articles from infor-
mation portals on hotel industry8and health9.
In Section 3.3 we present how the genre struc-
ture of a customer review affects the text sentiment
polarity. It is an enhancement of the discourse per-
spective in sentiment analysis.
3.3 Pilot Annotation
Our CLARIN-PL pilot study on sentiment analy-
sis of customer reviews was conducted in 2018.
The initial part of the analysis included 3,000
reviews. Each text was manually annotated by
two annotators: a psychologist and a linguist,
who worked according to the general guidelines.
The annotation tool used for this task was In-
forex10 (Marci´
nczuk et al.,2012;Marci´
nczuk and
Oleksy,2019) – a web-based system for text cor-
pora management, annotation and analysis, avail-
able as an open source project. In the pilot project,
we decided to deal with the sentiment annotation
of the entire text. There was also an attempt to
manually extract descriptions of particular aspects
of the review. In both annotation cases we used
the same tag system that is used in plWordNet
Emo for lexical units: [+m] (strong positive), [+s]
(weak positive), [-m] (strong negative), [-s] (weak
negative), [amb] (ambiguous). We assumed that
reviews are always characterised by a certain po-
larity, which is why we did not use the [0] (neutral)
tag in the pilot annotation.
In the process of annotation we focused mainly
on the strategic places of the text. In the consumer
review these are opening and closing sentences,
i.e. a text frame. The opening sentences consist
of the general opinion of the author about the sub-
ject of the review, and the closing sentences con-
tain the author’s recommendation for the review
recipients. The annotators have developed their
first overall rating based on these two segments.
In the text, review authors changed their opinions
only subtly. Regardless of the modification of the
main opinion in the text, we did not use the [amb]
tag when the frame of the text was clearly positive
or negative. Polarity of the text frame was influ-
enced not only by the lexical content, but also by
non-verbal elements: emoticons or multiplication
of punctuation marks, e.g. exclamation marks.
The annotators were also recommended to dis-
tinguish those parts of the text that are placed
in one sentence, but relate to different aspects
(e.g. the teacher’s appearance or teaching skills).
This task turned out to be very difficult, specially
in specifying, even with the help of guidelines,
how to mark precisely in the text the boundaries
of a given aspect. The Positive Specific Agree-
ment (Hripcsak and Rothschild,2005) between the
annotators in the task of annotating the boundaries
of aspects was below 0.15. The concept of an-
notation was radically changed and presented in
Section 3.4.
10 PL/Inforex
3.4 PolEmo Annotation Guidelines
In the main stage of the project we decided to an-
notate the sentiment for the whole text (a meta
level) and the sentence level. We assumed that
this strategy allows to establish the acceptable
value of PSA, because the division of the text
into sentences was determined by the MACA11
tool (Radziszewski and ´
Sniatowski,2011), so the
task was limited only to annotating the sentiment
of the sentence. We followed the rule that the meta
annotation results partially from sentence annota-
tions, however the frame polarity is the main fac-
tor for the final meta annotation. We have pre-
pared the following annotation tags, regardless of
whether the entire text or sentence is annotated:
SP – entirely positive;
WP – generally positive, but there are some
negative aspects within the review;
0 – neutral;
WN – generally negative, but there are some
positive aspects within the review;
SN – entirely negative;
AMB – there are both positive and negative
aspects in the text that are balanced in terms
of relevance.
This time we used [0] tag (neutral) because in the
main stage of the project we extended the corpus
with neutral texts presented in Section 3.2. Also
reviews that are not neutral often contain neutral
We tested the new guidelines on a subset of 50
documents, achieving a PSA of 80% for the meta
level and 78% for the sentence level. In the sec-
ond iteration of the annotation guidelines improve-
ment, the values were 87% (meta) and 85% (sen-
tence). In the last iteration of the improvement of
the guidelines, the annotators reached a PSA of
90% (meta) and 87% (sentence).
3.5 PolEmo 2.0 Annotation Analysis
We decided to publish the first results of the re-
search on the PolEmo 1.0 corpus when the number
of annotated reviews reached 8462 and the number
of annotated sentences was 35724 (Koco´
n et al.,
11Morphological Analysis Converter and Aggregator:
2019). Due to the fact that in PolEmo 2.0 there
are only those annotated elements that received
2 annotations from linguists and were agreed by
the super-annotator, this time we provide 8216
reviews and 57466 sentences. In Section 5we
present Table 7with the final distribution of anno-
tations and Table 6with the number of elements in
each domain (evaluation data splits). In this sec-
tion we focus on annotation agreement and anno-
tation errors.
H 91.91 36.29 99.41 39.38 91.61 40.11 79.73
M 94.09 26.42 99.05 22.37 96.28 42.46 89.52
P 94.06 23.33 100.0 47.62 85.95 33.68 78.76
S 87.50 20.00 00.00 36.07 92.52 54.19 77.03
A 92.87 32.20 99.18 37.10 93.48 41.86 83.41
H 93.78 00.00 88.40 00.20 93.05 33.94 85.39
M 90.43 28.75 91.84 26.58 93.43 39.04 88.83
P 91.27 01.20 48.42 06.90 90.84 30.50 76.82
S 79.21 00.00 26.56 02.76 81.39 33.73 60.78
A 91.93 11.94 87.21 07.24 92.12 33.86 84.56
Table 2: Positive Specific Agreement for annotations ob-
tained at the level (L) of text (T) and sentence (S) for each
domain (D): hotels (H), medicine (M), products (P), school
(S) and all (A).
Table 2presents PSA values obtained at the
level of text and sentence for all domains. The
overall PSA value for texts is 83.41% and for sen-
tences is 84.56%. It is worth noting that for the
domains to which we have not added neutral texts
(products, school), there are practically no neutral
annotations at the text level (see Table 7). The
highest values are obtained for the most obvious
categories (SP, SN and 0), regardless of the level
of text description. For the remaining categories
PSA value is lower than 40.00% in most cases.
D A/ SN/ SP/ A/ A/ A/ R A/WP/
H 28.55 22.07 18.33 17.08 07.86 03.12 02.99 47.63
M 18.66 26.24 14.29 17.49 12.24 04.37 06.71 37.32
P 28.16 24.27 13.59 19.42 10.68 02.91 00.97 48.54
S 36.21 07.76 28.45 10.34 06.03 08.62 02.59 49.14
A 26.69 22.07 17.82 16.79 09.02 03.89 03.74 45.23
Table 3: Distribution (%) of disagreements between annota-
tors at the text level. A – AMB tag, A/WP – one annotator
assigned [AMB], other – [WP]. R is the rest of rare occurring
combinations. A/WP/WN is the sum of A/WP, A/WN and
Table 3presents the distribution of disagree-
ments between annotators at the text level. The
most common disagreement is within the pair of
tags [AMB/WP]. Nearly half of the disagreements
are related to any pair of AMB, WP and WN tags.
This suggests that annotators, despite the guide-
lines, have difficulty in judging the relevance of
aspects regardless of the domain, or it is a very
subjective task.
D SN/ A/ A/ A/ SP/ A/ SP/ R A/WP/
0 SN 0 WP 0 SP WP R WN
H 10.52 14.29 05.65 19.80 09.42 07.88 09.31 04.30 30.40
M 34.66 08.10 05.02 04.98 15.93 03.32 06.68 04.35 11.62
P 07.84 21.08 33.57 06.21 05.57 09.00 05.17 02.15 09.93
S 04.63 13.90 26.59 08.66 06.45 20.44 12.19 02.01 12.49
A 16.22 13.80 13.23 11.69 10.20 08.20 08.08 18.58 19.07
Table 4: Distribution (%) of disagreements between annota-
tors at the sentence level. A – AMB tag, A/WP – one annota-
tor assigned [AMB], other – [WP]. R is the rest of rare occur-
ring combinations. A/WP/WN is the sum of A/WP, A/WN
and WN/WP.
Table 4presents the distribution of disagree-
ments between annotators at the sentence level.
The most common disagreement is within the pair
of tags [SN/0]. This time the cases of disagree-
ments between A/WP/WN tags are less than 20%.
Most of the errors are related to the neutral sen-
tence marking. The analysis of specific cases and
a discussion with linguists showed that in the task
of annotating sentences it is difficult to isolate a
sentence from the context and sometimes the an-
notation of the next sentence was a consequence
of the sentiment of the previous sentence.
We have found that it is difficult to decide on
the relevance of the aspects and without creating
a hierarchy of relevance of aspects for a given do-
main it will be hard to achieve better agreement for
WP/WN/AMB tags. Due to the fact that mistakes
are often within these tags, we have combined
them into one AMB tag. PolEmo 2.0 will also be
available for the original tags, but research (Ko-
n et al.,2019) has shown that machine learn-
ing methods achieve F-score for WP/WN/AMB
classes no higher than PSA. The evaluation data in
this research has WP/WN/AMB tags merged into
one AMB tag. Table 5presents PSA values after
the merging step. The total PSA increased from
83% to 91% for texts and from 85% to 88% for
4 Multi-Level Sentiment Recognition
Recently deep neural networks show relatively
good performance among all available methods
of processing such information (Glorot et al.,
2011). Possibility of retrieving data from different
sources like social networks (Pak and Paroubek,
2010), publicly available discussion boards or
H 91.92 99.42 78.50 91.62 89.39
M 94.09 99.05 70.25 96.28 93.43
P 94.06 100.0 77.82 85.95 89.07
S 87.50 00.00 80.78 92.52 88.32
A 92.87 99.18 76.87 93.48 90.91
H 93.78 88.40 65.64 93.05 89.83
M 90.43 91.84 59.40 93.43 90.13
P 91.27 48.42 41.22 90.84 79.12
S 79.21 26.56 45.48 81.39 65.68
A 91.92 87.21 56.82 92.12 87.50
Table 5: Positive Specific Agreement for annotations with
WP/WN/AMB merged into one AMB tag, obtained at the
level (L) of text (T) and sentence (S) for each domain (D):
hotels (H), medicine (M), products (P), school (S) and all (A).
marketing platforms connected with proper anno-
tations on training data set can provide not only
simple positive, negative or neutral classification
but lead to accurate fine-grained sentiment predic-
tion (Guzman and Maalej,2014).
We selected the same classifiers for the recogni-
tion tasks as in (Koco´
n et al.,2019): (1) Logistic
Regression as a fastText recognition model (Joulin
et al.,2017) with KGR10 word embeddings (Ko-
n and Gawor,2018) providing a baseline for text
classification; (2) BiLSTM (Zhou et al.,2016) in
two variants: KGR10 embeddings as features only
and KGR10 embeddings extended with general
polarity information from sentiment dictionary de-
scribed in (Koco´
n et al.,2019); (3) BERT (Devlin
et al.,2018) with additional sequence classifica-
tion layer.
We changed the architecture of BiLSTM and
BERT architecture. In case of BiLSTM, instead of
fixed input length we changed the model to work
with text of any length. The input tensor shape
is (None, 300) for embedding-only variant (BiL-
STM) and (None, 306) for embedding+dictionary
variant (BiLSTMd). We changed the shape of the
second gaussian noise layer to (None, 300)/(None,
306), respectively. Next layers remain the same,
i.e. (1) BiLSTM layer with 1024 hidden units, (2)
dropout layer (ratio 0.2). Last dense layer changed
due to the reduction of sentiment labels from 6
to 4 by label merging process described in Sec-
tion 3.5. For BERT we used the same architec-
ture as in (Koco´
n et al.,2019) for the whole texts,
but we changed it for sentences. We reduced the
maximum sequence length from 512 to 64 (cov-
ers more than 99% of sentences) and we increased
batch size from 32 to 128.
5 Evaluation
As in article (Koco´
n et al.,2019a;Koco´
n et al.,
2019), we prepared three variants of evaluation of
the sentiment classification methods:
SD Single Domain – evaluation sets created
using elements from the same domain;
DO Domain Out – train/dev sets created us-
ing elements from 3 domains, test set from
the remaining domain. This variant allows
to evaluate the ability of the classification
method to capture the domain-intependent
sentiment features;
MD Mixed Domains – SD train/dev/test sets
joined respectively. This variant allows to ex-
amine the ability of the classifier to generalise
the task of sentiment analysis in all available
We use SDT,DOT, and MDT abbreviations for
text evaluation types and SDS,DOS, and MDS for
sentence evaluation types. We use also prefixes
of domains (Hotels, Medicine, School, Products)
as suffixes for SD* and DO* variants, e.g. SDS-H
is a Single Domain evaluation type performed on
Sentences within Hotels domain, whereas DOT-
M is a Domain-Out evaluation type performed on
Texts trained on texts outside Medicine domain
and tested on texts from that domain.
Table 6shows the number of texts and sen-
tences annotated by linguists for all evaluation
types, with division into the number of elements
within training, validation and test sets. The dis-
tribution of labels for each domain (both texts and
sentences) is presented in Table 7.
6 Results
Table 8presents the values of F1-score for each la-
bel, global F1-score, micro-AUC and macro-AUC
for all evaluation types related to the texts. In
case of evaluation for a single domain for each la-
bel, fastText (using Logistic Regression) outper-
formed other classifiers in 16 out of 28 distin-
guishable cases. The worst results are obtained for
ambiguous cases, but in 9 out of 13 cases F1-score
is higher than 0.5 and this result is much better,
than obtained for intermediate labels (weak posi-
tive and weak negative) presented in work (Koco´
Type Domain Train Dev Test SUM
Hotels 3165 396 395 3956
Medicine 2618 327 327 3272
Products 387 49 48 484
School 403 50 51 504
!Hotels 3408 427 - 3835
!Medicine 3955 496 - 4451
!Products 6186 774 - 6960
!School 6170 772 - 6942
MDT All 6573 823 820 8216
Hotels 19881 2485 2485 24851
Medicine 18126 2265 2266 22657
Products 5942 743 742 7427
School 2025 253 253 2531
!Hotels 26093 3262 - 29355
!Medicine 27848 3481 - 31329
!Products 40032 5004 - 45036
!School 43949 5494 - 49443
MDS All 45974 5745 5747 57466
Table 6: The number of texts/sentences for each evaluation
type in train/dev/test sets.
et al.,2019). BERT classifier performs much bet-
ter (14 out of 28 cases) in domain-out knowledge
transfer evaluation (DOT). For this evaluation type
only 4 times fastText was better. These observa-
tions are consistent with the results of article (Ko-
n et al.,2019a) for valence dimensions.
7 Conclusions
BERT’s performance is below the expectations of
this advanced method in case of the classification
of the whole texts. Looking at both tables (8and
9), BERT’s results are the best in 64 out of 182
label-specific cases. BiLSTM outperformed other
methods in 48 cases. Adding an external senti-
ment dictionary helped in 40 label-specific cases.
Overall BiLSTM performance is better in 88 out of
182 cases. BERT dominance (when distinguish-
ing between BiLSTM and BiLSTMd) is observed
in DOT and all sentence cases. MDT case is the
most promising in terms of the further use of the
recognition method in applications such as brand
monitoring or early crisis detection. The values of
the general F1, micro AUC and macro AUC are
the highest for BiLSTM variants (see Table 6).
We published PolEmo 2.0 in CLARIN-PL
DSpace repository12 under the Creative Commons
4.0 License. We also intend to test the contex-
tualized embedding that we are currently build-
Type Domain SP AMB 0 SN
Hotels 25.61 24.29 10.77 39.33
Medicine 29.37 09.57 24.11 36.95
Products 11.16 27.48 00.41 60.95
School 51.39 38.29 00.00 10.32
All 27.84 19.47 14.81 37.88
Hotels 29.55 12.26 17.05 41.15
Medicine 23.18 06.26 39.48 31.08
Products 24.61 19.86 09.36 46.17
School 35.56 37.38 08.89 18.17
All 26.67 11.98 24.54 36.81
Table 7: The distribution (%) of annotations in a
given domain for the following sets: SDT – single do-
main texts (100%=8216), SDS – single domain sentences
ing using the ELMo deep word representations
method (Peters et al.,2018), with the use of the
large KGR10 corpus presented in work (Koco´
et al.,2019a). We also want to train the basic
BERT model with the use of KGR10 to investi-
gate whether it will improve the quality of senti-
ment recognition. It is also very interesting to use
the propagation of sentiment annotation in Word-
Net (Koco´
n et al.,2018a,b), to increase the cover-
age of the sentiment dictionary and to potentially
improve the recognition quality as well. This ob-
jective can be achieved by other complex methods
such as OpenAI GPT-2 (Radford et al.,2019) and
domain dictionaries construction methods utilis-
ing WordNet (Koco´
n and Marci´
Khalid Al-Rowaily, Muhammad Abulaish, Nur Al-
Hasan Haldar, and Majed Al-Rubaian. 2015. Bisal–
a bilingual sentiment analysis lexicon to analyze
dark web forums for cyber security. Digital Inves-
tigation, 14:53–62.
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-
tiani. 2010. Sentiwordnet 3.0: an enhanced lexical
resource for sentiment analysis and opinion mining.
In LREC, volume 10, pages 2200–2204.
Alexandra Balahur and Marco Turchi. 2012. Mul-
tilingual sentiment analysis using machine transla-
tion? In Proceedings of the 3rd workshop in com-
putational approaches to subjectivity and sentiment
analysis, pages 52–60. Association for Computa-
tional Linguistics.
Jože Buˇ
car, Martin Žnidaršiˇ
c, and Janez Povh. 2018.
Annotated news corpora and a lexicon for sentiment
analysis in slovene.Language Resources and Eval-
uation, 52(3):895–919.
T C SP AMB 0 SN F1 micro macro
1 83.58 55.56 98.80 85.47 80.25 94.28 73.96
287.31 64.24 97.56 88.44 84.05 95.44 75.87
3 84.69 67.39 96.30 89.97 84.05 96.71 76.77
4 83.50 59.88 93.83 86.90 81.01 95.62 74.94
182.83 36.84 98.65 81.48 83.18 95.62 73.35
2 78.35 18.18 96.60 78.29 77.37 92.99 70.83
3 75.13 15.87 94.67 76.19 74.31 91.92 70.13
4 80.75 00.00 97.30 85.61 83.79 96.37 74.29
140.00 54.55 00.00 85.29 75.00 93.09 63.65
2 00.00 00.00 00.00 82.93 70.83 87.49 35.50
3 00.00 08.70 00.00 67.65 50.00 77.82 44.43
4 00.00 00.00 00.00 84.34 72.92 89.21 39.81
181.36 66.67 00.00 50.00 74.00 84.27 59.85
2 65.31 60.47 00.00 25.00 60.00 76.92 56.23
3 72.73 57.89 00.00 28.57 64.00 76.12 53.97
4 71.79 00.00 00.00 00.00 56.00 79.48 51.02
1 77.63 41.77 90.48 80.85 73.16 90.39 71.30
2 74.37 25.00 85.71 73.28 66.08 85.96 67.75
3 82.52 52.69 86.42 82.14 76.46 92.77 73.17
483.84 47.27 85.71 83.43 76.20 94.15 73.46
1 76.40 20.00 81.89 78.26 74.01 89.54 66.99
2 73.81 20.62 88.89 76.38 70.03 88.34 68.92
3 73.14 23.08 88.41 78.33 72.48 91.71 70.94
478.11 23.30 92.20 78.84 72.78 90.81 71.01
1 50.00 57.14 00.00 78.69 68.75 90.27 72.90
266.67 55.17 00.00 75.86 66.67 88.90 74.73
3 50.00 64.29 00.00 85.25 75.00 93.76 72.04
440.00 52.17 40.00 82.54 70.83 90.65 72.06
1 72.73 59.26 00.00 33.33 60.00 76.97 60.24
2 73.47 56.25 00.00 26.67 58.00 82.03 59.79
3 78.43 23.08 00.00 26.67 50.00 76.92 58.62
480.00 52.94 00.00 28.57 62.00 83.71 58.89
1 82.20 53.64 95.73 84.06 80.37 93.69 73.61
287.22 61.92 95.20 88.17 84.39 96.41 76.44
3 84.33 55.63 94.37 86.61 81.71 95.19 75.36
4 85.40 56.75 96.07 85.97 82.07 96.72 76.43
1 84.42 54.44 98.80 84.37 79.49 93.46 73.62
286.73 65.14 95.00 89.09 83.80 96.06 76.33
3 85.00 58.33 96.30 86.80 81.27 95.24 75.44
4 85.86 63.58 95.00 87.91 82.78 96.82 76.52
1 81.82 30.00 96.60 83.27 82.57 95.21 73.30
288.32 36.36 95.95 87.55 86.24 97.16 75.92
3 84.38 32.14 96.55 88.12 84.10 95.77 74.95
4 86.01 32.65 97.96 86.79 85.02 97.37 76.12
1 50.00 72.73 00.00 91.18 83.33 93.54 74.56
266.67 66.67 00.00 92.31 83.33 94.86 76.25
3 33.33 53.85 00.00 87.10 72.92 92.23 73.35
4 50.00 42.86 00.00 77.42 64.58 93.26 68.60
1 77.78 66.67 00.00 57.14 70.00 85.85 62.86
287.27 73.68 00.00 28.57 78.00 94.63 66.48
387.27 82.35 00.00 25.00 78.00 93.53 66.70
4 84.21 66.67 00.00 00.00 74.00 93.55 66.52
Table 8: F1-scores for text-oriented evaluation. Training sets
for evaluation types (T) are the same as in Table 6rows 1-9.
Classifiers: (1) logistic regression (fastText), (2) BiLSTM on
word embeddings only (3) BiLSTMd – word embeddings ex-
tended using polarity dictionary (4) BERT. Evaluation types
are explained in Section 5.
T C SP AMB 0 SN F1 micro macro
1 71.98 40.00 64.49 75.90 68.21 83.48 64.44
282.51 53.93 72.23 84.29 78.31 93.78 73.40
3 81.69 51.41 71.21 84.21 77.99 93.43 73.03
4 82.46 56.65 75.33 84.21 78.99 92.97 72.98
1 67.58 25.90 73.33 64.06 66.18 82.41 61.67
2 72.36 31.75 78.20 71.17 71.96 90.67 70.09
3 74.49 29.13 79.62 72.58 73.33 91.18 70.39
475.69 27.24 81.33 73.77 74.53 90.76 69.72
1 62.22 35.34 33.93 73.19 60.78 80.13 59.96
2 62.21 28.34 40.65 74.48 60.78 81.82 61.34
366.67 31.46 36.36 73.94 61.32 83.05 62.51
466.67 16.77 36.04 74.07 62.80 82.63 60.82
159.34 58.37 34.29 42.50 54.55 77.34 59.64
2 47.06 47.85 34.29 28.26 43.08 68.40 53.11
3 45.16 51.61 35.56 26.97 43.87 73.38 56.71
4 51.31 63.24 18.18 00.00 51.78 76.17 52.96
1 61.49 26.94 46.98 62.32 54.53 74.29 57.88
2 72.57 34.60 58.97 74.56 66.56 87.02 67.76
372.76 42.29 60.50 74.80 67.81 87.89 68.21
4 70.42 42.12 60.89 74.81 66.96 85.71 68.07
1 48.58 21.18 56.83 55.56 50.33 71.50 55.83
261.87 26.37 62.44 64.55 59.47 80.72 63.67
3 58.68 24.77 63.00 63.00 58.41 80.83 63.51
461.87 27.21 66.58 64.25 60.75 81.80 65.08
1 54.21 23.77 28.92 58.81 47.04 69.03 53.20
2 66.28 33.33 35.34 72.20 59.30 81.78 63.82
366.47 30.61 31.50 72.05 58.36 81.15 62.98
4 64.26 35.82 30.95 72.78 58.76 78.58 62.11
1 38.52 42.05 34.92 30.30 37.15 59.92 52.56
2 53.25 43.90 19.35 46.03 44.27 71.52 58.91
358.82 47.50 23.73 41.79 46.64 71.10 61.07
4 55.13 51.89 29.79 44.07 49.01 73.09 59.20
1 66.17 32.36 63.05 66.73 61.27 79.33 61.45
277.43 47.21 74.09 79.40 74.13 91.48 71.70
3 77.10 45.88 74.30 78.73 73.70 91.52 71.83
4 76.65 47.76 76.70 79.27 74.36 91.19 71.80
1 72.09 33.13 61.42 72.88 65.43 81.43 62.66
282.82 51.63 73.18 84.23 78.51 93.64 73.19
3 81.73 54.51 72.68 84.77 78.59 93.80 73.53
482.82 55.41 74.76 84.52 78.91 93.04 73.12
1 63.02 23.12 68.42 61.87 61.37 79.79 60.19
276.10 34.88 79.19 75.27 74.44 91.55 70.72
3 75.27 35.29 79.60 72.51 73.42 91.21 70.72
4 75.12 40.00 81.83 75.50 75.67 91.71 71.52
1 56.89 31.85 31.75 63.39 52.16 73.92 56.03
2 67.75 36.44 35.93 76.90 63.88 86.03 65.86
370.65 35.34 40.00 77.89 65.23 87.23 67.14
4 65.19 33.33 42.60 75.53 62.26 84.60 65.06
1 52.17 48.68 26.67 41.44 46.25 69.03 54.72
2 59.17 64.42 34.15 54.55 58.50 79.16 62.17
361.71 50.81 30.43 52.00 52.96 78.05 62.10
4 58.62 53.47 34.29 50.53 53.36 81.38 61.85
Table 9: F1-scores for sentence-oriented evaluation. Train-
ing sets for evaluation types (T) are the same as in Table 6
rows 1-9. Classifiers: (1) logistic regression (fastText), (2)
BiLSTM on word embeddings only (3) BiLSTMd – word
embeddings extended using polarity dictionary (4) BERT.
Evaluation types are explained in Section 5.
Kia Dashtipour, Soujanya Poria, Amir Hussain, Erik
Cambria, Ahmad YA Hawalah, Alexander Gelbukh,
and Qiang Zhou. 2016. Multilingual sentiment anal-
ysis: state of the art and independent comparison of
techniques. Cognitive computation, 8(4):757–771.
Lingjia Deng and Janyce Wiebe. 2015. Mpqa 3.0: An
entity/event-level sentiment corpus. In Proceedings
of the 2015 conference of the North American chap-
ter of the association for computational linguistics:
human language technologies, pages 1323–1328.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
Murthy Ganapathibhotla and Bing Liu. 2008. Mining
opinions in comparative sentences. In Proceedings
of the 22nd International Conference on Computa-
tional Linguistics-Volume 1, pages 241–248. Asso-
ciation for Computational Linguistics.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio.
2011. Domain adaptation for large-scale sentiment
classification: A deep learning approach. In Pro-
ceedings of the 28th international conference on ma-
chine learning (ICML-11), pages 513–520.
Emitza Guzman and Walid Maalej. 2014. How do
users like this feature? a fine grained sentiment
analysis of app reviews. In 2014 IEEE 22nd inter-
national requirements engineering conference (RE),
pages 153–162. IEEE.
Ivan Habernal and Tomáš Brychcín. 2013. Unsuper-
vised improving of sentiment analysis using global
target context. In Proceedings of RANLP 2013. As-
sociation for Computational Linguistics.
Ruining He and Julian McAuley. 2016. Ups and
downs: Modeling the visual evolution of fashion
trends with one-class collaborative filtering. In
proceedings of the 25th international conference
on world wide web, pages 507–517. International
World Wide Web Conferences Steering Committee.
George Hripcsak and Adam S. Rothschild. 2005. Tech-
nical Brief: Agreement, the F-Measure, and Relia-
bility in Information Retrieval. JAMIA, 12(3):296–
Minqing Hu and Bing Liu. 2004. Mining and summa-
rizing customer reviews. In Proceedings of the tenth
ACM SIGKDD international conference on Knowl-
edge discovery and data mining, pages 168–177.
Arkadiusz Janz, Jan Koco´
n, Maciej Piasecki, and
Monika Za´
nska. 2017. plWordNet as a Ba-
sis for Large Emotive Lexicons of Polish. In LTC’17
8th Language and Technology Conference, Pozna´
Poland. Fundacja Uniwersytetu im. Adama Mick-
iewicza w Poznaniu.
Armand Joulin, Edouard Grave, Piotr Bojanowski, and
Tomas Mikolov. 2017. Bag of tricks for efficient
text classification. In Proceedings of the 15th Con-
ference of the European Chapter of the Association
for Computational Linguistics: Volume 2, Short Pa-
pers, pages 427–431. Association for Computational
Hiroshi Kanayama and Tetsuya Nasukawa. 2006. Fully
automatic lexicon expansion for domain-oriented
sentiment analysis. In Proceedings of the 2006 con-
ference on empirical methods in natural language
processing, pages 355–363. Association for Compu-
tational Linguistics.
Jan Koco´
n and Michał Gawor. 2018. Evaluating
KGR10 Polish word embeddings in the recogni-
tion of temporal expressions using BiLSTM-CRF.
Schedae Informaticae, 27.
Jan Koco´
n, Arkadiusz Janz, and Maciej Piasecki.
2018a. Classifier-based Polarity Propagation in a
Wordnet. In Proceedings of the 11th International
Conference on Language Resources and Evaluation
Jan Koco´
n, Arkadiusz Janz, and Maciej Piasecki.
2018b. Context-sensitive Sentiment Propagation in
WordNet. In Proceedings of the 9th International
Global Wordnet Conference (GWC’18).
Jan Koco´
n, Monika Za´
nska, and Piotr
Miłkowski. 2019. Multi-Level Analysis and Recog-
nition of the Text Sentiment on the Example of Con-
sumer Opinions. In Proceedings of the International
Conference Recent Advances in Natural Language
Processing, RANLP 2019.
Jan Koco´
n, Arkadiusz Janz, Miłkowski Piotr, Monika
Riegel, Małgorzata Wierzba, Artur Marchewka, Ag-
nieszka Czoska, Damian Grimling, Barbara Konat,
Konrad Juszczyk, Katarzyna Klessa, and Maciej Pi-
asecki. 2019a. Recognition of emotions, polarity
and arousal in large-scale multi-domain text reviews.
In Zygmunt Vetulani and Patrick Paroubek, editors,
Human Language Technologies as a Challenge for
Computer Science and Linguistics, pages 274–280.
Wydawnictwo Nauka i Innowacje, Pozna´
n, Poland.
Jan Koco´
n, Arkadiusz Janz, Monika Riegel, Mał-
gorzata Wierzba, Artur Marchewka, Agnieszka
Czoska, Damian Grimling, Barbara Konat, Konrad
Juszczyk, Katarzyna Klessa, and Maciej Piasecki.
2019b. Propagation of emotions, arousal and po-
larity in WordNet using Heterogeneous Structured
Synset Embeddings. In Proceedings of the 10th In-
ternational Global Wordnet Conference (GWC’19),
Wrocław, Poland.
Jan Koco´
n and Michał Marci´
nczuk. 2016. Generating
of Events Dictionaries from Polish WordNet for the
Recognition of Events in Polish Documents. In Text,
Speech and Dialogue, Proceedings of the 19th In-
ternational Conference TSD 2016, volume 9924 of
Lecture Notes in Artificial Intelligence, Brno, Czech
Republic. Springer.
Klaus Krippendorff. 2018. Content analysis: An intro-
duction to its methodology. Sage publications.
Siaw Ling Lo, Erik Cambria, Raymond Chiong, and
David Cornforth. 2017. Multilingual sentiment
analysis: from formal to informal and scarce re-
source languages. Artificial Intelligence Review,
William C Mann and Sandra A Thompson. 1988.
Rhetorical structure theory: Toward a functional the-
ory of text organization. Text-Interdisciplinary Jour-
nal for the Study of Discourse, 8(3):243–281.
Michal Marci´
nczuk, Monika Za´
nska, and Ma-
ciej Piasecki. 2011. Structure annotation in the Pol-
ish corpus of suicide notes. In Text, Speech and Di-
alogue - 14th International Conference, TSD 2011,
Pilsen, Czech Republic, September 1-5, 2011. Pro-
ceedings, volume 6836 of Lecture Notes in Com-
puter Science, pages 419–426. Springer.
Michał Marci´
nczuk, Jan Koco´
n, and Bartosz Broda.
2012. Inforex – a web-based tool for text corpus
management and semantic annotation. In Proceed-
ings of the Eight International Conference on Lan-
guage Resources and Evaluation (LREC’12), Istan-
bul, Turkey. European Language Resources Associ-
ation (ELRA).
Michał Marci´
nczuk and Marcin Oleksy. 2019. Inforex
—a Collaborative System for Text Corpora Annota-
tion and Analysis Goes Open. In Proceedings of the
International Conference Recent Advances in Natu-
ral Language Processing, RANLP 2019.
Walaa Medhat, Ahmed Hassan, and Hoda Korashy.
2014. Sentiment analysis algorithms and applica-
tions: A survey. Ain Shams engineering journal,
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
Sebastiani, and Veselin Stoyanov. 2016. Semeval-
2016 task 4: Sentiment analysis in twitter. In Pro-
ceedings of the 10th international workshop on se-
mantic evaluation (semeval-2016), pages 1–18.
Alexander Pak and Patrick Paroubek. 2010. Twitter as
a corpus for sentiment analysis and opinion mining.
In LREc, volume 10, pages 1320–1326.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
2227–2237, New Orleans, Louisiana. Association
for Computational Linguistics.
Maciej Piasecki. 2014. User-driven language technol-
ogy infrastructure—the case of clarin-pl. In Pro-
ceedings of the Ninth Language Technologies Con-
ference. Ljubljana, Slovenia.
Maciej Piasecki, Marek Maziarz, Stanisław Szpakow-
icz, and Ewa Rudnicka. 2014. PLWordNet as the
Cornerstone of a Toolkit of Lexico-semantic Re-
sources. In Proc. 7th International Global Wordnet
Conference, pages 304–312.
Maciej Piasecki, Ksenia Młynarczyk, and Jan Ko-
n. 2017. Recognition of genuine Polish suicide
notes. In Proceedings of the International Confer-
ence Recent Advances in Natural Language Pro-
cessing, RANLP 2017, pages 583–591.
Maria Pontiki, Dimitris Galanis, Haris Papageor-
giou, Ion Androutsopoulos, Suresh Manandhar, AL-
Smadi Mohammad, Mahmoud Al-Ayyoub, Yanyan
Zhao, Bing Qin, Orphée De Clercq, et al. 2016.
Semeval-2016 task 5: Aspect based sentiment anal-
ysis. In Proceedings of the 10th international work-
shop on semantic evaluation (SemEval-2016), pages
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Blog, page 8.
Adam Radziszewski and Tomasz ´
Sniatowski. 2011.
Maca — a configurable tool to integrate Polish mor-
phological data. In Proceedings of the Second In-
ternational Workshop on Free/Open-Source Rule-
Based Machine Translation.
Jyoti Ramteke, Samarth Shah, Darshan Godhia, and
Aadil Shaikh. 2016. Election result prediction us-
ing twitter sentiment analysis. In 2016 international
conference on inventive computation technologies
(ICICT), volume 1, pages 1–5. IEEE.
Monika Riegel, Małgorzata Wierzba, Marek Wypych,
Łukasz ˙
Zurawski, Katarzyna Jednoróg, Anna
Grabowska, and Artur Marchewka. 2015. Nencki
affective word list (nawl): The cultural adaptation of
the berlin affective word list–reloaded (bawl-r) for
polish. Behavior Research Methods, 47(4):1222–
Anna Rogers, Alexey Romanov, Anna Rumshisky,
Svitlana Volkova, Mikhail Gronas, and Alex Gribov.
2018. Rusentiment: An enriched sentiment analysis
dataset for social media in russian. In Proceedings
of the 27th International Conference on Computa-
tional Linguistics, pages 755–763.
Richard Socher, Alex Perelygin, Jean Wu, Jason
Chuang, Christopher D Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment tree-
bank. In Proceedings of the 2013 conference on
empirical methods in natural language processing,
pages 1631–1642.
V Subramaniyaswamy, R Logesh, M Abejith, Sunil
Umasankar, and A Umamakeswari. 2017. Senti-
ment analysis of tweets for estimating criticality and
security of events. Journal of Organizational and
End User Computing (JOEUC), 29(4):51–71.
Maite Taboada, Kimberly Voll, and Julian Brooke.
2008. Extracting sentiment as a function of dis-
course structure and topicality. Simon Fraser Uni-
veristy School of Computing Science Technical Re-
Marek Troszy´
nski and Aleksandra Wawer. 2017. Czy
komputer rozpozna hejtera? wykorzystanie uczenia
maszynowego (ml) w jako´
sciowej analizie danych.
Przegl ˛ad Socjologii Jako´
sciowej, 13(2):62–80.
Fei Wang, Yunfang Wu, and Likun Qiu. 2012. Exploit-
ing discourse relations for sentiment analysis. Pro-
ceedings of COLING 2012: Posters, pages 1311–
Aleksander Wawer. 2012. Mining co-occurrence ma-
trices for so-pmi paradigm word candidates. In Pro-
ceedings of the Student Research Workshop at the
13th Conference of the European Chapter of the As-
sociation for Computational Linguistics, pages 74–
80. Association for Computational Linguistics.
Aleksander Wawer and Maciej Ogrodniczuk. 2017.
Results of the poleval 2017 competition: sentiment
analysis shared task. In 8th Language and Technol-
ogy Conference: Human Language Technologies as
a Challenge for Computer Science and Linguistics.
Aleksander Wawer and Dominika Rogozinska. 2012.
How much supervision? corpus-based lexeme sen-
timent estimation. In 2012 IEEE 12th International
Conference on Data Mining Workshops, pages 724–
730. IEEE.
Małgorzata Wierzba, Monika Riegel, Marek Wypych,
Katarzyna Jednoróg, Paweł Turnau, Anna
Grabowska, and Artur Marchewka. 2015. Ba-
sic emotions in the nencki affective word list (nawl
be): New method of classifying emotional stimuli.
PLoS One, 10(7):e0132305.
Monika Za´
nska. 2013. Listy po˙
zegnalne: w
poszukiwaniu lingwistycznych wyznaczników auten-
sci tekstu. Wydawnictwo Quaestio, Wrocław.
Monika Za´
nska, Maciej Piasecki, and Stan
Szpakowicz. 2015. A large wordnet-based senti-
ment lexicon for polish. In Proceedings of the In-
ternational Conference Recent Advances in Natural
Language Processing, pages 721–730.
Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep
learning for sentiment analysis: A survey. Wiley In-
terdisciplinary Reviews: Data Mining and Knowl-
edge Discovery, 8(4):e1253.
Wenbin Zhang and Steven Skiena. 2010. Trading
strategies to exploit blog and news sentiment. In
Fourth international aAAI conference on weblogs
and social media.
Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen
Li, Hongwei Hao, and Bo Xu. 2016. Attention-
based bidirectional long short-term memory net-
works for relation classification. In Proceedings of
the 54th Annual Meeting of the Association for Com-
putational Linguistics (Volume 2: Short Papers),
volume 2, pages 207–212.
... There is a plethora of different models selected for training emotion detection. Some works have tested various models and stayed with SVMs ( [17]), some -due to data scarcity, for instance -prefer Bert ( [22,27]) and some stick to LSTM ( [20]) or BiLSTM-architectures with co-attention ( [27,28]) or sequence-based convolutional neural networks [29]. Following the opinion of [30] that "(...) HANs (Hierarchical Attention Networks) work on two different and complementary aspects: one reveals the hierarchical structure of documents, whereas the second one őxes two levels of attention, applied at a word and sentence level, to provide different importance to the content during the document representation" we őrst chose this type of neural networks, as described in [31], for our work. ...
... If we look at emotion extraction works based on online reviews, there are at least two works analyzing movie and hotel reviews ( [30,32]) or Yelp, Yahoo, and Amazon reviews ( [31]). As for studies of Polish opinion reviews, the PolEmo corpus of consumer reviews covers the domains of medicine (patients' opinions on doctors), hotels, products, and school (students' reviews of the lecturers) with about 8,216 reviews ( [28]). ...
... Another problem is that in the NLP literature, we have found very few works related to the detection of a crisis of reputation in its early stage (see the Related Work section). As we can see in Table 1, most works use a deőnition of a crisis tailored to their speciőc problem or task rather than a general one that would apply to the NLP őeld as a whole, despite the fact that many authors stress the need to coin it [28]. ...
Full-text available
From the perspective of marketing studies, a crisis of reputation is interpreted as such post factum, when measurable financial loss is induced. However, it is highly demanded to discover its signs as early as possible for risk management purposes. Here is where artificial intelligence finds its application since the start of the Facebook era. Emotion- or sentiment-classification algorithms, based on BiLSTM neural networks or transformer architectures, achieve very good F1 scores. Nevertheless, the scholarly literature offers very few approaches to the detection of reputational crises ante factum from an NLP point of view. At the same time, not every peak of mentions with negative sentiment equals a crisis of reputation by definition. There exist ample general sentiment classification tools dedicated to a specific social medium, e.g. Twitter, while reputational crises often expand over various Internet sources. However, they also tend to be highly unpredictable in the way they appear and spread online. Moreover, very few studies of their development have so far been conducted from the perspective of NLP tool-design. Therefore, in our work we try to answer the question: how can we track reputational crises fast and precisely in multiple communication channels, and what do current NLP methods offer with this respect? For this purpose we have: consulted Internet monitoring experts and, defined major crisis topics for three business domains, built and tested three different approaches to crisis detection (a HAN-based emotion detection model, heuristic crisis detection models for predefined risks, a statistical mention peak analysis tool with an ML-based summarization algorithm) to track most Internet sources and cover both explicit and implicit content and performed an analysis of 15 reputational crises in online sources in Poland. We offer a comparative analysis of NLP tools of qualitative semantic methods applied to a study of real-life reputational crises that appeared in Poland within the last two decades.
... Recently, most of the research in the field of sentiment analysis has relied on effective solutions based on deep neural networks. The current state-of-the-art applies Recurrent Neural Networks and Transformers, such as BiLSTM/CNN [10,11], BERT [11,12], or RoBERTa [13,14]. The concept of knowledge transfer between domains, document types, and user biases within social networks has been discussed in [15]. ...
... Recently, most of the research in the field of sentiment analysis has relied on effective solutions based on deep neural networks. The current state-of-the-art applies Recurrent Neural Networks and Transformers, such as BiLSTM/CNN [10,11], BERT [11,12], or RoBERTa [13,14]. The concept of knowledge transfer between domains, document types, and user biases within social networks has been discussed in [15]. ...
... We created a new kind of dataset for sentiment analysis tasks -PolEmo 2.0 [11]. Each sentence, as well as the entire document, is labeled with one of the four following sentiment classes: (1) P: positive; (2) 0 : neutral; (3) N : negative; (4) AMB : ambivalent, that is, there are positive and negative aspects in the text that are balanced in terms of relevance. ...
We carried out extensive experiments on the MultiEmo dataset for sentiment analysis with texts in eleven languages. Two adapted versions of the LaBSE deep architecture were confronted against the LASER model. That allowed us to conduct cross-language validation of these language agnostic methods. The achieved results proved that LaBSE embeddings with an additional attention layer within the biLSTM architecture commonly outperformed other methods.KeywordsCross-language NLPSentiment analysisLanguage-agnostic representationLaBSEBiLSTMLASEROpinion miningMultiEmo
... 5), it was exploited in the MultiEmo web service for language-agnostic sentiment analysis: https: // All results presented in this paper are downloadable: the MultiEmo dataset at 798 and source codes at [9,10], BERT [10,11], or RoBERTa [12,13]. The idea of knowledge transfer between domains, document types, and user biases in the context of social media was discussed in [14]. ...
... 5), it was exploited in the MultiEmo web service for language-agnostic sentiment analysis: https: // All results presented in this paper are downloadable: the MultiEmo dataset at 798 and source codes at [9,10], BERT [10,11], or RoBERTa [12,13]. The idea of knowledge transfer between domains, document types, and user biases in the context of social media was discussed in [14]. ...
... We created a new kind of dataset for sentiment analysis tasks -PolEmo 2.0 [10]. Each sentence as well as the entire document are labelled with one out of the four following sentiment classes: (1) P: positive; (2) 0 : neutral; (3) N : negative; (4) AMB: ambivalent, i.e., there are both positive and negative aspects in the text that are balanced in terms of relevance. ...
We developed and validated a language-agnostic method for sentiment analysis. Cross-language experiments carried out on the new MultiEmo dataset with texts in 11 languages proved that LaBSE embeddings with an additional attention layer implemented in the BiLSTM architecture outperformed other methods in most cases.KeywordsCross-language NLPSentiment analysisLanguage-agnostic representationLASERLaBSEBiLSTMOpinion miningMultiEmo
... Such language models contain millions of parameters but also require large computational resources. Hence, their simplified methods, e.g., BiLSTM [15,19], are often used in practice. We refer to both of these approaches as our baselines. ...
... PolEmo 2.0 dataset [12,15] is a sentiment analysis task benchmark dataset. It consists of more than 8,000 consumer reviews, containing more than 57,000 sentences. ...
... The benefits of additional knowledge bases are best seen in simple language models [17]. For this reason, fastText model for Polish language [14] and BiLSTM model [15] working on the basis of embeddings per token derived from it has been taken into consideration (Fig. 4). This approach allows to use the knowledge base at the level of each token. ...
We propose and test multiple neuro-symbolic methods for sentiment analysis. They combine deep neural networks – transformers and recurrent neural networks – with external knowledge bases. We show that for simple models, adding information from knowledge bases significantly improves the quality of sentiment prediction in most cases. For medium-sized sets, we obtain significant improvements over state-of-the-art transformer-based models using our proposed methods: Tailored KEPLER and Token Extension. We show that the cases with the improvement belong to the hard-to-learn set.KeywordsNeuro-symbolic sentiment analysisplWordNetKnowledge baseTransformersKEPLERHerBERTBiLSTMPolEmo 2.0
... AspectEmo Corpus is an extended version of a publicly available PolEmo 2.0 corpus [40] that was used in many projects on the use of different methods in sentiment analysis. ...
... The tags that have a lower PSA value are also generally less frequently picked by annotators, as shown in Table IV. Similar lower PSA results for ambiguous tags were obtained in similar studies on the analysis of sentiment at text and sentence levels [40]. ...
... Figure 4 shows a learning curve for AMB label. It follows the problem noticed in similar work on PolEmo 2.0 [40] -the ambiguous label is not only underrepresented but also hard to assign for human experts as well as for deep learning models. ...
Conference Paper
Aspect-based sentiment analysis (ABSA) is a text analysis method that categorizes data by aspects and identifies the sentiment assigned to each aspect. Aspect-based sentiment analysis can be used to analyze customer opinions by associating specific sentiments with different aspects of a product or service. Most of the work in this topic is thoroughly performed for English, but many low-resource languages still lack adequate annotated data to create automatic methods for the ABSA task. In this work, we present annotation guidelines for the ABSA task for Polish and preliminary annotation results in the form of the AspectEmo corpus, containing over 1.5k consumer reviews annotated with over 63k annotations. We present an agreement analysis on the resulting annotated corpus and preliminary results using transformer-based models trained on AspectEmo.
... As discrepancies between annotations existed -most of the mistakes happened for (WP/WN/AMB) tags, the authors decided to eliminate separate tags for weakly positive and weakly negative reviews, and merge those tags into one (AMB) tag. Furthermore, Kocoń et al. (2019) indicated that the majority of the errors were related to the identification of neutral (0) reviews. Hence, we decided to combine (0) and (AMB) tags together. ...
... This dataset(Kocoń et al., 2019) comprises around 8,200 online reviews related to education, products, medicine and hotel domains. The vast majority of PolEmo 2.0 reviews (around 85%) come from the medicine and hotel domains. ...
Full-text available
People express their opinions and views in different and often ambiguous ways, hence the meaning of their words is often not explicitly stated and frequently depends on the context. Therefore, it is difficult for machines to process and understand the information conveyed in human languages. This work addresses the problem of sentiment analysis (SA). We propose a simple yet comprehensive method which uses contextual embeddings and a self-attention mechanism to detect and classify sentiment. We perform experiments on reviews from different domains, as well as on languages from three different language families, including morphologically rich Polish and German. We show that our approach is on a par with state-of-the-art models or even outperforms them in several cases. Our work also demonstrates the superiority of models leveraging contextual embeddings. In sum, in this paper we make a step towards building a universal, multilingual sentiment classifier.
... Kocoń et al., 2019) is a dataset of online consumer reviews from four domains: medicine, hotels, products, and university. It consists of 8,216 reviews having 57,466 sentences. ...
Full-text available
The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
... A psychologist and a linguist annotated reviews. Another independent annotator solved disagreements [23]. The released collection has already been split into test (820 samples), validation (823 samples), and training (6573 samples) subsets. ...
Conference Paper
Many publications prove that the creation of a multiobjective machine learning model is possible and reasonable. Moreover, we can see significant gains in expanding the knowledge domain, increasing prediction quality, and reducing the inference time. New developments in cross-lingual knowledge transfer open up a range of possibilities, particularly in working with low-resource languages. With a motivation to explore the latest subfields of natural language processing and their interactions, we decided to create a multi-task multilingual model for the following text classification tasks: functional style, domain, readability, and sentiment. The paper discusses the effectiveness of particular language-agnostic approaches to Polish and English and the effectiveness and validity of the multi-task model.
5G networks offer novel communication infrastructure for Internet of Things applications, especially for healthcare applications. There, edge computing enabled Internet of Medical Things provides online patient status monitoring. In this contribution, a Chicken Swarm Optimization algorithm, based on Energy Efficient Multi-objective clustering is applied in an IoMT system. An effective fitness function is designed for cluster head selection. In a simulated environment, performance of proposed scheme is evaluated. KeywordsEnergy efficiencyNetwork lifetimeClusteringCluster head selectionDelayChicken swarm optimizationSensor networksAdaptive networks
Conference Paper
Full-text available
Models are increasing in size and complexity in the hunt for SOTA. But what if those 2% increase in performance does not make a difference in a production use case? Maybe benefits from a smaller, faster model outweigh those slight performance gains. Also, equally good performance across languages in multilingual tasks is more important than SOTA results on a single one. We present the biggest, unified, multilingual collection of sentiment analysis datasets. We use these to assess 11 models and 80 high-quality sentiment datasets (out of 342 raw datasets collected) in 27 languages and included results on the internally annotated datasets. We deeply evaluate multiple setups, including fine-tuning transformer-based models for measuring performance. We compare results in numerous dimensions addressing the imbalance in both languages coverage and dataset sizes. Finally, we present some best practices for working with such a massive collection of datasets and models for a multi-lingual perspective.
Conference Paper
Full-text available
In this paper we present a novel method for emotive propagation in a wordnet based on a large emotive seed. We introduce a sense-level emotive lexicon annotated with polarity, arousal and emotions. The data were annotated as a part of a large study involving over 20,000 participants. A total of 30,000 lexical units in Polish WordNet were described with meta-data, each unit received about 50 annotations concerning polarity, arousal and 8 basic emotions, marked on a multilevel scale. We present a preliminary approach to propagating emotive metadata to unla-beled lexical units based on the distribution of manual annotations using logistic regression and description of mixed synset embeddings based on our Heterogeneous Structured Synset Embeddings.
Conference Paper
Full-text available
In this article, we present a novel multi-domain dataset of Polish text reviews, annotated with sentiment on different levels: sentences and the whole documents. The annotation was made by linguists in a 2+1 scheme (with inter-annotator agreement analysis). We present a preliminary approach to the classification of labelled data using logistic regression, bidirec-tional long short-term memory recurrent neural networks (BiLSTM) and bidirec-tional encoder representations from transformers (BERT).
Full-text available
The article introduces a new set of Polish word embeddings, built using KGR10 corpus, which contains more than 4 billion words. These embeddings are evaluated in the problem of recognition of temporal expressions (timexes) for the Polish language. We described the process of KGR10 corpus creation and a new approach to the recognition problem using Bidirectional Long-Short Term Memory (BiLSTM) network with additional CRF layer, where specific embeddings are essential. We presented experiments and conclusions drawn from them.
Conference Paper
Full-text available
In this article, we present a novel multidomain dataset of Polish text reviews. The data were annotated as part of a large study involving over 20,000 participants. A total of 7,000 texts were described with metadata, each text received about 25 annotations concerning polarity, arousal and eight basic emotions, marked on a multilevel scale. We present a preliminary approach to data labelling based on the distribution of manual annotations and to the classification of labelled data using logistic regression and bi-directional long short-term memory recurrent neural networks.
Conference Paper
Full-text available
In the paper we present the latest changes introduce to Inforex-a web-based system for qualitative and collaborative text corpora annotation and analysis. One of the most important news is the release of source codes. Now the system is available on the GitHub repository ( CLARIN-PL/Inforex) as an open source project. The system can be easily setup and run in a Docker container what simplifies the installation process. The major improvements include: semi-automatic text annotation, multilingual text preprocessing using CLARIN-PL web services, morphological tagging of XML documents, improved editor for annotation attribute, batch annotation attribute editor, morphological disambiguation, extended word sense annotation. This paper contains a brief description of the mentioned improvements. We also present two use cases in which various Inforex features were used and tested in real-life projects.
Conference Paper
Full-text available
In this paper we present a novel approach to the construction of an extensive, sense-level sentiment lexicon built on the basis of a wordnet. The main aim of this work is to create a high-quality sentiment lexicon in a partially automated way. We propose a method called Classifier-based Polarity Propagation, which utilises a very rich set of wordnet-based features, to recognize and assign specific sentiment polarity values to wordnet senses. We have demonstrated that in comparison to the existing rule-base solutions using specific, narrow set of semantic relations, our method allows for the construction of a more reliable sentiment lexicon, starting with the same seed of annotated synsets.
Conference Paper
Full-text available
In this paper we present a comprehensive overview of recent methods of the sentiment propagation in a wordnet. Next, we propose a fully automated method called Classifier-based Polarity Propagation , which utilises a very rich set of features , where most of them are based on wordnet relation types, multi-level bag-of-synsets and bag-of-polarities. We have evaluated our solution using manually annotated part of plWordNet 3.1 emo, which contains more than 83k manual sentiment annotations, covering more than 41k synsets. We have demonstrated that in comparison to existing rule-based methods using a specific narrow set of semantic relations our method has achieved statistically significant and better results starting with the same seed synsets.
Conference Paper
Full-text available
We present a large emotive lexicon of Polish which has been constructed by manual expansion of the emotive annotation defined for plWordNet 3.0 emo (a very large wordnet of Polish). The annotation encompasses: sentiment polarity, basic emotions and fundamental human values. Annotation scheme and revised guidelines for the annotation process are discussed. We present also statistics for the contemporary state of the development. Finally, the idea of the second plWordNet-based emotive lexicon created in controlled experiments is introduced. A method of selection of word senses for experiments is proposed and evaluated.
Deep learning has emerged as a powerful machine learning technique that learns multiple layers of representations or features of the data and produces state‐of‐the‐art prediction results. Along with the success of deep learning in many application domains, deep learning is also used in sentiment analysis in recent years. This paper gives an overview of deep learning and then provides a comprehensive survey of its current applications in sentiment analysis. This article is categorized under: • Fundamental Concepts of Data and Knowledge > Data Concepts • Algorithmic Development > Text Mining