Figure 1 - uploaded by Jan Kocoń
Content may be subject to copyright.
Rating distributions within emotional categories. All values are normalized to the interval [0,1].

Rating distributions within emotional categories. All values are normalized to the interval [0,1].

Source publication
Conference Paper
Full-text available
Analysis of emotions elicited by opinions, comments, or articles commonly exploits annotated corpora, in which the labels assigned to documents average the views of all annotators, or represent a majority decision. The models trained on such data are effective at identifying the general views of the population. However, their usefulness for predict...

Contexts in source publication

Context 1
... the acquired data consists of ten emotional categories: valence, arousal, and eight basic emotions: sadness, anticipation, joy, fear, surprise, disgust, trust and anger. Mean text rating distributions within emotional categories are presented in Figure 1. In total, 7k opinions * average of 53.46 annotators per opinion * 10 categories = 3.74M single annotations were collected. ...
Context 2
... performance in the PEB scenario is the lowest for the valence category, which may result from the highest agreement level (α int c = 0.38) and more flat distribution, Figure 1. Simultaneously, the reasoning based on text only (TXT scenario) demonstrated an opposite dependency: its performance is greatest for the highest agreement (va- Figure 6: R-squared results on TXT+PEB scenario and HerBERT model in relation to the number of texts from the past set, randomly selected to compute P EB(u, c) averaged over all users u -the solid lines. ...
Context 3
... the acquired data consists of ten emotional categories: valence, arousal, and eight basic emotions: sadness, anticipation, joy, fear, surprise, disgust, trust and anger. Mean text rating distributions within emotional categories are presented in Figure 1. In total, 7k opinions * average of 53.46 annotators per opinion * 10 categories = 3.74M single annotations were collected. ...
Context 4
... performance in the PEB scenario is the lowest for the valence category, which may result from the highest agreement level (α int c = 0.38) and more flat distribution, Figure 1. Simultaneously, the reasoning based on text only (TXT scenario) demonstrated an opposite dependency: its performance is greatest for the highest agreement (va- Figure 6: R-squared results on TXT+PEB scenario and HerBERT model in relation to the number of texts from the past set, randomly selected to compute P EB(u, c) averaged over all users u -the solid lines. ...

Citations

... Others are concerned with identifying similar perspectives among groups of annotators based on demographic information (Bizzoni et al., 2022;Goyal et al., 2022;Sang & Stanton, 2022), personality traits (Labat et al., 2022), or annotation behaviour (Akhtar et al., 2020). Finally, some works are concerned with preserving the views of individual annotators' evaluations Milkowski et al., 2021). view this as a continuum from data-to human-centricism, where works that focus on the individual level of annotator granularity sit at the human end of the spectrum. ...
... Modelling human bias The fifth (and last) category is focused on human biases. Milkowski et al. (2021) introduce a novel measure estimating Personal Emotional Bias (PEB) in evaluating opinions. PEB measures the extent to which previously known annotations of a given user differ from the average annotations provided by all others for a given emotional category, aggregated over all documents. ...
Article
Full-text available
In Artificial Intelligence research, perspectivism is an approach to machine learning that aims at leveraging data annotated by different individuals in order to model varied perspectives that influence their opinions and world view. We present the first survey of datasets and methods relevant to perspectivism in Natural Language Processing (NLP). We review datasets in which individual annotator labels are preserved, as well as research papers focused on analysing and modelling human perspectives for NLP tasks. Our analysis is based on targeted questions that aim to surface how different perspectives are taken into account, what the novelties and advantages of perspectivist approaches/methods are, and the limitations of these works. Most of the included works have a perspectivist goal, even if some of them do not explicitly discuss perspectivism. A sizeable portion of these works are focused on highly subjective phenomena in natural language where humans show divergent understandings and interpretations, for example in the annotation of toxic and otherwise undesirable language. However, in seemingly objective tasks too, human raters often show systematic disagreement. Through the framework of perspectivism we summarize the solutions proposed to extract and model different points of view, and how to evaluate and explain perspectivist models. Finally, we list the key concepts that emerge from the analysis of the sources and several important observations on the impact of perspectivist approaches on future research in NLP.
... All these differences can impact the annotation process. Studies such as Milkowski et al. (2021) have shown that individual differences among annotators can significantly affect emotion annotations in text. These individual differences introduce subjectivity into data assumed to be objective, leading to inconsistencies that can skew the training and evaluation of models designed to predict emotional reactions from text. ...
Preprint
Full-text available
This paper investigates the presence of political bias in emotion inference models used for sentiment analysis (SA) in social science research. Machine learning models often reflect biases in their training data, impacting the validity of their outcomes. While previous research has highlighted gender and race biases, our study focuses on political bias - an underexplored yet pervasive issue that can skew the interpretation of text data across a wide array of studies. We conducted a bias audit on a Polish sentiment analysis model developed in our lab. By analyzing valence predictions for names and sentences involving Polish politicians, we uncovered systematic differences influenced by political affiliations. Our findings indicate that annotations by human raters propagate political biases into the model's predictions. To mitigate this, we pruned the training dataset of texts mentioning these politicians and observed a reduction in bias, though not its complete elimination. Given the significant implications of political bias in SA, our study emphasizes caution in employing these models for social science research. We recommend a critical examination of SA results and propose using lexicon-based systems as a more ideologically neutral alternative. This paper underscores the necessity for ongoing scrutiny and methodological adjustments to ensure the reliability and impartiality of the use of machine learning in academic and applied contexts.
... This includes alternative measures, such as majority vote, but also new techniques coming from the field of AI, allowing to capture individual assessment. Here, Miłkowski et al. (2021) propose Personal Emotional Bias (PEB) metric as a measure of an individual's tendency to annotate different categories of emotion. In further studies it could be adopted to the annotation of pathos-related Argument Schemes to investigate those individual differences in the annotation of emotion-appealing arguments. ...
Article
Full-text available
In this paper, we present a model of pathos, delineate its operationalisation, and demonstrate its utility through an analysis of natural language argumentation. We understand pathos as an interactional persuasive process in which speakers are performing pathos appeals and the audience experiences emotional reactions. We analyse two strategies of such appeals in pre-election debates: pathotic Argument Schemes based on the taxonomy proposed by Walton et al. (Argumentation schemes, Cambridge University Press, Cambridge, 2008), and emotion-eliciting language based on psychological lexicons of emotive words (Wierzba in Behav Res Methods 54:2146–2161, 2021). In order to match the appeals with possible reactions, we collect real-time social media reactions to the debates and apply sentiment analysis (Alswaidan and Menai in Knowl Inf Syst 62:2937–2987, 2020) method to observe emotion expressed in language. The results point to the importance of pathos analysis in modern discourse: speakers in political debates refer to emotions in most of their arguments, and the audience in social media reacts to those appeals using emotion-expressing language. Our results show that pathos is a common strategy in natural language argumentation which can be analysed with the support of computational methods.
... This is why rigorous studies are essential before deploying data and models in real-world settings. When creating data, it is crucial to incorporate different perspectives evaluation standards, such as "golden standards" [5], incorporating criteria for evaluating annotators [6,7,8], grouping them according to potential bias factors [9] or using text visualization techniques to analyze annotated datasets [10]. On the model level, explainable AI (XAI) techniques [11,12,13] are being used to demystify complex models and ensure transparency. ...
Preprint
Full-text available
This paper explores the correlation between linguistic diversity, sentiment analysis and transformer model architectures. We aim to investigate how different English variations impact transformer-based models for irony detection. To conduct our study, we used the EPIC corpus to extract five diverse English variation-specific datasets and applied the KEN pruning algorithm on five different architectures. Our results reveal several similarities between optimal subnetworks, which provide insights into the linguistic variations that share strong resemblances and those that exhibit greater dissimilarities. We discovered that optimal subnetworks across models share at least 60% of their parameters, emphasizing the significance of parameter values in capturing and interpreting linguistic variations. This study highlights the inherent structural similarities between models trained on different variants of the same language and also the critical role of parameter values in capturing these nuances.
... This is why rigorous studies are essential before deploying data and models in real-world settings. When creating data, it is crucial to incorporate different perspectives evaluation standards, such as "golden standards" [5], incorporating criteria for evaluating annotators [6,7,8], grouping them according to potential bias factors [9] or using text visualization techniques to analyze annotated datasets [10]. On the model level, explainable AI (XAI) techniques [11,12,13] are being used to demystify complex models and ensure transparency. ...
Conference Paper
Full-text available
This paper explores the correlation between linguistic diversity, sentiment analysis and transformer model architectures. We aim to investigate how different English variations impact transformer-based models for irony detection. To conduct our study, we used the EPIC corpus to extract five diverse English variation-specific datasets and applied the KEN pruning algorithm on five different architectures. Our results reveal several similarities between optimal subnetworks, which provide insights into the linguistic variations that share strong resemblances and those that exhibit greater dissimilarities. We discovered that optimal subnetworks across models share at least 60% of their parameters, emphasizing the significance of parameter values in capturing and interpreting linguistic variations. This study highlights the inherent structural similarities between models trained on different variants of the same language and also the critical role of parameter values in capturing these nuances.
... Studies like and underscore the importance of tailoring NLP models to individual beliefs and preferences to enhance the handling of offensive content and controversial topics. Models that incorporate personal perspectives, as demonstrated by Miłkowski et al. (2021) and Mireshghallah et al. (2022), offered superior predictions by acknowledging individual emotional responses. Kazienko et al. (2023) extend this approach by developing deep learning models that account for individual differences, significantly outperforming traditional models in subjective tasks. ...
Preprint
Full-text available
Large language models (LLMs) have significantly advanced Natural Language Processing (NLP) tasks in recent years. However, their universal nature poses limitations in scenarios requiring personalized responses, such as recommendation systems and chatbots. This paper investigates methods to personalize LLMs, comparing fine-tuning and zero-shot reasoning approaches on subjective tasks. Results demonstrate that personalized fine-tuning improves model reasoning compared to non-personalized models. Experiments on datasets for emotion recognition and hate speech detection show consistent performance gains with personalized methods across different LLM architectures. These findings underscore the importance of personalization for enhancing LLM capabilities in subjective text perception tasks.
... For the primary assessments, data partitioning derived from the past-present-future1-future2 design [39], portrayed in Figure 11. This segmentation was engineered to reflect real-world prediction system data availability. ...
... As a concept of contextual and human-centered processing, personalization in NLP was proposed by us and recently extensively explored in [20][21][22][96][97][98][99][100]. Here, we extend it to ChatGPT prompts as personalized in-context processing. ...
Article
Full-text available
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25\% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.
... Much less often, multi-label classification or multivariate regression is considered, see e.g. [51]. ...
... The emotional data contains individualized, self-reported perceptions of emotions elicited by a given text. The Sentimenti organisation 2 carried out the annotation process of over 7,000 texts, which resulted in an average of 50 emo-705 tional annotations per piece of text [51]. In order to obtain authentic results Computer Assisted Web Interview (CAWI) approach was used. ...
Article
Full-text available
Some tasks in content processing, e.g., natural language processing (NLP), like hate or offensive speech and emotional or funny text detection, are subjective by nature. Each human may perceive some content individually. The existing reasoning methods commonly rely on agreed output values, the same for all recipients. We propose fundamentally different – personalized solutions applicable to any subjective NLP task. Our five new deep learning models take into account not only the textual content but also the opinions and beliefs of a given person. They differ in their approaches to learning Human Bias (HuBi) and fusion with content (text) representation. The experiments were carried out on 14 tasks related to offensive, emotional, and humorous texts. Our personalized HuBi methods radically outperformed the generalized ones for all NLP problems. Personalization also has a greater impact on reasoning quality than commonly explored pre-trained and fine-tuned language models. We discovered a high correlation between human bias calculated using our dedicated formula and that learned by the model. Multi-task solutions achieved better outcomes than single-task architectures. Human and word embeddings also provided additional insights.
... As a concept of contextual and human-centered processing, personalization in NLP was proposed by us and recently extensively explored in [20,22,21,97,98,99,100,101]. Here, we extend it to ChatGPT prompts as personalized incontext processing. ...
Preprint
Full-text available
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25\% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.