Fig 1 - uploaded by Jan Kocoń
Content may be subject to copyright.
Personalized model architectures: (a) the OneHot model using one-hot encoded user ID and the HuBi-Formula model utilizing the pre-calculated HB feature, (b) HuBi-Medium: learned human embedding model, (c) the HuBi-Complex model using human-word embeddings. The TXT Baseline model architecture is marked in the Diagram (a) with the green dotted line.

Personalized model architectures: (a) the OneHot model using one-hot encoded user ID and the HuBi-Formula model utilizing the pre-calculated HB feature, (b) HuBi-Medium: learned human embedding model, (c) the HuBi-Complex model using human-word embeddings. The TXT Baseline model architecture is marked in the Diagram (a) with the green dotted line.

Source publication
Conference Paper
Full-text available
Many tasks in natural language processing like offensive, toxic, or emotional text classification are subjective by nature. Humans tend to perceive textual content in their own individual way. Existing methods commonly rely on the agreed output values, the same for all consumers. Here, we propose personalized solutions to subjective tasks. Our four...

Contexts in source publication

Context 1
... six models were exploited in four scenarios: (1)-(3) three independent binary classification tasks of Wikipedia discussion texts -Attack, Aggression, Toxicity, and (4) prediction of emotional text perception for each of the ten emotional categories -multivariate regression with 10 continuous outputs (multitask). All models presented in Figure 1 are shown in the single-task scenario, whereas in the multitask scenario, one model predicts all emotion dimensions simultaneously, and the human bias is learned for each dimension separately (the result is a vector, not a number). In the multitask case, where human embeddings are learned, there is a single embedding vector space for multidimensional prediction, the same as in the single-task variant. ...
Context 2
... embedding is the only input for the TXT model reflecting the non-personalized method, Figure 1a. It corresponds to the commonly used generalized methods known in NLP with one unified output for all users. ...
Context 3
... the dimension of this vector can be pretty large (depending on the number of annotators), so we proposed a variant in which we represent users in the form of a calculated HB measure (a single number for each task). Both models are shown in Figure 1a. ...
Context 4
... is the main difference of this model compared to NCFs, which typically create a representation of both the user and the item (text). The HuBi-Medium architecture with a realvalued and custom length latent vector for each annotator in the dataset is depicted in Figure 1b. This vector is initialized randomly and optimized using backpropagation. ...
Context 5
... the annotator embeddings, word embeddings are also randomly initialized and trained via backpropagation. The HuBiComplex architecture is shown in Figure 1c. Its prediction is defined as: ...

Citations

... Prior work has investigated methods to retain information about variability and uncertainty. Research has included the prediction of measures such as unbiased annotator standard deviation [2,3], the embedding of individual annotators to improve performance on the aggregated ground truth, with some investigation into how well the model annotator uncertainty correlates with real uncertainty [4,5,6], and the prediction of the distribution of annotations over a given utterance [7]. Yet, gaps remain. ...
... Previous work has investigated the prediction of individual annotators on subjective tasks such as emotion recognition and hate speech [4,5,10]. Davani et al. introduced an encoderbased model with separate classification heads for each annotator. ...
... An alternative approach to learning individual annotators is through annotator embeddings [5]. Prior work from Kocoń et al. demonstrates that annotator-specific embeddings can be used to personalize model predictions and capture the bias of individual annotators. ...
... Prior work has investigated methods to retain information about variability and uncertainty. Research has included the prediction of measures such as unbiased annotator standard deviation [2,3], the embedding of individual annotators to improve performance on the aggregated ground truth, with some investigation into how well the model annotator uncertainty correlates with real uncertainty [4,5,6], and the prediction of the distribution of annotations over a given utterance [7]. Yet, gaps remain. ...
... Previous work has investigated the prediction of individual annotators on subjective tasks such as emotion recognition and hate speech [4,5,10]. Davani et al. introduced an encoderbased model with separate classification heads for each annotator. ...
... An alternative approach to learning individual annotators is through annotator embeddings [5]. Prior work from Kocoń et al. demonstrates that annotator-specific embeddings can be used to personalize model predictions and capture the bias of individual annotators. ...
Preprint
Emotion expression and perception are nuanced, complex, and highly subjective processes. When multiple annotators label emotional data, the resulting labels contain high variability. Most speech emotion recognition tasks address this by averaging annotator labels as ground truth. However, this process omits the nuance of emotion and inter-annotator variability, which are important signals to capture. Previous work has attempted to learn distributions to capture emotion variability, but these methods also lose information about the individual annotators. We address these limitations by learning to predict individual annotators and by introducing a novel method to create distributions from continuous model outputs that permit the learning of emotion distributions during model training. We show that this combined approach can result in emotion distributions that are more accurate than those seen in prior work, in both within- and cross-corpus settings.
... During the experiment, we exploited the following models [9,36]: ...
Preprint
Full-text available
The development of large language models, such as ChatGPT (GPT-3.5) and GPT-4, has revolutionized natural language processing (NLP) and opened up new possibilities in various fields. These models demonstrate remarkable capabilities in generating coherent and contextually relevant text, making them suitable for a wide range of applications. This work focuses on automatic text annotation in subjective problems and person-alization using ChatGPT. The primary objective is to investigate the ChatGPT generative capabilities and evaluate its performance in classification and regression NLP tasks. Furthermore, the work also contributes a novel methodology for evaluating personalized ChatGPT and adapting it to address specific problem domains. The results obtained from multiple experimental setups showcase the potential of the method to automatically exploit ChatGPT to generate text annotations. However, the conclusions drawn from the research highlight the need for a further detailed and more extensive analysis across multiple problem domains and diverse datasets.
... Generalized models usually consist of two parts: text encoder (language model), which creates text representation e t and classifier or regressor (usually fully-connected layer) that gives predictionŷ. However, recent studies [28], [51], [52] show that this approach should not be considered correct, as adding information about the annotator significantly improves model quality and yields better results. The approach that combines information about the text and the human is socalled personalized. ...
... The comparison of generalized and personalized approaches is shown in Fig 1. There are few existing architectures [28], [51] utilizing this fact. Still, all of them are deterministic, meaning none model uncertainty as a direct optimization of negative log-likelihood. ...
... This can include information such as the deviation of responses from the majority voice, metadata about the user, user identifier [28], the correlation of the text's context with historical evaluations, or other features unique to the recipient of the text. It also can be randomly initialized and tuned during the learning process by backpropagation [51]. ...
Conference Paper
Full-text available
Designing predictive models for subjective problems in natural language processing (NLP) remains challenging. This is mainly due to its non-deterministic nature and different perceptions of the content by different humans. It may be solved by Personalized Natural Language Processing (PNLP), where the model exploits additional information about the reader to make more accurate predictions. However, current approaches require complete information about the recipients to be straight embedded. Besides, the recent methods focus on deterministic inference or simple frequency-based estimations of the probabilities. In this work, we overcome this limitation by proposing a novel approach to capture the uncertainty of the forecast using conditional Normalizing Flows. This allows us to model complex multimodal distributions and to compare various models using negative log-likelihood (NLL). In addition, the new solution allows for various interpretations of possible reader perception thanks to the available sampling function. We validated our method on three challenging, subjective NLP tasks, including emotion recognition and hate speech. The comparative analysis of generalized and personalized approaches revealed that our personalized solutions significantly outperform the baseline and provide more precise uncertainty estimates. The impact on the text interpretability and uncertainty studies are presented as well. The information brought by the developed methods makes it possible to build hybrid models whose effectiveness surpasses classic solutions. In addition, an analysis and visualization of the probabilities of the given decisions for texts with high entropy of annotations and annotators with mixed views were carried out.
... These tasks often witness low annotator agreement due to varied text interpretation. While traditional methods prioritize majority consensus, personalized models aim for individual-specific outputs, often yielding superior results [29,30,41]. ...
... This study explores the resilience of personalized prediction architectures in NLP against malicious annotations. Specifically, we tested the robustness of leading personalized prediction architectures, User-ID [29,26] and HuBi-Medium [30,26], against poisoning attacks in aggression and sentiment prediction tasks. Simulated attacks varied in attacker behavior and numbers. ...
... This strategy endeavors to produce prediction outputs attuned to specific annotators by assimilating their unique characteristics. This can be achieved through various means, including multitask frameworks [13], demographic data inclusion [42,1,29], or direct annotator representation [30,41]. Recent studies advocate for these personalized models' superior efficacy across subjective tasks as well as more objective tasks, such as medical decision-making [3,8]. ...
... Similar works are conducted in Warner and Hirschberg 2012;Silva et al. 2016;Gitari et al. 2015. However, tasks related to the detection of social phenomena, like offensiveness, and toxicity are often subjective in nature (Kocoń et al., 2021). A recent survey among American adults stated that according to half of the participants, "it is hard to know what others might find offensive", and the majority of them acknowledged there were disagreements in what is perceived as sexist or racist (pew, Accessed: 2022-12-03). ...
Conference Paper
Full-text available
Subjectivity and difference of opinion are key social phenomena, and it is crucial to take these into account in the annotation and detection process of derogatory textual content. In this paper , we use four datasets provided by SemEval-2023 Task 11 and fine-tune a BERT model to capture the disagreement in the annotation. We find individual annotator modeling and aggregation lowers the Cross-Entropy score by an average of 0.21, compared to the direct training on the soft labels. Our findings further demonstrate that annotator metadata contributes to the average 0.029 reduction in the Cross-Entropy score.
... Jan Kocon et al. [4] proposed deep learning approaches to consider the content and the specificity of a given person, They performed experiments on 4 datasets, including Wikipedia texts claiming attacks and toxicity with ten emotional categories. ...
... Finally, they locate biased statements using established classification algorithms to utilize the data. Jan Kocon et al. [4] https://ieeexplore. ieee.org/abstract/ ...
... However, there are methods to share a single model between users and learn a unique representation for each person. This representation is combined with the text representation to produce a user-informed prediction [38,22]. Even though these methods mitigate most of the memory-related issues, they continue to require user embedding optimization, which is easier than training the entire model. ...
... It is important to note that the sentiment analysis task is much less subjective than the hate speech detection or emotion recognition tasks [22,27], for which personalized methods from the literature achieved much better quality gains when annotator context was added relative to the baseline. For some emotions, up to 40 pp of improvement was reported, while for sentiment analysis, the quality gains for F1 and Acc measures are, respectively: 3pp and 3.1pp on S140, 3.2pp and 3.4pp on IMDB, and 3.2pp and 2.8pp on MHS. ...
Chapter
Full-text available
Data Maps is an interesting method of graphical representation of datasets, which allows observing the model’s behaviour for individual instances in the learning process (training dynamics). The method groups elements of a dataset into easy-to-learn, ambiguous, and hard-to-learn. In this article, we present an extension of this method, Differential Data Maps, which allows you to visually compare different models trained on the same dataset or analyse the effect of selected features on model behaviour. We show an example application of this visualization method to explain the differences between the three personalized deep neural model architectures from the literature and the HumAnn model we developed. The advantage of the proposed HumAnn is that there is no need for further learning for a new user in the system, in contrast to known personalized methods relying on user embedding. All models were tested on the sentiment analysis task. Three datasets that differ in the type of human context were used: user-annotator, user-author, and user-author-annotator. Our results show that with the new explainable AI method, it is possible to pose new hypotheses explaining differences in the quality of model performance, both at the level of features in the datasets and differences in model architectures.
... We treat sentences as a part of the document to train sequential sentence classification models. We also perform a personalized setup for sequence sentence classification in the new CLARIN-Emo dataset, similar to [17,26,13], and find that personalization has a positive impact on model performance in emotion recognition. ...
... where x n is the count of samples of class n in the training set Personalized approach. Since emotion recognition is a subjective task [17], each annotator could have a different perspective on the same sentence. We decided to combine the SSC model with a personalized approach called UserID proposed by [26]. ...
Chapter
Full-text available
In this paper, we investigate whether it is possible to automatically annotate texts with ChatGPT or generate both artificial texts and annotations for them. We prepared three collections of texts annotated with emotions at the level of sentences and/or whole documents. CLARIN-Emo contains the opinions of real people, manually annotated by six linguists. Stockbrief-GPT consists of real human articles annotated by ChatGPT. ChatGPT-Emo is an artificial corpus created and annotated entirely by ChatGPT. We present an analysis of these corpora and the results of Transformer-based methods fine-tuned on these data. The results show that manual annotation can provide better-quality data, especially in building personalized models.KeywordsChatGPTEmotion recognitionAutomatic annotation
... As a concept of contextual and human-centered processing, personalization in NLP was proposed by us and recently extensively explored in [20][21][22][96][97][98][99][100]. Here, we extend it to ChatGPT prompts as personalized in-context processing. ...
Article
Full-text available
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT's capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness, and stance detection. In contrast, the other tasks require more objective reasoning like word sense disambiguation, linguistic acceptability, and question answering. We also evaluated GPT-4 model on five selected subsets of NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25\% for zero-shot and few-shot evaluation. For GPT-4 model, a loss for semantic tasks is significantly lower than for ChatGPT. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability to personalize ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool's usefulness to society and how the learning and validation procedures for such systems should be established.