Jan Kocoń

Jan Kocoń
Wrocław University of Science and Technology | WUT · Department of Artificial Intelligence

PhD
R&D in AI, NLP & LLMs | Scientific Director @ PLLuM | Lead Data Scientist @ CLARIN | Assistant Professor @ Wroclaw Tech

About

99
Publications
25,467
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,700
Citations
Introduction
I am currently working on advanced personalization models based on deep learning in subjective natural language processing tasks such as recognizing emotions, sentiment, hate speech, humor, etc. in text. I also work on cross-lingual knowledge transfer problems and applications of language-agnostic models. My work is carried out mainly within the projects: CLARIN-PL (clarin-pl.eu) and CLARIN-PL-Biz (clarin.biz).

Publications

Publications (99)
Conference Paper
Full-text available
This paper presents a comprehensive study of sentiment analysis for Polish book reviews through the creation of a novel, manually annotated dataset and the evaluation of various language models. We introduce a detailed sentiment annotation scheme, addressing challenges encountered during the annotation process, and evaluate model performance on sen...
Article
Full-text available
The rapid evolution of large language models, in particular OpenAI’s GPT-3.5-turbo and GPT-4, indicates a growing interest in advanced computational methodologies. This paper proposes a novel approach to synthetic data generation and knowledge distillation through prompt engineering. The potential of large language models (LLMs) is used to address...
Preprint
Full-text available
We address the main problem of self-learning LLM: the question of what to learn. We propose a self-learning LLM framework that enables an LLM to independently learn previously unknown knowledge through self-assessment of their own hallucinations. Using the hallucination score, we introduce a new concept of Points in The Unknown (PiUs), along with o...
Preprint
Full-text available
Large language models (LLMs) have significantly advanced Natural Language Processing (NLP) tasks in recent years. However, their universal nature poses limitations in scenarios requiring personalized responses, such as recommendation systems and chatbots. This paper investigates methods to personalize LLMs, comparing fine-tuning and zero-shot reaso...
Preprint
Full-text available
Large language models are experiencing a significant surge of attention and rapid development. It is happening mainly due to the publication of OpenAI's ChatGPT models: GPT3.5-turbo and GPT-4. This article uses prompt engineering to present an innovative approach to synthetic data generation and knowledge distillation. Specifically, we focus on thr...
Preprint
Full-text available
The performance of machine learning models is closely linked to the quality of training data, underpinning the ’garbage in, garbage out’ principle. Label noise in datasets is a key challenge in training and evaluation. This study introduces two innovative ChatGPT-based methods, ChatGPT-Predict and ChatGPT-Detect, for effective noise detection in la...
Preprint
Full-text available
The development of large language models, such as ChatGPT (GPT-3.5) and GPT-4, has revolutionized natural language processing (NLP) and opened up new possibilities in various fields. These models demonstrate remarkable capabilities in generating coherent and contextually relevant text, making them suitable for a wide range of applications. This wor...
Conference Paper
Full-text available
Sentiment analysis involves using WordNets enriched with emotional metadata, which are valuable resources. However, manual annotation is time-consuming and expensive, resulting in only a few WordNet Lexical Units being annotated. This paper introduces two new techniques for automatically propagating sentiment annotations from a partially annotated...
Conference Paper
Full-text available
Designing predictive models for subjective problems in natural language processing (NLP) remains challenging. This is mainly due to its non-deterministic nature and different perceptions of the content by different humans. It may be solved by Personalized Natural Language Processing (PNLP), where the model exploits additional information about the...
Conference Paper
Full-text available
Data annotated by humans is a source of knowledge by describing the peculiarities of the problem and therefore fueling the decision process of the trained model. Unfortunately, the annotation process for subjective natural language processing (NLP) problems like offensiveness or emotion detection is often very expensive and time-consuming. One of t...
Conference Paper
Full-text available
In the era of artificial intelligence, data is gold but costly to annotate. The paper demonstrates a groundbreaking solution to this dilemma using ChatGPT for text augmentation in sentiment analysis. We leverage ChatGPT's generative capabilities to create synthetic training data that significantly improves the performance of smaller models, making...
Conference Paper
Full-text available
This article compiles research on the extraction of human characteristics using three different methods: questionnaires, annotations , and biases. We have performed an analysis of how personalized perception of texts is affected by individual human profile and bias. To acquire comprehensive knowledge about individual user preferences , we have gath...
Chapter
Full-text available
In this paper, we investigate whether it is possible to automatically annotate texts with ChatGPT or generate both artificial texts and annotations for them. We prepared three collections of texts annotated with emotions at the level of sentences and/or whole documents. CLARIN-Emo contains the opinions of real people, manually annotated by six ling...
Chapter
Full-text available
Data Maps is an interesting method of graphical representation of datasets, which allows observing the model’s behaviour for individual instances in the learning process (training dynamics). The method groups elements of a dataset into easy-to-learn, ambiguous, and hard-to-learn. In this article, we present an extension of this method, Differential...
Article
Full-text available
Some tasks in content processing, e.g., natural language processing (NLP), like hate or offensive speech and emotional or funny text detection, are subjective by nature. Each human may perceive some content individually. The existing reasoning methods commonly rely on agreed output values, the same for all recipients. We propose fundamentally diffe...
Article
Full-text available
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-kn...
Preprint
Full-text available
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transforme...
Article
Full-text available
In this paper, we propose a sentiment analysis of Twitter data focused on the attitudes and sentiments of Polish migrants and stayers during the pandemic. We collected 9 million tweets and retweets between January and August 2021, and analysed them using MultiEmo, the multilingual, multilevel, multi-domain sentiment analysis corpus. We discovered t...
Preprint
Full-text available
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. Several publications on ChatGPT evaluation test its effectiveness on well-kn...
Conference Paper
Full-text available
Researchers in Natural Language Processing (NLP) and recommendation systems typically train machine learning models on large corpora. In many cases, the corpus is constructed using annotations from a third-party, such as crowd-sourced workers, volunteers, or real users of the social networking services. This opens the possibility of malicious agent...
Conference Paper
Full-text available
In recognizing hate speech in text, a frequently overlooked aspect is the specific recipient of the content. Information about the user can be considered as another potential modality in addition to the textual representation. In this work, we present the multi-modal hate speech detection problem as a task of personalized prediction based on text a...
Article
Full-text available
In this paper, we propose a sentiment analysis of Twitter data focused on the attitudes and sentiments of Polish migrants and stayers during the pandemic. We collected 9 million tweets and retweets between January and August 2021, and analysed them using MultiEmo, the multilingual, multilevel, multi-domain sentiment analysis corpus. We discovered t...
Conference Paper
Full-text available
For subjective NLP problems, such as classification of hate speech, aggression, or emotions, personalized solutions can be exploited. Then, the learned models infer about the perception of the content independently for each reader. To acquire training data, texts are commonly randomly assigned to users for annotation, which is expensive and highly...
Conference Paper
Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transforme...
Conference Paper
Full-text available
The paper addresses the important problem of multilingual and language-agnostic approaches to the aspect-based sentiment analysis (ABSA) task, using modern approaches based on transformer models. We propose a new dataset based on automatic translation of the Polish AspectEmo dataset together with cross-lingual transfer of tags describing aspect pol...
Conference Paper
Full-text available
Neuro-symbolic approaches explore ways to combine neural networks with traditional symbolic knowledge. These methods are gaining attention due to their efficiency and the requirement of fewer data compared to currently used deep models. This work investigated several neuro-symbolic models for sentiment analysis focusing on a variety of ways to add...
Conference Paper
Full-text available
Transformer models like BERT have significantly improved performance on many NLP tasks, e.g., sentiment analysis. However, their large number of parameters makes real-world applications difficult because of computational costs and latency. Many compression methods have been proposed to solve this problem using quantization, weight pruning, and know...
Conference Paper
Full-text available
As humans, we experience a wide range of feelings and reactions. One of these is laughter, often related to a personal sense of humor and the perception of funny content. Due to its subjective nature, recognizing humor in NLP is a very challenging task. Here, we present a new approach to the task of predicting humor in the text by applying the idea...
Conference Paper
Full-text available
In this article, we discuss the conditions surrounding the building of historical and literary corpora. We describe the assumptions and method of making the original corpus of the Polish novel (1864-1939). Then, we present the research procedure aimed at demonstrating the variability of the emotional value of the concept of “the city” and “the coun...
Chapter
Full-text available
In this work, we present an advanced semantic search engine dedicated to travel offers, allowing the user to create queries in the Natural Language. We started with the Polish language in focus. Search for e-commerce requires a different set of methods and algorithms than search for travel, search for corporate documents, for law documents, for med...
Chapter
Full-text available
We carried out extensive experiments on the MultiEmo dataset for sentiment analysis with texts in eleven languages. Two adapted versions of the LaBSE deep architecture were confronted against the LASER model. That allowed us to conduct cross-language validation of these language agnostic methods. The achieved results proved that LaBSE embeddings wi...
Chapter
Full-text available
In this paper, we present paragraph segmentation using cross-lingual knowledge transfer models. In our solution, we investigate the quality of multilingual models, such as mBERT and XLM-RoBERTa, as well as language independent models, LASER and LaBSE. We study the quality of segmentation in 9 different European languages, both for each language sep...
Conference Paper
Full-text available
A unified gold standard commonly exploited in natural language processing (NLP) tasks requires high inter-annotator agreement. However, there are many subjective problems that should respect users individual points of view. Therefore in this paper, we evaluate three different personalized methods on the task of hate speech detection. The user-cente...
Chapter
Full-text available
We propose and test multiple neuro-symbolic methods for sentiment analysis. They combine deep neural networks – transformers and recurrent neural networks – with external knowledge bases. We show that for simple models, adding information from knowledge bases significantly improves the quality of sentiment prediction in most cases. For medium-sized...
Preprint
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is v...
Chapter
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
Chapter
Full-text available
In this article we present extended results obtained on the multidomain dataset of Polish text reviews collected within the Sentimenti project. We present preliminary results of classification models trained and tested on 7,000 texts annotated by over 20,000 individuals using valence, arousal, and eight basic emotions from Plutchik’s model. Additio...
Conference Paper
Full-text available
Humans' emotional perception is subjective by nature, in which each individual could express different emotions regarding the same textual content. Existing datasets for emotion analysis commonly depend on a single ground truth per data sample, derived from majority voting or averaging the opinions of all annotators. In this paper, we introduce a n...
Chapter
Full-text available
The aim of this paper is to investigate the applicability of language models to the problem of lexical substitution in a strongly inflected language. For this purpose, we focus on pre-trained models based on transformer architectures, in particular BERT and BART. We present a solution in the form of the BART-based sequence-to-sequence model. Then w...
Chapter
Full-text available
We developed and validated a language-agnostic method for sentiment analysis. Cross-language experiments carried out on the new MultiEmo dataset with texts in 11 languages proved that LaBSE embeddings with an additional attention layer implemented in the BiLSTM architecture outperformed other methods in most cases.KeywordsCross-language NLPSentimen...
Article
Some tasks in content processing, e.g., natural language processing (NLP) like hate or offensive speech, emotional or funny texts detection are subjective by nature. Each human may perceive some content in their own individual way. The existing reasoning methods commonly rely on agreed output values, the same for all recipients. We propose fundamen...
Article
Full-text available
Emotion lexicons are useful in research across various disciplines, but the availability of such resources remains limited for most languages. While existing emotion lexicons typically comprise words, it is a particular meaning of a word (rather than the word itself) that conveys emotion. To mitigate this issue, we present the Emotion Meanings data...
Conference Paper
Full-text available
Aspect-based sentiment analysis (ABSA) is a text analysis method that categorizes data by aspects and identifies the sentiment assigned to each aspect. Aspect-based sentiment analysis can be used to analyze customer opinions by associating specific sentiments with different aspects of a product or service. Most of the work in this topic is thorough...
Conference Paper
Full-text available
Many tasks in natural language processing like offensive, toxic, or emotional text classification are subjective by nature. Humans tend to perceive textual content in their own individual way. Existing methods commonly rely on the agreed output values, the same for all consumers. Here, we propose personalized solutions to subjective tasks. Our four...
Conference Paper
Full-text available
Many publications prove that the creation of a multiobjective machine learning model is possible and reasonable. Moreover, we can see significant gains in expanding the knowledge domain, increasing prediction quality, and reducing the inference time. New developments in cross-lingual knowledge transfer open up a range of possibilities, particularly...
Article
Full-text available
Analysis of subjective texts like offensive content or hate speech is a great challenge, especially regarding annotation process. Most of current annotation procedures are aimed at achieving a high level of agreement in order to generate a high quality reference source. However, the annotation guidelines for subjective content may restrict the anno...
Conference Paper
Full-text available
Analysis of emotions elicited by opinions, comments, or articles commonly exploits annotated corpora, in which the labels assigned to documents average the views of all annotators, or represent a majority decision. The models trained on such data are effective at identifying the general views of the population. However, their usefulness for predict...
Conference Paper
Full-text available
There is content such as hate speech, offensive, toxic or aggressive documents, which are perceived differently by their consumers. They are commonly identified using classifiers solely based on textual content that generalize pre-agreed meanings of difficult problems. Such models provide the same results for each user, which leads to high misclass...
Chapter
Full-text available
This article presents MultiEmo, a new benchmark data set for the multilingual sentiment analysis task including 11 languages. The collection contains consumer reviews from four domains: medicine, hotels, products and university. The original reviews in Polish contained 8,216 documents consisting of 57,466 sentences. The reviews were manually annota...
Article
Full-text available
In this article we extend a WordNet structure with relations linking synsets to Desikan’s brain regions. Based on lexicographer files and WordNet Domains the mapping goes from synset semantic categories to behavioural and cognitive functions and then directly to brain lobes. A human brain connectome (HBC) adjacency matrix was utilised to capture tr...
Article
Full-text available
Multi-task learning (MTL) has been successfully utilized in numerous NLP tasks, including sequence labeling. In this work, we utilize three transformer-based models (XLM-R, HerBERT, mBERT) to improve recognition quality using MTL for selected low-resource language (Polish) and three disjoint sequence labeling tasks with different levels of inter-an...
Data
Presentation for the article: Propagation of emotions, arousal and polarity in WordNet using Heterogeneous Structured Synset Embeddings
Article
Full-text available
In this article, we present a novel technique for the use of language-agnostic sentence representations to adapt the model trained on texts in Polish (as a low-resource language) to recognize polarity in texts in other (high-resource) languages. The first model focuses on the creation of a language-agnostic representation of each sentence. The seco...
Conference Paper
Full-text available
In this article we present an extended version of PolEmo – a corpus of consumer reviews from 4 domains: medicine, hotels, products and school. Current version (PolEmo 2.0) contains 8,216 reviews having 57,466 sentences. Each text and sentence was manually annotated with sentiment in 2+1 scheme, which gives a total of 197,046 annotations. We obtaine...
Article
Full-text available
Sentiment analysis is a hot research topic of Natural Language Processing with its main focus on emotive analysis of textual opinions. The task of sentiment recognition is highly domain-dependent, thus, there is a great need for designing the methods with decent domain adaptation abilities. In this paper we present a brief overview of existing data...
Conference Paper
Full-text available
In this article, we present a novel multi-domain dataset of Polish text reviews, annotated with sentiment on different levels: sentences and the whole documents. The annotation was made by linguists in a 2+1 scheme (with inter-annotator agreement analysis). We present a preliminary approach to the classification of labelled data using logistic regr...
Conference Paper
Full-text available
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
Conference Paper
Full-text available
In this paper we present a novel method for emotive propagation in a wordnet based on a large emotive seed. We introduce a sense-level emotive lexicon annotated with polarity, arousal and emotions. The data were annotated as a part of a large study involving over 20,000 participants. A total of 30,000 lexical units in Polish WordNet were described...
Preprint
Full-text available
This article introduces the issue of recognition and normalisation of temporal expressions for the Polish language. We describe what temporal information is and we present TimeML specification, adapted to Polish as a model for the description of temporal expressions. Classes of temporal expressions are presented as well as guidelines for annotation...
Conference Paper
Full-text available
This article presents the research in the recognition and normalization of Polish temporal expressions as the result of the first PolEval 2019 shared task. Temporal information extracted from the text plays a significant role in many information extraction systems, like question answering, event recognition or text summarization. A specification fo...
Chapter
Full-text available
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
Conference Paper
Full-text available
In this article, we present a novel multidomain dataset of Polish text reviews. The data were annotated as part of a large study involving over 20,000 participants. A total of 7,000 texts were described with metadata, each text received about 25 annotations concerning polarity, arousal and eight basic emotions, marked on a multilevel scale. We pres...
Article
Full-text available
The article introduces a new set of Polish word embeddings, built using KGR10 corpus, which contains more than 4 billion words. These embeddings are evaluated in the problem of recognition of temporal expressions (timexes) for the Polish language. We described the process of KGR10 corpus creation and a new approach to the recognition problem using...
Preprint
Full-text available
The article introduces a new set of Polish word embeddings, built using KGR10 corpus, which contains more than 4 billion words. These embeddings are evaluated in the problem of recognition of temporal expressions (timexes) for the Polish language. We described the process of KGR10 corpus creation and a new approach to the recognition problem using...
Conference Paper
Full-text available
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM)...
Presentation
Full-text available
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM)...
Conference Paper
Full-text available
In this paper we present a novel approach to the construction of an extensive, sense-level sentiment lexicon built on the basis of a wordnet. The main aim of this work is to create a high-quality sentiment lexicon in a partially automated way. We propose a method called Classifier-based Polarity Propagation, which utilises a very rich set of wordne...
Conference Paper
Full-text available
In this paper we present a comprehensive overview of recent methods of the sentiment propagation in a wordnet. Next, we propose a fully automated method called Classifier-based Polarity Propagation , which utilises a very rich set of features , where most of them are based on wordnet relation types, multi-level bag-of-synsets and bag-of-polarities....
Conference Paper
Full-text available
In this paper we present our attempts in the PolEval 2017 Sentiment Analysis Task. The task is not only one of the first challenges in sentiment analysis focused on Polish language, but also represents a novel approach to sentiment analysis, namely, predicting the sentiment not of a sentence, or a document, but of a word or a phrase within the cont...
Conference Paper
Full-text available
We present a large emotive lexicon of Polish which has been constructed by manual expansion of the emotive annotation defined for plWordNet 3.0 emo (a very large wordnet of Polish). The annotation encompasses: sentiment polarity, basic emotions and fundamental human values. Annotation scheme and revised guidelines for the annotation process are dis...
Conference Paper
Full-text available