Jan Kocoń

Jan Kocoń
Wroclaw University of Science and Technology | WUT · Department of Artificial Intelligence

PhD

About

58
Publications
9,684
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
321
Citations
Introduction
I am currently working on advanced personalization models based on deep learning in subjective natural language processing tasks such as recognizing emotions, sentiment, hate speech, humor, etc. in text. I also work on cross-lingual knowledge transfer problems and applications of language-agnostic models. My work is carried out mainly within the projects: CLARIN-PL (clarin-pl.eu) and CLARIN-PL-Biz (clarin.biz).

Publications

Publications (58)
Preprint
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is v...
Article
Full-text available
Emotion lexicons are useful in research across various disciplines, but the availability of such resources remains limited for most languages. While existing emotion lexicons typically comprise words, it is a particular meaning of a word (rather than the word itself) that conveys emotion. To mitigate this issue, we present the Emotion Meanings data...
Conference Paper
Aspect-based sentiment analysis (ABSA) is a text analysis method that categorizes data by aspects and identifies the sentiment assigned to each aspect. Aspect-based sentiment analysis can be used to analyze customer opinions by associating specific sentiments with different aspects of a product or service. Most of the work in this topic is thorough...
Conference Paper
Many tasks in natural language processing like offensive, toxic, or emotional text classification are subjective by nature. Humans tend to perceive textual content in their own individual way. Existing methods commonly rely on the agreed output values, the same for all consumers. Here, we propose personalized solutions to subjective tasks. Our four...
Conference Paper
Many publications prove that the creation of a multiobjective machine learning model is possible and reasonable. Moreover, we can see significant gains in expanding the knowledge domain, increasing prediction quality, and reducing the inference time. New developments in cross-lingual knowledge transfer open up a range of possibilities, particularly...
Article
Full-text available
Analysis of subjective texts like offensive content or hate speech is a great challenge, especially regarding annotation process. Most of current annotation procedures are aimed at achieving a high level of agreement in order to generate a high quality reference source. However, the annotation guidelines for subjective content may restrict the anno...
Conference Paper
Full-text available
Analysis of emotions elicited by opinions, comments, or articles commonly exploits annotated corpora, in which the labels assigned to documents average the views of all annotators, or represent a majority decision. The models trained on such data are effective at identifying the general views of the population. However, their usefulness for predict...
Conference Paper
Full-text available
There is content such as hate speech, offensive, toxic or aggressive documents, which are perceived differently by their consumers. They are commonly identified using classifiers solely based on textual content that generalize pre-agreed meanings of difficult problems. Such models provide the same results for each user, which leads to high misclass...
Chapter
This article presents MultiEmo, a new benchmark data set for the multilingual sentiment analysis task including 11 languages. The collection contains consumer reviews from four domains: medicine, hotels, products and university. The original reviews in Polish contained 8,216 documents consisting of 57,466 sentences. The reviews were manually annota...
Article
Full-text available
In this article we extend a WordNet structure with relations linking synsets to Desikan’s brain regions. Based on lexicographer files and WordNet Domains the mapping goes from synset semantic categories to behavioural and cognitive functions and then directly to brain lobes. A human brain connectome (HBC) adjacency matrix was utilised to capture tr...
Article
Full-text available
Multi-task learning (MTL) has been successfully utilized in numerous NLP tasks, including sequence labeling. In this work, we utilize three transformer-based models (XLM-R, HerBERT, mBERT) to improve recognition quality using MTL for selected low-resource language (Polish) and three disjoint sequence labeling tasks with different levels of inter-an...
Data
Presentation for the article: Propagation of emotions, arousal and polarity in WordNet using Heterogeneous Structured Synset Embeddings
Article
Full-text available
In this article, we present a novel technique for the use of language-agnostic sentence representations to adapt the model trained on texts in Polish (as a low-resource language) to recognize polarity in texts in other (high-resource) languages. The first model focuses on the creation of a language-agnostic representation of each sentence. The seco...
Conference Paper
Full-text available
In this article we present an extended version of PolEmo – a corpus of consumer reviews from 4 domains: medicine, hotels, products and school. Current version (PolEmo 2.0) contains 8,216 reviews having 57,466 sentences. Each text and sentence was manually annotated with sentiment in 2+1 scheme, which gives a total of 197,046 annotations. We obtaine...
Article
Full-text available
Sentiment analysis is a hot research topic of Natural Language Processing with its main focus on emotive analysis of textual opinions. The task of sentiment recognition is highly domain-dependent, thus, there is a great need for designing the methods with decent domain adaptation abilities. In this paper we present a brief overview of existing data...
Conference Paper
Full-text available
In this article, we present a novel multi-domain dataset of Polish text reviews, annotated with sentiment on different levels: sentences and the whole documents. The annotation was made by linguists in a 2+1 scheme (with inter-annotator agreement analysis). We present a preliminary approach to the classification of labelled data using logistic regr...
Conference Paper
Full-text available
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
Conference Paper
Full-text available
In this paper we present a novel method for emotive propagation in a wordnet based on a large emotive seed. We introduce a sense-level emotive lexicon annotated with polarity, arousal and emotions. The data were annotated as a part of a large study involving over 20,000 participants. A total of 30,000 lexical units in Polish WordNet were described...
Preprint
Full-text available
This article introduces the issue of recognition and normalisation of temporal expressions for the Polish language. We describe what temporal information is and we present TimeML specification, adapted to Polish as a model for the description of temporal expressions. Classes of temporal expressions are presented as well as guidelines for annotation...
Conference Paper
Full-text available
This article presents the research in the recognition and normalization of Polish temporal expressions as the result of the first PolEval 2019 shared task. Temporal information extracted from the text plays a significant role in many information extraction systems, like question answering, event recognition or text summarization. A specification fo...
Chapter
Full-text available
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
Conference Paper
Full-text available
In this article, we present a novel multidomain dataset of Polish text reviews. The data were annotated as part of a large study involving over 20,000 participants. A total of 7,000 texts were described with metadata, each text received about 25 annotations concerning polarity, arousal and eight basic emotions, marked on a multilevel scale. We pres...
Article
Full-text available
The article introduces a new set of Polish word embeddings, built using KGR10 corpus, which contains more than 4 billion words. These embeddings are evaluated in the problem of recognition of temporal expressions (timexes) for the Polish language. We described the process of KGR10 corpus creation and a new approach to the recognition problem using...
Preprint
Full-text available
The article introduces a new set of Polish word embeddings, built using KGR10 corpus, which contains more than 4 billion words. These embeddings are evaluated in the problem of recognition of temporal expressions (timexes) for the Polish language. We described the process of KGR10 corpus creation and a new approach to the recognition problem using...
Conference Paper
Full-text available
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM)...
Presentation
Full-text available
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM)...
Conference Paper
Full-text available
In this paper we present a novel approach to the construction of an extensive, sense-level sentiment lexicon built on the basis of a wordnet. The main aim of this work is to create a high-quality sentiment lexicon in a partially automated way. We propose a method called Classifier-based Polarity Propagation, which utilises a very rich set of wordne...
Conference Paper
Full-text available
In this paper we present a comprehensive overview of recent methods of the sentiment propagation in a wordnet. Next, we propose a fully automated method called Classifier-based Polarity Propagation , which utilises a very rich set of features , where most of them are based on wordnet relation types, multi-level bag-of-synsets and bag-of-polarities....
Conference Paper
Full-text available
In this paper we present our attempts in the PolEval 2017 Sentiment Analysis Task. The task is not only one of the first challenges in sentiment analysis focused on Polish language, but also represents a novel approach to sentiment analysis, namely, predicting the sentiment not of a sentence, or a document, but of a word or a phrase within the cont...
Conference Paper
Full-text available
We present a large emotive lexicon of Polish which has been constructed by manual expansion of the emotive annotation defined for plWordNet 3.0 emo (a very large wordnet of Polish). The annotation encompasses: sentiment polarity, basic emotions and fundamental human values. Annotation scheme and revised guidelines for the annotation process are dis...
Conference Paper
Full-text available
In this article we present the result of the research on the recognition of genuine Polish suicide notes (SNs). We provide useful method to distinguish between SNs and other types of discourse, including counterfeited SNs. The method uses a wide range of word-based and semantic features and it was evaluated using Polish Corpus of Suicide Notes, whi...
Article
A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Informa...
Conference Paper
Full-text available
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Data
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Presentation
Full-text available
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Presentation
Full-text available
Tematem wystąpienia będą narzędzia i zasoby do ekstrakcji informacji z tekstu opracowane i udostępniane poprzez Centrum Technologii Językowych CLARIN-PL. Zostaną zaprezentowane narzędzia do automatycznego rozpoznawania odniesień (w tym nazw własnych, wyrażeń temporalnych oraz wyznaczników sytuacji), normalizacji wyrażeń temporalnych (normalizacja l...
Conference Paper
Full-text available
In this article we present the result of the recent research in the recognition of events in Polish. Event recognition plays a major role in many natural language processing applications such as question answering or automatic summarization. We adapted TimeML specification (the well known guideline for English) to Polish language. We annotated 540...
Presentation
Full-text available
Tematem wystąpienia będą narzędzia i zasoby do ekstrakcji określonych informacji z tekstów. zaprezentowane zostaną trzy zagadnienia: wykrywanie wyrażeń przestrzennych, wykrywanie wyznaczników sytuacji oraz wykrywanie czasowników z podmiotem domyślnym w kontekście zadania rozpoznawania koreferencji. Dla każdego z zagadnień zostanie omówiony zakres r...
Presentation
Full-text available
Tematem wystąpienia będzie zagadnienie automatycznego rozpoznawania jednostek identyfikacyjnych (głównie nazw własnych oraz przymiotników pochodzących od nazw własnych) oraz wyrażeń temporalnych w tekstach. Zostanie przedstawiony zakres rozpoznawanych informacji (definicja i kategorie jednostek identyfikacyjnych oraz wyrażeń temporalnych) oraz zost...
Article
Full-text available
p> Temporal Expressions in Polish Corpus KPWr This article presents the result of the recent research in the interpretation of Polish expressions that refer to time. These expressions are the source of information when something happens, how often something occurs or how long something lasts. Temporal information, which can be extracted from te...
Article
Full-text available
p> Towards an event annotated corpus of Polish The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – onto...
Conference Paper
Full-text available
In this article we present the result of the recent research in the recognition of Polish temporal expressions. The temporal information extracted from the text plays major role in many information extraction systems, like question answering, event recognition or discourse analysis. We prepared a broad description of Polish temporal expressions, ca...
Article
Full-text available
In the paper we present an extended version of the graph-based unsupervised Word Sense Disambiguation algorithm. The algorithm is based on the spreading activation scheme applied to the graphs dynamically built on the basis of the text words and a large wordnet. The algorithm, originally proposed for English and Princeton WordNet, was adapted to Po...
Conference Paper
Full-text available
Polish named entities are mostly out-of-vocabulary words, i.e. they are not described in morphological lexicons, and their proper analysis by Polish morphological analysers is difficult.The existing approaches to guessing unknown word lemmas and descriptions do not provide results on a satisfactory level. Moreover, lemmatisation of multiword named...
Article
Full-text available
In the paper we present a customizable and open-source framework for proper names recognition called Liner2. The framework consists of several universal methods for sequence chunking which include: dictionary look-up, pattern matching and statistical processing. The statistical processing is performed using Conditional Random Fields and a rich set...
Conference Paper
Full-text available
Many text processing tasks require to recognize and classify Named Entities. Currently available morphological analysers for Polish cannot handle unknown words (not included in analyser's lexicon). Polish is a language with rich inflection, so comparing two words (even having the same lemma) is a non-trivial task. The aim of the similarity function...
Conference Paper
Full-text available
The aim of this paper is to present a system for semantic text annotation called Inforex. Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense Disambiguation (WSD) and relations between named entities. The system also supports manual t...

Projects

Projects (4)
Project
Liner2 (Marcińczuk, et al., 2013), is a generic framework which can be used to solve various tasks based on sequence labeling, i.e. recognition of named entities, temporal expressions, mentions of events. It provides a set of modules (based on statistical models, dictionaries, rules and heuristics) which recognize and annotate certain types of phrases. The framework was already used for recognition of named entities (different levels of granularity, including boundaries, coarse- and fine-grained categories) (Marcińczuk, et al, 2012), temporal expressions (Kocoń and Marcińczuk, 2016) and event mentions (Kocoń and Marcińczuk, 2016) for Polish.
Project
Development of a web system for text corpora construction. Inforex allows parallel access and sharing resources among many users. The system assists semantic annotation of texts on several levels, such as marking text references, creating new references, or marking word senses.