About
91
Publications
13,467
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
561
Citations
Introduction
I mainly work on information extraction from text for Polish, including named entities, semantic relations between named entities, temporal expressions, spatial expressions and events. I am also involved in works on sentiment analysis, plagiarism detection, language errors detection.
Publications
Publications (91)
This study examines transformer-based models and their effectiveness in named entity recognition tasks. The study investigates data representation strategies, including single, merged, and context, which respectively use one sentence, multiple sentences, and sentences joined with attention to context per vector. Analysis shows that training models...
This study leverages transformer-based models and focuses on data representation strategies in the named entity recognition task, including "single" (one sentence per vector), "merged" (multiple sentences per vector), and "context" (sentences joined with attention to context). Performance analysis reveals that models trained with a single strategy...
Presentation of PolEval 2022 Task 2: Abbreviation disambiguation
Task: Decide if a given phrase is an abbreviation and, if so, give its expanded forms — base and inflected.
Without relying on any user reports, we use algorithmic detection and Bayesian statistical methods to analyse two large data streams (329 k users) of Reddit content to study the correlations between username toxicity (of various types, such as offensive or sexually explicit) and their online toxic behavior (personal attacks, sexual harassment among...
Samurai Labs specjalizuje się w tworzeniu technologii informatycznych do walki z cyberprzemocą w Internecie.
Nasze główne obszary zainteresowania to przetwarzanie tekstu w celu wykrywania ataków personalnych, mowy nienawiści, treści suicydalnych, pedofilii.
Nasze zadania realizujemy poprzez łączenie metod symbolicznych oraz metod maszynowego ucze...
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
In this paper, we study language used by suicidal users on Reddit social media platform. To do that, we firstly collect a large-scale dataset of Reddit posts and annotate it with highly trained and expert annotators under a rigorous annotation scheme. Next, we perform a multifaceted analysis of the dataset, including: (1) the analysis of user activ...
To solve the punctuation restoration task, we build a unary classifier based on a neural network (NN). The NN was trained to predict if there is, or not, a punctuation mark after a word. We used pre-trained attention-based language models for Polish (Polish RoBERTa and HerBERT) to encode the word as a vector of numeric values. Thanks to the attenti...
Our solution to post-correction OCR results is based on a cascade of heuristics designed to fix specific groups of OCR errors. We identified four main types of OCR errors: punctuation, capitalisation, similar alphabetical characters, and noise in the form of random characters. We created a set of heuristics for each group to correct the most common...
In the article we present a single-run approach to recognizing nested named entities using neural networks with transformers. The main advantage of this approach is that a single model is trained to recognize all entity types. The model can identify all entities in a single pass. Our main contribution is the simplified representation of nested name...
In the paper, we deal with the problem of unsupervised text document clustering for the Polish language. Our goal is to compare the modern approaches based on language modeling (doc2vec and BERT)
with the classical ones, i.e., TF-IDF and wordnet-based. The experiments are conducted on three datasets containing qualification descriptions. The experi...
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents , normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language...
In the paper, we deal with the problem of spatial expression recognition. The goal of this task is to recognize in text information structures that represent a relative spatial relationship between two objects (a trajector and a landmark) indicated by a preposition of location, for example, a bookonthe table. We used the Corpus of Polish Spatial Te...
Przegląd zasobów i narzędzi do przetwarzania języka naturalnego dostępnych w ramach infrastruktury CLARIN-PL
Podczas wystąpienia zostały przedstawione wyniki analizy istniejących rozwiązań dla grupowania kwalifikacji z Zintegrowanego Rejestru Kwalifikacji (ZRK)
In the paper, we present a work in progress on fine-grained named entity recognition for Polish. The recent works on language modeling and deep learning had a significant impact on many natural language processing tasks, including named entity recognition. However, the focus was mainly on the evaluation of coarse-grained named entity models. In our...
The deep learning approach utilizing FastText language models significantly outperformed the CRF-based model in fine-grained named entity task (82 categories). ✔ Up to 10 pp improvement of F-measure. ✔ Shorter training time-60 minutes vs. several days. ✔ 4x faster processing. ✗ Larger model-11 GB vs. 0,5 GB. ✗ Requires more memory-13 GB vs. 3 GB.
In the paper, we present a work in progress on fine-grained named entity recognition for Polish. The recent works on language model-ing and deep learning had a significant impact on many natural language processing tasks, including named entity recognition. However, the focus was mainly on the evaluation of coarse-grained named entity models. In ou...
Udostępnianie zasobów i zarządzanie korpusami tekstowymi w infrastrukturze CLARIN-PL
In the paper we present the latest changes introduce to Inforex-a web-based system for qualitative and collaborative text corpora annotation and analysis. One of the most important news is the release of source codes. Now the system is available on the GitHub repository (https://github.com/ CLARIN-PL/Inforex) as an open source project. The system c...
In the paper we present the latest changes introduce to Inforex-a web-based system for qualitative and collaborative text corpora annotation and analysis. One of the most important news is the release of source codes. Now the system is available on the GitHub repository (https://github.com/ CLARIN-PL/Inforex) as an open source project. The system c...
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
In this article we discuss the current state-of-the-art for named entity recognition for Polish. We present publicly available resources and open-source tools for named entity recognition. The overview includes various kind of resources, i.e. guidelines, annotated corpora (NKJP, KPWr, CEN, PST) and lexicons (NELexiconS, PNET, Gazetteer). We present...
This paper summarises the PolEval 2019 shared task on lemmatization of proper names and multi-word phrases for Polish. The participating system has to generate a lemma for each phrase marked in the input set of documents following the KPWr lemmatization guidelines. Each document contains a plain text with a set of phrases marked with XML tags. The...
This article presents the research in the recognition and normalization of Polish temporal expressions as the result of the first PolEval 2019 shared task. Temporal information extracted from the text plays a significant role in many information extraction systems, like question answering, event recognition or text summarization. A specification fo...
Summary of PolEval 2019 Task 2 on Lemmatization of Proper Names and Multi-word Phrases
We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams pa...
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM)...
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM)...
A preliminary study in zero anaphora coreference resolution for Polish
Zero anaphora is an element of the coreference resolution task that has not yet been directly addressed in Polish and, in most studies, it has been left as the most challenging aspect for further investigation. This article presents an initial study of this problem. The prepara...
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
In the paper we present a tool for lemmatization of multi-word common noun phrases and named entities for Polish called Polem. The tool is based on a set of manually crafted rules and heuristics utilizing a set of dictionaries (including morphological, named entities and inflection patterns). The accuracy of lemmatization obtained by the tool reach...
A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Informa...
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Tematem wystąpienia będą narzędzia i zasoby do ekstrakcji informacji z tekstu opracowane i udostępniane poprzez Centrum Technologii Językowych CLARIN-PL. Zostaną zaprezentowane narzędzia do automatycznego rozpoznawania odniesień (w tym nazw własnych, wyrażeń temporalnych oraz wyznaczników sytuacji), normalizacji wyrażeń temporalnych (normalizacja l...
Ważnym zadaniem Centrum Technologii Językowych CLARIN-PL jest dostarczenie narzędzi umożliwiających wygodne prace korpusowe. Podczas wykładu słuchacze zapoznają się z podstawowymi zagadnieniami dotyczącymi przetwarzania i znakowania tekstów w systemie Inforex, na przykładzie wybranych korpusów zarówno treningowych (KPWr, PCSN), jak i użytkowych. Pr...
In this paper, the problem of spatial relation recognition in Polish is examined. We present the different ways of distributing spatial information throughout a sentence by reviewing the lexical and grammatical signals of various relations between objects. We focus on the spatial usage of prepositions and their meaning, determined by the ‘conceptua...
In this article we present the result of the recent research in the recognition of events in Polish. Event recognition plays a major role in many natural language processing applications such as question answering or automatic summarization. We adapted TimeML specification (the well known guideline for English) to Polish language. We annotated 540...
In the paper we cover the problem of spatial expression recognition in text for Polish language. A spatial expression is a text fragment which describes a relative location of two or more physical objects to each other. The first part of the paper treats about a Polish corpus annotated with spatial expressions and annotators agreement. In the secon...
Tematem wystąpienia będzie narzędzie do analizy odniesień geograficzny w tekstach literackich o nazwie Mapa Literacka. Narzędzie służy do automatycznego rozpoznania odniesień do obiektów geograficznych w tekstach w j. polskim, dokonuje ich kategoryzacji semantycznej i geolokalizacji oraz prezentuje wyniki na mapie geograficznej. Odniesienia do obie...
Tematem wystąpienia będą narzędzia i zasoby do ekstrakcji określonych informacji z tekstów. zaprezentowane zostaną trzy zagadnienia: wykrywanie wyrażeń przestrzennych, wykrywanie wyznaczników sytuacji oraz wykrywanie czasowników z podmiotem domyślnym w kontekście zadania rozpoznawania koreferencji. Dla każdego z zagadnień zostanie omówiony zakres r...
Tematem wystąpienia będzie zagadnienie automatycznego rozpoznawania jednostek identyfikacyjnych (głównie nazw własnych oraz przymiotników pochodzących od nazw własnych) oraz wyrażeń temporalnych w tekstach. Zostanie przedstawiony zakres rozpoznawanych informacji (definicja i kategorie jednostek identyfikacyjnych oraz wyrażeń temporalnych) oraz zost...
p>
Temporal Expressions in Polish Corpus KPWr
This article presents the result of the recent research in the interpretation of Polish expressions that refer to time. These expressions are the source of information when something happens, how often something occurs or how long something lasts. Temporal information, which can be extracted from te...
p>
Towards an event annotated corpus of Polish
The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – onto...
Presentation of tools for recognition spatial information in texts for Polish, including: Liner2 a basic tool for named entity recognition, Serel a basic tool for recognition of semantic relations between named entities, Literary Map an application for visualisation of toponyms on a map (geolocalization) and SpatialPL a tool for recognition spatial...
The paper investigates the accuracy of a Named Entity Recognition (NER) algo-rithm based on the Hidden Markov Model in the domain of Polish stock exchange reports. The task of NER was limited to the recognition and classification of Named Entities representing persons and companies. The algorithm was tested on a small Polish domain corpus of stock...
This article describes a heuristic approach to zero subject detection in Polish. It focuses on the zero subject detection as a crucial step in end-to-end coreference resolution. The zero subject verbs are recognized using a set of manually created rules utilizing information from different sources, including: a dependency parser, a shallow relation...
In this paper we discuss the performance of existing tools for coreference resolution for Polish from the perspective of information extraction tasks. We take into consideration the source of mentions, i.e., gold standard vs mentions recognized automatically. We evaluate three existing tools, i.e., IKAR, Ruler and Bartek on the KPWr corpus. We show...
In this article we present the result of the recent research in the recognition of Polish temporal expressions. The temporal information extracted from the text plays major role in many information extraction systems, like question answering, event recognition or discourse analysis. We prepared a broad description of Polish temporal expressions, ca...
Conditional Random Fields (CRFs) have been proven to be very useful in many sequence labelling tasks from the field of natural language processing, including named entity recognition (NER). The advantage of CRFs over other statistical models (like Hidden Markov Models) is that they can utilize a large set of features describing a sequence of observ...
A method for the recognition of the compositionality of Multi Word Expressions (MWEs) is proposed. First, we study associations between MWEs and the structure of wordnet lexico-semantic relations. A simple method of splitting plWordNet’s MWEs into compositional and non-compositional on the basis of the hypernymy structure is discussed. However, our...
In this paper we cover the problem of recognition of semantic relations between proper names (PNs) in running text. We focus on the manual rule creation approach and discuss to what extent the existing tools can be used for this task. As a result of our initial research we developed a rule-based toolset for recognition of relations between PNs call...
The paper presents WordNetLoom – an application for WordNet development used in the construction of a Polish WordNet called plWordNet. WordNetLoom provides two means of interaction: a form-based, implemented initially, and a graph-based introduced recently. The graphical, active presentation of WordNet structure enables direct work on the structure...
We report on our efforts aimed at building an Open Domain Question Answering system for Polish. Our contribution is twofold: we gathered a set of question-answer pairs from various Polish sources and we performed an empirical evaluation of two re-ranking methods. The gathered collection contains factoid, list, non-factoid and yes-no questions, whic...
Feature extraction from text corpora is an important step in Natural Language Processing (NLP), especially for Machine Learning (ML) techniques. Various NLP tasks have many common steps, e.g. low level act of reading a corpus and obtaining text windows from it. Some high-level processing steps might also be shared, e.g. testing for morpho-syntactic...
In this paper we cover the problem of recognition of semantic relations between proper names (PNs) in running text. We focus on the manual rule creation approach and discuss to what extent the existing tools can be used for this task. As a result of our initial research we developed a rule-based toolset for recognition of relations between PNs call...
In this paper we present a formalism for text annotation called WCCL Match. The need for a new formalism originates from our works related to Question Answering for Polish. We examined several existing formalisms to conclude that none of them fulfills our requirements. The new formalism was designed on top of an existing language for writing morpho...
In the paper we present a customizable and open-source framework for proper names recognition called Liner2. The framework consists of several universal methods for sequence chunking which include: dictionary look-up, pattern matching and statistical processing. The statistical processing is performed using Conditional Random Fields and a rich set...
In the paper we present a preliminary work on automatic construction of rules for recognition of semantic relations between pairs of proper names in Polish texts. Our goal was to check the feasibility of automatic rule construction using existing inductive logic programming (ILP) system as an alternative or supporting method for manual rule creatio...
The aim of this paper is to present a system for semantic text annotation called Inforex. Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense Disambiguation (WSD) and relations between named entities. The system also supports manual t...
In this paper we present several optimizations introduced to Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers incompleteness, problem of unambiguous generalization features, feature construction and selection, and finally reco...
Polish Corpus of Suicide Notes (henceforth PCSN) is constructed to meet the needs of forensic linguistics. Suicide notes are
messages created in borderline situation, shortly before death. Hence the annotation schema requires a complex description
of a document structure, the textual content, as well as its linguistic properties. TEI was selected a...
In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic information. To utilize the rich set of features we applied Conditional Random Fields (CRF) — a modern...
In the paper we present a Proper Name Recognition algorithm based on the Hidden Markov Model (HMM). Recognition of the Proper Names (PN) is treated as the basis for Named Entity Recognition problem in general. The proposed method is based on combining domain-dependent method based on HMM with domain independent methods based on gazetteers and hand-...
The paper presents WordnetLoom - a new version of an application supporting the development of the Polish wordnet called plWordNet. The primary user interface of WordnetLoom is a graph-based, graphical, active presentation of wordnet structure. Linguist can directly work on the structure of synsets linked by relation links. The new version is compa...
Accuracy of a Named Entity Recognition algorithm based on the Hidden Markov Model is investigated. The algorithm was limited
to recognition and classification of Named Entities representing persons. The algorithm was tested on two small Polish domain
corpora of stock exchange and police reports. Comparison with the base lines algorithms based on th...
We present a limited prototype of the CLARIN Language Technology Infrastructure (LTI) node, which provides several types of web services for Polish. The functionality encompasses morpho-syntactic processing, shallow semantic processing of corpus on the basis of the SuperMatrix system and plWordNet browsing. We take the prototype as the starting poi...
The WordNet Weaver application supports extension of a new wordnet. One of its functions is to suggest lexical units semantically
close to a given unit. Suggestions arise from activation-area attachment – multi-criteria voting based on several algorithms
that score semantic relatedness. We present the contributing algorithms and the method of combi...
We present an environment for the recognition and translation of Named Entities (NEs). The environment consists of a new formalism for the Named Entity Recognition and Translation (NERT), a parsing mechanism that reads the rules, recognizes Named Entities in given texts and suggests their translation, as well as a set of tools for the evaluation. W...
This paper deals with the task of definition extraction with the training corpus suffering from the problems of small size,
high noise and heavy imbalance. A previous approach, based on manually constructed shallow grammars, turns out to be hard
to better even by such robust classifiers as SVMs, AdaBoost and simple ensembles of classifiers. However...
Manual construction of a wordnet can be facilitated by a system that suggests semantic relations acquired from corpora. Such
systems tend to produce many wrong suggestions. We propose a method of filtering a raw list of noun pairs potentially linked
by hypernymy, and test it on Polish. The method aims for good recall and sufficient precision. The c...
The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented: Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial grammar gives results better than a relatively complicated partial grammar,...
In the paper the application of the general Memory Base Learning to Event Recognition in the domain of reports of stock issuers is investigated. A multi-classifier scheme is applied in which the boundaries of annotations are identified first and then a heuristic algorithm of merging into pair is applied. A modified method based only on positive exa...