Michał Marcińczuk

Michał Marcińczuk
Wroclaw University of Science and Technology | WUT · Faculty of Computer Science and Management

Ph.D.

About

84
Publications
9,568
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
414
Citations
Introduction
I mainly work on information extraction from text for Polish, including named entities, semantic relations between named entities, temporal expressions, spatial expressions and events. I am also involved in works on sentiment analysis, plagiarism detection, language errors detection.
Additional affiliations
April 2016 - present
Wroclaw University of Science and Technology
Position
  • Professor (Assistant)
Description
  • Researcher in the CLARIN-PL project.

Publications

Publications (84)
Presentation
Full-text available
Samurai Labs specjalizuje się w tworzeniu technologii informatycznych do walki z cyberprzemocą w Internecie. Nasze główne obszary zainteresowania to przetwarzanie tekstu w celu wykrywania ataków personalnych, mowy nienawiści, treści suicydalnych, pedofilii. Nasze zadania realizujemy poprzez łączenie metod symbolicznych oraz metod maszynowego ucze...
Article
Full-text available
In this paper, we study language used by suicidal users on Reddit social media platform. To do that, we firstly collect a large-scale dataset of Reddit posts and annotate it with highly trained and expert annotators under a rigorous annotation scheme. Next, we perform a multifaceted analysis of the dataset, including: (1) the analysis of user activ...
Presentation
Full-text available
To solve the punctuation restoration task, we build a unary classifier based on a neural network (NN). The NN was trained to predict if there is, or not, a punctuation mark after a word. We used pre-trained attention-based language models for Polish (Polish RoBERTa and HerBERT) to encode the word as a vector of numeric values. Thanks to the attenti...
Presentation
Full-text available
Our solution to post-correction OCR results is based on a cascade of heuristics designed to fix specific groups of OCR errors. We identified four main types of OCR errors: punctuation, capitalisation, similar alphabetical characters, and noise in the form of random characters. We created a set of heuristics for each group to correct the most common...
Article
In the paper, we deal with the problem of unsupervised text document clustering for the Polish language. Our goal is to compare the modern approaches based on language modeling (doc2vec and BERT) with the classical ones, i.e., TF-IDF and wordnet-based. The experiments are conducted on three datasets containing qualification descriptions. The experi...
Conference Paper
Full-text available
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents , normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language...
Article
In the article we present a single-run approach to recognizing nested named entities using neural networks with transformers. The main advantage of this approach is that a single model is trained to recognize all entity types. The model can identify all entities in a single pass. Our main contribution is the simplified representation of nested name...
Chapter
In the paper, we deal with the problem of spatial expression recognition. The goal of this task is to recognize in text information structures that represent a relative spatial relationship between two objects (a trajector and a landmark) indicated by a preposition of location, for example, a bookonthe table. We used the Corpus of Polish Spatial Te...
Presentation
Full-text available
Przegląd zasobów i narzędzi do przetwarzania języka naturalnego dostępnych w ramach infrastruktury CLARIN-PL
Presentation
Full-text available
Podczas wystąpienia zostały przedstawione wyniki analizy istniejących rozwiązań dla grupowania kwalifikacji z Zintegrowanego Rejestru Kwalifikacji (ZRK)
Conference Paper
Full-text available
In the paper, we present a work in progress on fine-grained named entity recognition for Polish. The recent works on language modeling and deep learning had a significant impact on many natural language processing tasks, including named entity recognition. However, the focus was mainly on the evaluation of coarse-grained named entity models. In our...
Poster
Full-text available
The deep learning approach utilizing FastText language models significantly outperformed the CRF-based model in fine-grained named entity task (82 categories). ✔ Up to 10 pp improvement of F-measure. ✔ Shorter training time-60 minutes vs. several days. ✔ 4x faster processing. ✗ Larger model-11 GB vs. 0,5 GB. ✗ Requires more memory-13 GB vs. 3 GB.
Preprint
Full-text available
In the paper, we present a work in progress on fine-grained named entity recognition for Polish. The recent works on language model-ing and deep learning had a significant impact on many natural language processing tasks, including named entity recognition. However, the focus was mainly on the evaluation of coarse-grained named entity models. In ou...
Presentation
Full-text available
Udostępnianie zasobów i zarządzanie korpusami tekstowymi w infrastrukturze CLARIN-PL
Poster
Full-text available
In the paper we present the latest changes introduce to Inforex-a web-based system for qualitative and collaborative text corpora annotation and analysis. One of the most important news is the release of source codes. Now the system is available on the GitHub repository (https://github.com/ CLARIN-PL/Inforex) as an open source project. The system c...
Conference Paper
Full-text available
In the paper we present the latest changes introduce to Inforex-a web-based system for qualitative and collaborative text corpora annotation and analysis. One of the most important news is the release of source codes. Now the system is available on the GitHub repository (https://github.com/ CLARIN-PL/Inforex) as an open source project. The system c...
Conference Paper
Full-text available
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-...
Article
In this article we discuss the current state-of-the-art for named entity recognition for Polish. We present publicly available resources and open-source tools for named entity recognition. The overview includes various kind of resources, i.e. guidelines, annotated corpora (NKJP, KPWr, CEN, PST) and lexicons (NELexiconS, PNET, Gazetteer). We present...
Conference Paper
Full-text available
This paper summarises the PolEval 2019 shared task on lemmatization of proper names and multi-word phrases for Polish. The participating system has to generate a lemma for each phrase marked in the input set of documents following the KPWr lemmatization guidelines. Each document contains a plain text with a set of phrases marked with XML tags. The...
Conference Paper
Full-text available
This article presents the research in the recognition and normalization of Polish temporal expressions as the result of the first PolEval 2019 shared task. Temporal information extracted from the text plays a significant role in many information extraction systems, like question answering, event recognition or text summarization. A specification fo...
Presentation
Full-text available
Summary of PolEval 2019 Task 2 on Lemmatization of Proper Names and Multi-word Phrases
Conference Paper
Full-text available
We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams pa...
Conference Paper
Full-text available
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM)...
Presentation
Full-text available
In the paper we present two systems for named entities recognition for Polish submitted to PolEval 2018 competition (Task 2). The first one, called Liner2, utilizes Conditional Random Fields with a rich set of features. The other one, called PolDeepNer, is an ensemble of three neural networks using a Bi-directional Long Short-Term Memory (Bi-LSTM)...
Article
Full-text available
A preliminary study in zero anaphora coreference resolution for PolishZero anaphora is an element of the coreference resolution task that has not yet been directly addressed in Polish and, in most studies, it has been left as the most challenging aspect for further investigation. This article presents an initial study of this problem. The preparati...
Conference Paper
Full-text available
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Presentation
Full-text available
In the paper we present a tool for lemmatization of multi-word common noun phrases and named entities for Polish called Polem. The tool is based on a set of manually crafted rules and heuristics utilizing a set of dictionaries (including morphological, named entities and inflection patterns). The accuracy of lemmatization obtained by the tool reach...
Article
A key challenge of the Information Extraction in Natural Language Processing is the ability to recognise and classify temporal expressions (timexes). It is a crucial source of information about when something happens, how often something occurs or how long something lasts. Timexes extracted automatically from text, play a major role in many Informa...
Conference Paper
Full-text available
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Data
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Presentation
Full-text available
In the paper we present an adaptation of Liner2 framework to solve the BSNLP 2017 shared task on multilingual named entity recognition. The tool is tuned to recognize and lemmatize named entities for Polish.
Presentation
Full-text available
Tematem wystąpienia będą narzędzia i zasoby do ekstrakcji informacji z tekstu opracowane i udostępniane poprzez Centrum Technologii Językowych CLARIN-PL. Zostaną zaprezentowane narzędzia do automatycznego rozpoznawania odniesień (w tym nazw własnych, wyrażeń temporalnych oraz wyznaczników sytuacji), normalizacji wyrażeń temporalnych (normalizacja l...
Presentation
Full-text available
Ważnym zadaniem Centrum Technologii Językowych CLARIN-PL jest dostarczenie narzędzi umożliwiających wygodne prace korpusowe. Podczas wykładu słuchacze zapoznają się z podstawowymi zagadnieniami dotyczącymi przetwarzania i znakowania tekstów w systemie Inforex, na przykładzie wybranych korpusów zarówno treningowych (KPWr, PCSN), jak i użytkowych. Pr...
Article
Full-text available
In this paper, the problem of spatial relation recognition in Polish is examined. We present the different ways of distributing spatial information throughout a sentence by reviewing the lexical and grammatical signals of various relations between objects. We focus on the spatial usage of prepositions and their meaning, determined by the ‘conceptua...
Conference Paper
Full-text available
In this article we present the result of the recent research in the recognition of events in Polish. Event recognition plays a major role in many natural language processing applications such as question answering or automatic summarization. We adapted TimeML specification (the well known guideline for English) to Polish language. We annotated 540...
Conference Paper
In the paper we cover the problem of spatial expression recognition in text for Polish language. A spatial expression is a text fragment which describes a relative location of two or more physical objects to each other. The first part of the paper treats about a Polish corpus annotated with spatial expressions and annotators agreement. In the secon...
Presentation
Full-text available
Tematem wystąpienia będzie narzędzie do analizy odniesień geograficzny w tekstach literackich o nazwie Mapa Literacka. Narzędzie służy do automatycznego rozpoznania odniesień do obiektów geograficznych w tekstach w j. polskim, dokonuje ich kategoryzacji semantycznej i geolokalizacji oraz prezentuje wyniki na mapie geograficznej. Odniesienia do obie...
Presentation
Full-text available
Tematem wystąpienia będą narzędzia i zasoby do ekstrakcji określonych informacji z tekstów. zaprezentowane zostaną trzy zagadnienia: wykrywanie wyrażeń przestrzennych, wykrywanie wyznaczników sytuacji oraz wykrywanie czasowników z podmiotem domyślnym w kontekście zadania rozpoznawania koreferencji. Dla każdego z zagadnień zostanie omówiony zakres r...
Presentation
Full-text available
Tematem wystąpienia będzie zagadnienie automatycznego rozpoznawania jednostek identyfikacyjnych (głównie nazw własnych oraz przymiotników pochodzących od nazw własnych) oraz wyrażeń temporalnych w tekstach. Zostanie przedstawiony zakres rozpoznawanych informacji (definicja i kategorie jednostek identyfikacyjnych oraz wyrażeń temporalnych) oraz zost...
Article
Full-text available
p> Temporal Expressions in Polish Corpus KPWr This article presents the result of the recent research in the interpretation of Polish expressions that refer to time. These expressions are the source of information when something happens, how often something occurs or how long something lasts. Temporal information, which can be extracted from te...
Article
Full-text available
p> Towards an event annotated corpus of Polish The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – onto...
Presentation
Full-text available
Presentation of tools for recognition spatial information in texts for Polish, including: Liner2 a basic tool for named entity recognition, Serel a basic tool for recognition of semantic relations between named entities, Literary Map an application for visualisation of toponyms on a map (geolocalization) and SpatialPL a tool for recognition spatial...
Article
Full-text available
The paper investigates the accuracy of a Named Entity Recognition (NER) algo-rithm based on the Hidden Markov Model in the domain of Polish stock exchange reports. The task of NER was limited to the recognition and classification of Named Entities representing persons and companies. The algorithm was tested on a small Polish domain corpus of stock...
Conference Paper
This article describes a heuristic approach to zero subject detection in Polish. It focuses on the zero subject detection as a crucial step in end-to-end coreference resolution. The zero subject verbs are recognized using a set of manually created rules utilizing information from different sources, including: a dependency parser, a shallow relation...
Conference Paper
Full-text available
In this paper we discuss the performance of existing tools for coreference resolution for Polish from the perspective of information extraction tasks. We take into consideration the source of mentions, i.e., gold standard vs mentions recognized automatically. We evaluate three existing tools, i.e., IKAR, Ruler and Bartek on the KPWr corpus. We show...
Conference Paper
Full-text available
In this article we present the result of the recent research in the recognition of Polish temporal expressions. The temporal information extracted from the text plays major role in many information extraction systems, like question answering, event recognition or discourse analysis. We prepared a broad description of Polish temporal expressions, ca...
Conference Paper
Full-text available
Conditional Random Fields (CRFs) have been proven to be very useful in many sequence labelling tasks from the field of natural language processing, including named entity recognition (NER). The advantage of CRFs over other statistical models (like Hidden Markov Models) is that they can utilize a large set of features describing a sequence of observ...
Conference Paper
A method for the recognition of the compositionality of Multi Word Expressions (MWEs) is proposed. First, we study associations between MWEs and the structure of wordnet lexico-semantic relations. A simple method of splitting plWordNet’s MWEs into compositional and non-compositional on the basis of the hypernymy structure is discussed. However, our...
Poster
Full-text available
In this paper we cover the problem of recognition of semantic relations between proper names (PNs) in running text. We focus on the manual rule creation approach and discuss to what extent the existing tools can be used for this task. As a result of our initial research we developed a rule-based toolset for recognition of relations between PNs call...
Article
Full-text available
The paper presents WordNetLoom – an application for WordNet development used in the construction of a Polish WordNet called plWordNet. WordNetLoom provides two means of interaction: a form-based, implemented initially, and a graph-based introduced recently. The graphical, active presentation of WordNet structure enables direct work on the structure...
Conference Paper
Full-text available
We report on our efforts aimed at building an Open Domain Question Answering system for Polish. Our contribution is twofold: we gathered a set of question-answer pairs from various Polish sources and we performed an empirical evaluation of two re-ranking methods. The gathered collection contains factoid, list, non-factoid and yes-no questions, whic...
Article
Feature extraction from text corpora is an important step in Natural Language Processing (NLP), especially for Machine Learning (ML) techniques. Various NLP tasks have many common steps, e.g. low level act of reading a corpus and obtaining text windows from it. Some high-level processing steps might also be shared, e.g. testing for morpho-syntactic...
Article
Full-text available
In this paper we cover the problem of recognition of semantic relations between proper names (PNs) in running text. We focus on the manual rule creation approach and discuss to what extent the existing tools can be used for this task. As a result of our initial research we developed a rule-based toolset for recognition of relations between PNs call...
Article
In this paper we present a formalism for text annotation called WCCL Match. The need for a new formalism originates from our works related to Question Answering for Polish. We examined several existing formalisms to conclude that none of them fulfills our requirements. The new formalism was designed on top of an existing language for writing morpho...
Article
Full-text available
In the paper we present a customizable and open-source framework for proper names recognition called Liner2. The framework consists of several universal methods for sequence chunking which include: dictionary look-up, pattern matching and statistical processing. The statistical processing is performed using Conditional Random Fields and a rich set...
Conference Paper
In the paper we present a preliminary work on automatic construction of rules for recognition of semantic relations between pairs of proper names in Polish texts. Our goal was to check the feasibility of automatic rule construction using existing inductive logic programming (ILP) system as an alternative or supporting method for manual rule creatio...
Conference Paper
Full-text available
The aim of this paper is to present a system for semantic text annotation called Inforex. Inforex is a web-based system designed for managing and annotating text corpora on the semantic level including annotation of Named Entities (NE), anaphora, Word Sense Disambiguation (WSD) and relations between named entities. The system also supports manual t...
Conference Paper
Full-text available
In this paper we present several optimizations introduced to Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers incompleteness, problem of unambiguous generalization features, feature construction and selection, and finally reco...
Conference Paper
Full-text available
Polish Corpus of Suicide Notes (henceforth PCSN) is constructed to meet the needs of forensic linguistics. Suicide notes are messages created in borderline situation, shortly before death. Hence the annotation schema requires a complex description of a document structure, the textual content, as well as its linguistic properties. TEI was selected a...
Conference Paper
In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic information. To utilize the rich set of features we applied Conditional Random Fields (CRF) — a modern...
Article
Full-text available
In the paper we present a Proper Name Recognition algorithm based on the Hidden Markov Model (HMM). Recognition of the Proper Names (PN) is treated as the basis for Named Entity Recognition problem in general. The proposed method is based on combining domain-dependent method based on HMM with domain independent methods based on gazetteers and hand-...
Conference Paper
Full-text available
The paper presents WordnetLoom - a new version of an application supporting the development of the Polish wordnet called plWordNet. The primary user interface of WordnetLoom is a graph-based, graphical, active presentation of wordnet structure. Linguist can directly work on the structure of synsets linked by relation links. The new version is compa...
Conference Paper
Accuracy of a Named Entity Recognition algorithm based on the Hidden Markov Model is investigated. The algorithm was limited to recognition and classification of Named Entities representing persons. The algorithm was tested on two small Polish domain corpora of stock exchange and police reports. Comparison with the base lines algorithms based on th...
Conference Paper
Full-text available
We present a limited prototype of the CLARIN Language Technology Infrastructure (LTI) node, which provides several types of web services for Polish. The functionality encompasses morpho-syntactic processing, shallow semantic processing of corpus on the basis of the SuperMatrix system and plWordNet browsing. We take the prototype as the starting poi...
Conference Paper
The WordNet Weaver application supports extension of a new wordnet. One of its functions is to suggest lexical units semantically close to a given unit. Suggestions arise from activation-area attachment – multi-criteria voting based on several algorithms that score semantic relatedness. We present the contributing algorithms and the method of combi...
Conference Paper
Full-text available
We present an environment for the recognition and translation of Named Entities (NEs). The environment consists of a new formalism for the Named Entity Recognition and Translation (NERT), a parsing mechanism that reads the rules, recognizes Named Entities in given texts and suggests their translation, as well as a set of tools for the evaluation. W...
Conference Paper
Full-text available
This paper deals with the task of definition extraction with the training corpus suffering from the problems of small size, high noise and heavy imbalance. A previous approach, based on manually constructed shallow grammars, turns out to be hard to better even by such robust classifiers as SVMs, AdaBoost and simple ensembles of classifiers. However...
Chapter
Manual construction of a wordnet can be facilitated by a system that suggests semantic relations acquired from corpora. Such systems tend to produce many wrong suggestions. We propose a method of filtering a raw list of noun pairs potentially linked by hypernymy, and test it on Polish. The method aims for good recall and sufficient precision. The c...
Conference Paper
Full-text available
The paper deals with the task of definition extraction from a small and noisy corpus of instructive texts. Three approaches are presented: Partial Parsing, Machine Learning and a sequential combination of both. We show that applying ML methods with the support of a trivial grammar gives results better than a relatively complicated partial grammar,...