Conference PaperPDF Available

KRNNT: Polish Recurrent Neural Network Tagger

Authors:

Abstract

The article presents a state-of-the-art complete part-of-speech tagger for Polish using recurrent neural networks. They allow for access to full left and right context of a sentence in comparison to context window. The tagger uses an external morphological analyzer. In comparison to the best Polish taggers, it does not use word form as a feature for the classifier, there is no separate classifier for unknown words and predictions are not limited to tags provided by a morphological analyzer. The obtained accuracy is higher — it achieves 28% error reduction and 7% points higher accuracy for unknown words. The tagger also might work faster than others by utilizing GPU. The tagger participated in PolEval competition and won two subtasks.
386
387
388
389
390
391
... In 2016, for Polish taggers "reaching the goal of 97% seemed very distant" (Kobyliński and Kieraś 2016). However, the next year thanks to PolEval competition (Kobyliński and Ogrodniczuk 2017) new deep learning approaches arose reaching 94% of accuracy (Krasnowska- Kieraś 2017, Wróbel 2017. Kobyliński et al. (2018) used meta-algorithm to achieve 94.7%. ...
... The most common solution is to randomly pick a lemma from interpretations from a morphological analyzer consistent with the predicted tag. KRNNT (Wróbel 2017) improves this process by learning the most common lemma for text form and tag pair. ...
Conference Paper
Full-text available
This paper presents winning solution to PolEval 2020 morphosyntactic tagging of Middle, New and Modern Polish task. The goal of the task is to disambiguate morphologic analysis. The solution has a full neural network pipeline (tokenization and morphosyntactic tagging) from raw text to annotated text. It does not require any external dependencies. However, the output from morphological analyzer can be exploited to increase the scores. Finally, the tagger exceeds the threshold of 97% obtaining the score of 97.3% for contemporary texts.
... The proposed Alium system uses a rule-based approach to perform the recognition and normalization phase. We did not take part in the competition but we compared the results of Alium with our Liner2 system (see Table 1), which performs the recognition of named entities (Marci´nczukMarci´nczuk et al. 2013, Marci´nczukMarci´nczuk et al. 2017), events (Koco´nKoco´n and Marci´nczukMarci´nczuk 2016) and temporal expressions (Koco´nKoco´n and Marci´nczukMarci´nczuk 2015, 2017, Koco´nKoco´n and Marci´nczukMarci´nczuk 2017. Liner2 is an open-source system available with the configuration used for this task in CLARIN-PL DSpace repository: http://hdl.handle.net/11321/531 ...
... -lemma -the lemma of the token determined by KRNNT tagger (Wróbel 2017) -morph_tags -the morphosyntactic tags of the token determined by KRNNT. ...
Chapter
In this paper I present a summary of my results from the competition that took place this year and was organized by PolEval. One of the tasks of this competition was the detection of offensive comments in social media. By joining this competition, I set myself a goal to compare some of the popular text classification models used on Kaggle or recommended by Google. That’s why during the competition I went through models such as: Ngrams and MLP, word embedding and sepCNN, Flair from Zalando with different embedding, combination of LSTM and GRU with word embedding trained from scratch. (http://2019.poleval.pl/files/poleval2019.pdf)
... KRNNT [44] also uses bi-directional recurrent neural networks (Gated Recurrent Units) for morphological tagging. In contrast with Toygger, KRNNT does not use word embeddings for feature representation. ...
... The compared systems are AvgPer [27], KRNNT [44], MorphoDiTaPL [25], Neuroparser [31], and Toygger. Table 7 Comparison of PolEval17 MD/tagging systems. ...
Article
In this paper we discuss the current state of the art in part-of-speech tagging for Polish. We introduce the problem of POS tagging and point out the key issues in tagging inflected languages, which make this task more difficult in the case of Polish than e.g. English. We also discuss the most important language resources connected with POS tagging, as well as the task of morphological analysis, as it is commonly used as a preliminary step in tagging. We describe the methods that have been applied to the problem of POS tagging for Polish to date and discuss the most current, neural-network based methods in more detail. Finally, we conclude with a general view of this field in the context of Polish and discuss possible future research directions.
... Experimental results verify that based on a morphological analyzer using a neural network in [32] level of error identification for polish text comes up to 93.3-99.9%. In [33], the author demonstrated the obtained accuracy of up to 28% to identify errors among the words being in common use and 7% for the unbeknown polish words. ...
Article
Full-text available
Citation: Lytvyn, V.; Pukach, P.; Vysotska, V.; Vovk, M.; Kholodna, N. Identification and Correction of Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology. Mathematics 2023, 11, 904. Abstract: A machine learning model for correcting errors in Ukrainian texts has been developed. It was established that the neural network has the ability to correct simple sentences written in Ukrainian; however, the development of a full-fledged system requires the use of spell-checking using dictionaries and the checking of rules, both simple and those based on the result of parsing dependencies or other features. In order to save computing resources, a pre-trained BERT (Bidirectional Encoder Representations from Transformer) type neural network was used. Such neural networks have half as many parameters as other pre-trained models and show satisfactory results in correcting grammatical and stylistic errors. Among the ready-made neural network models, the pre-trained neural network model mT5 (a multilingual variant of T5 or Text-to-Text Transfer Transformer) showed the best performance according to the BLEU (bilingual evaluation understudy) and METEOR (metric for evaluation of translation with explicit ordering) metrics.
... There were a few approaches to Polish language lemmatisation, for example KRNNT tagger [4] based on recurrent neural network. In 2019 lemmatisation was a part of a PolEval competition 2 . ...
Preprint
Full-text available
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence. As a result, developing efficient lemmatisation algorithm is the complex task. In recent years it can be observed that deep learning models used for this task outperform other methods including machine learning algorithms. In this paper the polish lemmatizer based on Google T5 model is presented. The training was run with different context lengths. The model achieves the best results for polish language lemmatisation process.
... As the training data we have provided a complete text of Wikipedia with morphosyntactic data provided by KRNNT tagger [37], categorization of articles into Wikipedia categories and WD types, Wikipedia redirections and internal links. ...
Chapter
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-the-art in Polish language processing in the respective tasks. In 2019 we have organized six different tasks, creating an even greater opportunity for NLP researchers to evaluate their systems in an objective manner. KeywordsTemporal expressionsLemmatizationEntity linkingMachine translationAutomatic speech recognitionCyberbullying detection
... POS-tagging has applications in many NLP pipelines, including Document Classification, Named Entity Recognition, Sentiment Analysis, and Question answering [1]. Task-performing software, usually labeled POS-tagger, can be either rule-based using lookup-tables, dictionaries, and/or extracted linguistic rules [2]; stochastic-based using various machine learning technologies from support vector machines [3] to the recurrent neural network (RNN) [4] and Deep Neural Networks [5]; or hybrid, combining the two, with an example being the TreeTagger, a software employing both lookup tables and dictionaries with the Hidden Markov Models (HMM) stochastic approach [6]. ...
Article
Full-text available
In a setting where multiple automatic annotation approaches coexist and advance separately but none completely solve a specific problem, the key might be in their combination and integration. This paper outlines a scalable architecture for Part-of-Speech tagging using multiple standalone annotation systems as feature generators for a stacked classifier. It also explores automatic resource expansion via dataset augmentation and bidirectional training in order to increase the number of taggers and to maximize the impact of the composite system, which is especially viable for low-resource languages. We demonstrate the approach on a preannotated dataset for Serbian using nested cross-validation to test and compare standalone and composite taggers. Based on the results, we conclude that given a limited training dataset, there is a payoff from cutting a percentage of the initial training set and using it to fine-tune a machine-learning-based stacked classifier, especially if it is trained bidirectionally. Moreover, we found a measurable impact on the usage of multiple tagsets to scale-up the architecture further through transfer learning methods.
... The tagger participated in PolEval 2017 POS Tagging competition and won task B and task C. Additionally, results for PolEval 2020 Morphosyntactic tagging of Middle, New and Modern Polish are reported. The paper is an extension of Language Technology Conference paper [25]. ...
Chapter
The article presents a state-of-the-art complete part-of-speech tagger for Polish which uses recurrent neural networks. The networks allow accessing the full left and right context of a sentence in comparison to a context window. The tagger uses an external morphological analyzer. In comparison to the best Polish taggers, it does not use word form as a feature for the classifier, there is no separate classifier for unknown words, and predictions are not limited to tags provided by a morphological analyzer. The accuracy is higher—it achieves 28% error reduction and 7% points higher accuracy for unknown words. The tagger also might work faster than others by utilizing GPU. The tagger participated in PolEval 2017 POS Tagging competition and won task B and task C. Additionally, results for PolEval 2020 Morphosyntactic tagging of Middle, New and Modern Polish are reported. The paper is an extension of the Language & Technology Conference paper [25].
... As the training data we have provided a complete text of Wikipedia with morphosyntactic data provided by KRNNT tagger (Wróbel, 2017), categorization of articles into Wikipedia categories and WD types, Wikipedia redirections and internal links. ...
Conference Paper
Full-text available
PolEval is a SemEval-inspired evaluation campaign for natural language processing tools for Polish. Submitted tools compete against one another within certain tasks selected by organizers, using available data and are evaluated according to pre-established procedures. It is organized since 2017 and each year the winning systems become the state-of-the-art in Polish language processing in the respective tasks. In 2019 we have organized six different tasks, creating an even greater opportunity for NLP researchers to evaluate their systems in an objective manner.
Chapter
In this paper we present a new approach to the problem of lemmatisation in inflectional languages on the example of Polish. We made an introduction to the problem domain, described the solution used – the Transformer architecture and learning process on lexical data – and presented experimental results showing a high degree of generalization of the new solution. At the very end, we presented conclusions and plans for future research.
ResearchGate has not been able to resolve any references for this publication.