ArticlePDF Available

UralicNLP: An NLP Library for Uralic Languages

Abstract

In the past years the natural language processing (NLP) tools and resources for small Uralic languages have received a major uplift. The open-source Giellatekno infrastructure has served a key role in gathering these tools and resources in an open environment for researchers to use. However, the many of the crucially important NLP tools, such as FSTs and CGs require specialized tools with a learning curve. This paper presents UralicNLP, a Python library, the goal of which is to mask the actual implementation behind a Python interface. This not only lowers the threshold to use the tools provided in the Giellatekno infrastructure but also makes it easier to incorporate them as a part of research code written in Python.
UralicNLP: An NLP Library for Uralic Languages
Mika Hämäläinen1
1Department of Digital Humanities, University of Helsinki
DOI: 10.21105/joss.01345
Software
Review
Repository
Archive
Submitted: 09 March 2019
Published: 09 May 2019
License
Authors of papers retain copy-
right and release the work un-
der a Creative Commons Attri-
bution 4.0 International License
(CC-BY).
Introduction
In the past years the natural language processing (NLP) tools and resources for the
small Uralic languages have received a major uplift. The open-source infrastructure by
Giellatekno (Moshagen, Pirinen, & Trosterud, 2013) has served a key role in gathering
these tools and resources in an open environment for researchers to use.
However, the many of the crucially important NLP tools, such as FSTs (nite-state trans-
ducers) (cf. Beesley & Karttunen, 2003) for processing morphology and CGs (constraint
grammars) (cf. Karlsson, Voutilainen, Heikkilä, & Anttila, 1995) for syntax, require spe-
cialized tools with a learning curve. Their use for a researcher who is not familiar with
them can be challenging, and ultimately lead to simply ignoring the existence of the
resources.
This paper presents UralicNLP, a Python library, the goal of which is to mask the actual
implementation behind a Python interface. This not only lowers the threshold to use the
tools provided in the Giellatekno infrastructure but also makes it easier to incorporate
them as a part of research code written in Python.
Functionalities
This section describes the current functionalities of the Python library. At the time of
writing, the library focuses on low-level NLP tasks. Additionally, semantic models are
provided for a limited number of languages.
Morphology
The FST models provided in the Giellatekno infrastructure are built on HFST (Helsinki
Finite-State Technology) (Lindén et al., 2013), which is an open-source tool for compil-
ing and running scripts that follow the FST formalism. UralicNLP uses the compiled
FST models available through the Online Dictionary of Uralic Languages (Hämäläinen &
Rueter, 2018).
The library provides morphological analysis on a word level for all supported languages.
This means that it will output all the possible morphological readings for an input word
form. The morphological analyzers provide typically a lemma, part-of-speech tag and a
list morphological tags such as the number and case of the word from. The list of possible
readings may include weights indicating the probability of the analysis. However, these
are not currently implemented in any of the FST models. For example, for the Finnish
word voit, the analyzer gives readings voi (butter) as a noun in the plural of nominative
and voida (can) as a verb in the second person of singular.
Hämäläinen, (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of Open Source Software, 4(37), 1345. https://doi.org/10.
21105/joss.01345
1
Given a lemma, part-of-speech tag and morphological tags separated by a plus sign, it
is possible to use UralicNLP to generate word forms. This inection mechanism can
be useful in various natural language generation tasks. For instance, giving the Finnish
word kissa, and the morphological tags plural and genitive, the library inects the word
as kissojen.
Disambiguation
Whereas the morphological functionality does the analysis only on the word level, the dis-
ambiguator applies CG rules to rule out the morphological readings that are not suitable
in the context by using the VISL CG-3 tool (Bick & Didriksen, 2015). These CG rules
originate from the Giellatekno repository, but they are downloaded through the Online
Dictionary of Uralic Languages.
Depending on the language, the disambiguator can often output multiple readings because
the rules are not sucient to fully disambiguate the sentence. It is important to take this
into account when using the functionality.
Lexical Lookup
The API of the Online Dictionary of Uralic Languages provides essentially the same
data as in the Giellatekno multilingual XML dictionaries in a JSON format. The actual
contents of the data depend on the language, but information such as semantic tags, URLs
to audio les, example sentences and translations in multiple languages is oftentimes
provided.
In order to use the lexical lookup, the ISO code of the minority language needs to be
specied. This will limit the query into the dictionary of that language. Queries can be
done either with a lemma or with an inectional form. It is also possible to query in one
of the languages the minority language words are translated to.
Semantics
UralicNLP provides an easy to use programmatic interface to SemFi and SemUr databases
(Hämäläinen, 2018a). These databases contain semantic information of words given their
syntactic relations. For example, the database can be used to list out all the verbs that can
have koira (dog) as a subject together with the frequency of the co-occurrence of the verbs
and the noun koira in a corpus. SemFi has previously been used in the computationally
creative task of poem generation (Hämäläinen, 2018b).
SemUr consists of databases for endangered Uralic languages that have been translated
automatically from SemFi. Both of SemFi and SemUr are structurally identical SQLite
databases which makes it possible to query them with the same methods provided by
UralicNLP.
Universal Dependency Parser
UralicNLP comes with functionality to parse Treebanks. The parsed Treebanks can be
queried eectively with the dierent universal dependency annotations such as part-of-
speech, dependency relation and lemma. The queries support regular expressions. This
functionality is useful with the growing number of UD Treebanks available for Uralic
languages.
Hämäläinen, (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of Open Source Software, 4(37), 1345. https://doi.org/10.
21105/joss.01345
2
Distribution
UralicNLP is distributed as an installable package through PyPi with the name uralic-
NLP1. The source code is released under the Apache open source license on GitHub.
References
Beesley, K. R., & Karttunen, L. (2003). Finite-state morphology. In (pp. 451–454).
Stanford, CA: CSLI Publications.
Bick, E., & Didriksen, T. (2015). CG-3 — beyond classical constraint grammar. In Pro-
ceedings of the 20th nordic conference of computational linguistics, NODALIDA 2015, may
11-13, 2015, vilnius, lithuania (pp. 31–39). University of Southern Denmark, Denmark;
Linköping University Electronic Press, Linköpings universitet.
Hämäläinen, M. (2018a). Extracting a semantic database with syntactic relations for
Finnish to boost resources for endangered Uralic languages. In Proceedings of the logic
and engineering of natural language semantics 15 (LENLS15). Retrieved from https:
//helda.helsinki.fi/handle/10138/282733
Hämäläinen, M. (2018b). Harnessing NLG to create Finnish poetry automatically. In
Proceedings of the ninth international conference on computational creativity (pp. 9–15).
Hämäläinen, M., & Rueter, J. (2018). Advances in synchronized XML-MediaWiki dictio-
nary development in the context of endangered Uralic languages. In Proceedings of the
eighteenth EURALEX international congress (pp. 967–978).
Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (1995). Constraint grammar: A
language-independent system for parsing unrestricted text. Walter de Gruyter.
Lindén, K., Axelson, E., Drobac, S., Hardwick, S., Kuokkala, J., Niemi, J., Pirinen, T.
A., et al. (2013). HFST a system for creating NLP tools. In International workshop on
systems and frameworks for computational morphology (pp. 53–71). Springer. doi:10.
1007/978-3-642-40486-3_4
Moshagen, S. N., Pirinen, T. A., & Trosterud, T. (2013). Building an open-source devel-
opment infrastructure for language technology projects. In Proceedings of the 19th nordic
conference of computational linguistics (NODALIDA 2013); may 22-24; 2013; oslo uni-
versity; norway. NEALT proceedings series 16 (pp. 343–352). University of Tromsø,
Norway; Linköping University Electronic Press.
1pip install uralicNLP
Hämäläinen, (2019). UralicNLP: An NLP Library for Uralic Languages. Journal of Open Source Software, 4(37), 1345. https://doi.org/10.
21105/joss.01345
3
... The transducer is available on GitHub for Komi-Zyrian.¹ The nightly builds are available through a Python library called Ural-icNLP² (Hämäläinen, 2019). Easy and efficient access to the traducers and their lexical materials has been the main designing principle, and we consider current approach very successful. ...
... Mika Hämäläinen's role has been central in building more widely accessible computational infrastructure to access these transducers (Hämäläinen, 2019). In the recent work to create an online editing platform that would allow improved access to the lexical materials, Khalid Alnajjar has been in an irreplaceable position (Alnajjar et al., 2020a). ...
Conference Paper
Full-text available
This study describes the ongoing development of the finite-state description for an endangered minority language, Komi-Zyrian. This work is located in the context where large written and spoken language corpora are available, which creates a set of unique challenges that have to be, and can be, addressed. We describe how we have designed the transducer so that it can benefit from existing open-source infrastructures and therefore be as reusable as possible.
... There exists a half-finished rule-based machine translation system between Erzya and Finnish 6 , and a grammar parser for Erzya 7 . The software package UralicNLP (Hämäläinen, 2019) supports Erzya among other languages. ...
Preprint
Full-text available
We present the first neural machine translation system for translation between the endangered Erzya language and Russian and the dataset collected by us to train and evaluate it. The BLEU scores are 17 and 19 for translation to Erzya and Russian respectively, and more than half of the translations are rated as acceptable by native speakers. We also adapt our model to translate between Erzya and 10 other languages, but without additional parallel data, the quality on these directions remains low. We release the translation models along with the collected text corpus, a new language identification model, and a multilingual sentence encoder adapted for the Erzya language. These resources will be available at https://github.com/slone-nlp/myv-nmt.
... We convert the crawled data into a textual corpus, which we clean from non-Arabic text and remove any Arabic diacritics or punctuation using UralicNLP (Hämäläinen, 2019). In the case of Fatwas, we format the text as question first followed by the answer provided by the Mufti, and for Tafseer, we add the context (i.e., passage/verse) prior to the question. ...
Preprint
Full-text available
The goal of the paper is to predict answers to questions given a passage of Qur'an. The answers are always found in the passage, so the task of the model is to predict where an answer starts and where it ends. As the initial data set is rather small for training, we make use of multilingual BERT so that we can augment the training data by using data available for languages other than Arabic. Furthermore, we crawl a large Arabic corpus that is domain specific to religious discourse. Our approach consists of two steps, first we train a BERT model to predict a set of possible answers in a passage. Finally, we use another BERT based model to rank the candidate answers produced by the first BERT model.
... In this work, we use dictionaries of three endangered languages Komi-Zyrian, Livonian and Erzya. The Komi and Erzya dictionaries are built as part of the Giella Project (Moshagen et al., 2014) 5 and they are available through Ural-icNLP (Hämäläinen, 2019), while the Livonian dictionary has been outlined in Rueter (2014). As seen in Figure 1, an XML dictionary contains lexemes, their parts-of-speech, and translations grouped by the meaning group. ...
Conference Paper
Full-text available
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Lat-vian and Estonian, and the Komi-Zyrian dictionary has some translations to Finnish, En-glish and Russian. We utilize graph-based approaches to augment such dictionaries by predicting new translations to existing and new languages based on different dictionaries for endangered languages and Wiktionar-ies. Our study focuses on the lexical resources for Komi-Zyrian (kpv), Erzya (myv) and Livo-nian (liv). We evaluate our approach by human judges fluent in the three endangered languages in question. Based on the evaluation, the method predicted good or acceptable translations 77% of the time. Furthermore, we train a neural prediction model to predict the quality of the automatically predicted translations with an 81% accuracy. The resulting extensions to the dictionaries are made available on the online dictionary platform used by the speakers of these languages.
... After this different operations can be applied, but at a different level: there are already many packages often provide deeper language specific functionality that should be leveraged. Example include UralicNLP (Hämäläinen, 2019) for basic NLP analysis of Uralic languages, and murre for specific dialectal and historical text normalization or lemmatization scenarios (Partanen et al., 2019;Hämäläinen et al., 2020). The NLP for Latin also seems fairly developed, and available models could be applied (Clérice, 2021). ...
Preprint
Full-text available
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852). The Finno-Ugrian Society is publishing Castr\'en's manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials. Most of these datasets are openly available in Zenodo. The study points to specific areas where further research is needed, and provides benchmarks for text recognition tasks.
... As an additional data source, we use the North Sámi Universal Dependencies treebank (Tyers and Sheyanova, 2017). We parse the corpus with Uralic-NLP (Hämäläinen, 2019) and split the compounds the rule-based morphological analyser identifies as consisting of two or more words in order to synthetically introduce errors. We also run the rule-based morphological analyser and morpho-syntactic disambiguator to add part-of-speech (POS) information to produce an additional data set with POS tags. ...
... AKU is an abbreviation for Avointa Kieliteknologiaa Uralilaisille/Uhanalaisille kielille (Open language technology for Uralic/Endangered languages). Other projects that are directly associated with this are uralicNLP (Hämäläinen, 2019), Akusanat (Hämäläinen and Rueter, 2019b) and Ver'dd (Alnajjar et al., 2019) (see also On Editing Dictionaries for Uralic Languages in an Online Environment, in this publication). Forthcoming work includes the expansion of the initial Permyak treebank found in Universal Dependencies version 2.5 (Zeman et al., 2019), i.e. further work on what is scheduled for the next UD release, hence the underlying acuteness of further work with this often understudied, but central variety of Komi. ...
Article
Full-text available
There are two main written Komi varieties , Permyak and Zyrian. These are mutually intelligible but derive from different parts of the same Komi dialect continuum, representing the varieties prominent in the vicinity and in the cities of Syktyvkar and Kudymkar, respectively. Hence, they share a vast number of features, as well as the majority of their lexicon, yet the overlap in their dialects is very complex. This paper evaluates the degree of difference in these written varieties based on changes required for computational resources in the description of these languages when adapted from the Komi-Zyrian original. Primarily these changes include the FST architecture, but we are also looking at its application to the Universal Dependencies annotation scheme in the morphologies of the two languages. Дженыта висьталӧм Коми кылын кык гижан кыв: пермяцкӧй да зырянскӧӥ. Öтамӧд коласын нія вежӧртанаӧсь, но аркмисӧ нія разнӧй коми диалекттэзісь. Пермяцкӧй кыв олӧ Кудымкар лапӧлын, а зырянскӧӥ-Сыктывкар ладорын. Пермяцкӧй да зырянскӧй литературнӧй кыввезын эм уна ӧткодьыс, ӧткодьӧн лоӧ и ыджыт тор лексикаын, но ны диалектнӧй чертаэзлӧн пантасьӧмыс ӧддьӧн гардчӧм. Эта статьяын мийӧ видзӧтам эна кык кывлісь ассямасӧ сы ладорсянь, мый ковсяс вежны лӧсьӧтӧм зырянскӧй вычислительнӧй ресурсісь, медбы керны сыись пермяцкӧйӧ. Медодз энӧ вежсьӧммесӧ колӧ керны FST-ын, но мийӧ сідзжӧ видзӧтам, кыдз FST лӧсялӧ Быдкодь Йитсьӧммезлӧн схемаӧ морфология ладорсянь.
... As an additional data source, we use the North Sámi Universal Dependencies treebank (Tyers and Sheyanova, 2017). We parse the corpus with Uralic-NLP (Hämäläinen, 2019) and split the compounds the rule-based morphological analyser identifies as consisting of two or more words in order to synthetically introduce errors. We also run the rule-based morphological analyser and morpho-syntactic disambiguator to add part-of-speech (POS) information to produce an additional data set with POS tags. ...
Conference Paper
Full-text available
We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a cor- pus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.
Article
Full-text available
We present an open-source online dictionary editing system, Ve rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.
Conference Paper
Full-text available
We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the problem of XML dictionaries in the context of small Uralic languages. XML is good at representing structured data, but it does not fare well in a situation where multiple users are editing the dictionary simultaneously. Furthermore, XML is overly complicated for non-technical users due to its strict syntax that has to be maintained valid at all times. Our system solves these problems by making a synchronized editing of the same dictionary data possible both in a MediaWiki environment and XML files in an easy fashion. In addition, we describe how the dictionary knowledge in the MediaWiki-based dictionary can be enhanced by an additional Semantic Me-diaWiki layer for more effective searches in the data. In addition, an API access to the lexical information in the dictionary and morphological tools in the form of an open source Python library is presented.
Conference Paper
Full-text available
This paper presents a new, NLG based approach to poetry generation in Finnish for use as a part of a bigger Poem Machine system the objective of which is to provide a platform for human computer co-creativity. The approach divides generation into a linguistically solid system for producing grammatical Finnish and higher level systems for producing a poem structure and choosing the lexical items used in the poems. An automatically extracted open-access semantic repository tailored for poem generation is developed for the system. Finally , the resulting poems are evaluated and compared with the state of the art in Finnish poem generation.
Conference Paper
Full-text available
This paper introduces the second version of SemFi, a semantic database for Finnish with syntactic relations. The previous version of SemFi has been used in poem generation, and thus it has application area in NLG applications. In addition to extending SemFi, this paper describes and evaluates its translation into four endangered Uralic languages , Skolt Sami, Erzya, Moksha and Komi-Zyrian, all of which are greatly under-resourced. The translated dataset is known as SemUr.
Conference Paper
Full-text available
The paper presents and evaluates various NLP tools that have been created using the open source library HFST - Helsinki Finite-State Technology and outlines the minimal extensions that this has required to a pure finite-state system. In particular, the paper describes an implementation and application of Pmatch presented by Karttunen at SFCM 2011.
Article
Full-text available
The finite-state paradigm of computer science has provided a basis for natural-language applications that are efficient, elegant, and robust. This volume is a practical guide to finite-state theory and the affiliated programming languages lexc and xfst. Readers will learn how to write tokenizers, spelling checkers, and especially morphological analyzer/generators for words in English, French, Finnish, Hungarian, and other languages. Included are graded introductions, examples, and exercises suitable for individual study as well as formal courses. These take advantage of widely-tested lexc and xfst applications that are just becoming available for noncommercial use via the Internet.
Conference Paper
This paper discusses methodological strengths and shortcomings of the Constraint Grammar paradigm (CG), showing how the classical CG formalism can be extended to achieve greater expressive power and how it can be enhanced and hybridized with techniques from other parsing paradigms. We present a new, largely theory-independent CG framework and rule compiler (CG-3), that allows the linguist to write CG rules incorporating different types of linguistic information and methodology from a wide range of parsing approaches, covering not only CG's native topological technique, but also dependency grammar, phrase structure grammar and unification grammar. In addition, we allow the integration of statistical/numerical constraints and non-discrete tag and string sets.
Building an open-source development infrastructure for language technology projects
  • S N Moshagen
  • T A Pirinen
  • T Trosterud
Moshagen, S. N., Pirinen, T. A., & Trosterud, T. (2013). Building an open-source development infrastructure for language technology projects. In Proceedings of the 19th nordic conference of computational linguistics (NODALIDA 2013); may 22-24; 2013; oslo university; norway. NEALT proceedings series 16 (pp. 343-352). University of Tromsø, Norway; Linköping University Electronic Press.