Project

AKU Open Language Technology for Uralic Languages

Goal: The goal of the project is to research and develop open source tools and infrastructure for Uralic languages. This includes dictionary work, FSTs and parsers.

Updates
0 new
0
Recommendations
0 new
0
Followers
0 new
4
Reads
0 new
37

Project log

Mika Hämäläinen
added a research item
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Lat-vian and Estonian, and the Komi-Zyrian dictionary has some translations to Finnish, En-glish and Russian. We utilize graph-based approaches to augment such dictionaries by predicting new translations to existing and new languages based on different dictionaries for endangered languages and Wiktionar-ies. Our study focuses on the lexical resources for Komi-Zyrian (kpv), Erzya (myv) and Livo-nian (liv). We evaluate our approach by human judges fluent in the three endangered languages in question. Based on the evaluation, the method predicted good or acceptable translations 77% of the time. Furthermore, we train a neural prediction model to predict the quality of the automatically predicted translations with an 81% accuracy. The resulting extensions to the dictionaries are made available on the online dictionary platform used by the speakers of these languages.
Mika Hämäläinen
added a research item
We present a new approach to inducing a bilingual dictionary between the endangered Erzya and Moksha languages automatically based on existing dictionaries in other languages. This work is, for the most part, complementary to the Mordvin research done by PhD László Keresztes, who has demonstrated the alignment of the two language forms in morphology, lexicon and syntax (1990), and then gone on to contemplate syncretisms of the individual languages (1999). In this paper, we present an automatic data-driven method for deriving new lexicographic links between Erzya and Moksha vocabularies. We strive to describe steps in Mordvin studies involving semantic alignment of the Erzya and Moksha lexicon. We briefly remind ourselves of previous dictionary and vocabulary work with the Mordvin languages, dating back to the 17th century. The semantic alignment of this lexicon is important in measuring linguistic distance between the closely related but distinct literary languages of Erzya and Moksha. This alignment fits into a larger scheme including etymological, morphological and syntactic alignment of the two languages.
Jack Rueter
added a research item
Parentheticals with speech verbs occur in three positions in the sentence; they are initial, medial and final. The parenthetical initiates, finalizes or temporarily pauses the flow of the sentence. We know that the Erzya language has a wealth of suffixes, which allow the language a liberal freedom of sentence constituent ordering. The freedom, however, is not infinite: at the beginning of the sentence, the subject of discussion is usually familiar, while, at the end, it is less familiar -- something new. Certain phenomena in the constituent ordering provide contrastivity. In preparation of this paper, the position of speech-verb parentheticals are taken into consideration. If we place the speech element initially, the order of words in the final parenthetical will change. We might consider that the change of ordering is caused by "displacement". The initial position, which more often than not, is dedicated to familiar information (the subject), is taken, and now the only element acceptable before the verb is an adverb. Let us take a two-word sentence: c'oras' sy 'the man is coming'. The word "c'oras'" in this sentence makes reference to an inferable or definite entity within the context. Therefore, it takes sentence-initial position. Had this same word come second, the verb would have been given notable emphasis: "sy c'oras'" 'The man IS coming'. The ordering might be used in correcting erroneous information provided by a partner in dialogue. e.g. "Son mer's' -- c'oras' a sy." 'he/she said -- the man is not coming.'
Mika Hämäläinen
added a research item
Measuring the semantic similarity of different texts has many important applications in Digital Humanities research such as information retrieval, document clustering and text summarization. The performance of different methods depends on the length of the text, the domain and the language. This study focuses on experimenting with some of the current approaches to Finnish, which is a morphologically rich language. At the same time, we propose a simple method, TFW2V, which shows high efficiency in handling both long text documents and limited amounts of data. Furthermore , we design an objective evaluation method which can be used as a framework for benchmarking text similarity approaches.
Jack Rueter
added 2 research items
This article discusses statistics of first person singular pronoun form variation found in Erzya-language literature with insight from a paradigm by Bubrikh (1930) based on the Kozlovka vernacular, which had apparently been designated as the base of Standard Erzya before the appearance of Evsevs'ev's "Èrzân' grammatika" in 1929. The long and short forms of personal pronouns and personal pronoun + postposition alternation with postposition stems and person indexing, appear to correlate with features of the Erzya language other than semantics.
The Erzya language has two sets of dative personal pronouns. Although the semantics are the same, there seems to be a discourse merit to the variation. This paper provides a brief presentation of the Prague School of V. Mathesius, F. Daneš and E. Hajičová with an analysis of some early Erzya writings according to Daneš's progression. The Erzya language has many sets of long and short personal pronouns, i.e. pronoun drop for nominative and genitive in favor of person indexing on the conjugated predicate or the possessum.
Jack Rueter
added a research item
Suomen kielitieteellinen yhdistys (SKY) [The Linguistic Association of Finland], or 27.-29.8.2009 Helsingissä järjestettiin symposiumi 'Case in and across languages'
Jack Rueter
added a research item
The purpose of this paper is to illuminate the usage of the absolute, indeterminate genitive and determinate genitive forms in the Erzian language with the help of literary sources, where possible. Stress will be placed on the notions of definiteness vs. neutral deixis, [±]animate, proper vs. common noun and partitivity.
Jack Rueter
added 3 research items
We present an open-source online dictionary editing system, Ve rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.
Building language resources for endangered languages, especially in the case of dictionaries, requires a substantial amount of manual work. This, however, is a time-consuming undertaking, and it is also why we propose an automated method for expanding the knowledge in the existing dictionaries. In this paper, we present an approach to automatically combine conceptually divided translations from multilingual dictionaries for small Uralic languages. This is done for the noun dictionaries of Skolt Sami, Erzya, Moksha and Komi-Zyrian in such a way that the combined translations are included in the dictionaries of each language and then evaluated by professional linguists fluent in these languages. Inclusion of the method as a part of the new crowdsourced MediaWiki based pipeline for editing the dictionaries is discussed. The method can be used there not only to expand the existing dictionaries but also to provide the editors with translations when they are adding a new lexical entry to the system.
There are two main written Komi varieties , Permyak and Zyrian. These are mutually intelligible but derive from different parts of the same Komi dialect continuum, representing the varieties prominent in the vicinity and in the cities of Syktyvkar and Kudymkar, respectively. Hence, they share a vast number of features, as well as the majority of their lexicon, yet the overlap in their dialects is very complex. This paper evaluates the degree of difference in these written varieties based on changes required for computational resources in the description of these languages when adapted from the Komi-Zyrian original. Primarily these changes include the FST architecture, but we are also looking at its application to the Universal Dependencies annotation scheme in the morphologies of the two languages. Дженыта висьталӧм Коми кылын кык гижан кыв: пермяцкӧй да зырянскӧӥ. Öтамӧд коласын нія вежӧртанаӧсь, но аркмисӧ нія разнӧй коми диалекттэзісь. Пермяцкӧй кыв олӧ Кудымкар лапӧлын, а зырянскӧӥ-Сыктывкар ладорын. Пермяцкӧй да зырянскӧй литературнӧй кыввезын эм уна ӧткодьыс, ӧткодьӧн лоӧ и ыджыт тор лексикаын, но ны диалектнӧй чертаэзлӧн пантасьӧмыс ӧддьӧн гардчӧм. Эта статьяын мийӧ видзӧтам эна кык кывлісь ассямасӧ сы ладорсянь, мый ковсяс вежны лӧсьӧтӧм зырянскӧй вычислительнӧй ресурсісь, медбы керны сыись пермяцкӧйӧ. Медодз энӧ вежсьӧммесӧ колӧ керны FST-ын, но мийӧ сідзжӧ видзӧтам, кыдз FST лӧсялӧ Быдкодь Йитсьӧммезлӧн схемаӧ морфология ладорсянь.
Mika Hämäläinen
added a research item
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the best accuracy is received by combining both of the modalities, as text only reaches to an overall accuracy of 57%, where as text and audio reach to 85%. Our code, models and data have been released openly on Github and Zenodo.
Mika Hämäläinen
added a research item
This study describes the ongoing development of the finite-state description for an endangered minority language, Komi-Zyrian. This work is located in the context where large written and spoken language corpora are available, which creates a set of unique challenges that have to be, and can be, addressed. We describe how we have designed the transducer so that it can benefit from existing open-source infrastructures and therefore be as reusable as possible.
Mika Hämäläinen
added a research item
Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en herramientas para redactar diccionarios de tal forma que las entradas sean estructuradas en el formato XML (Extensible Markup Language). Desde los diccionarios en XML podemos generar código para analizadores morfológicos que son útiles para todo tipo de actividades de PLN. En este artículo mostramos las ventajas que una documentación digital y legible por máquina tiene. Describimos, también, el sistema en el contexto de lenguas urálicas amenazadas.
Mika Hämäläinen
added a research item
We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a cor- pus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.
Jack Rueter
added a research item
This paper discusses three groups of derivates in the Erzya Mordvinian language etymologically associated with the deictic pronoun se 'that (egocentric, distal)'. Each group of derivatives is, historically, from a different period, and therefore each one is dealt with separately. The individual groups are examined for deictic and discourse features. Examples have been taken from grammars, and there are also ones produced by native speakers. It appears that the more opaque the derivation the less deictic its use. Finally, not all of the groups are equally acceptable among native speakers, and it would appear that at least the latter two groups merit an in-depth investigation according to dialect and age group of users.
Mika Hämäläinen
added a research item
In this paper, we present our free and open-source online dictionary editing system that has been developed for editing the new edition of the Finnish-Skolt Sami dictionary. We describe how the system can be used in post-editing a dictionary and how NLP methods have been incorporated as a part of the workflow. In practice, this means the use of FSTs (finite-state transducers) to enhance connections between lexemes and to generate inflection paradigms automatically. We also discuss our work in the wider context of lexicography of endangered languages. Our solutions are based on the open-source work conducted in the Giella infrastructure, which means that our system can be easily extended to other endangered languages as well. We have collaborated closely with Skolt Sami community lexicographers in order to build the system for their needs. As a result of this collaboration, the latest Finnish-Skolt Sami dictionary was edited and published using our system.
Mika Hämäläinen
added a research item
Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3% accuracy in texts written by Agricola and 87.7% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.
Mika Hämäläinen
added a research item
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible to use them as fallback systems together with the FSTs. The source code, models and datasets have been released on Zenodo.
Mika Hämäläinen
added 2 research items
This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features-some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon. The source materials used in the initial treebank represent fieldwork practices where not all tokens of all sentences are equally annotated. For this reason, establishing regular annotation practices for the entire Apurinã treebank is an ongoing project.
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models , and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However, a model fine-tuned on Multilingual BERT reaches the best factual label accuracy of 97.2%. Our results suggest that the performance difference is due to a difference in the original training data. Furthermore, we find that a regular LSTM model works better than one trained with a pretrained word2vec model. These findings suggest that more work needs to be done for pretrained models in Finnish language as they have been trained on small and biased corpora.
Mika Hämäläinen
added a research item
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
Mika Hämäläinen
added a research item
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfer machine translation system for the Mordvin language forms. We indicate reference points within Mordvin Studies and other parts of Uralic studies, as a point of departure for outlining a linguistic studies with a means for measuring its own progress and developing a roadmap for further studies.
Mika Hämäläinen
added a research item
Мы представляем открытую онлайн-инфраструктуру для редактирования и визуализации сло- варей разных уральских языков (например, эрзя, мокша, скольт-саамский и коми-зырянский). Наша инфраструктура полностью интегрируется в существующую Giellatekno с точки зрения словарей XML и морфологии FST. Наш код в открытом источнике.
Mika Hämäläinen
added 2 research items
Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15%, and conclude through careful error analysis that this quality is already very useful as a starting point for refined human transcriptions. Our results with related Nganasan language are more modest, with best model having the error rate of 33%. We show, however, through experiments where Kamas training data is enlarged incrementally, that Nganasan results are in line with what is expected under low-resource circumstances of the language. Based on this, we provide recommendations for scenarios in which further language documentation or archive processing activities could benefit from modern ASR technology. All training data and processing scripts haven been published on Zenodo with clear licences to ensure further work in this important topic.
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.
Mika Hämäläinen
added a research item
This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that they can benefit from existing open-source infrastructures and are as reusable as possible.
Mika Hämäläinen
added a research item
Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best results. We believe this is due to the size of the training data available for the model. Our models are accessible as a Python package. The study provides important information about the adaptability of these methods in different contexts, and gives important baselines for further study.
Mika Hämäläinen
added a research item
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.
Jack Rueter
added a research item
jack.rueter@helsinki.fi Аннотация В этом документе рассматривается вопрос о создании национального корпуса для языковой документации по мокшанскому и эрзянскому литературным языкам в координации с архивами диалектов, включающих более 80 лет полевой работы (вкл. Шокшу, Каратаев). И показывается текущее исследование, развитие необходимых компьютерных инструментов, согласно последовательности и системе проекта открытого исследования. Ключевые слова: Мордовские языки, эрзя, мокша, диалекты, архивные материалы, текстовые корпусы, компьютерные инструменты, универсальные зависимости, морфологически снабженные словаря, морфологические анализаторы, перевод на основе неглубокого переноса, GiellaLT, Apertium, HFST Rueter, J.M., University of Helsinki, Finland; A Corpus of National Mordvin Languages: principles of development and perspectives of functionality. Abstract This paper addresses the issue of a national corpus for language documentation of the Moksha and Erzya literary languages in coordination with dialect archives comprising over 80 years of fieldwork (inclusive Shoksha, Karatai). It shows necessary development in computer-assisted research tools and ongoing research aligned with a consistent and systematic open research project.
Mika Hämäläinen
added a research item
We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the problem of XML dictionaries in the context of small Uralic languages. XML is good at representing structured data, but it does not fare well in a situation where multiple users are editing the dictionary simultaneously. Furthermore, XML is overly complicated for non-technical users due to its strict syntax that has to be maintained valid at all times. Our system solves these problems by making a synchronized editing of the same dictionary data possible both in a MediaWiki environment and XML files in an easy fashion. In addition, we describe how the dictionary knowledge in the MediaWiki-based dictionary can be enhanced by an additional Semantic Me-diaWiki layer for more effective searches in the data. In addition, an API access to the lexical information in the dictionary and morphological tools in the form of an open source Python library is presented.
Jack Rueter
added a research item
The purpose of this article is to outline morphological facts about the two literary languages Erzya and Moksha, which can be used for estimating the distinctive character of these individual language forms. Whereas earlier morphological evaluations of the linguistic distance between Erzya and Moksha have placed them in the area of 90% cohesion, this one does not. This study evaluates the languages on the basis of non-ambiguity, parallel sets of ambiguity and divergent ambiguity. Non-ambiguity is found in combinatory function to morphological formant alignment, e.g. молян go+V+Ind+Prs+ScSg1. Parallel sets of ambiguity is found in combinatory-function set to morphological formant alignment where both languages share the same sets of ambiguous readings, e.g. саизь v s сявозь take+V+Ind+ScPl3+OcSg3, ScPl3+OcPl3. Divergent ambiguity is found in forms with non-symmetric alignments of combinatory functions, e.g. саинек take+V+Ind+Prt1+ScPl1, +Prt1+ScPl1+OcSg3, +Prt1+ScPl1+OcPl3 vs сявоме take+V+Ind+Prt1+ScPl1, сявоськ take+V+Ind+Prt1+ScPl1+OcSg3, +Prt1+ScPl1+OcPl3. This morphological evaluation will establish the preparatory work in syntactic disambiguation necessary for facilitating Erzya↔Moksha machine translation, whereas machine translation will enhance the usage of mutual language resources. Results show that the Erzya and Moksha languages, in the absence of loan words from the 20 th century, share less than 50% of their vocabularies, 63% of their regular nominal declensions and 48% of their regular finite conjugations.
Mika Hämäläinen
added 2 research items
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.
We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and some synthetically generated South Sámi data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible to use North Sámi training data for South Sámi without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South Sámi, which makes it usable and applicable in the contexts of any other endangered language as well.
Mika Hämäläinen
added a research item
We present an open online infrastructure for editing and visualization of dictionaries of different Uralic languages (e.g. Erzya, Moksha, Skolt Sami and Komi-Zyrian). Our infrastructure integrates fully into the existing Giellatekno one in terms of XML dictionaries and FST morphology. Our code is open source, and the system is being actively used in editing a Skolt Sami dictionary set to be published in 2020. Abstract Tämä artikkeli esittelee Uralilaisten kielten (kuten ersän, mokshan, koltansaamen ja komi-syrjäänin) sanakirjojen toimit-tamiseen ja visualisointiin tarkoitetun avoimen verkkoinfrastruktuurin. Mei-dän infrastruktuurimme integroituu Giellateknoon XML-sanakirjojen ja FST-morfologian osalta. Lähdekoodimme on avointa, ja järjestelmäämme käytetään tällä hetkellä aktiivisesti koltansaamen sanakirjan toimitustyössä. Koltan sanakirja julkaistaan vuonna 2020.
Jack Rueter
added 8 research items
This article illustrates the initial development of a finite-state two-level model for Komi-Zyrian from 1996 (Izhevsk, Udmurtia) as presented in Permistika 6, 2000. It illustrates the use of Latin letter representations for he Cyrillic spelling of Komi-Zyrian, variation in spelling conventions of the language, stem alternation observed in inflectional paradigms. Subsequently it provides a set of two-level rules to address stem alternations and provides initial statistics on the lemma and stem materials presently available for the desciption (1996).
This paper attempts to evaluate some of the systematic differences in Uralic Universal Dependencies treebanks from a perspective that would help to introduce reasonable improvements in treebank annotation consistency within this language family. The study finds that the coverage of Uralic languages in the project is already relatively high, and the majority of typically Uralic features are already present and can be discussed on the basis of existing treebanks. Some of the idiosyncrasies found in individual treebanks stem from language-internal grammar traditions, and could be a target for harmonization in later phases.
Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to improve OCR quality for English. For morphologically rich languages, however, even large word lists do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach, which is useful in many language processing applications. is paper investigates language modeling in the open-source OCR engine Tesseract using morphological analyzers. We present experiments on two Uralic languages Finnish and Erzya. According to our experiments, word lists may still be superior to morphological analyzers in OCR even for languages with rich morphology. Our error analysis indicates that morphological analyzers can cause a large amount of real word OCR errors.
Mika Hämäläinen
added a research item
This paper will provide a brief description of Skolt Sami and how it might be construed as a pluricentric language. Historical factors are identified that might contribute to a pluricentric identity: geographic location and political history; shortages of language documentation, and the establishment of a normative body for the development of a standard language. Skolt Sami is assessed in the context of Sami languages and is forwarded as one of a closely related yet distinct language group. Here the issue then becomes one of facilitating diversity even for under-documented languages. And we aptly describe opportunities in language technology that have been utilized to this end. Finally, brief insight is given for other Uralic languages with regard to pluricentric character and possibilities for language users to facilitate the maintenance of their individual language needs.
Mika Hämäläinen
added 9 research items
In this paper, we identify the need for a standardized formalism for the structured XML dictionaries of endangered Uralic languages in the Giella infrastructure. For this purpose, we have decided to use TEI formalism as it is a standardized way of representing data and its commonly used in the field of lexicography. This paper focuses on describing the issues and challenges faced in the conversion of the Giella XML into TEI. A full conversion scheme is introduced in this paper contrasting the peculiarities of the two XML formalisms. We incorporate the new TEI-based XML structure into our existing online dictionary system as an output format.
We describe a MediaWiki-based online dictionary for endangered Uralic languages. The system makes it possible to synchronize edits done in XML-based dictionaries and edits done in the MediaWiki system. This makes it possible to integrate the system with the existing open-source Giellatekno infrastructure that provides and utilizes XML formatted dictionaries for use in a variety of NLP tasks. As our system provides an online dictionary, the XML-based dictionaries become available for a wider audience and the dictionary editing process can be crowdsourced for community engagement with a full integration to the existing XML dictionaries. We present how new automatically produced data is encoded and incorporated into our system in addition to our preliminary experiences with crowdsourcing.
This paper introduces the second version of SemFi, a semantic database for Finnish with syntactic relations. The previous version of SemFi has been used in poem generation, and thus it has application area in NLG applications. In addition to extending SemFi, this paper describes and evaluates its translation into four endangered Uralic languages , Skolt Sami, Erzya, Moksha and Komi-Zyrian, all of which are greatly under-resourced. The translated dataset is known as SemUr.
Mika Hämäläinen
added a project goal
The goal of the project is to research and develop open source tools and infrastructure for Uralic languages. This includes dictionary work, FSTs and parsers.