About
93
Publications
6,144
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
416
Citations
Introduction
I make finite-state morpho-lexical descriptions for some of the endangered and low-resource languages I speak (Erzya, Moksha, Komi-Zyrian) and comprehend to varied degrees (Skolt Sami, Võro, Komi-Permyak, Lushootseed, Apurinã, Sakurabia, Udmurt, Livonian, Olonets-Karelian, Tenino, Wasco-Wishram, Paiute, etc.), mainly in the Giella infrastructure in Tromsø, Norway. I work in Universal Dependencies, and now with Apertium shallow-transfer machine translation. https://rueter.github.io/
Current institution
Additional affiliations
May 2011 - November 2011
Kone Foundation
Position
- Language Programme author
Description
- I authored a five-year funding program (2013–2017) targeting scholarly studies and facilitation for minority Finno-Ugric as well as minority languages of Finland together with the secretary Anna Talasniemi. Some projects with funding from this five-year program are still underway today.
Publications
Publications (93)
Pokémon have, for a long time, been an object of scientific research. As the world’s highest grossing media franchise, it is evident why Pokémon has also gained popularity in the research community. Because Pokémon is such an international phenomenon, it is important that Pokémon names are included in NLP tools developed for endangered languages as...
This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and ling...
We present a novel digital humanities method for representing our Twitch chatters as user em-beddings created by a large language model (LLM). We cluster these embeddings automatically using affinity propagation and further narrow this clustering down through manual analysis. We analyze the chat of one stream by each Twitch streamer: SmallAnt, Doug...
We present a novel digital humanities method for representing our Twitch chatters as user embeddings created by a large language model (LLM). We cluster these embeddings automatically using affinity propagation and further narrow this clustering down through manual analysis. We analyze the chat of one stream by each Twitch streamer: SmallAnt, DougD...
After the introduction of large language models (LLMs), science has not remained the same. Researchers from several different fields of science have been rushing to conduct research on LLMs. This is due to the fact that LLMs are no longer something only machine learning experts can understand. As the middle L in LLM stands for language, it is evide...
This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and ling...
Old Permic, also known as Old Komi, is an extinct variety of Komi that was spoken in the late Middle Ages in the lower Vychegda river basin in Northeastern European Russia, in an area that currently is not Komi-speaking. This language variety is attested in fragmentary records from the 14th to 17th century written both in the Old Permic alphabet an...
Description of Mordvin language corpora development at the Language Bank of Finland.Description of development.
We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are struc-tured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively tow...
In this paper, we present an FST based approach for conducting morphological analysis, lemmatization and generation of Lushootseed words. Furthermore, we use the FST to generate training data for an LSTM based neural model and train this model to do morphological analysis. The neural model reaches a 71.9% accuracy on the test data. Furthermore, we...
In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word embeddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language da...
In this paper, we present an approach for translating word embeddings from a majority language into 4 minority languages: Erzya, Moksha, Udmurt and Komi-Zyrian. Furthermore, we align these word em-beddings and present a novel neural network model that is trained on English data to conduct sentiment analysis and then applied on endangered language d...
This article deals with the partitive function of the Erzya verbs of ingestion. These are the verbs 'eat', 'drink' and 'smoke', 'breath', etc., which are often referred to in studies of Erzya and Moksha as taking ablative-case direct objects with an ablative function. There is no individual case in Erzya or Moksha that strictly takes the object fun...
The book describes various approaches in rule-based language technology The authors are leading experts in this research field. The book is the first of its kind and it gives a comprehensive picture of the state-of-the-art in rule-based language technology. The book shows the suitability of the technology to all language types, including languages...
This chapter is an overview of Moksha Mordvin. It contains a general typological profile and information on demographics, geography, and variation. The phonology of Moksha is examined in detail. The morphology section covers the properties of nominals (number, case, possession), verbs (tense, mood, aspect, person), and other parts of speech. The sy...
In this article, we approach finite-state description practices that must be instilled in the developer. Thoughts are presented accompanied by reference to concrete experiences with different languages and their description. We contend that finite-state description of languages leads to development in the describer-developer. This presupposes regul...
Neural Machine Translation (NMT) has made significant strides in breaking down language barriers around the globe. For lesser-resourced languages like Moksha and Erzya, however, the development of robust NMT systems remains a challenge due to the scarcity of parallel corpora. This paper presents a novel approach to address this challenge by leverag...
European Partitives in Comparison studies structures that express parts, amounts, and proportions typically in relation with wholes, with each other, or with measures. Examples of partitive are some friends, some water, some of my friends, some of this water, or a group of friends and a glass of water. The volume presents four studies on partitives...
Transcriptions in different languages are a ubiquitous data format in linguistics and in many other fields in the humanities. However, the majority of these resources remain both under-used and under-studied. This may be the case even when the materials have been published in print, but is certainly the case for the majority of unpublished transcri...
Presentation on verbs of ingestion in the Erzya language and how they operate with four case-marked noun phrases. In the short grammar tradition starting from nominative-genitive alternation when speaking of whole objects. Definite genitive vs definite inessive alternation for marking ongoing aspect when attached to the causee in a causative struct...
This document is dedicated to a young man, who, despite the number of times he has traveled around the Sun, is always open to new thoughts on ways to include languages, especially the smaller ones, and the people who speak them in far-reaching and sustainable open-source development. Since Trond Trosterud in Tromsø is attributed a terrific track re...
Abstract
This presentation approaches the partitive function in Erzya verbs of ingestion. The verbs of ingestion, ‘eat’, ‘drink’, ‘smoke’, ‘breathe’, etc. are notorious in Erzya and Moksha for their collocation with direct objects in the ablative case, with partitive function. As Erzya does not have one specific case for marking the direct object,...
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Lat-vian and Estonian, and the Komi-Zyrian dictionary has some translations to Finni...
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852). The Finno-Ugrian Society is publishing Castr\'en's manuscripts as new critical and digital editions, and at the same time different research groups have...
We present a new approach to inducing a bilingual dictionary between the endangered Erzya and Moksha languages automatically based on existing dictionaries in other languages. This work is, for the most part, complementary to the Mordvin research done by PhD László Keresztes, who has demonstrated the alignment of the two language forms in morpholog...
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813-1852). The Finno-Ugrian Society is publishing Castrén's manuscripts as new critical and digital editions, and at the same time different research groups have als...
This study discusses the way different numerals and related expressions are currently annotated in the Universal Dependencies project, with specific focus on the Uralic language family. We analyse different annotation conventions between individual treebanks, and aim to highlight some areas where further development work and systematization could p...
We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53% accuracy with a Thai BERT model in detecting depression. This establishes a...
We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53\% accuracy with a Thai BERT model in detecting depression. This establishes a...
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of...
There are two main written Komi varieties , Permyak and Zyrian. These are mutually intelligible but derive from different parts of the same Komi dialect continuum, representing the varieties prominent in the vicinity and in the cities of Syktyvkar and Kudymkar, respectively. Hence, they share a vast number of features, as well as the majority of th...
Problems: WALS (71) The Prohibitive, (112) Negative Morphemes, and (120) Zero Copula forPredicate NominalsThe corpora consulted in this research contain approximately 3.5 million words from over 120publication of the Erzya literary language. Versions of this corpora exist on servers of the Max-Planck-Institute in Leipzig and the General Linguistics...
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of...
This study describes the ongoing development of the finite-state description for an endangered minority language, Komi-Zyrian. This work is located in the context where large written and spoken language corpora are available, which creates a set of unique challenges that have to be, and can be, addressed. We describe how we have designed the transd...
Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en herramientas para redactar diccionarios de tal forma que las entradas sean estructuradas en el formato XML (Extensible Markup Language). Desde los diccionarios en XML podemos generar código para analizadores morfológicos que son útiles para todo tipo de a...
In this paper, we present our free and open-source online dictionary editing system that has been developed for editing the new edition of the Finnish-Skolt Sami dictionary. We describe how the system can be used in post-editing a dictionary and how NLP methods have been incorporated as a part of the workflow. In practice, this means the use of FST...
This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features-some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-sta...
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However,...
This paper presents and discusses the first Universal Dependencies treebank for the Apurin\~a language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features - some of which are unique to Apurin\~a. The construction of the treebank has also served as an opportunity to develop fini...
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models , and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However,...
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible...
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible...
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfe...
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
We present an open-source online dictionary editing system, Ve rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that...
This study uses a character level neural machine translation approach trained on a long short-term memory-based bi-directional recurrent neural network architecture for diacritization of Medieval Arabic. The results improve from the online tool used as a baseline. A diacritization model have been published openly through an easy to use Python packa...
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
jack.rueter@helsinki.fi Аннотация В этом документе рассматривается вопрос о создании национального корпуса для языковой документации по мокшанскому и эрзянскому литературным языкам в координации с архивами диалектов, включающих более 80 лет полевой работы (вкл. Шокшу, Каратаев). И показывается текущее исследование, развитие необходимых компьютерных...
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological an...
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological an...
This article illustrates the initial development of a finite-state two-level model for Komi-Zyrian from 1996 (Izhevsk, Udmurtia) as presented in Permistika 6, 2000. It illustrates the use of Latin letter representations for he Cyrillic spelling of Komi-Zyrian, variation in spelling conventions of the language, stem alternation observed in inflectio...
This paper will provide a brief description of Skolt Sami and how it might be construed as a pluricentric language. Historical factors are identified that might contribute to a pluricentric identity: geographic location and political history; shortages of language documentation, and the establishment of a normative body for the development of a sta...
We present an open online infrastructure for editing and visualization of dictionaries of different Uralic languages (e.g. Erzya, Moksha, Skolt Sami and Komi-Zyrian). Our infrastructure integrates fully into the existing Giellatekno one in terms of XML dictionaries and FST morphology. Our code is open source, and the system is being actively used i...
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfe...
Artikkeli tutkii runonlausuntaa laadullisen korpusperustaisen analyysin kautta eritellen prosodisia piirteitä sekä niiden toiminnallista merkitystä. Analyysi pohjautuu neljään lausuttuun runoon. Lopuksi eritellään laskennallisen algoritmin toimintaa, joka tuottaa puhesynteesiin prosodisia piirteitä niin, että se mukailee oikeaa runonlausuntaa. Tutk...
This paper attempts to evaluate some of the systematic differences in Uralic Universal Dependencies treebanks from a perspective that would help to introduce reasonable improvements in treebank annotation consistency within this language family. The study finds that the coverage of Uralic languages in the project is already relatively high, and the...
Timothy Feist: A Grammar of Skolt Saami. Mémoires de la Société Finno-Ougrienne 273. Finno-Ugrian Society. Helsinki 2015. 414 p.
https://doi.org/10.33339/fuf.86126
This is an assessment of the merits of the English-language Skolt Sami Grammar written by Timothy Feist with respect to existing scholarship already available in English, Finnish and G...
We describe a MediaWiki-based online dictionary for endangered Uralic languages. The system makes it possible to synchronize edits done in XML-based dictionaries and edits done in the MediaWiki system. This makes it possible to integrate the system with the existing open-source Giellatekno infrastructure that provides and utilizes XML formatted dic...
Endangered Uralic languages present a high variety of inflectional forms in their morphology. This results in a high number of homonyms in inflections, which introduces a lot of morphological ambiguity in sentences. Previous research has employed constraint grammars to address this problem, however CGs are often unable to fully disambiguate a sente...
In this paper, we identify the need for a standardized formalism for the structured XML dictionaries of endangered Uralic languages in the Giella infrastructure. For this purpose, we have decided to use TEI formalism as it is a standardized way of representing data and its commonly used in the field of lexicography. This paper focuses on describing...
We approach the problem of expanding the set of cognate relations with a sequence-to-sequence NMT model. The language pair of interest, Skolt Sami and North Sami, has too limited a set of parallel data for an NMT model as such. We solve this problem on the one hand, by training the model with North Sami cognates with other Uralic languages and, on...
Мы представляем открытую онлайн-инфраструктуру для редактирования и визуализации сло- варей разных уральских языков (например, эрзя, мокша, скольт-саамский и коми-зырянский). Наша инфраструктура полностью интегрируется в существующую Giellatekno с точки зрения словарей XML и морфологии FST. Наш код в открытом источнике.
This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated....
We present an open source Python library to automatically produce syntactically correct Finnish sentences when only lemmas and their relations are provided. The tool resolves automatically morphosyntax in the sentence such as agreement and government rules and uses Omorfi to produce the correct morphological forms. In this paper, we discuss how cas...
We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the problem of XML dictionaries in the context of small Uralic languages. XML is good at representing structured data, but it does not fare well in a situation where multiple users are editing the dictionary simultaneously. Furthermore, XML is overly complicated...
Building language resources for endangered languages, especially in the case of dictionaries, requires a substantial amount of manual work. This, however, is a time-consuming undertaking, and it is also why we propose an automated method for expanding the knowledge in the existing dictionaries. In this paper, we present an approach to automatically...
In this introduction we have tried to present concisely the history of language technology for Uralic languages up until today, and a bit of a desiderata from the point of view of why we organised this special issue. It is of course not possible to cover everything that has happened in a short introduction like this. We have attempted to cover the...
This paper is directed at developing research methods for establishing affinities between individually documented language materials and descriptions of the Erzya dialects. To this
end Erzya is assessed in its relation to its Mordvinic counterpart, Moksha. First, an outline is given illustrating: (a) lexical diversity as documented for Erzya versu...
Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to improve OCR quality for English. For morphologically rich languages, however, even large word lists do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach,...
This paper describes porting Oahpa, a set of advanced interactive language learning programs, to two new languages both of which spoken in Estonia – Estonian and Võro. Our programs offer a platform where the user can practice vocabulary and the generation of morphologically complex forms both in isolation and within sentential contexts. An overview...
The purpose of this article is to outline morphological facts about the two literary languages Erzya and Moksha, which can be used for estimating the distinctive character of these individual language forms. Whereas earlier morphological evaluations of the linguistic distance between Erzya and Moksha have placed them in the area of 90% cohesion, th...
This article outlines the multiple use of electronic source materials from the Livonian-Estonian-Latvian Dictionary of 2012 in a “Kone Foundation” funded project for developing finite-state morphological parsers. It provides an introduction to the project, the language-independent Giellatekno infrastructure at Tromsø, Norway, and the materials util...
This article deals with Erzya, their affiliations, where they live and where their language is spoken. A geographical presentation is outlined for where Erzya has been traditionally spoken over the past one hundred years, as documented in the collections of Heikki Paasonen. Use of the language as a medium of communication is assessed in the rural a...
This article approaches the +OmstO formative used in deverbal inflection from a concatenational perspective. It describes the morphological distinction between the elative-case non-finite in +Om+stO (sams → samsto'to arrive'; oznoms → oznomsto'to pray'; molÍems → molÍemste'to go') and its counterpart the elative-case deverbal noun in +OmA+stO (sams...
This dissertation is a synchronic description of adnominal person in the highly synthetic morphological system of Erzya as attested in extensive Erzya-language written-text corpora consisting of nearly 140 publications with over 4.5 million words and over 285,000 unique lexical items. Insight for this description have been obtained from several sou...
Case in Erzya, A synthesis of morphology, semantics, syntactic function, and compatibility with number, person and definiteness
The Erzya language is an agglutinative Uralic language, and from a morphological perspective Erzya can be seen to have three basic word classes: (1) those that generally take no inflection at all; (2) those that generally...
Suomen kielitieteellinen yhdistys (SKY) [The Linguistic Association of Finland], or
27.-29.8.2009 Helsingissä järjestettiin symposiumi 'Case in and across languages'
The purpose of this paper is to illuminate the usage of the absolute, indeterminate genitive and determinate genitive forms in the Erzian language with the help of literary sources, where possible. Stress will be placed on the notions of definiteness vs. neutral deixis, [±]animate, proper vs. common noun and partitivity.
This article discusses statistics of first person singular pronoun form variation found in Erzya-language literature with insight from a paradigm by Bubrikh (1930) based on the Kozlovka vernacular, which had apparently been designated as the base of Standard Erzya before the appearance of Evsevs'ev's "Èrzân' grammatika" in 1929. The long and short...
This paper discusses three groups of derivates in the Erzya Mordvinian language etymologically associated with the deictic pronoun se 'that (egocentric, distal)'. Each group of derivatives is, historically, from a different period, and therefore each one is dealt with separately. The individual groups are examined for deictic and discourse features...
Parentheticals with speech verbs occur in three positions in the sentence; they are initial, medial and final. The parenthetical initiates, finalizes or temporarily pauses the flow of the sentence. We know that the Erzya language has a wealth of suffixes, which allow the language a liberal freedom of sentence constituent ordering. The freedom, howe...
The Erzya language has two sets of dative personal pronouns. Although the semantics are the same, there seems to be a discourse merit to the variation. This paper provides a brief presentation of the Prague School of V. Mathesius, F. Daneš and E. Hajičová with an analysis of some early Erzya writings according to Daneš's progression. The Erzya lang...