Jack Rueter

Jack Rueter
University of Helsinki | HY · Digital Humanities

Doctor of Philosophy

About

73
Publications
2,855
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
278
Citations
Citations since 2016
59 Research Items
273 Citations
2016201720182019202020212022020406080100120
2016201720182019202020212022020406080100120
2016201720182019202020212022020406080100120
2016201720182019202020212022020406080100120
Introduction
I make finite-state morpho-lexical descriptions for some of the endangered and low-resource languages I speak (Erzya, Moksha, Komi-Zyrian) and comprehend to varied degrees (Skolt Sami, Võro, Komi-Permyak, Lushootseed, Apurinã, Sakurabia, Udmurt, Livonian, Olonets-Karelian, Tenino, Wasco-Wishram, Paiute, etc.), mainly in the Giella infrastructure in Tromsø, Norway. I work in Universal Dependencies, and now with Apertium shallow-transfer machine translation. https://github.com/rueter
Additional affiliations
May 2011 - November 2011
Kone Foundation
Position
  • Language Programme author
Description
  • I authored a five-year funding program (2013–2017) targeting scholarly studies and facilitation for minority Finno-Ugric as well as minority languages of Finland together with the secretary Anna Talasniemi. Some projects with funding from this five-year program are still underway today.
September 1997 - June 2004
Mordovian State University
Position
  • Language Assistant
Description
  • Fifth year Finnish language instruction for Erzya- and Moksha-language students, Finno-Ugric language history. In the department of Finno-Ugric language studies

Publications

Publications (73)
Presentation
Full-text available
Presentation on verbs of ingestion in the Erzya language and how they operate with four case-marked noun phrases. In the short grammar tradition starting from nominative-genitive alternation when speaking of whole objects. Definite genitive vs definite inessive alternation for marking ongoing aspect when attached to the causee in a causative struct...
Article
Full-text available
This document is dedicated to a young man, who, despite the number of times he has traveled around the Sun, is always open to new thoughts on ways to include languages, especially the smaller ones, and the people who speak them in far-reaching and sustainable open-source development. Since Trond Trosterud in Tromsø is attributed a terrific track re...
Presentation
Full-text available
Abstract This presentation approaches the partitive function in Erzya verbs of ingestion. The verbs of ingestion, ‘eat’, ‘drink’, ‘smoke’, ‘breathe’, etc. are notorious in Erzya and Moksha for their collocation with direct objects in the ablative case, with partitive function. As Erzya does not have one specific case for marking the direct object,...
Conference Paper
Full-text available
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Lat-vian and Estonian, and the Komi-Zyrian dictionary has some translations to Finni...
Preprint
Full-text available
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852). The Finno-Ugrian Society is publishing Castr\'en's manuscripts as new critical and digital editions, and at the same time different research groups have...
Article
Full-text available
We present a new approach to inducing a bilingual dictionary between the endangered Erzya and Moksha languages automatically based on existing dictionaries in other languages. This work is, for the most part, complementary to the Mordvin research done by PhD László Keresztes, who has demonstrated the alignment of the two language forms in morpholog...
Conference Paper
Full-text available
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castrén (1813-1852). The Finno-Ugrian Society is publishing Castrén's manuscripts as new critical and digital editions, and at the same time different research groups have als...
Conference Paper
Full-text available
This study discusses the way different numerals and related expressions are currently annotated in the Universal Dependencies project, with specific focus on the Uralic language family. We analyse different annotation conventions between individual treebanks, and aim to highlight some areas where further development work and systematization could p...
Conference Paper
Full-text available
We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53% accuracy with a Thai BERT model in detecting depression. This establishes a...
Preprint
Full-text available
We present the first openly available corpus for detecting depression in Thai. Our corpus is compiled by expert verified cases of depression in several online blogs. We experiment with two different LSTM based models and two different BERT based models. We achieve a 77.53\% accuracy with a Thai BERT model in detecting depression. This establishes a...
Conference Paper
Full-text available
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of...
Article
Full-text available
There are two main written Komi varieties , Permyak and Zyrian. These are mutually intelligible but derive from different parts of the same Komi dialect continuum, representing the varieties prominent in the vicinity and in the cities of Syktyvkar and Kudymkar, respectively. Hence, they share a vast number of features, as well as the majority of th...
Article
Full-text available
Problems: WALS (71) The Prohibitive, (112) Negative Morphemes, and (120) Zero Copula forPredicate NominalsThe corpora consulted in this research contain approximately 3.5 million words from over 120publication of the Erzya literary language. Versions of this corpora exist on servers of the Max-Planck-Institute in Leipzig and the General Linguistics...
Preprint
Full-text available
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of...
Conference Paper
Full-text available
This study describes the ongoing development of the finite-state description for an endangered minority language, Komi-Zyrian. This work is located in the context where large written and spoken language corpora are available, which creates a set of unique challenges that have to be, and can be, addressed. We describe how we have designed the transd...
Article
Full-text available
Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en herramientas para redactar diccionarios de tal forma que las entradas sean estructuradas en el formato XML (Extensible Markup Language). Desde los diccionarios en XML podemos generar código para analizadores morfológicos que son útiles para todo tipo de a...
Conference Paper
Full-text available
In this paper, we present our free and open-source online dictionary editing system that has been developed for editing the new edition of the Finnish-Skolt Sami dictionary. We describe how the system can be used in post-editing a dictionary and how NLP methods have been incorporated as a part of the workflow. In practice, this means the use of FST...
Conference Paper
Full-text available
This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features-some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-sta...
Preprint
Full-text available
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However,...
Preprint
Full-text available
This paper presents and discusses the first Universal Dependencies treebank for the Apurin\~a language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features - some of which are unique to Apurin\~a. The construction of the treebank has also served as an opportunity to develop fini...
Conference Paper
Full-text available
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models , and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However,...
Preprint
Full-text available
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible...
Conference Paper
Full-text available
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible...
Preprint
Full-text available
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfe...
Conference Paper
Full-text available
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
Article
Full-text available
We present an open-source online dictionary editing system, Ve rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
Preprint
Full-text available
We present an open-source online dictionary editing system, Ve'rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami...
Conference Paper
Full-text available
This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that...
Preprint
Full-text available
This study uses a character level neural machine translation approach trained on a long short-term memory-based bi-directional recurrent neural network architecture for diacritization of Medieval Arabic. The results improve from the online tool used as a baseline. A diacritization model have been published openly through an easy to use Python packa...
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
Conference Paper
Full-text available
jack.rueter@helsinki.fi Аннотация В этом документе рассматривается вопрос о создании национального корпуса для языковой документации по мокшанскому и эрзянскому литературным языкам в координации с архивами диалектов, включающих более 80 лет полевой работы (вкл. Шокшу, Каратаев). И показывается текущее исследование, развитие необходимых компьютерных...
Conference Paper
Full-text available
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological an...
Preprint
Full-text available
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological an...
Article
Full-text available
This article illustrates the initial development of a finite-state two-level model for Komi-Zyrian from 1996 (Izhevsk, Udmurtia) as presented in Permistika 6, 2000. It illustrates the use of Latin letter representations for he Cyrillic spelling of Komi-Zyrian, variation in spelling conventions of the language, stem alternation observed in inflectio...
Chapter
This paper will provide a brief description of Skolt Sami and how it might be construed as a pluricentric language. Historical factors are identified that might contribute to a pluricentric identity: geographic location and political history; shortages of language documentation, and the establishment of a normative body for the development of a sta...
Conference Paper
Full-text available
We present an open online infrastructure for editing and visualization of dictionaries of different Uralic languages (e.g. Erzya, Moksha, Skolt Sami and Komi-Zyrian). Our infrastructure integrates fully into the existing Giellatekno one in terms of XML dictionaries and FST morphology. Our code is open source, and the system is being actively used i...
Conference Paper
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dial...
Conference Paper
Full-text available
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfe...
Conference Paper
Full-text available
Artikkeli tutkii runonlausuntaa laadullisen korpusperustaisen analyysin kautta eritellen prosodisia piirteitä sekä niiden toiminnallista merkitystä. Analyysi pohjautuu neljään lausuttuun runoon. Lopuksi eritellään laskennallisen algoritmin toimintaa, joka tuottaa puhesynteesiin prosodisia piirteitä niin, että se mukailee oikeaa runonlausuntaa. Tutk...
Conference Paper
Full-text available
This paper attempts to evaluate some of the systematic differences in Uralic Universal Dependencies treebanks from a perspective that would help to introduce reasonable improvements in treebank annotation consistency within this language family. The study finds that the coverage of Uralic languages in the project is already relatively high, and the...
Article
Timothy Feist: A Grammar of Skolt Saami. Mémoires de la Société Finno-Ougrienne 273. Finno-Ugrian Society. Helsinki 2015. 414 p. https://doi.org/10.33339/fuf.86126 This is an assessment of the merits of the English-language Skolt Sami Grammar written by Timothy Feist with respect to existing scholarship already available in English, Finnish and G...
Conference Paper
Full-text available
We describe a MediaWiki-based online dictionary for endangered Uralic languages. The system makes it possible to synchronize edits done in XML-based dictionaries and edits done in the MediaWiki system. This makes it possible to integrate the system with the existing open-source Giellatekno infrastructure that provides and utilizes XML formatted dic...
Conference Paper
Full-text available
Endangered Uralic languages present a high variety of inflectional forms in their morphology. This results in a high number of homonyms in inflections, which introduces a lot of morphological ambiguity in sentences. Previous research has employed constraint grammars to address this problem, however CGs are often unable to fully disambiguate a sente...
Conference Paper
Full-text available
In this paper, we identify the need for a standardized formalism for the structured XML dictionaries of endangered Uralic languages in the Giella infrastructure. For this purpose, we have decided to use TEI formalism as it is a standardized way of representing data and its commonly used in the field of lexicography. This paper focuses on describing...
Conference Paper
Full-text available
We approach the problem of expanding the set of cognate relations with a sequence-to-sequence NMT model. The language pair of interest, Skolt Sami and North Sami, has too limited a set of parallel data for an NMT model as such. We solve this problem on the one hand, by training the model with North Sami cognates with other Uralic languages and, on...
Conference Paper
Full-text available
Мы представляем открытую онлайн-инфраструктуру для редактирования и визуализации сло- варей разных уральских языков (например, эрзя, мокша, скольт-саамский и коми-зырянский). Наша инфраструктура полностью интегрируется в существующую Giellatekno с точки зрения словарей XML и морфологии FST. Наш код в открытом источнике.
Conference Paper
Full-text available
This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated....
Conference Paper
Full-text available
We present an open source Python library to automatically produce syntactically correct Finnish sentences when only lemmas and their relations are provided. The tool resolves automatically morphosyntax in the sentence such as agreement and government rules and uses Omorfi to produce the correct morphological forms. In this paper, we discuss how cas...
Conference Paper
Full-text available
We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the problem of XML dictionaries in the context of small Uralic languages. XML is good at representing structured data, but it does not fare well in a situation where multiple users are editing the dictionary simultaneously. Furthermore, XML is overly complicated...
Article
Full-text available
Building language resources for endangered languages, especially in the case of dictionaries, requires a substantial amount of manual work. This, however, is a time-consuming undertaking, and it is also why we propose an automated method for expanding the knowledge in the existing dictionaries. In this paper, we present an approach to automatically...
Article
Full-text available
In this introduction we have tried to present concisely the history of language technology for Uralic languages up until today, and a bit of a desiderata from the point of view of why we organised this special issue. It is of course not possible to cover everything that has happened in a short introduction like this. We have attempted to cover the...
Article
Full-text available
This paper is directed at developing research methods for establishing affinities between individually documented language materials and descriptions of the Erzya dialects. To this end Erzya is assessed in its relation to its Mordvinic counterpart, Moksha. First, an outline is given illustrating: (a) lexical diversity as documented for Erzya versu...
Article
Full-text available
Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to improve OCR quality for English. For morphologically rich languages, however, even large word lists do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach,...
Conference Paper
Full-text available
This paper describes porting Oahpa, a set of advanced interactive language learning programs, to two new languages both of which spoken in Estonia – Estonian and Võro. Our programs offer a platform where the user can practice vocabulary and the generation of morphologically complex forms both in isolation and within sentential contexts. An overview...
Preprint
Full-text available
The purpose of this article is to outline morphological facts about the two literary languages Erzya and Moksha, which can be used for estimating the distinctive character of these individual language forms. Whereas earlier morphological evaluations of the linguistic distance between Erzya and Moksha have placed them in the area of 90% cohesion, th...
Article
Full-text available
This article outlines the multiple use of electronic source materials from the Livonian-Estonian-Latvian Dictionary of 2012 in a “Kone Foundation” funded project for developing finite-state morphological parsers. It provides an introduction to the project, the language-independent Giellatekno infrastructure at Tromsø, Norway, and the materials util...
Article
This article deals with Erzya, their affiliations, where they live and where their language is spoken. A geographical presentation is outlined for where Erzya has been traditionally spoken over the past one hundred years, as documented in the collections of Heikki Paasonen. Use of the language as a medium of communication is assessed in the rural a...
Article
Full-text available
This article approaches the +OmstO formative used in deverbal inflection from a concatenational perspective. It describes the morphological distinction between the elative-case non-finite in +Om+stO (sams → samsto'to arrive'; oznoms → oznomsto'to pray'; molÍems → molÍemste'to go') and its counterpart the elative-case deverbal noun in +OmA+stO (sams...
Article
Full-text available
This dissertation is a synchronic description of adnominal person in the highly synthetic morphological system of Erzya as attested in extensive Erzya-language written-text corpora consisting of nearly 140 publications with over 4.5 million words and over 285,000 unique lexical items. Insight for this description have been obtained from several sou...
Presentation
Full-text available
Case in Erzya, A synthesis of morphology, semantics, syntactic function, and compatibility with number, person and definiteness The Erzya language is an agglutinative Uralic language, and from a morphological perspective Erzya can be seen to have three basic word classes: (1) those that generally take no inflection at all; (2) those that generally...
Presentation
Full-text available
Suomen kielitieteellinen yhdistys (SKY) [The Linguistic Association of Finland], or 27.-29.8.2009 Helsingissä järjestettiin symposiumi 'Case in and across languages'
Article
Full-text available
The purpose of this paper is to illuminate the usage of the absolute, indeterminate genitive and determinate genitive forms in the Erzian language with the help of literary sources, where possible. Stress will be placed on the notions of definiteness vs. neutral deixis, [±]animate, proper vs. common noun and partitivity.
Article
Full-text available
This article discusses statistics of first person singular pronoun form variation found in Erzya-language literature with insight from a paradigm by Bubrikh (1930) based on the Kozlovka vernacular, which had apparently been designated as the base of Standard Erzya before the appearance of Evsevs'ev's "Èrzân' grammatika" in 1929. The long and short...
Article
Full-text available
This paper discusses three groups of derivates in the Erzya Mordvinian language etymologically associated with the deictic pronoun se 'that (egocentric, distal)'. Each group of derivatives is, historically, from a different period, and therefore each one is dealt with separately. The individual groups are examined for deictic and discourse features...
Article
Full-text available
Parentheticals with speech verbs occur in three positions in the sentence; they are initial, medial and final. The parenthetical initiates, finalizes or temporarily pauses the flow of the sentence. We know that the Erzya language has a wealth of suffixes, which allow the language a liberal freedom of sentence constituent ordering. The freedom, howe...
Article
Full-text available
The Erzya language has two sets of dative personal pronouns. Although the semantics are the same, there seems to be a discourse merit to the variation. This paper provides a brief presentation of the Prague School of V. Mathesius, F. Daneš and E. Hajičová with an analysis of some early Erzya writings according to Daneš's progression. The Erzya lang...

Network

Cited By

Projects

Projects (2)
Project
The goal of the project is to research and develop open source tools and infrastructure for Uralic languages. This includes dictionary work, FSTs and parsers.