Figure 3 - uploaded by Mika Hämäläinen
Content may be subject to copyright.
A diagram showing some triggers used in description of ALGG type nouns

A diagram showing some triggers used in description of ALGG type nouns

Source publication
Conference Paper
Full-text available
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological an...

Context in source publication

Context 1
... Skolt Sami is not a language with entirely simple concatenation strategies, we can make a few observations of the interplay between The lemma for the word algg 'beginning' is the same as the nominative singular and has no morpho-phonological changes, hence no triggers are present when coding +N+Sg+Nom. In the genitive and accusative singular, however, coding +N+Sg+Acc co-occurs with coda vowel lengthening indicated with the trigger V2VV (lengthening, i.e. one vowel becomes two) and consonant cluster weakening indicated with the trigger XYY2XY (i.e. the consonant cluster altenation in -lgg and -lg) (compare concatenation and phenomena in Figure 2), on the one hand, and the compound of concatenational morphology with accompanying triggers V2VV and XYY2XY, on the other in (Figure 3). The .yaml code test content can be further utilized as in-line testing code by simply flipping content left-toright for analysis reading, as shown in (Figure 4). ...

Citations

... 31 In other words, an FST consists of an initial state and a finite number of medial and final states that are connected by a finite number of transitions that map input strings to output strings as regular relations. The input describing the regular language * is often called upper or 30 See Rueter & Hämäläinen (2020) lexical tape and the output describing the regular language * lower or surface tape, which in the context of morphological analysis correspond to morphological deep and surface representation of words. Transducers can also be weighted. ...
Thesis
Full-text available
This thesis explores the use of Natural Language Processing (NLP) on the Akkadian language documented from 2400 BCE to 100 CE. The methods and tools proposed in this thesis aim to fill the gaps left in previous research in Computational Assyriology, contributing to the transformation of transliterated cuneiform tablets into richly annotated text corpora, as well as to the quantitative lexicographic analysis of cuneiform texts. Three contributions of this thesis address the task of transforming Akkadian from its basic Latinized representation, transliteration, into linguistically annotated text corpora. These include (I) neural network-based automatic phonological transcription of transliterated cuneiform text, which is essential for normalizing the diverse spelling variations encountered in the Akkadian writing system; (II) finite-state-based automatic morphological analysis of Akkadian that allows deconstructing word forms into morphological labels, lemmata and part-of-speech tags to improve the useability of Akkadian corpora for quantitative analysis; and (III) creation of a morphological gold standard, and a standardized Universal Dependencies approved morphological label set for Akkadian morphology as the byproduct of an Akkadian treebank. Three contributions address the previously unexplored quantitative analysis of Akkadian lexical semantics using word association measures and word embeddings in order to better understand the language in its own terms. One of these contributions is (IV) an algorithmic method for reducing the distortion caused by fully or partially duplicated sequences in Akkadian texts. This algorithm solves over-representation issues encountered in pointwise mutual information (PMI)-based collocation analysis, and according to preliminary results, also in PMI-based word embeddings. Two contributions (V and VI) are quantitative case studies that demonstrate the use of PMI and word embeddings in Akkadian lexicography, and compare the results with previous qualitative philological research. The last contribution (VII) is a hybrid approach, where PMI is applied to social network analysis of the Neo-Assyrian pantheon in order to reinforce the statistical relevance between the actors. These "semantic" social networks are used to study the position of the Assyrian main god, Aššur, within the pantheon. In addition to the contributions, this thesis presents the first survey of Computational Assyriology, which covers six decades of research on automatic artifact reconstruction, optical character recognition, linguistic annotation, and quantitative analysis of cuneiform texts.
... Finally the rule correspondence along with the recognized words inflections is shown as the output. The whole of the above sequence are done Kimmo [11] . Figure 1 hows the overall procedure of Lexical surface rule based Workflow of the word inflection recognition comprises of states and directed transitions between them. ...
Article
Full-text available
ARTICLE INFO ABSTRACT Tamil language has rich morphological inflections. Statistical study on word inflections [1] in Tamil language shows that almost all the nouns can be inflected to a degree of minimum three folds. Recent works in this field claims that the growth of inflectional morphology increases with the inclusion of colloquial way [2] of communication and expression of the language. Tamil language is conversed and communicated both in prose and regional verse forms. Modern morphological analyzers experience the challenge of extracting the root morpheme of the inflected word of interest. As the degree of inflection folding increases, the corresponding algorithms and tools developed for the purpose fails to prove the robustness and strays away from the accuracy of root morpheme extraction. Differing methods are tried and deployed by the researchers to address the growing issue. Most of these published methods can be categorized in to classes like, command based extractors, script based extractors and rule based extractors. The later one consistently maintains the extraction robustness even amidst the increase in inflection of a morpheme. Rule based morpheme extractors see every word in two forms: Lexical form and surface form. These extractors try to fit and establish a correspondence between the two forms through a rule of the language. This article is the outcome towards attempting to address the issue of manifold inflections through the rule Lexical-Surface (LS) based morpheme extractors [3] .
... Despite the low number of speakers, they had the presentations of the Sami cultural event simultaneously interpreted from Skolt Sami to Finnish and from other Sami languages to Skolt Sami by professional interprets. Thanks to Rueter's continuous efforts for the digital revitalization of the language, Skolt Sami has an extensive digital multilingual dictionary [30] and FST morphology [27]. The situtaion of Skolt Sami is fortunate in the sense that it is one of many Sami languages. ...
Preprint
Full-text available
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
... Despite the low number of speakers, they had the presentations of the Sami cultural event simultaneously interpreted from Skolt Sami to Finnish and from other Sami languages to Skolt Sami by professional interprets. Thanks to Rueter's continuous efforts for the digital revitalization of the language, Skolt Sami has an extensive digital multilingual dictionary [30] and FST morphology [27]. The situtaion of Skolt Sami is fortunate in the sense that it is one of many Sami languages. ...
Chapter
Full-text available
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
Conference Paper
Full-text available
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Lat-vian and Estonian, and the Komi-Zyrian dictionary has some translations to Finnish, En-glish and Russian. We utilize graph-based approaches to augment such dictionaries by predicting new translations to existing and new languages based on different dictionaries for endangered languages and Wiktionar-ies. Our study focuses on the lexical resources for Komi-Zyrian (kpv), Erzya (myv) and Livo-nian (liv). We evaluate our approach by human judges fluent in the three endangered languages in question. Based on the evaluation, the method predicted good or acceptable translations 77% of the time. Furthermore, we train a neural prediction model to predict the quality of the automatically predicted translations with an 81% accuracy. The resulting extensions to the dictionaries are made available on the online dictionary platform used by the speakers of these languages.
Article
Full-text available
We present an open-source online dictionary editing system, Ve rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.
Article
Full-text available
Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en herramientas para redactar diccionarios de tal forma que las entradas sean estructuradas en el formato XML (Extensible Markup Language). Desde los diccionarios en XML podemos generar código para analizadores morfológicos que son útiles para todo tipo de actividades de PLN. En este artículo mostramos las ventajas que una documentación digital y legible por máquina tiene. Describimos, también, el sistema en el contexto de lenguas urálicas amenazadas.
Conference Paper
Full-text available
We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a cor- pus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.
Conference Paper
Full-text available
In this paper, we present our free and open-source online dictionary editing system that has been developed for editing the new edition of the Finnish-Skolt Sami dictionary. We describe how the system can be used in post-editing a dictionary and how NLP methods have been incorporated as a part of the workflow. In practice, this means the use of FSTs (finite-state transducers) to enhance connections between lexemes and to generate inflection paradigms automatically. We also discuss our work in the wider context of lexicography of endangered languages. Our solutions are based on the open-source work conducted in the Giella infrastructure, which means that our system can be easily extended to other endangered languages as well. We have collaborated closely with Skolt Sami community lexicographers in order to build the system for their needs. As a result of this collaboration, the latest Finnish-Skolt Sami dictionary was edited and published using our system.