Conference PaperPDF Available

FST Morphology for the Endangered Skolt Sami Language

Authors:

Abstract and Figures

We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.
Content may be subject to copyright.
Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020) , pages 250–257
Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020
c
European Language Resources Association (ELRA), licensed under CC-BY-NC
FST Morphology for the Endangered Skolt Sami Language
Jack Rueter, Mika Hämäläinen
Department of Digital Humanities
University of Helsinki
{jack.rueter, mika.hamalainen}@helsinki.fi
Abstract
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority
Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other.
This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the
work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with
facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami
words in 148 inflectional paradigms and over 12 derivational forms.
Keywords: Skolt Sami, endangered languages, morphology
1. Introduction
Skolt Sami is a minority language belonging to Sami
branch of the Uralic language family. With its native speak-
ers at only around 300, it is considered a severely endan-
gered language (Moseley, 2010), which, despite its pluri-
centric potential, is decidedly focusing on one mutual lan-
gauge (Rueter and Hämäläinen, 2019). In this paper, we
present our open-source FST morphology for the language,
which is a part of the wider context of its on-going revital-
ization efforts.
The intricacies of Skolt Sami morphology include qual-
ity and quantity variation in the word stem as well as
suprasegmental palatalization before subsequent affixes.
Like Northern Sami and Estonian, Skolt Sami has conso-
nant quantity and quality variation that surpasses that of
Finnish, i.e. Skolt Sami has as many as three lengths in
the vowel and consonant quantities in a given word.
The finite-state description of Skolt Sami involves develop-
ing strategies for reusability of open-source documentation
in other minority languages. In other words, the FST de-
scription is designed in such a fashion that it can be ap-
plied to other languages as well with minimal modifica-
tions. Skolt Sami, like many other minority Uralic lan-
guages, attests to a fair degree of regular morphology, i.e.,
its nouns are marked for the categories of number, pos-
session and numerous case forms with regular diminutive
derivation, and its verbs are conjugated for tense, mood
and person in addition to undergoing several regular deriva-
tions. Morphological descriptions have been developed in
the GiellaLT (Sami Language technology) infrastructure at
the Norwegian Arctic University in Tromso, using Helsinki
Finite-State Technology (HFST) (Lindén et al., 2013).
Working in the GiellaLT infrastructure, it is possible to ap-
ply ready-made solutions to multiple language learning, fa-
cilitation and empowerment tasks. Leading into the digital
age, there are ongoing implementations, such as keyboards1
for various platforms, and corpora2, being expanded to
provide developers, researchers and language community
1http://divvun.no/keyboards/index.html/
2http://gtweb.uit.no/korp/
members access to language materials directly. The trick is
to find new uses and reuses for data sets and technologies
as well as to bring development closer to the language com-
munity. If development follows the North Sámi lead, any
project can reap from the work already done.
Extensive work has already been done on data and tool
development in the GiellaLT infrastructure (Moshagen et
al., 2013) and (Moshagen et al., 2014), and previous work
also exists for Skolt Sami3(Sammallahti and Mosnikoff,
1991; Sammallahti, 2015; Feist, 2015). There are online
and click-in-text dictionaries (Rueter, 2017), 4spell check-
ers (Morottaja et al., 2018), 5, these are implemented in
OpenOffice, but some of the more prominent languages
are supported in MS Word, as well as rule-based language
learning (Antonsen et al., 2013; Uibo et al., 2015). For
languages with extensive description and documentation,
there are syntax checkers (Wiechetek et al., 2019), machine
translation (Antonsen et al., 2017) and speech synthesis and
recognition (Hjortnaes et al., 2020), just to mention the tip
of the iceberg (Rueter, 2014). From a language learner
and research point of departure, the development and ap-
plication of these tools points to well-organized morpho-
syntactic and lexical descriptions of the language in focus.
By well-organized descriptions, we mean approaching
tasks at hand with applied reusability. Reusability is illus-
trated in the construction of a morphological analyzer for
linguists, which, due to the fact that it is able to recognize
and analyze regular morphological forms, can also serve as
a morphological spell checker. In fact, this same analyzer
can be reversed and used as a generator, which is useful
in providing language learners with fixed, analogous and
random tasks in morphology. The same morphological an-
3http://oahpa.no/sms/useoahpa/background.
eng.html/, read further in this article for subsequent develop-
ments in http://oahpa.no/nuorti/
4The forerunner https://sanit.oahpa.no/read/, an
online dictionary here, and on analogous pages of other dic-
tionaries, (e.g., https://saan.oahpa.no/read/), can be
dragged to the tool bar of Firefox and Google Chrome
5http://divvun.no/korrektur/korrektur.
html/
250
alyzer, when augmented by glosses, can immediately begin
to provide online dictionary and click-in-text analyses.
The development of an optimal morphological analyzer and
glossing for a language like Skolt Sami requires concise
morphological and lexical work, on the one hand, and ac-
cess to corpora including language learning materials, on
the other. Corpora provide access to language in use, and
language learning materials help to establish a received un-
derstanding of the language. To this end, the morphologi-
cal analyzer for Skolt Sami has been constructed to analyze
and generate a pedagogically enhanced orthography, for in-
dication of short and long diphthongs preceding geminates
as well as mid low front vowels, as might be rendered in a
pronouncing dictionary. One such example might be seen
in the word kue
˙
0tt ‘hut’ as opposed to the literal norm kue0tt,
where the dot below the enot only indicates a slightly low-
ered pronunciation of the vowel but also assists in identi-
fying the paradigm type, kue
˙
0tt :kue0°id hut+N+Pl+Acc
versus kue0ll :kuõ0lid fish+N+Pl+Acc’.
By focusing on the construction of a pedagogical enhanced
analyzer-generator, teaching resources can be developed
that target randomly generated morphological tasks for the
language learner as in the North Sami learning tool Davvi 6.
In any given language reader, there are texts with words in
various forms and an accompanying vocabulary. While vo-
cabulary translation can readily be utilized as a fixed task in
language learning, inflectional tasks, especially in morpho-
logically rich languages, can be developed as random exer-
cises. Although the contextual word forms in the reader are
quite limited, it is possible to construct randomized mor-
phological exercises where the student is expected to in-
flect nouns, adjectives and verbs alike in forms that have
been taught but not explicitly given for the random words
provided in the reader vocabulary, e.g. in nouns the student
may select vocabulary from reader Achapters 1–5 with a
randomized task for nouns, plural, comitative, third person
singular possessive suffix: +N+Pl+Com+PxSg3. Essen-
tially all nouns in the selected vocabulary available for this
reading are inadvertently presented to the learner.
2. Related Work
In the past, multiple methods have been proposed for auto-
matically learning morphology for a given language. One
of these is Morfessor (Creutz and Lagus, 2007), which is a
set of tools designed to learn morphology from raw textual
data. It has been developed with Finnish in mind, and this
means that it is intended to perform well with extensive reg-
ular morphology, i.e. morphologically rich languages, too.
Bergmanis and Goldwater (Bergmanis and Goldwater,
2017) present another statistical approach that can also take
spelling variation into account. Their approach is based on
the notion of a morphological chain consisting of child-
parent pairs. When analyzing the morphology of a lan-
guage, the approach takes several features into account such
as presence of the parent in the training data, semantic sim-
ilarity, likely affixes and so on.
Such statistical approaches, however, are data-hungry. This
is a problem for various reasons in the case of Skolt Sami.
6http://oanpa.no/davvi/morfas/
The scarce quantity of textual data is one limitation, but it
is even a greater one given that the language is still being
standardized and the users provide a variety of forms and
vocabulary when expressing themselves in their native lan-
guage. This means an even greater variety in morphology
that the statistical model should be able capture from a lim-
ited dataset.
In the absence of a reasonably sized descriptive corpus of
the language, annotated or not, the most accurate way to
model the morphology is by using a rule-based methodol-
ogy.
FSTs (Finite-State Transducers) have been shown in the
past to be an effective way to model the morphology even
for languages with an abundance of morphological features
(cf. (Beesley and Karttunen, 2003)). Perhaps one of the
largest-scale FSTs to model the morphology of a language
is the one developed for Finnish (Pirinen et al., 2017). This
tool, Omorfi, serves as the state-of-the-art morphological
analyzer for Finnish.
3. The FST Model Development Pipeline
Developing a morphological description of a language pre-
supposes a language-learning and documentary approach.
Other people have learned the language and become profi-
cient in it before you, so extract paradigms from grammars,
readers and research to build the language model. If you
are the first researcher to describe the language, take hints
from the language learners, if there are any, they may be
still developing their own understanding of the language
morpho-syntax, and, at times, they may provide you with
informative interpretations of the language.
Idiosyncrasies of a language can, sometimes, be captured
through comparison to those of another. When a descrip-
tion of Skolt Sami, Finnish, Estonian, etc. introduces alien
phenomena, such as word-stem quality and quantity vari-
ation as well as suprasegmental palatalization, it is a good
idea to try describing them both separately and in tandem.
Word-stem quality variation affects both consonants and
vowel. In consonants, an analogous English example might
be illustrated with the f:vvariation found in the English
words life,lives and loaf,loaves. From a historical perspec-
tive, the verb to live will serve as an instance where long
and short vowels accentuate a distinction between nouns
and verbs. In a like manner, the English verb paradigm
(sing,sang,sung) provides a sample of vowel variation
with regular semantic alignment in other verbs, such as
swim and drink. These seemingly peripheral phenomena
of English, however, are central to the description of Skolt
Sami morphology, where consonant quality and quantity
variation permeate the verbal and nominal inflection sys-
tems. Suprasegmental palatalization is yet another phe-
nomenon to be dealt with, as it may present its own influ-
ence on sound variations in both the consonants and vowels
in the same coda of a word stem. These require sound vari-
ation modeling in what is referred to as a two-level model,
where awareness of underlying hypothetical sound patterns
and surface-level reflexes are united to facilitate analysis
and generation of paradigmatic stem type variation, e.g.
an underlying sw{iau}m could be configured with a ˆ VowI
trigger to call the form swim,ˆ VowA the form swam, and
251
ˆ VowU the form swum.
Theoretically speaking, Skolt Sámi has vowel and conso-
nant quantity variation in three lengths, i.e. monophthongs
and diphthongs as well as geminates and consonant clus-
ters are subject to three lengths. One problem with the ini-
tial finite-state description of Skolt Sami was that attempts
were made to describe Skolt Sami according to the comple-
mentary distibution of quantity found in North Sámi7.
By chance, the author set out to describe vowel and conso-
nant quantity as separate conjoined phenomena, and when
the instance of short vowel and shortened consonant in tan-
dem presented itself, only a little extra implementation was
required for identifying this new variation. In fact, the phe-
nomenon had been described earlier as allegro versus largo,
but it had been ignored in some of the linguistic literature
(Koponen and Rueter, 2016).
Preparing the description of a single word is much like writ-
ing a terse dictionary entry. The required information con-
sists of a head word form or lemma, a stem form from
which to derive all required stems, a continuation lexicon
indicating paradigm type (part of speech is also interest-
ing), and finally a gloss or note. The word radio ‘radio’
might be presented as follows:
radio+N:radio N_RADIO ''radio'' ;
The LEMMA:STEM CONTINUATION-LEXICON NOTE pre-
sentation represents one line of code consisting of four
pieces of data. First, comes the index, which consists of the
lemma and part-of-speech tag. Second, after a separating
colon, comes the stem, which, with the Continuation lexi-
con (third constituent) make paradigm compilation possible
by indicating what base all subsequent concatenated mor-
phology connects to – the loanword ‘radio’ has no stem-
internal variation. Finally, there is the optional NOTE con-
stituent, where a gloss has been provided.
The Continuation lexicon name has been written in upper-
case letters to distinguish it from the remainder of the code
line. In this language, continuation lexicon names are ini-
tially marked for part of speech, hence the initial ‘N_’. This
part-of-speech increment is more of a mnemonic note to
help facilitate faster manual coding. After initial denom-
inal derivation lexica, nouns, adjectives and numerals are
directed to mutual handling of case, number and possessive
marking.
This initial line of code may encode even more complex
data. One such entry might be observed in the noun ve0rdd
‘stream’, which exhibits necessary information for complex
stem variation:
ve0rdd+N:ve
˙ˆ1VOW{0Ø}rdd N_KAQLBB ''flow,
stream'';
The index ve0rdd+N: (LEMMA constituent and part-of-
speech tag), as such, is readily comprehensible. The part-
of-speech tag may also be preceded by tags indicating vari-
ants in order of preference (+v1,+v2) and homonymity
7In North Sámi, there is a three-way gradation system where
grade one has an extra-long vowel and short consonant, grade two
has a long vowel with a long consonant, and grade three has a
short vowel with an extra-long consonant.
(+Hom1,+Hom2), and it may be followed by tags indicat-
ing semantics (+Sem...) and part-of-speech subtypes (e.g.
+Prop for proper nouns, +Dem as in demonstrative pro-
noun). Tags, of course, may be inserted at the root or in
subsequent continuation lexica – this is simply a matter of
taste and the complexity of the continuation lexicon net-
work.
The STEM ve
˙ˆ1VOW{0Ø}rdd in combination with the
CONTINUATION-LEXICON N_KAQLBB is what captures
the proliferation of six separate stem forms used in regu-
lar inflection: ve0rdd ‘SG+NOM’, vee0rd ‘SG+GEN’, ve
˙rdda
SG+ILL’, vii0rdi ‘PL+GEN’, ve0rdstes ‘SG+LOC+PXSG3’,
ve
˙e
˙rdažDIMIN+SG+NOM’. While vowel and consonant
variation might be considered peripheral in English, these
extensive patterns are wide-spread in Skolt Sami inflection.
Some verb types may even have as many as eleven sepa-
rate stem forms used in regular inflection and derivation.
Hence, consonant and vowel quality together with quantity
in both provides a challenge for description of the regular
inflectional paradigms of Skolt Sámi.
The continuation lexicon N_KAQLBB mnemonically points
to the Skolt Sámi word 0lbb ‘calf (anim.)’ as a reference
to paradigm type.
Reference to paradigms has traditionally been done using
numbers. This entails access to a set of paradigm descrip-
tions, because no one can be expected to memorize large
sets of paradigm types by number alone. Using familiar
words to allude to paradigm types, however, may be straight
forward from a native speaker’s perspective, but they too
will require documentation in test code. Test codes might
be located adjacent to the appropriate affix continuation
lexicon or in a separate set of test files (see also the noun
algg ‘beginning’ in Figure 1, below). The NOTE section, of
course, is open for virtually any type of data.
Development of guidelines helps newcomers join a tra-
dition and construct analogous, parallel descriptions in
the same or similar infrastructures. The presupposition
of a willingness to adapt new projects to the practices
of established analogous work is an important element in
open-source FST development at GiellaLT, which has been
adopted as the basis for guideline development. At Giel-
laLT documentation is sometimes sparse, incomplete or dif-
ficult to find, and therefore it is imperative that all possi-
ble reference be made to shared practices. For maximal-
ized short term achievement (2 to 5 years), the project lan-
guages to consult first are North Sami (sme) and South
Sami (sma), whereas the experience from the Skolt Sami
language project is discussed here.
Skolt Sami specific descriptive materials have been dealt
with in the light of work in closely related languages. Here,
practice with analogous work in other Sami and Uralic lan-
guages has been helpful in learning mnemonic methods that
can be applied as well as lexicon code line writing and
sound variation modeling. Each language has many of its
own requirements, but, where ever possible, we should seek
out ways to align all projects.
The tag sets used with various language parsers at GiellaLT
are extensive and have been directly adapted to work in the
Skolt Sami project to ensure a high usability of tools al-
ready implemented and in mutual use in many language
252
projects. Ordering of tags reflects parsing no later than
2005, e.g. N+Sg+Nom giehta ... (Sjur Moshagen and
Trosterud, 2005). Inflection types are indicated mnemon-
ically by use of a frequent representative of the type, a
strategy also observed in Omorfi, e.g. an initial continu-
ation class marking N_ALGG (algg ‘beginning’) is given
for nouns with a coda structure in VhighC1C2C2. Inflection
type naming of this kind draws the developer’s attention to
the familiar word and helps to minimize specification con-
sultation required when inflection types are only numeri-
cally coded, e.g. 1, 2, 3... Both systems, however, require
set specifications for each inflection type.
In order to enable morpho-lexical variation detection, FST
description presupposes a degree of wrong form genera-
tion. Indeed, wrong form coverage is what facilitates in-
telligent spell checking suggestions, e.g. generation of a
four-year-old’s simple past rendition, swimmed, with a hint
tag +regular-past-error could be useful. For extended cov-
erage, more inflection types and extensions are described
than would otherwise be assumed from mere phonologi-
cal descriptions. There is diversity in the spoken language,
which has meant that certain stem types or individual forms
must be provided with multiple realizations. Here we want
to avoid assigning multiple paradigms to individual lem-
mas where the distinction between the paradigms may lie
in only one or two forms (cf. (Iva, 2007)).
In Skolt Sami building a slightly more demanding descrip-
tion of the phonology has meant the inclusion of otherwise
pedagogical characters and graphemes. Special filtering is
available for converting pedagogic target transducers into
normative transducers and spell relaxes extend these in turn
to descriptive transducers. These same methods are shared
by other language projects in the GiellaLT infrastructure. In
the long run, tweeking the description for pedagogic target-
ing means that even more uses are being made available,
and that basic work is almost immediately available for
continuation projects already realized or under construction
in other language projects, i.e. syntactic disambiguation,
text-to-speech, etymology suggestion.
3.1. Development of the two-level description
Skolt Sami Finite-state transducer development reuses de-
scriptive materials for both concatenation strategies and
testing. Work in the GiellaLT infrastructure begins with
generation-analysis code test files (yaml), with content as
in (Figure 1). Each line contains a lemma, subsequent tag
set and resulting output word form or forms following a
colon, e.g. algg+N+Sg+Gen: aalg.
The lines of description in the yaml test file (lemma + tag
set + resulting word forms) are readily copied to a lexc af-
fix description file for further editing and implementation
as code (Figure 2). Here it can be observed that concate-
national morphology is added after the :colon, but at the
same time there is a certain amount of further required mor-
phological quality and quantity change.
Editing in the continuation lexica in the affixes/*.lexc files
entails stripping the lemma and the part of the target word
forms that can serve as the stem. Since Skolt Sami is not
a language with entirely simple concatenation strategies,
we can make a few observations of the interplay between
Figure 1: A diagram showing file content for yaml
analyzer-generator testing
Figure 2: A diagram showing LEXICON development for
ALGG type nouns
simple morphological concatenation and the complemen-
tary two-level model facilitation.
The lemma for the word algg ‘beginning’ is the same as
the nominative singular and has no morpho-phonological
changes, hence no triggers are present when coding
+N+Sg+Nom. In the genitive and accusative singular,
however, coding +N+Sg+Acc co-occurs with coda vowel
lengthening indicated with the trigger V2VV (lengthening,
i.e. one vowel becomes two) and consonant cluster weaken-
ing indicated with the trigger XYY2XY (i.e. the consonant
cluster altenation in -lgg and -lg) (compare concatenation
and phenomena in Figure 2), on the one hand, and the com-
pound of concatenational morphology with accompanying
triggers V2VV and XYY2XY, on the other in (Figure 3).
Figure 3: A diagram showing some triggers used in de-
scription of ALGG type nouns
The .yaml code test content can be further utilized as
in-line testing code by simply flipping content left-to-
right for analysis reading, as shown in (Figure 4). Im-
plicit in the test data, we can observe five different
stems for the monophthong noun algg:algg ‘Sg+Nom’,
253
aalg‘Sg+Gen’, a0lˇ
gˇ
ge ‘Sg+Ill’, algstan ‘Sg+Loc+PxSg1’,
aa0lje ‘Dimin+N+Sg+Gen’.
Figure 4: A diagram showing some test data for ALGG
type noun analysis
Although there are instances of single stems taking nu-
merous affixes, e.g. biografia or radio, above, most
nominals and verbs require multiple stems. The exten-
sive stem variation observed in the noun algg, above,
is surpassed in the verb tie0tted ‘to know’. It uses the
following 10 stems in regular inflection: tie0tt- ‘Inf’,
tie0°-‘Ind+Prt+Sg3’, tiõt’t- ‘Imprt+ConNeg’, tiõ°-‘Deriv’,
tiõ0t’t- ‘Ind+Prt+Pl3’, tiõ0°-‘Pot’, teât’t- ‘Imprt+Pl3’, teâtt-
‘Ind+Prs+Sg3’, teâ°-‘Cond’, teä0t’t- ‘Ind+Prs+Pl3’. The
vowel quality variation in Skolt Sami and North Sami is
analogous to what is observed in Germanic irregular verbs,
e.g. sing,sang,sung.
Skolt Sami provides a challenge deserving of morpholex-
ical and two-level model descriptions as introduced origi-
nally (Koskenniemi, 1983) integration. Integration of con-
catenation lexicon and morphophonological two-level de-
scription has required both intuition and a working knowl-
edge of the target language. Whereas concatenation al-
ludes to simply adding one morpheme to another, morpho-
phonology draws our attention to changes required in the
stems; hence the challenge of defining 10 separate stems
for a single lemma in Skolt Sami provided above. (More ex-
tensive descriptions of quality, quantity and suprasegmental
variation are provided in (Feist, 2015; Sammallahti, 2015).)
The two-level model utilizes parallel constraints for phono-
logical description. As mentioned above, descriptive gram-
mars of the Skolt Sami language indicate multiple simul-
taneous, coordinated variation in the stem. Thus work on
the two-level model initially opted to provide separate trig-
gers for each individual phenomenon, here ˆ V2VV quantity,
ˆ VowRaise quality and ˆ PAL palatalization.
In brief, triggers are an artificial means of replacing the
natural phonological features occurring in the morphology.
They can be used for causing phenomena subsequent (right-
context here) or preceding (left-context). For example, if
front-back vowel harmony is highly predictable on the basis
of the preceding stem, the individual stems can be marked
{front} or {back} triggers in order to elicit the front or
back allomorphs of subsequent suffixes, i.e. triggers are set
for right-context phenomena. A trigger provides for ma-
nipulation of the harmony reflexes necessary for incorrect
morphology, as well, i.e. something needed in recogniz-
ing misspellings in intelligent computer-assisted language
learning and spell checker suggestions – let us remember
the instance of swimmed, above.
The two-level model rules facilitate simultaneous variation
of many features in the same word. Left and right con-
texts play an important role in this description, whereas
both contexts can contain morpho-phonological phenom-
ena seen to precede or follow the change elicited by a given
rule, or they can disregard them. Triggers are used in rule
writing, because the actual morphophonology of the words
does not necessarily reflect idealconsistant trigger pattern-
ing.
Zero-to-surface-entity rules present in the early phases of
the project have been corrected by adding multicharacter
archiphones to the individual stems. Stem-internal change
such as matters of vowel quantity and quality are indicated
with these symbols. For purposes of phenomenon recog-
nition, curly brackets have been used for displaying arrays
of variation, e.g. {eöâä} indicates there is a vowel vari-
ation of four separate qualities as required in the various
stems. Parallel multiple-character symbols have been im-
plemented for suprasegmentals, length markers, etc. Stem
variation in the word.
Modeling quantity in Skolt Sami has meant a divorce from
the description of other Sami languages. Quantity varia-
tion is generally viewed as a coordinated phenomenon af-
fecting vowel and consonant length simultaneously (see
reference to North Sámi and complementary distribution
of quantity, above). Skolt Sami deviates here: The pre-
dictable ‘extra long vowel + short consonant’; ‘long vowel
+ long consonant’, ‘short vowel + extra long consonant’
combinations are supplemented by a fourth ‘extra short
vowel + extra short consonant’ pattern. The four-way
split required little new coding; original quantity model-
ing had treated vowel and consonant length as separate
phenomena. When the fourth pattern became more ap-
parent after the first half year, all triggers were present,
and actually little work was required to implement their
use. Since the fourth pattern alternates with the long-
vowel-long-consonant pattern algstan (allegro) aalgstan
(largo), respectively ‘begin+N+Sg+Loc+PxSg1’, morelan-
guage documentation was required, as this variation was
found to permeate the inflection and derivation pattern of
the language.
Modeling quality in Skolt Sami has introduced multi-
character symbols in the stem. These multi-character sym-
bols contain arrays of realizations in commented curly
brackets, e.g. t%{ie%}%{eöâä%}%{0Ø%}tt ‘to know’,
above. Each array indicates a mnemonic list of variables.
These lists are easy to interpret and consistent with guesser
and cognate search development, where sound change is
consistently traceable (Kimmo Koskenniemi and Heiki-
Jaan Kaalep, pc.). Moreover, array notations are analo-
254
gous with inflection group identifying model words as in
N_ALGG and N_KAQLBB, above.
Variation in the multi-character symbols as well as the
unmarked consonants is modeled with triggers. Triggers
are used to elicit vowel length and height, suprasegmental
palatalization (which may affect the realization of both the
preceding vowel and subsequent consonantism), as well as
consonant length and quality. In the Skolt Sami project,
vowel length is triggered with the multi-character symbols
%ˆ V2VV (short to long) and %ˆ VV2V (long to short).
To avoid balancing problems introduced with flag diacritics
and further unexpected complications, triggers are ordered
and follow the stem before concatenated suffixes. The tie0°-
stem required for rendering the form V+Pot+Sg3: tie0°ež
is elicited with the consecutive triggers: %ˆ VOWRaise,
%ˆ PALE, %ˆ PAL and %ˆ CC2C, i.e. vowel raising (which
would regularly render ), suprasegmental coloring (ren-
dering )ie), palatalization ( 0) and consonant quality
change via shortening. The large number of triggers de-
manded a large memory, and to alleviate the problem a
reversed-intersect function was implemented in the Giel-
laLT infrastructure as recommended by a member of the
HFST team.
3.2. Deviation from Point of Departure on
GiellaLT
The Skolt Sami project has seen departure from previous
work in the infrastructure but simultaneously adherence to
a mnemonic system of description. In the course of the
project, the policy of lemma followed by a simple ortho-
graphic stem has not been retained. The number of nominal
stem types has risen to 308 from the 56 described in (Sam-
mallahti and Mosnikoff, 1991), while the number of verbal
stem types is 115 as compared to 30 (ibidem.). Adjectives
and numerals share inflection types with nouns. Before the
commence of the project in 2013, for instance, only 280
verbs and 828 nouns were partially facilitated by the sys-
tem, whereas by the end of 2018 the analogous figures were
4844 verb stems with over 40 conjugation forms as well as
numerous verbal and nominal derivations and 23683 noun
stems with over 98 declensional forms aa well as additional
derivations, and the entire lemma count exceeded 36000.
Multi-character symbol development endears mnemonic
forms. Arrays enclosed in curly brackets are used for in-
dicating vowel quality and quantity variation, a practice
analogous of inflection type model words that hint at the
type of stem variation. Triggers have, in matters of length,
been drafted to reflect specific nuances of coda description,
e.g. %ˆ VV2V indicates vowel shortening, %ˆ CCC2CC
geminate shortening, and %ˆ XYY2XY consonant cluster
shortening, respectively.
Triggers have been fashioned for and subsequent affixes.
The stem has been filled with multiple-character symbols to
indicate which letters and graphemes undergo change and
what kind of change. Ordered triggers have been applied
to bring about these changes regardless of the orthographic
context, which simplifies the generation of incorrect forms,
a necessity in the recognition of ill-formed word forms and
their alignment with the desired words.
Trigger ordering is aligned with the orthographic realiza-
Word Class glossed unglossed inflections derivations
Adjectives 4190 166 16 3
Nouns 21640 712 99 3+
Verbs 4845 23 33 6+
total 30675 901 148 12+
Figure 5: morpholexical coverage’
tion of phonological phenomena. Thus, changes in penulti-
mate syllables precede those in ultimate syllables, which is
similar to vowel changes preceding suprasegmental mark-
ing and subsequent consonants.A special context marker
Pen is used before each trigger effecting change in the
penultimate syllable. The trigger count in a given stem may
reach six.
4. Lexical and Morphological Coverage
In the absence of gold annotated data, we do not con-
duct an evaluation typical to the current mainstream NLP,
but rather describe the coverage of forms and lexemes in
the transducer. Here we will limit our discussion to the
most extensive paradigms, i.e. adjectives, nouns and verbs
(see Figure 5). In addition to statistics on glossed and un-
glossed lexicon, where glossed is a loose term for the pres-
ence of at least one single word translation for each Skolt
Sámi word in the Akusanat dictionary (Hämäläinen and
Rueter, 2018), we will discuss regular inflection and deriva-
tion. While inflection refers to conjugation and declension,
on the one hand, derivation indicates part-of-speech trans-
formation brought about by morphological means, on the
other. As a result of this work, the Skolt Sámi transducer
represents a lexicon of over 30,000 lemmas with a cover-
age of over 2.3 million inflectional forms, not to mention
the derivational exponent or compound nouns.
Adjectives in Skolt Sami may have special attribute forms
for use in the noun phrase, as is the situation in other Sami
languages. Adjectives are also known to decline in the same
case forms as nouns, which brings us to a total of approxi-
mately 16 paradigmatic forms associated with the declina-
tion of each adjective. Regular derivation, it will be noted,
is generally limited to comparative and superlative inflec-
tion will all cases as well as nominalization, which goes on
to feed regular noun inflection.
Nouns, like adjectives, can be declined in seven cases for
singular and plural with the addition of the partitive8. In
contrast to the adjectives, however, number and case can
be augmented with possession markers for three persons
and two numbers, which brings the number of paradigmatic
cells in declination to nearly 100. Nouns can further be de-
rived as regular diminutives (this again feeds regular deriva-
tion) and two types of adjectives with the meanings ‘with-
out X (privative)’ and ‘full of X’ (both of which can further
derived as nouns, and the former is regulary derived as a
verb).
The verbal paradigm is also relatively extensive. Each tense
and additional mood, with the exception of the imperative,
has three categories for person, two for number and an in-
definite personal form (7). Thus, in addition to two tenses
in the indicative, the subjunctive and potential mood there
8the partitive has no morphological distinction for number
255
are five more forms for the imperative, which brings us to
a total of 33 forms in a given conjugation paradigm. Non-
finite derivation, participles in addition to deverbal nouns
and verbs, adds feeders to nominal and verbal derivation
alike.
A large percentage of this regular inflection is in place and
available in the UralicNLP, a python library for Uralic mi-
nority languages (Hämäläinen, 2019). The lexical database
for Skolt Sami is also undergoing rigorous scrutiny and de-
velopment in the editing of the forth-coming Moshnikoff
Skolt Sami dictionary in Ve0rdd9, an open-source dictio-
nary environment for minority language community editor
and developer collaboration (Alnajjar et al., 2020). Ve0rdd
‘stream, flow’ also provides an interface for feedback into
the dictionary system.
5. Discussion and Future Work
The FSTs are released in GiellaLT infrastructure as a con-
stantly updating bleeding edge release. Efforts have been
made to bring the writing of the FST lexc materials into an
easier MediaWiki based framework (Rueter and Hämäläi-
nen, 2017). All edits to the FSTs made in the Medi-
aWiki platform are automatically synchronized with those
uploaded to GiellaLT.
According to statistics at GiellaLT for online dictionary us-
age, the Skolt Sami–Finnish dictionary enjoys a great pop-
ularity among the language community. It is only second
to North Sami–Norwegian (Trosterud, p.c. 2019–06–04).
Statistics provide pointers for where elaboration is needed
in definitions as well as the shortcomings of the transducer
(analysis of misspelled words).
In order to make the FSTs more accessible for other re-
sarchers conducting NLP tasks focused on Skolt Sami,
the FSTs have been made available through UralicNLP
(Hämäläinen, 2019). This is a specialized Python library
for NLP for Uralic languages which makes using FSTs
easier by providing a documented programmatic interface.
Furthermore, the library uses precompiled models, which
further facilitates the reuse of our FSTs.
Modeling diphthongs is still a challenge for Skolt Sami. Fu-
ture work will attempt to develop separate triggers for the
first and second element. Thus, the treatment of diphthongs
will be analogous to that of quantity. Especially front and
fronted diphthongs still offer unresolved variation in the
paradigms of a number of nouns.
FSTs provide a good starting point for development of
higher level NLP tools that embrace the new neural network
methods. For instance, FSTs can be used to generate paral-
lel sentences out of lexica and abstract syntax descriptions
to be used for neural machine translation in scenarios with-
out any real parallel data (Hämäläinen and Alnajjar, 2019).
Neural models for morphological tagging can as well ben-
efit from readings provided by FSTs (Ens et al., 2019).
6. Conclusions
We have presented the current state of our on-going project
of modeling Skolt Sami morphology. The transducers are
9https://akusanat.com/verdd/
made available in a continuously updated fashion in multi-
ple different channels, to promote their use in any tasks that
contributes to the revitalization of the language
The highly phonological Skolt Sami orthography has
strengthened the notion that one description might be uti-
lized in multiple tools, i.e. text-to-speech, orthographic,
pedagogical, etc. This has lead to the addition of two extra
characters in the alphabet and the addition of a pedagogic
dictionary type generator.
Mnemonic formation of inflection type indicators has
been followed by the formulation of mnemonic multiple-
character symbols and triggers. Triggers have been or-
dered, and regular inflection has been modeled to exceed
mere finite conjugation and nominal declension. Additional
trigger work may be required for the description of diph-
thong quality change and derivation, but this must be done
in collaboration with the language community, language re-
searchers and the normative body.
7. Bibliographical References
Alnajjar, K., Hämäläinen, M., and Rueter, J. (2020). On
editing dictionaries for uralic languages in an online en-
vironment. In Proceedings of the Sixth International
Workshop on Computational Linguistics of Uralic Lan-
guages.
Antonsen, L., Johnson, R., Trosterud, T., and Uibo, H.
(2013). Generating modular grammar exercises with
finite-state transducers. In Proceedings of the second
workshop on NLP for computer-assisted language learn-
ing at NoDaLiDa 2013, pages 27–38.
Antonsen, L., Gerstenberger, C., Kappfjell, M.,
Nystø Rahka, S., Olthuis, M.-L., Trosterud, T., and
Tyers, F. M. (2017). Machine translation with north
saami as a pivot language. In Proceedings of the 21st
Nordic Conference on Computational Linguistics, pages
123–131, Gothenburg, Sweden, May. Association for
Computational Linguistics.
Beesley, K. R. and Karttunen, L., (2003). Finite-State Mor-
phology, pages 451–454. Stanford, CA: CSLI Publica-
tions.
Bergmanis, T. and Goldwater, S. (2017). From Segmen-
tation to Analyses: A Probabilistic Model for Unsuper-
vised Morphology Induction. In Proceedings of the 15th
Conference of the European Chapter of the Association
for Computational Linguistics: Volume 1, Long Papers,
pages 337–346.
Creutz, M. and Lagus, K. (2007). Unsupervised models
for morpheme segmentation and morphology learning.
ACM Transactions on Speech and Language Processing,
4(1), January.
Ens, J., Hämäläinen, M., Rueter, J., and Pasquier, P. (2019).
Morphosyntactic disambiguation in an endangered lan-
guage setting. In Proceedings of the 22nd Nordic Con-
ference on Computational Linguistics, pages 345–349.
Feist, T., (2015). A Grammar of Skolt Saami, volume 273,
pages 137–216. Helsinki: Suomalais-Ugrilainen Seura.
Hämäläinen, M. and Alnajjar, K. (2019). A template
based approach for training nmt for low-resource uralic
languages-a pilot with finnish. In Proceedings of the
256
2019 2nd International Conference on Algorithms, Com-
puting and Artificial Intelligence, pages 520–525.
Hämäläinen, M. (2019). UralicNLP: An NLP library for
Uralic languages. Journal of Open Source Software,
4(37):1345.
Hjortnaes, N., Partanen, N., Rießler, M., and M. Tyers,
F. (2020). Towards a speech recognizer for Komi, an
endangered and low-resource uralic language. In Pro-
ceedings of the Sixth International Workshop on Compu-
tational Linguistics of Uralic Languages, pages 31–37,
Wien, Austria, 10–11 January. Association for Compu-
tational Linguistics.
Hämäläinen, M. and Rueter, J. (2018). Advances in Syn-
chronized XML-MediaWiki Dictionary Development in
the Context of Endangered Uralic Languages. In Pro-
ceedings of the Eighteenth EURALEX International
Congress, pages 967–978.
Iva, S. (2007). Võru kirjakeele sõnamuutmissüsteem. [The
Inflection System of the Võro Literary Language.] PhD
thesis. University of Tartu.
Koponen, E. and Rueter, J. (2016). The first com-
plete scientific grammar of skolt saami in english. In
Finnisch-Ugrische Forschungen, 2016(63), pages 254–
266. Suomalais-Ugrilainen Seura.
Koskenniemi, K. (1983). Two-Level Morphology: A Gen-
eral Computational Model for Word-Form Recognition
and Production. Helsinki: University of Helsinki, De-
partment of General Linguistics.
Lindén, K., Axelson, E., Drobac, S., Hardwick, S.,
Kuokkala, J., Niemi, J., Pirinen, T. A., and Silfverberg,
M. (2013). HFST a system for creating NLP tools. In
International Workshop on Systems and Frameworks for
Computational Morphology, pages 53–71. Springer.
Morottaja, P., Olthuis, M.-L., Trosterud, T., and An-
tonsen, L. (2018). Anarâškielâ tivvoomohjelm –
Kielâ- já ortografiafeeilâi kuorrâm tivvoomohjelmáin.
Dutkansearvvi die ¯
dalašáigeˇcála, 1(2):63–259.
Christopher Moseley, editor. (2010). Atlas of the World0s
Languages in Danger. UNESCO Publishing, 3rd edi-
tion. Online version: http://www.unesco.org/languages-
atlas/.
Moshagen, S. N., Pirinen, T. A., and Trosterud, T. (2013).
Building an open-source development infrastructure for
language technology projects. In Proceedings of the
19th Nordic Conference of Computational Linguistics
(NODALIDA 2013); May 22-24; 2013; Oslo University;
Norway., number 85 in 16, pages 343–352. Linköping
University Electronic Press; Linköpings universitet.
Moshagen, S., Rueter, J., Pirinen, T., Trosterud, T., and
Tyers, F. M. (2014). Open-source infrastructures for
collaborative work on under-resourced languages. The
LREC 2014 Workshop “CCURL 2014 - Collaboration
and Computing for Under-Resourced Languages in the
Linked Open Data Era”.
Pirinen, T. A., Listenmaa, I., Johnson, R., Tyers, F. M.,
and Kuokkala, J. (2017). Open morphology of Finnish.
LINDAT/CLARIN digital library at the Institute of For-
mal and Applied Linguistics, Charles University.
Rueter, J. and Hämäläinen, M. (2017). Synchronized Me-
diawiki Based Analyzer Dictionary Development. In
Proceedings of the Third Workshop on Computational
Linguistics for Uralic Languages, pages 1–7.
Rueter, J. and Hämäläinen, M. (2019). Skolt sami, the
makings of a pluricentric language, where does it stand?
In Rudolf Muhr, et al., editors, European Pluricentric
Languages in Contact and Conflict, Bern, Switzerland.
Peter Lang.
Rueter, J. (2014). The Livonian-Estonian-Latvian Dictio-
nary as a threshold to the era of language technological
applications. Eesti ja soome-ugri keeleteaduse ajakiri,
5(1):251–259.
Rueter, J. (2017). DEMO: Giellatekno open-source click-
in-text dictionaries for bringing closely related languages
into contact. In Proceedings of the Third Workshop on
Computational Linguistics for Uralic Languages, pages
8–9, St. Petersburg, Russia, January. Association for
Computational Linguistics.
Sammallahti, P. and Mosnikoff, J. (1991). Suomi-
Koltansaame sanakirja. Lää0dd-sää0m sää0nneˇ
ke0rjj
[Finnish-Skolt Sami Dictionary]. Ohcejohka: Girjegiisá
Oy.
Sammallahti, P., (2015). Vuõ0lˇ
gˇ
ge jåå0tted ooudâs, De fas
johttájedje, Taas mentiin: Sää0mˇ
kiõllsažlookkâmˇ
ke0rjj,
Nuortalašgiel lohkosat, Koltansaamen lukemisto, vol-
ume 14, pages 150–171. Oulu: Oulun Yliopisto.
Sjur Moshagen, P. S. and Trosterud, T. (2005). Twol at
work. CSLI Studies in Computational Linguistics ON-
LINE, pages 94–105.
Uibo, H., Pruulmann-Vengerfeldt, J., Rueter, J., and Iva, S.
(2015). Oahpa! õpi! opiq! developing free online pro-
grams for learning Estonian and võro. In Proceedings of
the fourth workshop on NLP for computer-assisted lan-
guage learning, pages 51–64, Vilnius, Lithuania, May.
LiU Electronic Press.
Wiechetek, L., Moshagen, S. N., and Omma, T. (2019). Is
this the end? two-step tokenization of sentence bound-
aries. In Proceedings of the Fifth International Workshop
on Computational Linguistics for Uralic Languages,
pages 141–153, Tartu, Estonia, January. Association for
Computational Linguistics.
8. Language Resource References
Sammallahti, P. and Mosnikoff, J., (1991). Suomi-
Koltansaame sanakirja. LÄÄ0DD-SÄÄ0m SÄÄ0NNÊ0RJJ
[Finnish-Skolt Sami Dictionary], pages 180–202. Ohce-
johka: Girjegiisá Oy.
257
... 31 In other words, an FST consists of an initial state and a finite number of medial and final states that are connected by a finite number of transitions that map input strings to output strings as regular relations. The input describing the regular language * is often called upper or 30 See Rueter & Hämäläinen (2020) lexical tape and the output describing the regular language * lower or surface tape, which in the context of morphological analysis correspond to morphological deep and surface representation of words. Transducers can also be weighted. ...
Thesis
Full-text available
This thesis explores the use of Natural Language Processing (NLP) on the Akkadian language documented from 2400 BCE to 100 CE. The methods and tools proposed in this thesis aim to fill the gaps left in previous research in Computational Assyriology, contributing to the transformation of transliterated cuneiform tablets into richly annotated text corpora, as well as to the quantitative lexicographic analysis of cuneiform texts. Three contributions of this thesis address the task of transforming Akkadian from its basic Latinized representation, transliteration, into linguistically annotated text corpora. These include (I) neural network-based automatic phonological transcription of transliterated cuneiform text, which is essential for normalizing the diverse spelling variations encountered in the Akkadian writing system; (II) finite-state-based automatic morphological analysis of Akkadian that allows deconstructing word forms into morphological labels, lemmata and part-of-speech tags to improve the useability of Akkadian corpora for quantitative analysis; and (III) creation of a morphological gold standard, and a standardized Universal Dependencies approved morphological label set for Akkadian morphology as the byproduct of an Akkadian treebank. Three contributions address the previously unexplored quantitative analysis of Akkadian lexical semantics using word association measures and word embeddings in order to better understand the language in its own terms. One of these contributions is (IV) an algorithmic method for reducing the distortion caused by fully or partially duplicated sequences in Akkadian texts. This algorithm solves over-representation issues encountered in pointwise mutual information (PMI)-based collocation analysis, and according to preliminary results, also in PMI-based word embeddings. Two contributions (V and VI) are quantitative case studies that demonstrate the use of PMI and word embeddings in Akkadian lexicography, and compare the results with previous qualitative philological research. The last contribution (VII) is a hybrid approach, where PMI is applied to social network analysis of the Neo-Assyrian pantheon in order to reinforce the statistical relevance between the actors. These "semantic" social networks are used to study the position of the Assyrian main god, Aššur, within the pantheon. In addition to the contributions, this thesis presents the first survey of Computational Assyriology, which covers six decades of research on automatic artifact reconstruction, optical character recognition, linguistic annotation, and quantitative analysis of cuneiform texts.
... Finally the rule correspondence along with the recognized words inflections is shown as the output. The whole of the above sequence are done Kimmo [11] . Figure 1 hows the overall procedure of Lexical surface rule based Workflow of the word inflection recognition comprises of states and directed transitions between them. ...
Article
Full-text available
ARTICLE INFO ABSTRACT Tamil language has rich morphological inflections. Statistical study on word inflections [1] in Tamil language shows that almost all the nouns can be inflected to a degree of minimum three folds. Recent works in this field claims that the growth of inflectional morphology increases with the inclusion of colloquial way [2] of communication and expression of the language. Tamil language is conversed and communicated both in prose and regional verse forms. Modern morphological analyzers experience the challenge of extracting the root morpheme of the inflected word of interest. As the degree of inflection folding increases, the corresponding algorithms and tools developed for the purpose fails to prove the robustness and strays away from the accuracy of root morpheme extraction. Differing methods are tried and deployed by the researchers to address the growing issue. Most of these published methods can be categorized in to classes like, command based extractors, script based extractors and rule based extractors. The later one consistently maintains the extraction robustness even amidst the increase in inflection of a morpheme. Rule based morpheme extractors see every word in two forms: Lexical form and surface form. These extractors try to fit and establish a correspondence between the two forms through a rule of the language. This article is the outcome towards attempting to address the issue of manifold inflections through the rule Lexical-Surface (LS) based morpheme extractors [3] .
... Despite the low number of speakers, they had the presentations of the Sami cultural event simultaneously interpreted from Skolt Sami to Finnish and from other Sami languages to Skolt Sami by professional interprets. Thanks to Rueter's continuous efforts for the digital revitalization of the language, Skolt Sami has an extensive digital multilingual dictionary [30] and FST morphology [27]. The situtaion of Skolt Sami is fortunate in the sense that it is one of many Sami languages. ...
Preprint
Full-text available
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
Conference Paper
Full-text available
Many endangered Uralic languages have multilingual machine readable dictionaries saved in an XML format. However, the dictionaries cover translations very inconsistently between language pairs, for instance, the Livonian dictionary has some translations to Finnish, Lat-vian and Estonian, and the Komi-Zyrian dictionary has some translations to Finnish, En-glish and Russian. We utilize graph-based approaches to augment such dictionaries by predicting new translations to existing and new languages based on different dictionaries for endangered languages and Wiktionar-ies. Our study focuses on the lexical resources for Komi-Zyrian (kpv), Erzya (myv) and Livo-nian (liv). We evaluate our approach by human judges fluent in the three endangered languages in question. Based on the evaluation, the method predicted good or acceptable translations 77% of the time. Furthermore, we train a neural prediction model to predict the quality of the automatically predicted translations with an 81% accuracy. The resulting extensions to the dictionaries are made available on the online dictionary platform used by the speakers of these languages.
Article
Full-text available
We present an open-source online dictionary editing system, Ve rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.
Article
Full-text available
Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en herramientas para redactar diccionarios de tal forma que las entradas sean estructuradas en el formato XML (Extensible Markup Language). Desde los diccionarios en XML podemos generar código para analizadores morfológicos que son útiles para todo tipo de actividades de PLN. En este artículo mostramos las ventajas que una documentación digital y legible por máquina tiene. Describimos, también, el sistema en el contexto de lenguas urálicas amenazadas.
Conference Paper
Full-text available
We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a cor- pus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.
Conference Paper
Full-text available
In this paper, we present our free and open-source online dictionary editing system that has been developed for editing the new edition of the Finnish-Skolt Sami dictionary. We describe how the system can be used in post-editing a dictionary and how NLP methods have been incorporated as a part of the workflow. In practice, this means the use of FSTs (finite-state transducers) to enhance connections between lexemes and to generate inflection paradigms automatically. We also discuss our work in the wider context of lexicography of endangered languages. Our solutions are based on the open-source work conducted in the Giella infrastructure, which means that our system can be easily extended to other endangered languages as well. We have collaborated closely with Skolt Sami community lexicographers in order to build the system for their needs. As a result of this collaboration, the latest Finnish-Skolt Sami dictionary was edited and published using our system.
Preprint
Full-text available
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfer machine translation system for the Mordvin language forms. We indicate reference points within Mordvin Studies and other parts of Uralic studies, as a point of departure for outlining a linguistic studies with a means for measuring its own progress and developing a roadmap for further studies.
Conference Paper
Full-text available
We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the problem of XML dictionaries in the context of small Uralic languages. XML is good at representing structured data, but it does not fare well in a situation where multiple users are editing the dictionary simultaneously. Furthermore, XML is overly complicated for non-technical users due to its strict syntax that has to be maintained valid at all times. Our system solves these problems by making a synchronized editing of the same dictionary data possible both in a MediaWiki environment and XML files in an easy fashion. In addition, we describe how the dictionary knowledge in the MediaWiki-based dictionary can be enhanced by an additional Semantic Me-diaWiki layer for more effective searches in the data. In addition, an API access to the lexical information in the dictionary and morphological tools in the form of an open source Python library is presented.
Conference Paper
Full-text available
We present an open online infrastructure for editing and visualization of dictionaries of different Uralic languages (e.g. Erzya, Moksha, Skolt Sami and Komi-Zyrian). Our infrastructure integrates fully into the existing Giellatekno one in terms of XML dictionaries and FST morphology. Our code is open source, and the system is being actively used in editing a Skolt Sami dictionary set to be published in 2020. Abstract Tämä artikkeli esittelee Uralilaisten kielten (kuten ersän, mokshan, koltansaamen ja komi-syrjäänin) sanakirjojen toimit-tamiseen ja visualisointiin tarkoitetun avoimen verkkoinfrastruktuurin. Mei-dän infrastruktuurimme integroituu Giellateknoon XML-sanakirjojen ja FST-morfologian osalta. Lähdekoodimme on avointa, ja järjestelmäämme käytetään tällä hetkellä aktiivisesti koltansaamen sanakirjan toimitustyössä. Koltan sanakirja julkaistaan vuonna 2020.
Conference Paper
Full-text available
Endangered Uralic languages present a high variety of inflectional forms in their morphology. This results in a high number of homonyms in inflections, which introduces a lot of morphological ambiguity in sentences. Previous research has employed constraint grammars to address this problem, however CGs are often unable to fully disambiguate a sentence, and their development is labour intensive. We present an LSTM based model for automatically ranking morphological readings of sentences based on their quality. This ranking can be used to evaluate the existing CG disambiguators or to directly morphologically disambiguate sentences. Our approach works on a morphological abstraction and it can be trained with a very small dataset.
Article
Full-text available
In the past years the natural language processing (NLP) tools and resources for small Uralic languages have received a major uplift. The open-source Giellatekno infrastructure has served a key role in gathering these tools and resources in an open environment for researchers to use. However, the many of the crucially important NLP tools, such as FSTs and CGs require specialized tools with a learning curve. This paper presents UralicNLP, a Python library, the goal of which is to mask the actual implementation behind a Python interface. This not only lowers the threshold to use the tools provided in the Giellatekno infrastructure but also makes it easier to incorporate them as a part of research code written in Python.
Book
Skolt Saami is a Finno-Ugric language spoken primarily in northeast Finland by less than 300 people. The aim of this descriptive grammar is to provide an overview of all the major grammatical aspects of the language. It comprises descriptions of Skolt Saami phonology, morphophonology, morphology, morphosyntax and syntax. A compilation of interlinearised texts is provided in Chapter 11. Skolt Saami is a phonologically complex language, displaying contrastive vowel length, consonant gradation, suprasegmental palatalisation and vowel height alternations. It is also well known for being one of the few languages to display three distinctive degrees of quantity; indeed, this very topic has already been the subject of an acoustic analysis (McRobbie-Utasi 1999). Skolt Saami is also a morphologically complex language. Nominals in Skolt Saami belong to twelve different inflectional classes. They inflect for number and nine grammatical cases and may also mark possession, giving rise to over seventy distinct forms. Verbs belong to four different inflectional classes and inflect for person, number, tense and mood. Inflection is marked by suffixes, many of which are fused morphemes. Other typologically interesting features of the language, which are covered in this grammar, include (i) the existence of distinct predicative and attributive forms of adjectives, (ii) the case-marking of subject and object nominals which have cardinal numerals as determiners, and (iii) the marking of negation with a negative auxiliary verb. Skolt Saami is a seriously endangered language and it is thus hoped that this grammar will serve both as a tool to linguistic researchers and as an impetus to the speech community in any future revitalisation efforts.
Article
Timothy Feist: A Grammar of Skolt Saami. Mémoires de la Société Finno-Ougrienne 273. Finno-Ugrian Society. Helsinki 2015. 414 p. https://doi.org/10.33339/fuf.86126 This is an assessment of the merits of the English-language Skolt Sami Grammar written by Timothy Feist with respect to existing scholarship already available in English, Finnish and German. Here the writers use their knowledge in comparative Sami research and finite-state morphological descriptions of the language.