Conference PaperPDF Available

Modelling the Reduplicating Lushootseed Morphology with an FST and LSTM

Authors:

Abstract and Figures

In this paper, we present an FST based approach for conducting morphological analysis, lemmatization and generation of Lushootseed words. Furthermore, we use the FST to generate training data for an LSTM based neural model and train this model to do morphological analysis. The neural model reaches a 71.9% accuracy on the test data. Furthermore, we discuss reduplication types in the Lushootseed language forms. The approach involves the use of both attested instances of reduplication and bare stems for applying a variety of reduplica-tions to, as it is unclear just how much variation can be attributed to the individual speakers and authors of the source materials. That is, there may be areal factors that can be aligned with certain types of reduplication and their frequencies .
Content may be subject to copyright.
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 40–46
July 14, 2023 ©2023 Association for Computational Linguistics
Modelling the Reduplicating Lushootseed Morphology with an FST and
LSTM
Jack Rueter
University of Helsinki
first.last@helsinki.fi
Mika Hämäläinen
Metropolia University of
Applied Sciences
first.last@metropolia.fi
Khalid Alnajjar
Rootroo Ltd
first@rootroo.com
Abstract
In this paper, we present an FST based ap-
proach for conducting morphological analysis,
lemmatization and generation of Lushootseed
words. Furthermore, we use the FST to gen-
erate training data for an LSTM based neural
model and train this model to do morphologi-
cal analysis. The neural model reaches a 71.9%
accuracy on the test data. Furthermore, we
discuss reduplication types in the Lushootseed
language forms. The approach involves the use
of both attested instances of reduplication and
bare stems for applying a variety of reduplica-
tions to, as it is unclear just how much variation
can be attributed to the individual speakers and
authors of the source materials. That is, there
may be areal factors that can be aligned with
certain types of reduplication and their frequen-
cies.
1 Introduction
A significant proportion of the world’s languages
face the threat of endangerment to varying degrees.
This endangered status poses certain constraints on
the extent to which modern NLP research can be
conducted with such languages. This is due to the
fact that many endangered languages lack exten-
sive textual resources that are readily accessible
online. Furthermore, even with available resources,
there is concern about the quality of the data, as it
may be influenced by various factors such as the
author’s level of fluency, accuracy of spelling, and
inconsistencies in character encoding at the most
basic level (see Hämäläinen 2021).
Reduplication appears in many languages of the
world (Raimy,2000). While full reduplication is
observed as a repeated word form, partial redu-
plication is associated with extensive variety both
regular and irregular. This paper focuses on a finite-
state description of the partial reduplication pat-
terns found in the Lushootseed language forms (lut)
and (slh). The most predominant forms of redupli-
cation in Lushootseed are distributive (Distr) and
diminutive (Dim), which can, in fact, appear in
tandem, but there are restrictions delimiting their
use (see Broselow 1983,Bates 1986,Urbanczyk
1994). In addition to Distr and Dim, however, we
also find a third and slightly less frequent random
or out of control distributive (OC) (see Bates et al.
1994,Urbanczyk 1996).
The base of these three types of reduplication
can be found in the initial two to three phonemes
of the word root most often referred to with the
notation C
1
VC
2
, but the authors of this paper will
surround the vowel with parentheses to indicate
the possibility of its absence: C
1
(V)C
2
and thus
accommodate the radical CC mentioned in (Beck
1999:24; Crowgey 2019: 39, 42).
The radical consist of simple and compound let-
ters alike, e.g.,
´
q
w
, g
w
,
λ
’, all of which add to
the issues of facilitating the extensive variation
in Lushootseed reduplication. First, the concept
of compound letters involved in regular redupli-
cation segments is a very import part of finite-
state description for Lushootseed. Although the
46 phonemes canonize the extensive alphabet, they
create their own demands on the description.
Our facilitation of Lushootseed reduplication
with a finite-state machine
1
is based on the use
of a five-place holder segement concatenated di-
rectly before the radical. We number these right-to-
left away from the radical {p5}{p4}{p3}{p2}{p1}
where the odd-numbered place holders represent
consonants, and the even-numbered ones vowels.
The system is set up so that the place holders
{p3}{p2}{p1} are used with Distr, Dim and OC
reduplication, whereas the more remote place hold-
ers {p5}{p4} are used to deal with Distr + Dim
combinations. Albeit, theory sees the distribu-
tive losing the third phoneme due to a principle
of antigemination (see Broselow 1983: 326–329,
and Urbanczyk 1994: 515) referencing also (Hess
1
Our code is published in https://github.com/giellalt/lang-
lut
40
1967: 7) and (Snyder 1968: 22). We have assumed
the absence of geminates and therefore have left
them out of the equation. Perhaps, further studies
will require their addition to our finite-state descrip-
tion of reduplication in permeating the Lushootseed
vocabulary.
2 Related work
Several different methods are currently in use to
model morphology of endangered languages com-
putationally. In this section, we will covers some
of the existing rule-based, statistical and neural
approaches. Our method embraces the rule-based
tradition because machine-learning based methods
rely on a lot of annoated data we currently do not
have for Lushootseed.
In the rule-based research, morphology has
mainly been modelled using a finite-state trans-
ducer (FST) using one of several technologies such
as HFST (Lindén et al.,2013), OpenFST (Allauzen
et al.,2007) or Foma (Hulden,2009). Such an ap-
proach has been successful in describing languages
of a variety of different morphological groups such
as polysynthetic languages (e.g. Plains Cree (Snoek
et al.,2014), East Cree (Arppe et al.,2017) and
Odawa (Bowers et al.,2017)), agglutinative lan-
guages (e.g. Komi-Zyrian (Rueter et al.,2021), San
Mateo Huave (Tyers and Castro,2023), Skolt Sami
(Rueter and Hämäläinen,2020), Sakha (Ivanova
et al.,2022) and Erzya (Rueter et al.,2020)) and
fusional languages (e.g. Akkadian (Sahala et al.,
2020) and Arabic (Shaalan and Attia,2012)).
For statistical approaches, Tang (2006) has done
research on English morphology by an approach
that comprises two interrelated components, which
are morphological rule learning and morphological
analysis. The morphological rules are acquired by
means of statistical learning from a list of words.
On another line of work, Kumar et al. (2009) has de-
veloped a machine learning technique that utilizes
sequence labeling and kernel methods for training,
which enables the model to effectively capture the
non-linear associations between various aspects of
the morphological features found in Tamil.
With the emergence of UniMorph (McCarthy
et al.,2020), which continues to include only par-
tial morphological descriptions of each language, a
great deal of neural based research has emerged to
conduct morphological analysis. The typical mod-
els that are used are LSTM (Matteson et al.,2018;
Akyürek et al.,2019) and Transformer (see Kodner
et al. 2022) based models.
3 Materials and methods
The materials used for this paper come from the
Lushootseed dictionary of Bates et al.,1994 and
language learning binders by Zalmai Zahir and
Peggy k
w
i
P
alq Ahvakana (Book 1 d
z
ix
w
First,
Book 2 d
@
g
w
i You, Book 3 s.
P@ì@
d Food, Book
4
P
al
P
al House) as well as a binder of transcrip-
tions to recordings from the University of Washing-
ton archives received in 2003 on the Muckleshoot
Reservation.
The method involves a mnemonic descriptive ap-
proach, implemented for a decidely deterministic
machine and human-friendly solution if there is
such a thing. To this end, we adhere to a three-
phoneme segment approach to Lushootseed de-
scription and simply start with the labeling 123.
Here ‹1› indicates the first consonant of the rad-
ical (root), ‹2› the vowel (which seems to be ab-
sent/latent in at least a few roots), and ‹3› the sec-
ond consonant. We then introduce a series of five
ordered place holders to precede the root.
The insertion of place holders is convenient in
this finite-state description if they come before the
root. Although there are numerous segments of
regular morphology, inserting a series of five place
holders immediately before the root can be seen
as just another step in regular concatenation. Here
it might be mentioned that theoretic distinctions
between inflection and clitics do not come before
consideration for orthographic practices (cf. Beck
2018).
The five place holder, numbering away from
the first three letters of the root is set so the odd
numbers correlate with the consonants and the even
numbers with the vowels. Thus, {p3} correlates
with kw, {p2} with a, and {p1} with t
{p5} {p4} {p3} {p2} {p1} kwa t
kwataˇ
c: kwataˇ
c ‘climb’
s‹kwataˇ
c: skwataˇ
c ‘mountain’
s {p5}:0 {p4}:0 {p3}:k
w
{p2}:a {p1}:0 k
w
a t
aˇ
c :
skwakw@taˇ
c ‘mountains’
s {p5}:0 {p4}:0 {p3}:k
w
{p2}:a {p1}:0 k
w
a:0
t a ˇ
c :
skwakwtaˇ
c ‘hill’
With this as a point of departure, we can then
enumerate four predominant tendencies, one total
reduplication, one partial to the left, two partial
to the right. First, total reduplication is 123123,
41
which is extremely regular and typically distribu-
tive in meaning. Second, comes the diminutive
with extensive variation: 1213, 12123, 1i13, 1i123,
1iq13. Third, and less frequent in the materials are
123
4 FST models
The finite-state description of Lushootseed involves
several layers of experience. It addresses issues
involving orthography, morphophonology, concate-
nation and symmetric tagging for subsequent ma-
chine readability. The orthography, which is canon-
ized by the language’s reduplication patterns, uses
lower-case letters with multiple diacritics, as no pre-
composed letters are available for nearly half of the
alphabet. The concatenative morphology, which
with the exception of the possessive person mark-
ing strategy, is symmetric but involves abbreviated
or short-hand forms for some consecutive mor-
phemes. The variation in multiple reduplication
patterns appears to be partially monolectic or geo-
graphic in nature, but there is definitely also breath-
ing room for variation in where individual deriva-
tions are used. In general, both preposed and post-
posed affixing is present, and, in particular, there is
asymmetry in the possessive person marking strat-
egy. For language-independent comparison, we use
flag diacritics in our models, which allows us su-
persegmental concatenation and facilitates regular
tagging practices for use in downstream language
technology, even work with Python libraries.
4.1 Orthography
Although there are established keyboard layouts
provided on official language-community sites
2
,
there are other keyboards, which may include non-
standard diactritic and letter combinations, that
are visibly present on the net and in easily ac-
cessible language materials. This has meant the
establishment of spellrelax files to allow for rec-
ognizing non-word internal single right quotation
mark, instead of a combining comma above di-
acritic, for example, or even small letter L with
middle tilde ‹U+026B› in place of small letter L
with belt ‹U+026C›.
4.2 Concatenation and Tagging
Reduplication has been dealt with as a prob-
lematic feature in earlier descriptions of the lan-
guages where it is regarded as nonlinear (see Ur-
2https://tulaliplushootseed.com/software-and-fonts/
banczyk 1996). Our solution has been to intro-
duce a segment of five place holders that facil-
itate copying values directly to predefined posi-
tions. As our concatenation in compilation reads
right-to-left, memory retention is minimized to
the three phonemes before the place holder series
{p5}{p4}{p3}{p2}{p1}. If these place holders are
to be used, the machine has already seen the redu-
plication trigger, which appears left of the word
stem.
The relatively mnemonic triggers have been
named according to relative position in the
radical model C
1
VC
2
, i.e., 123. Thus, the
distributive reduplication C
1
VC
2
C
1
VC
2
is la-
beled distr_trigger_123123. Analogically, the
diminutive reduplications C
1
VC
1
C
2
,C
1
iC
1
C
2
,
C
1
i
P
C
1
C
2
,C
1
i
P
C
1
VC
2
are represented by the
triggers dim_trigger1213, dim_trigger1i13,
dim_trigger1iq13, dim_trigger1iq123, respectively.
OC reduplication (out of control, random) in
C1VC2VC2is represented by OC_12323.
The reduplication g
w
aadg
w
ad in
l
@
=b
@
=l
@
cu–g
w
aadg
w
ad (source Beck 2018:
example 13) ‘talking’ could be illustrated as
C
1
VVC
2
C
1
VC
2
, i.e., trigger_122123. The
underlying use of our placeholders, however,
would show the following transformation
{p5}:1 {p4}:2 {p3}:0 {p2}:2 {p1}:3 1 2 3
{p5}:gw{p4}:a {p3}:0 {p2}:a {p1}:d gwa d
Reduplication triggers are accompanied by dia-
critic flags, which make it possible to position tags
in the output. Flag diacritics are also used to ad-
dress the symmetrical tagging of prefixes after the
lemma, on the one hand, and to disallow simulta-
neous tagging for two possessive markers, on the
other.
5 Current state
Presently the lexicon is extremely small. It contains
110 verbs and 283 nouns, which might explain the
low coverage rate of 70%, i.e., 1822 unrecognized
tokens out of a total of 6186 tokens in the test
corpus.
The two-level model has 31 rules governing redu-
plication copying patterns in the place holders and
vowel loss or permutation in the root. The vowel
system has be complemented by vowels with acute
and grave accents, which might be useful in ped-
agogical use of the language model, and in work
with language variation across the continuum of
the language community.
42
source target
ëu l @ˇ
c,iˇ
c,ˇ
c,iˇ
c,@lpyaqid NPlNom
add@xwtubuPqw@xwN Sg Nom RemPst Ptc PxSg2 Clt
b@add@xwtubuPb u PqwN Pl Nom Anew RemPst Ptc PxSg2
Table 1: Examples of the training data
tag Anew Clt Hab Irr Pl Ptc PxPl1 PxPl2 PxSP3 PxSg1 PxSg2 RemPst Sg
precision 0.77 0.96 1.00 0.98 0.94 0.91 0.90 0.89 0.80 0.83 0.92 0.81 0.87
recall 0.97 0.77 0.89 0.97 0.95 0.89 0.79 0.55 0.61 0.90 0.91 0.99 0.82
F1-score 0.86 0.86 0.94 0.98 0.94 0.90 0.84 0.68 0.69 0.87 0.91 0.89 0.84
Table 2: Per tag results of the neural model
The lexc continuation lexica number at 135.
These continuation lexica provide coverage for reg-
ular nominal and verbal inflection, which utilizes a
mutual set of morphology controlled partially with
flag diacritics.
6 Neural Extension
No matter how extensive an FST transducer is, it
still cannot cover the entire lexicon of a language.
For this reason, we also experiment with training
neural models to do morphological analysis based
on the FST transducer described in this paper. The
goal is not to replace the FST we have described in
this paper, but to develop a neural "fallback" model
that can be used when a word is not covered by the
FST.
We follow the approach suggested by Hämäläi-
nen et al. (2021), we use the code that has been
made available in UralicNLP (Hämäläinen,2019).
This approach consists of querying the FST trans-
ducer for all the possible morphological forms for
a given lemma. For a given input, the FST will thus
produce all possible inflections and their morpho-
logical readings.
We limit our data to nouns only, and we use a list
of 214 Lushootseed nouns to generate all the possi-
ble morphological forms for. This way, we produce
a dataset consisting of around 756,000 inflectional
form-morphological reading tuplets. This means
that we have an average of 3536 inflectional forms
for each lemma. We split this data into 70% train-
ing, 15% validation and 15% testing. The test data
has words that are completely unseen to the model
in the training data. This means that in the testing,
the model needs to analyze based on lemmas and
word forms it has not seen before even in a partial
paradigm.
For the model itself, we use a Python library
called OpenNMT (Klein et al.,2017) and use it
to train an LSTM based recurrent neural network
architecture with the default settings of the library.
The task is defined as a character-level neural ma-
chine translation problem where each word form
are split into characters separated by a white-space
in the target side and the morphological readings
produced by the FST are split into separate mor-
phological tokens. Examples of the training data
can be seen in Table 1.
The overall accuracy of the model is 71.9%.
This is measured by counting how many full mor-
phological readings the model predicted correctly
for each word form in the test corpus. The re-
sults per morphological tag can be seen in Table
2. These results exclude the N(noun) tag and Nom
(nominative) tag because all morphological forms
had those tags in the dataset.
7 Discussion and Conclusions
In order to further test the accuracy of our Lushoot-
seed description, more test data and descriptions of
regular inflection will be needed. The challenge is
to continue with the outline given for an inflectional
complex (see Lonsdale 2001) and define what can
actually be described as regular.
More time will be required to model more recent
reanalyses of the morphological complexes. This
means we may need to establish whether a six-
placeholder segment is required to aptly describe
Lushootseed reduplication and put our description
in line with a hypothesis of antigemination.
The idea of describing morphological complexes
as series of aligned clitics is very interesting (see
Beck 2018). This will actually provide fuel for
future work with syntax, since most of the semantic
information is already present in the word roots
where the clitics conglomerate.
43
Limitations
The FST does not yet have an extensive coverage
of the Lushootseed vocabulary, so it does not work
on all domains of text. Also, writing an FST takes
a lot of time and requires special knowledge of the
language. The neural model is limited to nouns
only, but it can work on out-of-vocabulary words
unlike the FST, however, we have only tested its
accuracy using the words that are known to the FST,
which means that words that follow very different
inflection patterns will, most likely, not be analyzed
correctly. Furthermore, the neural model was not
trained on derivational morphology, which means
that word derivations might also result in erroneous
predictions.
Ethics statement
When dealing with an endangered language it is
important to make sure that the research also con-
tributes to the language community. This is the
reason why we open-source our FST and neural
model. We also work on data that has been given to
us by speakers of Lushootseed with the intention of
us working on building morphological descriptons
and tools for the language. This means that we are
not conducting our research with no regard to the
language community.
Acknowledgments
This research is supported by FIN-CLARIN and
Academy of Finland (grant 345610 Kielivarojen ja
kieliteknologian tutkimusinfrastruktuuri).
References
Ekin Akyürek, Erenay Dayanık, and Deniz Yuret. 2019.
Morphological analysis using a sequence decoder.
Transactions of the Association for Computational
Linguistics, 7:567–579.
Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wo-
jciech Skut, and Mehryar Mohri. 2007. Openfst: A
general and efficient weighted finite-state transducer
library: (extended abstract of an invited talk). In
Implementation and Application of Automata: 12th
International Conference, CIAA 2007, Praque, Czech
Republic, July 16-18, 2007, Revised Selected Papers
12, pages 11–23. Springer.
Antti Arppe, Marie-Odile Junker, and Delasie Torko-
rnoo. 2017. Converting a comprehensive lexical
database into a computational model: The case of
East Cree verb inflection. In Proceedings of the 2nd
Workshop on the Use of Computational Methods in
the Study of Endangered Languages, pages 52–56,
Honolulu. Association for Computational Linguis-
tics.
Dawn Bates. 1986. An analysis of lushootseed diminu-
tive reduplication. Proceedings of the Twelfth Annual
Meeting of the Berkeley Linguistics Society (1986),
pp. 1–13.
Dawn Bates, Thom Hess, and Vi Hilbert. 1994. Lushoot-
seed Dictionary. University of Washington Press.
Seattle and London. Bates, Dawn (ed.).
D. Beck. 1999. Words and prosodic phrasing in lushoot-
seed narrative. In Hall, T. A. and Kleinhenz, U.,
editors, Studies on the Phonological Word, pages
23–46.
David Beck. 2018. Aspectual affixation in lushootseed:
A minor reanalysis. In Wa7 xweysás i nqwal’utteníha
i ucwalmícwa: He loves the people’s languages. Es-
says in honour of Henry Davis. UBC Occasional
Papers in Linguistics.
Dustin Bowers, Antti Arppe, Jordan Lachler, Sjur
Moshagen, and Trond Trosterud. 2017. A morpho-
logical parser for odawa. In Proceedings of the 2nd
Workshop on the Use of Computational Methods in
the Study of Endangered Languages, pages 1–9, Hon-
olulu. Association for Computational Linguistics.
Ellen Broselow. 1983. Salish double reduplications:
Subjacency in morphology. Natural Language Lin-
guistic Theory 1(3).
Joshua Crowgey. 2019. Braiding Language (by Com-
puter): Lushootseed Grammar Engineering. Uni-
versity of Washington. A dissertation submitted in
partial fulfillment of the requirements for the degree
of Doctor of Philosophy,.
Mika Hämäläinen. 2019. Uralicnlp: An nlp library for
uralic languages. Journal of open source software.
Mika Hämäläinen. 2021. Endangered languages are not
low-resourced! Multilingual Facilitation.
Mika Hämäläinen, Niko Partanen, Jack Rueter, and
Khalid Alnajjar. 2021. Neural morphology dataset
and models for multiple languages, from the large to
the endangered. In Proceedings of the 23rd Nordic
Conference on Computational Linguistics (NoDaL-
iDa), pages 166–177, Reykjavik, Iceland (Online).
Linköping University Electronic Press, Sweden.
Thom. Hess. 1967. Snohomish Grammatical Structure.
Unpublished Ph.D. dissertation. University of Wash-
ington.
Mans Hulden. 2009. Foma: a finite-state compiler and
library. In Proceedings of the Demonstrations Ses-
sion at EACL 2009, pages 29–32, Athens, Greece.
Association for Computational Linguistics.
44
Sardana Ivanova, Jonathan Washington, and Francis
Tyers. 2022. A free/open-source morphological anal-
yser and generator for sakha. In Proceedings of the
Thirteenth Language Resources and Evaluation Con-
ference, pages 5137–5142, Marseille, France. Euro-
pean Language Resources Association.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel-
lart, and Alexander Rush. 2017. OpenNMT: Open-
source toolkit for neural machine translation. In Pro-
ceedings of ACL 2017, System Demonstrations, pages
67–72, Vancouver, Canada. Association for Compu-
tational Linguistics.
Jordan Kodner, Salam Khalifa, Khuyagbaatar Bat-
suren, Hossep Dolatian, Ryan Cotterell, Faruk Akkus,
Antonios Anastasopoulos, Taras Andrushko, Arya-
man Arora, Nona Atanalov, Gábor Bella, Elena
Budianskaya, Yustinus Ghanggo Ate, Omer Gold-
man, David Guriel, Simon Guriel, Silvia Guriel-
Agiashvili, Witold Kiera´
s, Andrew Krizhanovsky,
Natalia Krizhanovsky, Igor Marchenko, Magdalena
Markowska, Polina Mashkovtseva, Maria Nepomni-
ashchaya, Daria Rodionova, Karina Scheifer, Alexan-
dra Sorova, Anastasia Yemelina, Jeremiah Young,
and Ekaterina Vylomova. 2022. SIGMORPHON–
UniMorph 2022 shared task 0: Generalization and
typologically diverse morphological inflection. In
Proceedings of the 19th SIGMORPHON Workshop
on Computational Research in Phonetics, Phonology,
and Morphology, pages 176–203, Seattle, Washing-
ton. Association for Computational Linguistics.
Arun Kumar, V Dhanalakshmi, RU Rekha, KP Soman,
S Rajendran, et al. 2009. Morphological analyzer
for agglutinative languages using machine learning
approaches. In 2009 International Conference on
Advances in Recent Technologies in Communication
and Computing, pages 433–435. IEEE.
Krister Lindén, Erik Axelson, Senka Drobac, Sam Hard-
wick, Juha Kuokkala, Jyrki Niemi, Tommi A Piri-
nen, and Miikka Silfverberg. 2013. Hfst—a system
for creating nlp tools. In Systems and Frameworks
for Computational Morphology: Third International
Workshop, SFCM 2013, Berlin, Germany, September
6, 2013 Proceedings 3, pages 53–71. Springer.
Deryle. Lonsdale. 2001. A two-level implementation
for lushootseed morphology. Papers for ICSNL 36
(Bar-el, L., L. Watt, and I. Wilson, eds.). UBCWPL
6:203– 214.
Andrew Matteson, Chanhee Lee, Youngbum Kim, and
Heuiseok Lim. 2018. Rich character-level informa-
tion for Korean morphological analysis and part-of-
speech tagging. In Proceedings of the 27th Inter-
national Conference on Computational Linguistics,
pages 2482–2492, Santa Fe, New Mexico, USA. As-
sociation for Computational Linguistics.
Arya D. McCarthy, Christo Kirov, Matteo Grella,
Amrit Nidhi, Patrick Xia, Kyle Gorman, Ekate-
rina Vylomova, Sabrina J. Mielke, Garrett Nico-
lai, Miikka Silfverberg, Timofey Arkhangelskiy, Na-
taly Krizhanovsky, Andrew Krizhanovsky, Elena
Klyachko, Alexey Sorokin, John Mansfield, Valts
Ernštreits, Yuval Pinter, Cassandra L. Jacobs, Ryan
Cotterell, Mans Hulden, and David Yarowsky. 2020.
UniMorph 3.0: Universal Morphology. In Proceed-
ings of the Twelfth Language Resources and Evalua-
tion Conference, pages 3922–3931, Marseille, France.
European Language Resources Association.
Eric Raimy. 2000. The phonology and morphology of
reduplication. de Gruyter.
Jack Rueter and Mika Hämäläinen. 2020. FST mor-
phology for the endangered Skolt Sami language.
In Proceedings of the 1st Joint Workshop on Spo-
ken Language Technologies for Under-resourced lan-
guages (SLTU) and Collaboration and Computing
for Under-Resourced Languages (CCURL), pages
250–257, Marseille, France. European Language Re-
sources association.
Jack Rueter, Mika Hämäläinen, and Niko Partanen.
2020. Open-source morphology for endangered
mordvinic languages. In Proceedings of Second
Workshop for NLP Open Source Software (NLP-OSS),
pages 94–100, Online. Association for Computa-
tional Linguistics.
Jack Rueter, Niko Partanen, Mika Hämäläinen, and
Trond Trosterud. 2021. Overview of open-source
morphology development for the Komi-Zyrian lan-
guage: Past and future. In Proceedings of the Seventh
International Workshop on Computational Linguis-
tics of Uralic Languages, pages 29–39, Syktyvkar,
Russia (Online). Association for Computational Lin-
guistics.
Aleksi Sahala, Miikka Silfverberg, Antti Arppe, and
Krister Lindén. 2020. BabyFST - towards a finite-
state based computational model of ancient baby-
lonian. In Proceedings of the Twelfth Language
Resources and Evaluation Conference, pages 3886–
3894, Marseille, France. European Language Re-
sources Association.
Khaled Shaalan and Mohammed Attia. 2012. Handling
unknown words in Arabic FST morphology. In Pro-
ceedings of the 10th International Workshop on Fi-
nite State Methods and Natural Language Processing,
pages 20–24, Donostia–San Sebastián. Association
for Computational Linguistics.
Conor Snoek, Dorothy Thunder, Kaidi Lõo, Antti Arppe,
Jordan Lachler, Sjur Moshagen, and Trond Trosterud.
2014. Modeling the noun morphology of Plains Cree.
In Proceedings of the 2014 Workshop on the Use of
Computational Methods in the Study of Endangered
Languages, pages 34–42, Baltimore, Maryland, USA.
Association for Computational Linguistics.
Warren Snyder. 1968. Southern Puget Sound Salish
Texts, Place Names, and Dictionary, volume 9. Sacra-
mento; Sacramento Anthropological Society.
Xuri Tang. 2006. English morphological analysis with
machine-learned rules. In Proceedings of the 20th Pa-
cific Asia Conference on Language, Information and
45
Computation, pages 35–41, Huazhong Normal Uni-
versity, Wuhan, China. Tsinghua University Press.
Francis M. Tyers and Samuel Herrera Castro. 2023. To-
wards a finite-state morphological analyser for san
mateo huave. In Proceedings of the Sixth Workshop
on the Use of Computational Methods in the Study
of Endangered Languages, pages 30–37, Remote.
Association for Computational Linguistics.
Suzanne Urbanczyk. 1994. Double reduplication in
parallel. Proceedings of the June 1994 Prosodic
Morphology Workshop. Utrecht.
Suzanne Urbanczyk. 1996. Morphological tem-
plates in reduplication. University of Mas-
sachusetts/University of British Columbia.
46
... Miller and De Santo (2023) present an analysis of tone patterns in Thai reduplication and propose a computational model to handle these patterns within a finite-state framework. FST-based techniques are also utilized for morphological analysis, lemmatization, and word generation in Lushootseed (Rueter et al., 2023). Additionally, FSTs are employed for morphological analysis (Alblwi et al., 2023) and transliteration (Malik et al., 2008) in Urdu. ...
Article
Full-text available
Reduplication is a highly productive process in Bengali word formation, with significant implications for various natural language processing (NLP) applications, such as parts-of-speech tagging and sentiment analysis. Despite its importance, this area has not been extensively explored in computational linguistics, especially for low-resource languages like Bengali. This study first demonstrates that a two-way finite-state transducer (FST) can effectively capture complete reduplication generation processes in Bengali. Second, it is shown that the formation of partial reduplication requires a set of 2-way FSTs due to the diverse patterns involved in Bengali partial reduplications. Third, the research highlights the utility of the reduplication generation process in identifying Bengali reduplication instances, achieving a commendable F1-Score of 88.11%. This method outperforms current state-of-the-art methods for identifying reduplicated expressions in Bengali text. This research contributes valuable insights into the computational representation of reduplication in Bengali, offering potential enhancements for NLP tasks in low-resource language scenarios.
... There is also work on using finite state transducers to do morphological tagging and segmentation. Some languages where such taggers exist are Haida (Lachler et al., 2018), Michif (Davis et al., 2021), Cree (Snoek et al., 2014), Lushootseed (Rueter et al., 2023), Wixarika (Mager et al., 2018a), Nahuatl (Pugh and Tyers, 2021) and Guaraní (Kuznetsova and Tyers, 2021). Languages where custom methods have been used for morphological tagging and segmentation include Inuktitut (Khandagale et al., 2022;Le and Sadat, 2021), Seneca (Liu et al., 2021), Quechua (Llitjós et al., 2005), Shipibo-Konibo (Mercado-Gonzales et al., 2018) and Mapugundun (Molineaux, 2023). ...
Conference Paper
Full-text available
We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are struc-tured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica.
Conference Paper
Full-text available
This study describes the ongoing development of the finite-state description for an endangered minority language, Komi-Zyrian. This work is located in the context where large written and spoken language corpora are available, which creates a set of unique challenges that have to be, and can be, addressed. We describe how we have designed the transducer so that it can benefit from existing open-source infrastructures and therefore be as reusable as possible.
Conference Paper
Full-text available
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible to use them as fallback systems together with the FSTs. The source code, models and datasets have been released on Zenodo.
Chapter
Full-text available
The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
Conference Paper
Full-text available
This document describes shared development of finite-state description of two closely related but endangered minority languages, Erzya and Moksha. It touches upon morpholexical unity and diversity of the two languages and how this provides a motivation for shared open-source FST development. We describe how we have designed the transducers so that they can benefit from existing open-source infrastructures and are as reusable as possible.
Conference Paper
Full-text available
Akkadian is a fairly well resourced extinct language that does not yet have a comprehensive morphological analyzer available. In this paper we describe a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language, that can achieve a coverage up to 97.3% and recall up to 93.7% on lemmatization and POS-tagging task on token level from a transcribed input. Since Akkadian word forms exhibit a high degree of morphological ambiguity, in that only 20.1% of running word tokens receive a single unambiguous analysis, we attempt a first pass at weighting our finite-state transducer, using existing extensive Akkadian corpora which have been partially validated for their lemmas and parts-of-speech but not the entire morphological analyses. The resultant weighted finite-state transducer yields a moderate improvement so that for 57.4% of the word tokens the highest ranked analysis is the correct one. We conclude with a short discussion on how morphological ambiguity in the analysis of Akkadian could be further reduced with improvements in the training data used in weighting the finite-state transducer as well as through other, context-based techniques.
Conference Paper
Full-text available
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.
Article
Full-text available
We introduce Morse, a recurrent encoder-decoder model that produces morphological analyses of each word in a sentence. The encoder turns the relevant information about the word and its context into a fixed size vector representation and the decoder generates the sequence of characters for the lemma followed by a sequence of individual morphological features. We show that generating morphological features individually rather than as a combined tag allows the model to handle rare or unseen tags and to outperform whole-tag models. In addition, generating morphological features as a sequence rather than, for example, an unordered set allows our model to produce an arbitrary number of features that represent multiple inflectional groups in morphologically complex languages. We obtain state-of-the-art results in nine languages of different morphological complexity under low-resource, high-resource, and transfer learning settings. We also introduce TrMor2018, a new high-accuracy Turkish morphology data set. Our Morse implementation and the TrMor2018 data set are available online to support future research. ¹ See https://github.com/ai-ku/Morse.jl for a Morse implementation in Julia/Knet (Yuret, 2016 ) and https://github.com/ai-ku/TrMor2018 for the new Turkish data set.
Article
Full-text available
In the past years the natural language processing (NLP) tools and resources for small Uralic languages have received a major uplift. The open-source Giellatekno infrastructure has served a key role in gathering these tools and resources in an open environment for researchers to use. However, the many of the crucially important NLP tools, such as FSTs and CGs require specialized tools with a learning curve. This paper presents UralicNLP, a Python library, the goal of which is to mask the actual implementation behind a Python interface. This not only lowers the threshold to use the tools provided in the Giellatekno infrastructure but also makes it easier to incorporate them as a part of research code written in Python.