Conference PaperPDF Available

Morphological Disambiguation of South Sámi with FSTs and Neural Networks

Authors:

Abstract

We present a method for conducting morphological disambiguation for South Sámi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Sámi UD Treebank and some synthetically generated South Sámi data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible to use North Sámi training data for South Sámi without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South Sámi, which makes it usable and applicable in the contexts of any other endangered language as well.
Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020) , pages 36–40
Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020
c
European Language Resources Association (ELRA), licensed under CC-BY-NC
Morphological Disambiguation of South S´
ami with FSTs and Neural Networks
Mika H¨
am¨
al¨
ainen, Linda Wiechetek
University of Helsinki, UiT The Arctic University of Norway
Finland, Norway
mika.hamalainen@helsinki.fi, linda.wiechetek@uit.no
Abstract
We present a method for conducting morphological disambiguation for South S´
ami, which is an endangered language. Our method
uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These
readings are disambiguated with a Bi-RNN model trained on the related North S´
ami UD Treebank and some synthetically generated
South S´
ami data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas; this makes it possible
to use North S´
ami training data for South S´
ami without the need for a bilingual dictionary or aligned word embeddings. Our approach
requires only minimal resources for South S´
ami, which makes it usable and applicable in the contexts of any other endangered language
as well.
Keywords: S´
ami languages, disambiguation, endangered languages
1. Introduction
S´
ami languages are a part of the Uralic language family,
and like many other Uralic languages they are endangered.
The languages of this family are synthetic, meaning that
they exhibit a great deal of inflectional and derivational
morphology making their processing with computational
means far from trivial.
In this paper, we present a method for morphological dis-
ambiguation of South S´
ami (ISO 639-3 code sma) by using
a morphological FST (finite-state transducer) analyzer and
a Bi-RNN (bi-directional recurrent neural network) trained
on North S´
ami (ISO 639-3 code sme) data and syntheti-
cally generated South S´
ami data. The disambiguation pro-
cess takes in all the morphological readings produced by
the FST and uses the neural network to pick the contextu-
ally correct disambiguated reading.
North and South are not direct neighbors in the dialect con-
tinuum, but share a big part of the lexicon and many gram-
matical features like an elaborate case system, non-finite
clause constructions, a large amount of verbal and nominal
derivations. However, they have a number of distinctions in
lexicon, morphology and syntax.
One of the important differences is the omission of the cop-
ula verb in South S´
ami, but not or less so in North S´
ami.
The typical word order is SOV (subject-object-verb) in
South S´
ami, and SVO (subject-verb-object) in North S´
ami.
The case system is slightly different as well. South S´
ami
distinguishes between inessive (place) and elative (source)
case (Bergsland, 1994). In North S´
ami, this is synthesized
in one morpho-syntactic case, called locative case.
In addition to the aforementioned differences, also the
homonymies are not the same. In North S´
ami, regular noun
homonymies are genitive/accusative and comitative singu-
lar/locative plural. In South S´
ami, on the other hand, they
are illative plural/accusative plural and essive (underspeci-
fied as regards number)/inessive plural/comitative singular.
Even in the context of morphologically rich languages, a
simple POS (part-of-speech) tagging is often not enough as
it only reduces some of the ambiguity, and is not enough for
lemmatization, for instance. Then again, without lemmati-
zation and the small amount of data available for these lan-
guages, modern NLP methods such as word embeddings
cannot be as reliably used as in the case of majority lan-
guages.
South S´
ami, with its estimated number of 500 speakers, is
categorized as severely endangered by UNESCO (Moseley,
2010) and is spoken in Norway and Sweden. The language
is spoken in Norway and Sweden and its bilingual users
frequently face bigger challenges regarding literacy in the
lesser used language than in the majority language due to
reduced access to language arenas (Outakoski, 2013; Lind-
gren et al., 2016).
The central tools used for disambiguation of S´
ami lan-
guages are finite state transducers and Constraint Gram-
mars. Constraint Grammar is a rule-based formalism for
writing disambiguation and syntactic annotation grammars
(Karlsson, 1990; Karlsson et al., 1995). Constraint Gram-
mar relies on a bottom-up analysis of running text. Possi-
ble but unlikely analyses are discarded step by step with the
help of morpho-syntactic context. The vislcg3 implemen-
tation1is used in particular.
South S´
ami has several Constraint Grammars including a
morpho-syntactic disambiguator, a shallow syntactic ana-
lyzer, and a dependency analyzer (Antonsen et al., 2010;
Antonsen and Trosterud, 2011). Antonsen and Trosterud
(2011) use a fairly small Constraint Grammar (115 rules)
for South S´
ami part of speech (POS) and lemma disam-
biguation, resulting in a precision of 0.87 and a recall of
0.98 for full morpho-syntactic disambiguation. While these
are very good results with a comparatively small workload,
they require the work of a linguist with knowledge of the
language or a linguist and a language expert in addition.
However, we want to show how grammatical tools can be
built in the absence of these.
Whereas our paper deals with South S´
ami disambiguation,
the main purpose of this work is to demonstrate that a
disambiguator can be built with relatively few resources
based on a morpho-syntactically related language. This is
1http://visl.sdu.dk/constraint_grammar.
html (accessed 2018-10-08), also (Didriksen, 2010)
36
Sentence Gos d´
appe lea m´
addi? ‘Where is the South here?’
FST output [’gos+Adv+Subqst’, ’gos+Adv’], [’d´
appe+Adv’], [’leat+V+IV+Ind+Prs+Sg3’],
[’m´
addat+V+TV+Imprt+Du2’, ’m´
addat+V+TV+PrsPrc’, ’m´
addi+N+Sg+Nom’], [?+CLB]
Source sequence Adv Subqst Adv IV Ind Prs Sg3 V Du2 Imprt N Nom PrsPc Sg TV V CLB
Target sequence Adv Adv Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin V
Case=Nom Number=Sing N CLB
Table 1: An example of the training data
useful, not only in the wider context of S´
ami languages,
but also for other endangered languages as it provides the
language community quickly with much-needed resources
while there are children - the future speakers - learning the
language. Our approach follows the previously established
ideology for using FSTs together with neural networks to
solve the problem of disambiguation (Ens et al., 2019).
2. Related Work
Parallel texts have been used to deal with morphological
tagging in the context of low-resourced languages (Buys
and Botha, 2016). They use aligned parallel sentences to
train their their Wsabie-based model to tag the low-resource
language based on the morphological tags of a more re-
sourced language sentences in the training data. A limita-
tion of this approach is that the morphological relatedness
of the high-resource and low-resource languages has to be
high.
Andrews et al. (Andrews et al., 2017) have proposed a
method for POS (part of speech) tagging of low-resource
languages. They use a bilingual dictionary between a low-
resource and high-resource language. In addition, their sys-
tem requires monolingual data for building cross-lingual
word embeddings. The resulting POS tagger is trained
on an LSTM neural network, and their approach performs
consistently better than the other approaches on the bench-
marks they report.
Lim et al. (Lim et al., 2018) present an approach for syn-
tactic parsing of Komi-Zyrian and North S´
ami data using
multilingual word embeddings. They use pre-trained word-
embeddings of two high-resource languages; Finnish and
Russian. Then they train monolingual word-embeddings
for the low-resource languages from small corpora. They
project these individual word embeddings into a single
space by using bilingual dictionaries for alignment. The
parser was implemented as an LSTM based model, and its
performance is higher for POS tagging than for syntactic
parsing. The most important finding for our purposes is
that including a related high-resource language improves
the accuracy of their method.
DsDs (Plank and Agi´
c, 2018) is a neural network based
part-of-speech tagger intended to be used in the context
of low-resource languages. Their core idea is to use a bi-
LSTM model to project POS tags from one language to an-
other with the help of lexical information and word embed-
dings. Their experiments in a low-resource setting reveal
that including word embeddings can boost the model, but
lexical information can also help to a smaller degree.
The scope of a great part of the related work is limited
to POS tagging. Nevertheless, the morphologically rich
Uralic languages call for a more full blown morphologi-
cal disambiguation than a mere POS tagging in order to
make higher-level NLP tools usable for these languages.
Moreover, our approach cannot count on the existence
of high-quality bilingual dictionaries between morpholog-
ically similar languages nor aligned word embeddings, as
such resources are not easily available for endangered lan-
guages.
3. Data and Tools
The training data for South S´
ami disambiguation comes
from the Universal Dependencies Treebank of the related
North S´
ami language (Sheyanova and Tyers, 2017). Out of
all the S´
ami languages, North S´
ami has by and large the
biggest amount of NLP resources available and therefore
its use as a starting point for related languages makes per-
fect sense. The treebank consists of 26K tokens and comes
pre-divided into a training and testing datasets.
In addition to the treebank, we use FSTs for both North
S´
ami and South S´
ami with UralicNLP (H¨
am¨
al¨
ainen, 2019).
These transducers are integrated in the open GiellaLT in-
frastructure (Moshagen et al., 2014) for Uralic languages.
The FSTs take in a word in an inflectional form and produce
all the possible morphological readings for it.
In order to evaluate our system, we use a small dataset for
South S´
ami that has been disambiguated automatically by
a Constraint Grammar and checked manually. Currently,
the dataset is not publicly available. The data consists of
1994 disambiguated sentences and we only use it for the
evaluation.
North S´
ami South S´
ami
Average 3.1 1.8
Table 2: Average ambiguity
Table 2 shows the average morphological ambiguity in the
North S´
ami training set and South S´
ami test set when the
FSTs are used to produce all morphological readings for
every word in the corpus. As we can see, North S´
ami ex-
hibits a much higher degree of morphological ambiguity
than South S´
ami.
For generating more data, we use the South S´
ami lemmas
from the South S´
ami-Norwegian dictionary located in the
GiellaLT infrastructure (Moshagen et al., 2014). The dic-
tionary has 11,438 POS tagged South S´
ami lemmas. We
only use this dictionary for South S´
ami words and omit all
the Norwegian translations in our method.
37
Template Target morphology
(N Sg Nom) (N Sg Ill) (V IV Ind Prs Sg3) (N Case=Nom Number=Sing) (N Case=Ill Number=Sing)
(V Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin)
(N Sg Nom) (Adv) (V TV Ger) (N Case=Nom Number=Sing) (Adv) (V VerbForm=Ger)
(N Sg Nom) (N Sg Ine) (N Case=Nom Number=Sing) (N Case=Ine Number=Sing)
mannem (N Sg Acc) (V TV Ind Prs Sg1)
(Pron Case=Acc Number=Sing Person=1 PronType=Prs)
(N Case=Acc Number=Sing)
(V Mood=Ind Number=Sing Person=1 Tense=Pres VerbForm=Fin)
(N Sg Nom) (N Sg Ela) (V IV Ind Prs Sg3) (N Case=Nom Number=Sing) (N Case=Ela Number=Sing)
(V Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin)
altemse (V TV Ind Prs Sg1) (N Ess)
(Pron Case=Acc Number=Sing Person=3 PronType=Prs)
(V Mood=Ind Number=Sing Person=1 Tense=Pres VerbForm=Fin)
(N Case=Ess)
Table 3: Templates for generating South S´
ami data
4. Neural Disambiguation
We train a sequence-to-sequence Bi-RNN model using
OpenNMT (Klein et al., 2017) with the default settings ex-
cept for the encoder where we use a BRNN (bi-directional
recurrent neural network) instead of the default RNN (re-
current neural network) as BRNN has been shown to pro-
vide a performance gain in a variety of tasks. We use the
default of two layers for both the encoder and the decoder
and the default attention model, which is the general global
attention presented by Luong et al. (Luong et al., 2015).
We experiment with two models, one that is trained with the
North S´
ami Treebank only, and another one that is trained
with South S´
ami text generated by templates and the North
S´
ami data. Both models are trained for 60,000 training
steps with the same random seed value.
The North S´
ami data gives us the target sequence, that is
the correct morphological tags and the POS tag. However,
the source sequence has to be generated automatically be-
fore the training. For this, we use the North S´
ami FST an-
alyzer. We produce all the possible morphologies for each
word in the Treebank. The training is done from a sorted
list of homonymous readings for each word separated by
a character indicating word boundary to the disambiguated
set of homonymous readings from the UD (universal de-
pendencies) TreeBank on a sentence level. In other words,
the only thing the model sees are morphological tags on the
source and the target side. Lemmas and words are dropped
out so that the model can be used for South S´
ami without
the need of aligned word embeddings or dictionaries. This
is illustrated in Table 1.
For producing synthetic data, we wrote six small templates
that reflect some common morpho-syntactic differences be-
tween South S´
ami and North S´
ami. This is, forinstance, the
absence of elative and inessive case in North S´
ami, both of
which are merged into the single locative case. For each
template, we produce 20 different ambiguous sentences by
selecting words fitting to the template at random from the
South S´
ami dictionary and inflecting them accordingly with
the FST. Once the words are inflected, we can analyze them
to get the ambiguous reading. The templates can be seen in
Table 3.
5. Results and Evaluation
We evaluate the models, the one trained only with the North
S´
ami Treebank and the one that had additional template
generated training data, with the disambiguated gold stan-
dard that exists for South S´
ami. As the South S´
ami gold
standard follows the GiellaLT FST tags, we converted the
tags automatically into UD format, since the neural network
is trained to output UD tags.
The evaluation results are shown in Table 4. The first col-
umn shows the percentage of sentences that have been fully
disambiguated correctly, the second columns shows this on
a word level i.e. how many words were fully correctly dis-
ambiguated and finally the last column shows the accuracy
in POS tagging. The results indicate that adding the small
synthetically generated data to the training boosted the re-
sults significantly.
Fully correct
sentences Fully correct
words POS correct
N. S´
ami
only 12.0% 37.6% 59.7%
N. S´
ami &
templates 13.0% 42.2% 66.4%
Table 4: Evaluation results of the two different models on
South S´
ami data
As for the incorrectly disambiguated morphological read-
ings, there is a degree to how incorrect they are. This is
shown in Table 5, which shows the errors based on how
many morphological tags were predicted wrong. In both
cases, more than half of the wrongly disambiguated words
only differ by one tag from the gold standard. The results
for the model trained on the additional template data show
that the errors the model makes are still closer to the correct
reading.
Below, we are having a closer look at the actual sentences
and their analyses, shedding some light on the shortcom-
ings the neural network and suggesting improvements. In
ex. (1), our system erroneously picks the nominal singular
nominative instead of the adverb reading for daelie. The
nominal reading, however, is very rare.
38
1 tag 2 tags 3 tags more tags
North S´
ami
only 58.4% 11.8% 15.8% 14.0%
North S´
ami
& templates 60.8% 12.1% 15.1% 12.0%
Table 5: Errors based on the number of erroneous morpho-
logical tags on a word level
(1) Daelie
now;thenSG.NOM
dle
so geajnam
street.ACC
gaavnem!
find.PRS.SG1
‘Then I find the street!’
Negation verbs pose a problem to the neural network. Of
the 238 instances only very few negation verbs - despite not
being homonymous with any other forms - are analyzed as
such. In ex. (2), im ‘I don’t’ is analyzed a an indicative past
tense verb 1st person singular (the last of which is correct)
despite the fact that im is not ambiguous.
(2) Im
not.NEG.SG1s¨
ıjhth
want.CONNEG
g˚
aabph
anywhere gih,
then, men
but
tjidtjie
mother jeahta
says m˚
annoeh.
us.DU1.NOM
‘I don’t want anywhere then, but mother says us two
will go.’
There are other difficulties related to negation in the system.
In the following example, the neural network predicts more
tokens than the sentence contains, i.e. a negative verb (cor-
rectly) and a connegative form (erroneously) usually pre-
ceded by the negative verb.
(3) -
-Aellieh!
not.NEG.IMPRT.SG2
‘- Don’t!’
6. Discussion and Conclusions
Uralic languages are highly ambiguous in terms of their
morphology, and the linguistic resources such as annotated
corpora for these languages are quite limited. This poses
challenges in the use of modern NLP methods that have
been successfully employed on high-resource languages. In
order to overcome these limitations, we proposed a repre-
sentation based on the ambiguous morphological tags of
each word in a sentence.
We have presented a viable way of disambiguation for
South S´
ami based on an FST and training data on North
S´
ami with minimal templates needed to cover some of the
morpho-syntactic differences of the two languages. The
preliminary results look promising, especially since there
are nine different S´
ami languages. Not to mention similar
situations for other endangered languages, where data for a
similar language is available.
Our method is more of a hybrid pipeline of rule-based FSTs
that produce the possible morphological readings and a
neural network that does the disambiguation. This makes
it possible to replace the FST with some other rule-based
solution or a neural network based morphological analyzer,
given that recent research has shown promising results for
the use of neural networks in morphology of endangered
languages (Schwartz et al., 2019; Silfverberg and Tyers,
2019).
Moreover, our pipeline can be further enhanced by rules. In
our experiments, we had the neural network disambiguate
out of all the possible morphological readings. Instead of
doing that, it is possible to disambiguate first with a rule-
based tool such as a Constraint Grammar, and use the neu-
ral network to disambiguate the remaining ambiguity. That
way we do not need to guess what we already know. It is
particularly important to make sure that if the morphology
is known, the neural network would not be used to guess
it again. This would allow for combining the best of the
two worlds; the accuracy of the rule-based methods and the
scalability of a neural network.
An interesting question for the future is how far one could
get in disambiguation with our proposed method if one was
only to train the model by using templates. As even a
small number of templates was enough to improve the re-
sults noticeably, an entirely template based approach does
not seem to be entirely out of the question. Especially if
the templates were constructed with more generative free-
dom such as by following a formalism deriving from CFG
(context-free grammar). The use of synthetically generated
source data is known to improve NMT (neural machine
translation) models when the target data is of a high quality
(see (Sennrich et al., 2016)). Also, some promising work
has been conducted in fully synthetically generated parallel
data in NMT (H¨
am¨
al¨
ainen and Alnajjar, 2019).
This year has been particularly good for Uralic languages
with small UD Treebanks recently published for Skolt
S´
ami, Karelian, Livvi, Komi-Permyak and Moksha. This
means that in the future we can try different variations
of our method with these languages as well with minimal
modifications to the current approach as all of these lan-
guages have rule-based FSTs available in the GiellaLT in-
frastructure.
7. Acknowledgments
We would like to thank Lene Antonsen and Anja Regina
Fjellheim Labj for their work on the South S´
ami Constraint
Grammar disambiguator within the GiellaLT infrastructure
and for making their automatically annotated and manually
corrected South S´
ami corpus available to us.
8. Bibliographical References
Andrews, N., Dredze, M., Van Durme, B., and Eisner, J.
(2017). Bayesian modeling of lexical resources for low-
resource settings. In Proceedings of the 55th Annual
Meeting of the Association for Computational Linguis-
tics (Volume 1: Long Papers), pages 1029–1039. Asso-
ciation for Computational Linguistics.
Antonsen, L. and Trosterud, T. (2011). Next to nothing – a
cheap south saami disambiguator. pages 131–137, 05.
Antonsen, L., Wiechetek, L., and Trosterud, T. (2010).
Reusing grammatical resources for new languages. In
Proceedings of the 7th International Conference on Lan-
guage Resources and Evaluation (LREC 2010), pages
2782–2789, Stroudsburg. The Association for Compu-
tational Linguistics.
Bergsland. (1994). Sydsamisk grammatikk. Davvi Girji.
39
Buys, J. and Botha, J. A. (2016). Cross-lingual morpholog-
ical tagging for low-resource languages. In Proceedings
of the 54th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), pages
1954–1964. Association for Computational Linguistics.
Didriksen, T., (2010). Constraint Grammar Manual: 3rd
version of the CG formalism variant. GrammarSoft ApS,
Denmark.
Ens, J., H¨
am¨
al¨
ainen, M., Rueter, J., and Pasquier, P. (2019).
Morphosyntactic disambiguation in an endangered lan-
guage setting. In Proceedings of the 22nd Nordic Con-
ference on Computational Linguistics, pages 345–349.
H¨
am¨
al¨
ainen, M. and Alnajjar, K. (2019). A template
based approach for training nmt for low-resource uralic
languages-a pilot with finnish. In Proceedings of the
2019 2nd International Conference on Algorithms, Com-
puting and Artificial Intelligence, pages 520–525.
Karlsson, F., Voutilainen, A., Heikkil¨
a, J., and Anttila, A.
(1995). Constraint Grammar: A Language-Independent
System for Parsing Unrestricted Text. Mouton de
Gruyter, Berlin.
Karlsson, F. (1990). Constraint Grammar as a Framework
for Parsing Running Text. In Hans Karlgren, editor,
Proceedings of the 13th Conference on Computational
Linguistics (COLING 1990), volume 3, pages 168–173,
Helsinki, Finland. Association for Computational Lin-
guistics.
Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M.
(2017). OpenNMT: Open-Source Toolkit for Neural Ma-
chine Translation. In Proc. ACL.
Lim, K., Partanen, N., and Poibeau, T. (2018). Multi-
lingual dependency parsing for low-resource languages:
Case studies on north saami and komi-zyrian. In Pro-
ceedings of the Eleventh International Conference on
Language Resources and Evaluation (LREC 2018).
Lindgren, E., Sullivan, K., Outakoski, H., and Westum,
A. (2016). Researching literacy development in the
globalised North: studying tri-lingual children’s english
writing in Finnish, Norwegian and Swedish S´
apmi. In
David R. Cole et al., editors, Super Dimensions in Glob-
alisation and Education, Cultural Studies and Transdici-
plinarity in Education, pages 55–68. Springer, Singa-
pore.
Luong, M.-T., Pham, H., and Manning, C. D. (2015).
Effective approaches to attention-based neural machine
translation. arXiv preprint arXiv:1508.04025.
Christopher Moseley, editor. (2010). Atlas of the World0s
Languages in Danger. UNESCO Publishing, 3rd edi-
tion. Online version: http://www.unesco.org/languages-
atlas/.
Outakoski, H. (2013). Davvis´
amegielat ˇ
c´
alam´
ahtu kon-
teaksta [The context of North S´
ami literacy]. S´
ami
diealaˇ
s´
aigeˇ
c´
ala, 1/2015:29–59.
Plank, B. and Agi´
c, ˇ
Z. (2018). Distant supervision from
disparate sources for low-resource part-of-speech tag-
ging. In Proceedings of the 2018 Conference on Em-
pirical Methods in Natural Language Processing, pages
614–620. Association for Computational Linguistics.
Schwartz, L., Chen, E., Hunt, B., and Schreiner, S. L.
(2019). Bootstrapping a neural morphological analyzer
for st. lawrence island yupik from a finite-state trans-
ducer. In Proceedings of the 3rd Workshop on the Use
of Computational Methods in the Study of Endangered
Languages Volume 1 (Papers), pages 87–96, Honolulu,
February. Association for Computational Linguistics.
Sennrich, R., Haddow, B., and Birch, A. (2016). Improv-
ing neural machine translation models with monolingual
data. In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers), pages 86–96, Berlin, Germany, August.
Association for Computational Linguistics.
Silfverberg, M. and Tyers, F. (2019). Data-driven morpho-
logical analysis for uralic languages. In Proceedings of
the Fifth International Workshop on Computational Lin-
guistics for Uralic Languages, pages 1–14, Tartu, Esto-
nia, January. Association for Computational Linguistics.
9. Language Resource References
H¨
am¨
al¨
ainen, M. (2019). UralicNLP: An NLP library for
Uralic languages. Journal of Open Source Software,
4(37):1345. 10.21105/joss.01345.
Moshagen, S., Rueter, J., Pirinen, T., Trosterud, T., and Ty-
ers, F. M. (2014). Open-Source Infrastructures for Col-
laborative Work on Under-Resourced Languages. The
LREC 2014 Workshop “CCURL 2014 - Collaboration
and Computing for Under-Resourced Languages in the
Linked Open Data Era”.
Sheyanova, M. and Tyers, F. M. (2017). Annotation
schemes in north s´
ami dependency parsing. In Proceed-
ings of the 3rd International Workshop for Computa-
tional Linguistics of Uralic Languages, pages 66–75.
40
... The advantage of UD treebanks is that they can be used directly in many neural NLP applications such as parsers (Qi et al., 2020) and part-of-speech taggers (Kim et al., 2017). Although the endangered languages have a very different starting point in comparison with large languages (Hämäläinen, 2021), there has been recent work (Lim et al., 2018;Ens et al., 2019;Hämäläinen and Wiechetek, 2020;Alnajjar, 2021) showcasing good results on a variety of tasks even for the few endangered languages that have a UD treebank. ...
Preprint
Full-text available
This paper presents and discusses the first Universal Dependencies treebank for the Apurin\~a language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features - some of which are unique to Apurin\~a. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon. The source materials used in the initial treebank represent fieldwork practices where not all tokens of all sentences are equally annotated. For this reason, establishing regular annotation practices for the entire Apurin\~a treebank is an ongoing project.
... The advantage of UD treebanks is that they can be used directly in many neural NLP applications such as parsers (Qi et al., 2020) and part-of-speech taggers (Kim et al., 2017). Although the endangered languages have a very different starting point in comparison with large languages (Hämäläinen, 2021), there has been recent work (Lim et al., 2018;Ens et al., 2019;Hämäläinen and Wiechetek, 2020;Alnajjar, 2021) showcasing good results on a variety of tasks even for the few endangered languages that have a UD treebank. ...
Conference Paper
Full-text available
This paper presents and discusses the first Universal Dependencies treebank for the Apurinã language. The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features-some of which are unique to Apurinã. The construction of the treebank has also served as an opportunity to develop finite-state description of the language and facilitate the transfer of open-source infrastructure possibilities to an endangered language of the Amazon. The source materials used in the initial treebank represent fieldwork practices where not all tokens of all sentences are equally annotated. For this reason, establishing regular annotation practices for the entire Apurinã treebank is an ongoing project.
... Rule-based morphology can also play an important role in higher level natural language generation tasks Hämäläinen & Alnajjar, 2019b) Syntax and disambiguation have also been tackled with rule-based methods for several endangered languages (Uí Dhonnchadha & van Genabith 2006;Trosterud, 2009). More recently, there have been efforts for using rule-based methods together with neural networks to achieve the same goal (Ens et al., 2019;Hämäläinen & Wiechetek, 2020). A variety of dictionary building methods has also emerged for language documentation (Garrett, 2018;Rueter & Hämäläinen, 2017). ...
Preprint
Full-text available
This paper presents the current lexical, morphological, syntactic and rule-based machine translation work for Erzya and Moksha that can and should be used in the development of a roadmap for Mordvin linguistic research. We seek to illustrate and outline initial problem types to be encountered in the construction of an Apertium-based shallow-transfer machine translation system for the Mordvin language forms. We indicate reference points within Mordvin Studies and other parts of Uralic studies, as a point of departure for outlining a linguistic studies with a means for measuring its own progress and developing a roadmap for further studies.
Conference Paper
Full-text available
We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are struc-tured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica.
Conference Paper
Full-text available
Endangered Uralic languages present a high variety of inflectional forms in their morphology. This results in a high number of homonyms in inflections, which introduces a lot of morphological ambiguity in sentences. Previous research has employed constraint grammars to address this problem, however CGs are often unable to fully disambiguate a sentence, and their development is labour intensive. We present an LSTM based model for automatically ranking morphological readings of sentences based on their quality. This ranking can be used to evaluate the existing CG disambiguators or to directly morphologically disambiguate sentences. Our approach works on a morphological abstraction and it can be trained with a very small dataset.
Article
We introduce an open-source toolkit for neural machine translation (NMT) to support research into model architectures, feature representations, and source modalities, while maintaining competitive performance, modularity and reasonable training requirements.
Conference Paper
Morphologically rich languages often lack the annotated linguistic resources required to develop accurate natural language processing tools. We propose models suitable for training morphological taggers with rich tagsets for low-resource languages without using direct supervision. Our approach extends existing approaches of projecting part-of-speech tags across languages, using bitext to infer constraints on the possible tags for a given word type or token. We propose a tagging model using Wsabie, a discriminative embedding-based model with rank-based learning. In our evaluation on 11 languages, on average this model performs on par with a baseline weakly-supervised HMM, while being more scalable. Multilingual experiments show that the method performs best when projecting between related language pairs. Despite the inherently lossy projection, we show that the morphological tags predicted by our models improve the downstream performance of a parser by +0.6 LAS on average.
Chapter
One impact of globalization is that English has achieved a natural place among the languages used by children in the Scandinavian Sápmi. The children speak or are learning North Sami, a national language (Finnish, Norwegian or Swedish) and English as compulsory subjects. This chapter draws on English data collected in a literacy research project involving almost 150 tri-lingual school-aged who wrote texts on laptops. The research explores children’s writing in English in a cross-national context and how these texts evidence tri-lingual teenagers handling of the super dimensions and superdiversity of their context, and the ways super dimensions might be evidenced.