Conference PaperPDF Available

Finding Sami Cognates with a Character-Based NMT Approach

Authors:

Abstract

We approach the problem of expanding the set of cognate relations with a sequence-to-sequence NMT model. The language pair of interest, Skolt Sami and North Sami, has too limited a set of parallel data for an NMT model as such. We solve this problem on the one hand, by training the model with North Sami cognates with other Uralic languages and, on the other, by generating more synthetic training data with an SMT model. The cognates found using our method are made publicly available in the Online Dictionary of Uralic Languages.
Proceedings of the Workshop on Computational Methods for
Endangered Languages
0-5.& Proceedings of the 3rd Workshop on
Computational Methods for Endangered Languages
Vol. 1 Papers
24*$-&

Finding Sami Cognates with a Character-Based
NMT Approach
Mika Hämäläinen
University of Helsinki.*,")"."-"*/&/)&-3*/,*;
Jack Rueter
jack.rueter@helsinki.$+"$,25&4&2)&-3*/,*;
0--074)*3"/%"%%*4*0/"-702,3"4 )<133$)0-"2$0-02"%0&%53$*-$.&-
"240'4)& 0.154"4*0/"-*/(5*34*$30..0/3
:*324*$-&*3#205()440805'02'2&&"/%01&/"$$&33#8*/(5*34*$3"4$)0-"24)"3#&&/"$$&14&%'02*/$-53*0/*/20$&&%*/(30'4)&!02,3)01
0/0.154"4*0/"-&4)0%3'02/%"/(&2&%"/(5"(&3#8"/"54)02*9&%"%.*/*342"4020'$)0-"202.02&*/'02."4*0/1-&"3&$0/4"$4
$53$)0-"2"%.*/$0-02"%0&%5
&$0..&/%&%*4"4*0/
"̈."̈-"̈*/&/*,""/%5&4&2"$,*/%*/(".*0(/"4&37*4)")"2"$4&2"3&%1120"$) Proceedings of the
Workshop on Computational Methods for Endangered Languages 0-24*$-&
6"*-"#-&"4 )<133$)0-"2$0-02"%0&%53$*-$.&-60-*33
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages: Vol. 1 Papers, pages 39–45,
Honolulu, Hawai‘i, USA, February 26–27, 2019.
39
Finding Sami Cognates with a Character-Based NMT Approach
Mika Hämäläinen
Department of Digital Humanities
University of Helsinki
mika.hamalainen@helsinki.fi
Jack Rueter
Department of Digital Humanities
University of Helsinki
jack.rueter@helsinki.fi
Abstract
We approach the problem of expanding the
set of cognate relations with a sequence-to-
sequence NMT model. The language pair of
interest, Skolt Sami and North Sami, has too
limited a set of parallel data for an NMT model
as such. We solve this problem on the one
hand, by training the model with North Sami
cognates with other Uralic languages and, on
the other, by generating more synthetic train-
ing data with an SMT model. The cognates
found using our method are made publicly
available in the Online Dictionary of Uralic
Languages.
1 Introduction
Sami languages have received a fair share of in-
terest in purely linguistic study of cognate rela-
tions. Although various schools of Finno-Ugric
studies have postulated contrastive interpretations
of where the Sami languages should be located
within the language family, there is strong ev-
idence demonstrating regular sound correspon-
dence between Samic and Balto-Finnic, on the one
hand, and Samic and Mordvin, on the other. The
importance of this correspondence is accentuated
by the fact that the Samic might provide insight
for second syllable vowel quality, as not all Samic-
Mordvin vocabulary is attested in Balto-Finnic (cf.
Korhonen,1981). The Sami languages themselves
(there are seven written languages) also exhibit
regular sound correspondence, even though cog-
nates, at times, may be opaque to the layman.
One token of cognate relation studies is the Álgu
database (Kotus,2006), which contains a set of
inter-Sami cognates. Cognates have applicabil-
ity in NLP research for low-resource languages
as they can, for instance, be used to induce the
predicate-argument structures from bilingual vec-
tor spaces (Peirsman and Padó,2010).
The main motivation for this work is to extend
the known cognate information available in the
Online Dictionary of Uralic Languages (Hämäläi-
nen and Rueter,2018). This dictionary, at its cur-
rent stage, only has cognate relations recorded in
the Álgu database.
Dealing with true cognates in a non-attested hy-
pothetical proto-language presupposes adherence
to a set of sound correlations posited by a given
school of thought. Since Proto-Samic is one such
language, we have taken liberties to interpret the
term cognate in the context of this paper more
broadly, i.e. not only words that share the same
hypothesized origin in Proto-Samic are considered
cognates (hence forth: true cognates), but also
items that might be deemed loan words acquired
from another language at separate points in the
temporal-spatial dimensions. This more permis-
sive definition makes it possible to tackle the prob-
lem computationally easier given the limitation
imposed by the scarcity of linguistic resources.
Our approach does not presuppose a seman-
tic similarity of the meaning of the cognate can-
didates, but rather explores cognate possibilities
based on grapheme changes. The key idea is that
the system can learn what kinds of changes are
possible and typical for North Sami cognates with
other Uralic languages in general. Taking leverage
from this more general level knowledge, the model
can learn the cognate features between North Sami
and Skolt Sami more specifically.
We assimilate this problem with that of normal-
ization of historical spelling variants. On a higher
level, historical variation within one language can
be seen as discovering cognates of different tem-
poral forms of the language. Therefore, we want
to take the work done in that vein for the first time
in the context of cognate detection. Using NMT
(neural machine translation) on a character level
has been shown to be the single most accurate
40
method in normalization by a recent study with
historical English (Hämäläinen et al.,2018).
In this paper, we use NMT in a similar character
level fashion for finding cognates. Furthermore,
due to the limited availability of training data, we
present an SMT (statistical machine translation)
method for generating more data to boost the per-
formance of the NMT model.
2 Related Work
Automatic identification of cognates has received
a fair share of interest in the past from different
methodological stand points. In this section, we
will go through some of these approaches.
Ciobanu and Dinu (2014) propose a method
based on orthographic alignment. This means a
character level alignment of cognate pairs. After
the alignment, the mismatches around the aligned
pairs are used as features for the machine learning
algorithm.
Another take on cognate detection is that of
Rama (2016). This approach employs Siamese
convolutional networks to learn phoneme level
representation and language relatedness of words.
They based the study on Swadesh lists and used
hand-written phonetic features and 1-hot encoding
for the phonetic representation.
Cognate detection has also been done by look-
ing at features such as semantics, phonetics and
regular sound correspondences (St. Arnaud et al.,
2017). Their approach implements a general
model and language specific models using support
vector machine (SVM).
Rama et al. (2017) present an unsupervised
method for cognate identification. The method
consists of extracting suitable cognate pairs with
normalized Levenshtein distance, aligning the
pairs and counting a point-wise mutual informa-
tion score for the aligned segments. New sets
of alignments are generated and the process of
aligning and scoring is repeated until there are no
changes in the average similarity score.
3 Finding Cognates
In this section, we describe our proposed approach
in finding cognates between North Sami and Skolt
Sami. We present the dataset used for the training
and an SMT approach in generating more training
data.
3.1 The Data
Our training data consists of Álgu (Kotus,2006),
which is an etymological database of the Sami lan-
guages. From this database, we use all the cognate
relations recorded for North Sami to all the other
Finno-Ugric languages in the database. This pro-
duces a parallel dataset of North Sami words and
their cognates in other languages.
The North Sami to other languages parallel
dataset consists of 32905 parallel words, of which
2633 items represent the correlations between
North Sami and Skolt Sami.
We find cognates for nouns, adjectives, verbs
and adverbs recorded in the Giellatekno dictionar-
ies (Moshagen et al.,2013) for North Sami and
Skolt Sami. These dictionaries serve as an input
for the trained NMT model and for filtering the
output produced by the model.
3.2 The NMT Model
For the purpose of our research we use OpenNMT
(Klein et al.,2017) to train a character based NMT
model that will take a Skolt Sami word as its in-
put and produce a potential North Sami cognate as
its output. We use the default settings for Open-
NMT1.
We train a sequence to sequence model with the
list of known cognates in other languages as the
source data and their North Sami counterparts as
the target data. In this way, the system learns a
good representation of the target language, North
Sami, and can learn what kind of changes are
possible between cognates in general. Thus, the
model can learn additional information about cog-
nates that would not be present in the North Sami-
Skolt Sami parallel data.
In order to make the model adapt more to
the North Sami-Skolt Sami pair in particular, we
continue training the model with only the North
Sami-Skolt Sami parallel data for an additional
10 epochs. The idea behind this is to bring the
model closer to the language pair of interest in
this research, while still maintaining the additional
knowledge it has learned about cognates in general
from the larger dataset.
3.3 Using SMT to Generate More Data
Research in machine translation has shown that
generating more synthetic parallel data that can be
1Version from the project’s master branch on the 13 April
of 2018
41
noisy in the source language end but is not noisy
in the target end, can improve the overall transla-
tions of an NMT model (Sennrich et al.,2015). In
light of this finding, we will try a similar idea in
our cognate detection task as well.
Due to the limited amount of North Sami-Skolt
Sami training data available, we use SMT instead
of NMT to train a model that will produce plausi-
ble but slightly irregular Skolt Sami cognates for
the word list of North Sami words obtained from
the Giellatekno dictionaries.
We use Moses (Koehn et al.,2007) baseline2to
train a translation model to the opposite direction
of the NMT model with the same parallel data.
This means translating from North Sami to Skolt
Sami. We use the same parallel data as for the
NMT model, meaning that on the source side, we
have North Sami and on the target side we have all
the possible cognates in other languages. The par-
allel data is aligned with GIZA++ (Och and Ney,
2003).
Since we are training an SMT model, there are
two ways we can make the noisy target of all the
other languages resemble more Skolt Sami. One
is by using a language model. For this, we build
a 10-gram language model with KenLM (Heafield
et al.,2013) from Skolt Sami words recorded in
the Giellatekno dictionaries.
The other way of making the model more aware
of Skolt Sami in particular is to tune the SMT
model after the initial training. For the tuning, we
use the Skolt Sami-North Sami parallel data exclu-
sively so that the SMT model will go more towards
Skolt Sami when producing cognates.
We use the SMT model to translate all of the
words extracted from the North Sami dictionary
into Skolt Sami. This results in a parallel dataset
of real, existing North Sami words and words that
resemble Skolt Sami. We then use this data to
continue the training of the previously explained
NMT model for 10 additional epochs.
3.4 Using the NMT Models
We use both of the NMT models, i.e. the one with-
out SMT generated additional data and the one
with the data separately to assess the difference in
their performance. We feed in the extracted Skolt
Sami words from the dictionary and translate each
word to a North Sami word as it would look like if
2As described in
http://www.statmt.org/moses/?n=moses.baseline
there were a cognate for that word in North Sami.
The approach produces many non-words which
we filter out with the North Sami dictionary. The
resulting list of translated words that are actually
found in the North Sami dictionary are considered
to be potential cognates found by the method.
4 Results and Evaluation
In this section, we present the results of both of
the NMT models, the one without SMT generated
data and the one with generated data. The results
shown in Table 1 indicate that the model with the
additional SMT generated data outperformed the
other model. The evaluation is based on a 200 ran-
domly selected cognate pairs output by the mod-
els. These pairs have then been checked by an
expert linguist according to principles outlined in
(4.1).
NMT NMT + SMT
accuracy 67.5% 83%
Table 1: Percentage of correctly found cognates
Table 2 gives more insight on the number of
cognates found and how they are represented in
the original Álgu database. The results show
that while the models have poor performance in
finding the cognates in the training data, they
work well in extending the cognates outside of the
known cognate list.
NMT NMT + SMT
Same as in Álgu 75 61
North Sami word in
Álgu but no cognates
with Skolt Sami
211 226
North Sami word in
Álgu with other
Skolt Sami cognates
646 577
North Sami word not
in Álgu 848 936
Cognates found in total 1780 1800
Table 2: Distribution of cognates in relation to Álgu
As one of the purposes of our work is to help
evaluate and develop etymological research, we
will conduct a more qualitative analysis of the cor-
rectly and incorrectly identified cognates for the
better working model, i.e. the one with SMT gen-
erated data. This means that the developers should
42
be aware not only of the comparative-linguistic
rules etymologists use when assessing the regular-
ity of cognate candidates but semantics as well.
In an introduction to Samic language history
(Korhonen,1981: 110–114), the proto-language
is divided into 4 separate phases. The first phase
involves vowel changes in the first and second syl-
lables (ä–e »e–e
ˆ,u–e »o
ˆ–e
ˆ,e–ä »e–˙
a,i–e »e
ˆ–e
ˆ,
etc.) followed by vowel rotation in the first sylla-
ble dependent on the quality of the second-syllable
vowel (e–e »e–e
ˆbut e–ä »ε˙
a). The second phase
entails the loss of quantitative distinction for first-
syllable vowels such that high vowels are normal-
ized as short, and non-high first-syllable vowels
are normally long (e–e
ˆ»¯
e–e
ˆ,ε˙
a»¯ε¯
e, etc.).
The third phase involves a slight vowel shift in the
first syllable and vowel split in the second. And
finally, the fourth phase introduces diphthongiza-
tion of non-high vowels in the first syllable (¯
e–e
ˆ»
ie–e
ˆ,¯ε¯
e»eä–¯
e, etc.).
In Table 3, below, we provide an approxima-
tion of a few hypothesized sound changes for three
words that are attested in Balto-Finnic and Mord-
vin, alike. *käte ‘hand; arm’ has true cognates
in Finnish käsi, Northern Sami giehta, Skolt Sami
ˇ
kiõtt, Erzya ked’ and Moksha käd’,*tule ‘fire’ is
represented by Finnish tuli, Northern Sami dolla,
Skolt Sami toll, and Mordvin tol, while *pesä
‘nest’ is attested in Finnish pesä, Northern Sami
beassi, Skolt Sami piess, Erzya pize, and Moksha
piza. The Roman numerals in the table correspond
to four separate phases in Proto-Samic.
I II III IV
‘hand; arm’ käte kete
ˆk¯
ete
ˆkiete
ˆ
‘fire’ tule to
ˆle
ˆto
ˆle
ˆtole
ˆ
‘nest’ pesä pεs˙
a p¯εs¯
e peäs¯
e
Table 3: Illustration of some vowel correlations in 4
phases of Proto-Samic
In the evaluation, our attention was drawn to
the adherence of (143) items to accepted sound
correlations while there were (57) candidates that
failed in this respect (cf. Korhonen,1981;Lehti-
ranta,2001;Aikio,2009). Irregular sound corre-
lation can be exemplified in the North Sami word
bierdna ’bear’ and its counterpart the Skolt Sami
word peärnn (the prime indicates palatalization
in Skolt Sami orthography) ’bear cub’. The for-
mer appears to represent the word type found in
’hand’ North Sami giehta and Skolt Sami ˇ
kiõtt,
whereas the latter represents the word type found
in ’nest’ North Sami beassi and Skolt Sami piess
and ’swamp’ North Sami jeaggi and Skolt Sami
jeäˇ
gˇ
g. Hence, on the basis of the North Sami
word bierdna ’bear’, one would posit a Skolt
Sami form *piõrnn, whereas the Skolt Sami word
peärnn ’bear cub’ would presuppose a North
Sami form *beardni. Both types have firm rep-
resentation in both languages, so it would seem
that these borrowings have entered the languages
at separate points in the spatio-temporal dimen-
sions.
4.1 Analysis of the Correct Cognates
Correct cognates were selected according to two
simples principles of similarity. On the one hand,
there was the principle of conceptual similarity in
their referential denotations (i.e., this refers to fu-
ture work in semantic relations). On the other
hand, a feasible cognate pair candidate should
demonstrate adherence or near adherence to ac-
cepted sound law theory. The question of adher-
ence versus near adherence indicated here can be
directed to concepts of sound law theory, where
conceivable irregularities may further be attributed
to points in spatio-temporal dimensions (i.e., when
and where a particular word was introduced into
the lexica of the two languages involved in the in-
vestigation).
In the investigation of 200 random cognate pair
candidates, 166 cognate pair candidates exhibited
conceptual similarity which in some instances sur-
passed what might have been discovered using a
bilingual dictionary. Of the 166 acceptable cog-
nate pairs 131 candidate pairs demonstrated regu-
lar correlation to received sound law theory.
Adherence to concepts of sound law theory can
be observed in the alignment of the North Sami
words ˇcuoika ’mosquito’ and a¯
da ’marrow’ with
their Skolt Sami counterparts ˇcuõškk and õõ ¯
d, re-
spectively. Although these words may appear
opaque to the layman, and thus this alignment
might be deemed dubious at first, awareness of
cognate candidates in the Erzya Mordvin ´
se´
ske
’mosquito’ and udem ’marrow’ helps to alleviate
initial misgivings.
As may be observed above, North Sami fre-
quently has two-syllable words where Skolt Sami
attests to single-syllable words. This rela-
tive length correlation between North Sami and
Skolt Sami is described through measurement in
43
prosodic feet (cf. Koponen and Rueter,2016).
While North Sami exhibits retention of the the-
oretical stem-final vowel in two-syllable words,
Skolt Sami appears to have lost it. In fact, stem-
final vowels in Skolt Sami are symptomatic of
proper nouns and participles (derivations). In
contrast, two-syllable words with consonant-final
stems appear in both North Sami and Skolt Sami,
which means we can expect a number of basic
verbs attesting to two-syllable infinitives (North
Sami bidjat ’put’ and Skolt Sami piijjâd) and
contract-stem nouns (North Sami guoppar and
Skolt Sami kuõbbâr ’mushroom’). Upon inspec-
tion of longer words, it will appear that Skolt Sami
words are at least one syllable shorter than their
cognates in North Sami, which can be attributed
to variation in foot development.
Cognate word correlations between North Sami
and Skolt Sami can be approached by count-
ing syllables in the lemmas (dictionary forms).
Through this approach, we can attribute some
word lengths automatically to parts-of-speech, i.e.
there is only one single-syllable verb in Skolt Sami
leed’be’, and it correlates to a single-syllable
verb in North Sami leat. Other verbs are therefore
two or more syllables in length in both languages.
While two-syllable verbs in North Sami correlate
with two-syllable verbs in Skolt Sami, multiple-
syllable verbs in North Sami usually correlate to
Skolt Sami counterparts in an X<=>X-1 relation-
ship (number of syllables in the language forms,
repectively), where North Sami is one orthograph-
ical syllable longer.
Short word pairs demonstrate a clear correlation
between two-syllable base words in North Sami
and single-syllable base words in Skolt Sami,
which with the exception of 4 words were all
nouns (54 all told). The high concentration of
noun attestation for single-syllable cognate nouns
in the two languages of investigation is counter-
balanced by the representation of verbs in other
word-length correlation groups.
Cognate pairs where both North Sami and Skolt
Sami attested to two-syllable lemmas were over-
represented by verbs. There were 47 verbs, 22
nouns, 10 adjectives and 1 adverb. This result is
symptomatic of lemma-based research. Surpris-
ingly enough, however, the three-to-two syllable
correlation between North Sami and Skolt Sami
also showed a similar representation: verbs (13),
nouns (2) and adverbs (1).
There was one attested correlation for a 5-
syllable word eŋgelasgiella ’English language’ in
North Sami and its 3-syllable counterpart in Skolt
Sami eŋgglõs ˇ
kiõll. Since we are looking at a com-
pound word with 3-to-2 and 2-to-1 correlations,
we can assume that our model is recognizing indi-
vidual adjacent segments within a larger unit.
Correct cognates do not necessarily require et-
ymologically identical source forms or structure.
The recognized cognate pairs represent both re-
cent loan words or possible irregularities in sound
law theory (35) and presumably older mutual lex-
icon (131) (see 4.2, below). They also attest to
differed structure and length (i.e., this may also
include derivation and compounding). While a
majority of the cognate candidate pairs linked
words sharing the same derivational level, 11 rep-
resented instances of additional derivation in ei-
ther the North Sami or Skolt Sami word, and 3
recognized instances where one of the languages
was represented by a compound word.
4.2 Analysis of the Incorrect Cognates
Incorrect cognates often offer vital input for cog-
nate detection development. There are, of course,
words pairs that diverge in regard to both accepted
sound law theory and semantic cohesion. These
pairs have not yet been applied to development. In
contrast, word pairs that appear to adhere to sound
law theory yet are not matched semantically might
be regarded as false friends. These pairs can be
potentially useful in further development.
Of 34 semantically non-feasible candidates, 12
stood out as false friends. One such example pair
is observed in the North Sami álgu ’beginning’
and the Skolt Sami älgg ’piece of firewood’. These
two words, it should be noted, can be associated
with the Finnish cognates alku ’beginning’ and
halko ’piece of split firewood [NB! there is a loss
of the word initial h]’, respectively. Since the theo-
retically expected vowel equivalent of the first syl-
lable ain Finnish is uo and ue in North Sami and
Skolt Sami, respectively, we might assume that
neither word comes from a mutual Samic-Finnic
proto-language.
We do not know to what extent random selec-
tion has affected our results. Had the first North
Sami noun álgu been replaced with its paradig-
matic verb álgit ’begin’, the Skolt Sami älˇ
gˇ
ged,
also translated as ’begin’, would have shown di-
rect correlation for áand äin North Sami and
44
Skolt Sami, respectively. The second noun, mean-
ing ’piece of split firewood’, is actually hálgu in
North Sami, which simply demonstrates hreten-
tion and the possible recognition problems faced
in the absence of semantic knowledge.
4.3 Summary of the Analyses
The cognate candidates were evaluated according
to two criteria: One query checked for concep-
tual similarity (correct vs incorrect), and the other
checked for regularity according to received sound
law theory. While the majority (65%) of the word
pairs evaluated were both conceptually similar and
correlated to received sound law theory, an addi-
tional 17% of the candidates represented irregular
sound correlation, as indicated by the figures 131
and 35 below, respectively.
Similar Dissimilar
Regular 131 12
Irregular 35 22
Table 4: Cognate candidate evaluation
The presence of an 11% negative score for both
sound law regularity and conceptual similarity in-
dicates an improvement requirement of at least
6% before the machine can be considered relevant
(95%). The 6% attestation of false friend discov-
ery, however, displays an already existing accu-
racy in our algorithm.
5 Conclusions and Future Work
In this paper, we have shown that using a
character-based NMT is a feasible way of expand-
ing a list of cognates by training the model mostly
on the cognate pairs for North Sami words in lan-
guages other than Skolt Sami. Furthermore, we
have shown that an SMT model can be used to
generate synthetic parallel data by pushing the
model more towards the direction of Skolt Sami
by introducing a Skolt Sami language model and
tuning the model with Skolt Sami - North Sami
parallel data.
In our evaluation, we have only considered the
best cognate produced by the NMT model with the
idea of one-to-one mapping. However, it is possi-
ble to make the NMT model output more than one
possible translation. In the future, we can conduct
more evaluation for a list of top candidates to see
whether the model is able to find more than one
cognate for a given word and whether the overall
recall can be improved for the words where the
top candidate has been rejected by the dictionary
check as a non-word.
We have currently limited our research in cog-
nates between Skolt Sami and North Sami where
the translation direction of the NMT model has
been towards North Sami. An interesting future
direction would be to change the translation direc-
tion. In addition to that, we are also interested in
trying this method out on other languages recorded
in the Álgu database.
We are also interested in conducting research
that is more linguistic in its nature based on the
cognate list produced in this paper. This will shed
more light in the current linguistic knowledge of
cognates in the Sami languages. The current re-
sults of the better working NMT model are re-
leased in the Online Dictionary for Uralic Lan-
guages3.
References
Ante Aikio. 2009. The Saami Loan Words in Finnish
and Karelian, first edition. Faculty of Humanities of
the University of Oulu, Oulu. Dissertation.
Alina Maria Ciobanu and Liviu P Dinu. 2014. Au-
tomatic detection of cognates using orthographic
alignment. In Proceedings of the 52nd Annual Meet-
ing of the Association for Computational Linguistics
(Volume 2: Short Papers), volume 2, pages 99–105.
Mika Hämäläinen and Jack Rueter. 2018. Advances
in Synchronized XML-MediaWiki Dictionary De-
velopment in the Context of Endangered Uralic Lan-
guages. In Proceedings of the Eighteenth EURALEX
International Congress, pages 967–978.
Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg
Tiedemann, and Eetu Mäkelä. 2018. Normalizing
early english letters to present-day english spelling.
In Proceedings of the Second Joint SIGHUM Work-
shop on Computational Linguistics for Cultural
Heritage, Social Sciences, Humanities and Litera-
ture, pages 87–96.
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H
Clark, and Philipp Koehn. 2013. Scalable modi-
fied Kneser-Ney language model estimation. In Pro-
ceedings of the 51st Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 2: Short
Papers), volume 2, pages 690–696.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Senellart, and Alexander M. Rush. 2017. Open-
NMT: Open-Source Toolkit for Neural Machine
Translation. In Proc. ACL.
3http://akusanat.com/
45
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, et al. 2007. Moses: Open source
toolkit for statistical machine translation. In Pro-
ceedings of the 45th annual meeting of the ACL on
interactive poster and demonstration sessions, pages
177–180. Association for Computational Linguis-
tics.
Eino Koponen and Jack Rueter. 2016. The first com-
plete scientific skolt sami grammar in english. FUF
63: 254344, pages 254–266. Pp. 261–265.
Mikko Korhonen. 1981. Johdatus lapin kielen his-
toriaan, volume 370 of Suomalaisen Kirjallisuu-
den Seuran toimituksia. Suomalaisen Kirjallisuuden
Seura, Helsinki.
Kotus. 2006. Álgu-tietokanta. Saamelaiskielten ety-
mologinen tietokanta. http://kaino.kotus.fi/algu/.
Juhani Lehtiranta. 2001. Yhteissaamelainen sanasto,
second edition, volume 200 of Suomalais-
Ugrilaisen Seuran toimituksia. Suomalais-
Ugrilainen Seura, Helsinki.
Sjur N. Moshagen, Tommi A. Pirinen, and Trond
Trosterud. 2013. Building an open-source de-
velopment infrastructure for language technology
projects. In Proceedings of the 19th Nordic Con-
ference of Computational Linguistics (NODALIDA
2013); May 22-24; 2013; Oslo University; Norway.
NEALT Proceedings Series 16, 85, pages 343–352.
Linköping University Electronic Press.
Franz Josef Och and Hermann Ney. 2003. A systematic
comparison of various statistical alignment models.
Computational Linguistics, 29(1):19–51.
Yves Peirsman and Sebastian Padó. 2010. Cross-
lingual induction of selectional preferences with
bilingual vector spaces. In Human Language Tech-
nologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computa-
tional Linguistics, pages 921–929. Association for
Computational Linguistics.
Taraka Rama. 2016. Siamese convolutional networks
for cognate identification. In Proceedings of COL-
ING 2016, the 26th International Conference on
Computational Linguistics: Technical Papers, pages
1018–1027.
Taraka Rama, Johannes Wahle, Pavel Sofroniev, and
Gerhard Jäger. 2017. Fast and unsupervised meth-
ods for multilingual cognate clustering. arXiv
preprint arXiv:1702.04938.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Improving neural machine translation
models with monolingual data. arXiv preprint
arXiv:1511.06709.
Adam St. Arnaud, David Beck, and Grzegorz Kondrak.
2017. Identifying cognate sets across dictionaries of
related languages. In Proceedings of the 2017 Con-
ference on Empirical Methods in Natural Language
Processing, pages 2519–2528.
... In the last few years, it has been approached as an MT task, as it can be seen as modelling sequence-to-sequence correspondences. Using neural networks has been promising (Beinborn et al., 2013;Wu and Yarowsky, 2018;Dekker, 2018;Hämäläinen and Rueter, 2019;Fourrier and Sagot, 2020a), although in most works the hyper-parameters of the neural models were not optimised. Moreover, the differences between MT and cognate prediction have not been studied. ...
... It is a lexical task that models regular, word-internal sound changes that transform words over time. It has been approached with phylogenetic trees combined with stochastic sound change models (Bouchard et al., 2007;Bouchard-Côté et al., 2009;Bouchard-Côté et al., 2013), purely statistical methods (Bodt et al., 2018), neural networks (Mulloni, 2007), language models (Hauer et al., 2019) andcharacter-level MT techniques (Beinborn et al., 2013;Wu and Yarowsky, 2018;Dekker, 2018;Hämäläinen and Rueter, 2019;Fourrier and Sagot, 2020a;Meloni et al., 2021), because of its similarity to a translation task (modelling sequence-to-sequence crosslingual correspondences between words). ...
... Cognates have for example been defined as words sharing spelling and meaning, regardless of their etymology (Frunza andInkpen, 2006Inkpen, , 2009, or words etymologically related no matter the relation(Hämäläinen and Rueter, 2019). ...
... Skolt Sami has received a moderate amount of NLP research interest as a result of Dr Jack Rueter's amazing work on building the fundamental NLP building blocks for Skolt Sami 2 as a result, Skolt Sami has an FST an online dictionary (see Hämäläinen et al. 2021a), a Universal Dependencies treebank (Nivre et al., 2022) and some neural models to identify cognates (Hämäläinen and Rueter, 2019). An empirical study by Wu et al. 2020 reveals that the transformer's performance on character-level transduction tasks, such as morphological inflection generation, is significantly influenced by batch size, unlike in recurrent models. ...
Conference Paper
Full-text available
This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation , and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.
... Skolt Sami has received a moderate amount of NLP research interest as a result of Dr Jack Rueter's amazing work on building the fundamental NLP building blocks for Skolt Sami 2 as a result, Skolt Sami has an FST an online dictionary (see Hämäläinen et al. 2021a), a Universal Dependencies treebank (Nivre et al., 2022) and some neural models to identify cognates (Hämäläinen and Rueter, 2019). An empirical study by Wu et al. 2020 reveals that the transformer's performance on character-level transduction tasks, such as morphological inflection generation, is significantly influenced by batch size, unlike in recurrent models. ...
Preprint
Full-text available
This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation, and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.
... We have also been able to use the neural networks to increase etymological relationships in the Skolt Sami dictionary (Hämäläinen and Reuter, 2019). The method was based on a character level LSTM model that was enhanced with synthetic data generated with a character-level statistical machine translation tool. ...
Conference Paper
Full-text available
We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are struc-tured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica.
... This paper addresses dictionary-resource development for endangered, under-resourced languages in collaboration with an open-source infrastructure with a rule-based orientation as described in Moshagen et al. (2014). It then outlines advances in XML-MediaWiki synchronization of multilingual dictionaries and enhanced features for etymological and cognate resource work (Hämäläinen & Rueter, 2019) and automatic combination of concepts in multilingual dictionaries (Hämäläinen, Tarvainen & Rueter, 2018). Work with XML introduces a need for a standardized TEI (Text Encoding Initiative) formalism. ...
... One also has to ask whether there are specific ways on how Permyak and Zyrian lexical resources should be connected to each other. This might be solved with cognate searching analogical to what has been used for Northern Sami and Skolt Sami cognates for establishing initial etymological associations (Hämäläinen and Rueter, 2019a). Russian loanwords, although differently adapted are largely shared. ...
Article
Full-text available
There are two main written Komi varieties , Permyak and Zyrian. These are mutually intelligible but derive from different parts of the same Komi dialect continuum, representing the varieties prominent in the vicinity and in the cities of Syktyvkar and Kudymkar, respectively. Hence, they share a vast number of features, as well as the majority of their lexicon, yet the overlap in their dialects is very complex. This paper evaluates the degree of difference in these written varieties based on changes required for computational resources in the description of these languages when adapted from the Komi-Zyrian original. Primarily these changes include the FST architecture, but we are also looking at its application to the Universal Dependencies annotation scheme in the morphologies of the two languages. Дженыта висьталӧм Коми кылын кык гижан кыв: пермяцкӧй да зырянскӧӥ. Öтамӧд коласын нія вежӧртанаӧсь, но аркмисӧ нія разнӧй коми диалекттэзісь. Пермяцкӧй кыв олӧ Кудымкар лапӧлын, а зырянскӧӥ-Сыктывкар ладорын. Пермяцкӧй да зырянскӧй литературнӧй кыввезын эм уна ӧткодьыс, ӧткодьӧн лоӧ и ыджыт тор лексикаын, но ны диалектнӧй чертаэзлӧн пантасьӧмыс ӧддьӧн гардчӧм. Эта статьяын мийӧ видзӧтам эна кык кывлісь ассямасӧ сы ладорсянь, мый ковсяс вежны лӧсьӧтӧм зырянскӧй вычислительнӧй ресурсісь, медбы керны сыись пермяцкӧйӧ. Медодз энӧ вежсьӧммесӧ колӧ керны FST-ын, но мийӧ сідзжӧ видзӧтам, кыдз FST лӧсялӧ Быдкодь Йитсьӧммезлӧн схемаӧ морфология ладорсянь.
... Despite this, researchers have employed neural networks in a low-resource setting by producing synthetic data. For instance, Hämäläinen and Rueter have built a neural network to detect cognates for between two endangered languages [18], Skolt Sami and North Sami. Their approach reached to a better accuracy when they combined data, synthetically produced by a statistical model, with real data. ...
Preprint
Full-text available
Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.
Thesis
In historical linguistics, cognates are words that descend in direct line from a common ancestor, called their proto-form, andtherefore are representative of their respective languages evolutions through time, as well as of the relations between theselanguages synchronically. As they reflect the phonetic history of the languages they belong to, they allow linguists to betterdetermine all manners of synchronic and diachronic linguistic relations (etymology, phylogeny, sound correspondences).Cognates of related languages tend to be linked through systematic phonetic correspondence patterns, which neuralnetworks could well learn to model, being especially good at learning latent patterns. In this dissertation, we seek tomethodically study the applicability of machine translation inspired neural networks to historical word prediction, relyingon the surface similarity of both tasks. We first create an artificial dataset inspired by the phonetic and phonotactic rules ofRomance languages, which allow us to vary task complexity and data size in a controlled environment, therefore identifyingif and under which conditions neural networks were applicable. We then extend our work to real datasets (after havingupdated an etymological database to gather a correct amount of data), study the transferability of our conclusions toreal data, then the applicability of a number of data augmentation techniques to the task, to try to mitigate low-resourcesituations. We finally investigat in more detail our best models, multilingual neural networks. We first confirm that, onthe surface, they seem to capture language relatedness information and phonetic similarity, confirming prior work. Wethen discover, by probing them, that the information they store is actually more complex: our multilingual models actuallyencode a phonetic language model, and learn enough latent historical information to allow decoders to reconstruct the(unseen) proto-form of the studied languages as well or better than bilingual models trained specifically on the task. Thislatent information is likely the explanation for the success of multilingual methods in the previous works
Article
Full-text available
We present a new approach to inducing a bilingual dictionary between the endangered Erzya and Moksha languages automatically based on existing dictionaries in other languages. This work is, for the most part, complementary to the Mordvin research done by PhD László Keresztes, who has demonstrated the alignment of the two language forms in morphology, lexicon and syntax (1990), and then gone on to contemplate syncretisms of the individual languages (1999). In this paper, we present an automatic data-driven method for deriving new lexicographic links between Erzya and Moksha vocabularies. We strive to describe steps in Mordvin studies involving semantic alignment of the Erzya and Moksha lexicon. We briefly remind ourselves of previous dictionary and vocabulary work with the Mordvin languages, dating back to the 17th century. The semantic alignment of this lexicon is important in measuring linguistic distance between the closely related but distinct literary languages of Erzya and Moksha. This alignment fits into a larger scheme including etymological, morphological and syntactic alignment of the two languages.
Article
Full-text available
Presentamos nuestra infraestructura para la documentación de lenguas urálicas, que consiste en herramientas para redactar diccionarios de tal forma que las entradas sean estructuradas en el formato XML (Extensible Markup Language). Desde los diccionarios en XML podemos generar código para analizadores morfológicos que son útiles para todo tipo de actividades de PLN. En este artículo mostramos las ventajas que una documentación digital y legible por máquina tiene. Describimos, también, el sistema en el contexto de lenguas urálicas amenazadas.
Conference Paper
Full-text available
We present our ongoing development of a synchronized XML-MediaWiki dictionary to solve the problem of XML dictionaries in the context of small Uralic languages. XML is good at representing structured data, but it does not fare well in a situation where multiple users are editing the dictionary simultaneously. Furthermore, XML is overly complicated for non-technical users due to its strict syntax that has to be maintained valid at all times. Our system solves these problems by making a synchronized editing of the same dictionary data possible both in a MediaWiki environment and XML files in an easy fashion. In addition, we describe how the dictionary knowledge in the MediaWiki-based dictionary can be enhanced by an additional Semantic Me-diaWiki layer for more effective searches in the data. In addition, an API access to the lexical information in the dictionary and morphological tools in the form of an open source Python library is presented.
Conference Paper
Full-text available
This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.
Article
Full-text available
In this paper we explore the use of unsupervised methods for detecting cognates in multilingual word lists. We use online EM to train sound segment similarity weights for computing similarity between two words. We tested our online systems on geographically spread sixteen different language groups of the world and show that the Online PMI system (Pointwise Mutual Information) outperforms a HMM based system and two linguistically motivated systems: LexStat and ALINE. Our results suggest that a PMI system trained in an online fashion can be used by historical linguists for fast and accurate identification of cognates in not so well-studied language families.
Conference Paper
Full-text available
Words undergo various changes when entering new languages. Based on the assumption that these linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method which proved relevant for sequence alignment in computational biology. We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Given a list of known cognates, our approach does not require any other linguistic information. However, it can be customized to integrate historical information regarding language evolution.
Article
Timothy Feist: A Grammar of Skolt Saami. Mémoires de la Société Finno-Ougrienne 273. Finno-Ugrian Society. Helsinki 2015. 414 p. https://doi.org/10.33339/fuf.86126 This is an assessment of the merits of the English-language Skolt Sami Grammar written by Timothy Feist with respect to existing scholarship already available in English, Finnish and German. Here the writers use their knowledge in comparative Sami research and finite-state morphological descriptions of the language.
Article
We introduce an open-source toolkit for neural machine translation (NMT) to support research into model architectures, feature representations, and source modalities, while maintaining competitive performance, modularity and reasonable training requirements.
Conference Paper
In this paper, we present phoneme level Siamese convolutional networks for the task of pair-wise cognate identification. We represent a word as a two-dimensional matrix and employ a siamese convolutional network for learning deep representations. We present siamese architectures that jointly learn phoneme level feature representations and language relatedness from raw words for cognate identification. Compared to previous works, we train and test on larger and realistic datasets; and, show that siamese architectures consistently perform better than traditional linear classifier approach.
Conference Paper
We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.