Conference PaperPDF Available

Leveraging Transformer-Based Models for Predicting Inflection Classes of Words in an Endangered Sami Language

Authors:

Abstract and Figures

This paper presents a methodology for training a transformer-based model to classify lexical and morphosyntactic features of Skolt Sami, an endangered Uralic language characterized by complex morphology. The goal of our approach is to create an effective system for understanding and analyzing Skolt Sami, given the limited data availability and linguistic intricacies inherent to the language. Our end-to-end pipeline includes data extraction, augmentation , and training a transformer-based model capable of predicting inflection classes. The motivation behind this work is to support language preservation and revitalization efforts for minority languages like Skolt Sami. Accurate classification not only helps improve the state of Finite-State Transducers (FSTs) by providing greater lexical coverage but also contributes to systematic linguistic documentation for researchers working with newly discovered words from literature and native speakers. Our model achieves an average weighted F1 score of 1.00 for POS classification and 0.81 for inflection class classification. The trained model and code will be released publicly to facilitate future research in endangered NLP.
Content may be subject to copyright.
Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages, pages 41–48
November 28-29, 2024 ©2024 Association for Computational Linguistics
Leveraging Transformer-Based Models for Predicting Inflection Classes of
Words in an Endangered Sami Language
Khalid Alnajjar
Rootroo Ltd
first@rootroo.com
Mika Hämäläinen
Metropolia University
of Applied Sciences
first.last@metropolia.fi
Jack Rueter
University of Helsinki
first.last@helsinki.fi
Abstract
This paper presents a methodology for training
a transformer-based model to classify lexical
and morphosyntactic features of Skolt Sami,
an endangered Uralic language characterized
by complex morphology. The goal of our ap-
proach is to create an effective system for un-
derstanding and analyzing Skolt Sami, given
the limited data availability and linguistic in-
tricacies inherent to the language. Our end-to-
end pipeline includes data extraction, augmen-
tation, and training a transformer-based model
capable of predicting inflection classes. The
motivation behind this work is to support lan-
guage preservation and revitalization efforts
for minority languages like Skolt Sami. Accu-
rate classification not only helps improve the
state of Finite-State Transducers (FSTs) by pro-
viding greater lexical coverage but also con-
tributes to systematic linguistic documentation
for researchers working with newly discovered
words from literature and native speakers. Our
model achieves an average weighted F1 score
of 1.00 for POS classification and 0.81 for in-
flection class classification. The trained model
and code will be released publicly to facilitate
future research in endangered NLP.
1 Introduction
Skolt Sami is a minority language in the Uralic
family, spoken primarily in Finland, and is charac-
terized by complex morphosyntactic properties and
rich morphological forms (see Koponen and Rueter
2016). Minority languages like Skolt Sami face sig-
nificant challenges in the field of natural language
processing (NLP) due to their endangered nature,
including a lack of extensive annotated datasets
and linguistic resources. This scarcity compli-
cates the development of computational models
capable of effectively understanding and analyzing
the language. Moreover, the morphology of Skolt
Sami is highly intricate, with numerous inflections
and derivations that present additional challenges
for automated processing (Rueter and Hämäläinen,
2020). Despite these challenges, developing NLP
models for minority languages is essential to pre-
serve linguistic diversity and support language re-
vitalization.
Accurate part-of-speech (POS) and inflection
class classification are fundamental steps in un-
derstanding the grammatical and semantic struc-
ture of a language. Such classifications enable
downstream NLP applications like machine trans-
lation, morphological analysis, and syntactic pars-
ing, which are particularly important for languages
with rich morphology. Additionally, effective clas-
sifiers can assist in improving the current state of
FSTs by providing greater lexical coverage, ulti-
mately enhancing their ability to handle the full
range of morphological variations found in Skolt
Sami. Classifiers can also aid researchers in sys-
tematically documenting new words collected from
literature and native speakers, which is crucial for
tracking linguistic evolution in endangered con-
texts. For Skolt Sami, POS and inflection class
classification can contribute to building digital re-
sources and educational tools, making the language
more accessible to both linguists and speakers.
To address these challenges, we propose a
transformer-based model designed to automate the
analysis of Skolt Sami, specifically for POS and
inflection class classification. Our approach in-
cludes data extraction, preprocessing, augmenta-
tion, model training and evaluation. We employed
advanced transformer architectures to learn the lin-
guistic features of Skolt Sami effectively. Addi-
tionally, we provide both the trained model and
the accompanying code publicly to support future
research efforts on endangered languages1.
The contributions of this work are as follows:
1.
Data Augmentation Using Miniparadigms:
We employed data augmentation techniques,
1https://github.com/mokha/predict-inflection-class
41
including the generation of morphological
forms, to mitigate data scarcity and improve
model robustness.
2.
Transformer-Based Model: We designed a
transformer-based model for POS and inflec-
tion class classification in Skolt Sami, utiliz-
ing shared embedding layers and task-specific
output heads.
2 Related work
Skolt Sami has received a moderate amount of NLP
research interest as a result of Dr Jack Rueter’s
amazing work on building the fundamental NLP
building blocks for Skolt Sami
2
as a result, Skolt
Sami has an FST (Rueter and Hämäläinen,2020)
an online dictionary (see Hämäläinen et al. 2021a),
a Universal Dependencies treebank (Nivre et al.,
2022) and some neural models to identify cognates
(Hämäläinen and Rueter,2019).
An empirical study by Wu et al. 2020 reveals that
the transformer’s performance on character-level
transduction tasks, such as morphological inflec-
tion generation, is significantly influenced by batch
size, unlike in recurrent models. By optimizing
batch size and introducing feature-guided transduc-
tion techniques, the transformer can outperform
RNN-based models, achieving state-of-the-art re-
sults on tasks such as grapheme-to-phoneme con-
version, transliteration, and morphological inflec-
tion. This study demonstrates that, with appropriate
modifications, transformers are highly effective for
character-level tasks as well.
Recent research (Abudouwaili et al.,2023) has
introduced a joint morphological tagger specifi-
cally designed for low-resource agglutinative lan-
guages. By leveraging multi-dimensional contex-
tual features of agglutinative words and employing
joint training, the proposed model mitigates the
error propagation typically seen in part-of-speech
tagging while enhancing the interaction between
part-of-speech and morphological labels. Further-
more, the model predicts part-of-speech and mor-
phological features separately, using a graph convo-
lution network to capture higher-order label inter-
actions. Experimental results demonstrate that this
approach outperforms existing models, showcasing
its effectiveness in low-resource language settings.
2
https://researchportal.helsinki.fi/en/projects/
koltansaamen-elvytys-kieliteknologia-avusteisen-
kielenoppimisohje
One notable contribution in this area is a
transformer-based inflection system that enhances
the standard transformer architecture by incorpo-
rating reverse positional encoding and type embed-
dings proposed by Yang et al. (2022). To address
data scarcity, the model also leverages data aug-
mentation techniques such as data hallucination
and lemma copying. The training process is con-
ducted in two stages: initial training on augmented
data using standard backpropagation and teacher
forcing, followed by further training with a modi-
fied version of scheduled sampling, termed student
forcing. Experimental results demonstrate that this
system achieves competitive performance across
both small and large data settings, highlighting its
efficacy in diverse morphological inflection tasks.
Recent work (Hämäläinen et al.,2021b) on mor-
phological analysis, generation, and lemmatization
for morphologically rich languages has focused
on training recurrent neural network (RNN)-based
models. A notable contribution in this area is the
development of a method for automatically extract-
ing large amounts of training data from finite-state
transducers (FSTs) for 22 languages, including 17
endangered ones. These neural models are de-
signed to follow the same tagset as the FSTs, ensur-
ing compatibility and allowing the neural models to
serve as fallback systems when used in conjunction
with the FSTs. This approach enhances the acces-
sibility and preservation of endangered languages
by leveraging both neural and rule-based systems.
3 Methodology
3.1 Data Collection and Preparation
Data extraction and preprocessing are particularly
critical when working with an endangered language
like Skolt Sami. This phase involved extracting lin-
guistic data from available resources and transform-
ing it into a structured format suitable for further
processing.
We extracted a total of 28,984 lexemes from
Ve
rdd (Alnajjar et al.,2020), an online tool de-
signed for editing and managing dictionaries for en-
dangered languages. Ve
rdd offers a structured and
efficient way to curate linguistic resources, mak-
ing it an invaluable asset for our dataset creation
process. The extracted lexemes included diverse
entries from the dictionary, which were parsed and
transformed into a tabular format for further analy-
sis and training. This structured dataset stored each
lexeme along with its POS and contextual lexical
42
information, ensuring consistency and accessibility
for subsequent processing.
3.2 Data Cleaning and Filtering
Data cleaning and filtering are crucial in the con-
text of endangered languages to ensure data quality
and improve model performance. We filtered the
dataset to include only nouns (
N
) and verbs (
V
), as
these categories were the most frequent and useful
for subsequent morphological analysis. These POS
categories were selected due to their high occur-
rence and significance in understanding the linguis-
tic structure of Skolt Sami.
We further filtered lexemes based on specific
patterns using regular expressions, removing non-
standard or infrequent forms to enhance the
model’s ability to generalize to common usage pat-
terns.
3.3
Data Augmentation Using Miniparadigms
To mitigate data scarcity, we employed data aug-
mentation using "miniparadigms." For each verb
and noun, specific morphological forms (e.g.,
present tense, singular form, imperative) were gen-
erated. We have employed UralicNLP (Hämäläi-
nen,2019) with PyHFST (Alnajjar and Hämäläi-
nen,2023) as the backend and used Skolt Sami
FST transducer (Rueter and Hämäläinen,2020) to
generate the forms. This approach added multi-
ple derived forms for each lexeme, thereby signifi-
cantly increasing the size of the dataset. The use of
miniparadigms allowed the model to learn morpho-
logical variations more effectively, compensating
for the limited data available.
Table 1lists the miniparadigms used for data
augmentation. These generated forms helped in-
crease the robustness and generalization capability
of the model.
3.4 Contlex Cleaning and Filtering
In total, there were 939 unique continuation lexica
(Contlex) for nouns (N) and verbs (V). Contlexes
are an FST way of indicating that a word belongs
to a certain inflection class. Many of these Con-
tlex labels included additional information, such as
V_JOAQTTED_ERRORTH
. To standardize the dataset,
we removed any additional information following
the second underscore (
_
). This process reduced
the number of unique Contlex labels to 514.
However, a large portion of these Contlex cat-
egories had very few lexemes. To improve data
quality and model robustness, we filtered out any
Contlex category that had fewer than 50 lexemes as
part of the data cleaning phase. After this filtering,
we ended up with 73 Contlex categories 52 for
nouns and 21 for verbs. Table 2lists the supported
Contlex for each part-of-speech.
3.5 Tokenization
To handle the morphological complexity of Skolt
Sami, we employed Byte-Pair Encoding (BPE)
(Gage,1994) as a tokenization method. BPE is
particularly effective for morphologically rich lan-
guages as it provides subword tokenization that
allows the model to understand both frequent mor-
phemes and unique words. We trained a BPE
model on the concatenated lexeme and all the form
data generated, using a vocabulary size of 2000
to capture the most relevant subword units for the
language.
This tokenization approach helped the model
deal with highly inflected forms of lexemes by
breaking them into smaller, more manageable units,
allowing for improved learning over the entire lexi-
con. The tokenized output was then integrated back
into the dataset for model training.
3.6 Label Encoding
The dataset involved categorical features such as
parts of speech and contextual lexical categories,
which needed to be converted into numerical form.
We designed a custom label encoder that used one
encoder for parts of speech and a separate encoder
for each POS-specific lexical category. This hierar-
chical encoding strategy preserved the information
about POS categories while ensuring flexibility for
lexical predictions.
The encoded labels were split into training and
testing sets, ensuring stratified sampling was used
to maintain the distribution of labels, especially
given the limited dataset size.
3.7 Transformer Model Architecture
We designed a transformer-based (Vaswani,2017)
neural network where we employed a shared em-
bedding layer followed by a transformer encoder
to learn generalized representations for both tasks:
POS prediction and Contlex prediction. The model
architecture involved a sequence of well-justified
choices aimed at optimizing learning while main-
taining simplicity and efficiency.
The input tokens, which were first processed us-
ing Byte-Pair Encoding, were then passed through
a shared embedding layer. This embedding layer
43
POS Morphological Forms Generated
V (Verbs) V+Ind+Prs+ConNeg, V+Ind+Prs+Sg3, V+Ind+Prt+Sg1, V+Ind+Prt+Sg3,
V+Inf, V+Ind+Prs+Sg1, V+Pass+PrfPrc, V+Ind+Prs+Pl3, V+Imprt+Sg3, V+Imprt+Pl3
N (Nouns) N+Sg+Loc, N+Sg+Ill, N+Pl+Gen, N+Sg+Nom, N+Sg+Gen, N+Sg+Loc+PxSg3,
N+Ess, N+Der/Dimin+N+Sg+Nom, N+Der/Dimin+N+Sg+Gen, N+Sg+Ill+PxSg1
Table 1: Select morphological forms to be used in the data augmentation phase
POS Contlex Supported
N
SAAQMM, SAJOS, MAINSTUMMUSH,
AELDD, CHAAQCC, VUYRR, ALGG,
TUYJJ, CHUOSHKK, CHUAQRVV,
KAADHNEKH, MUORYZH, TAQHTT,
PAPP, JEAQRMM, AANAR, MUORR,
VOONYS, TAALKYS, AUTT, LOAQDD,
BIOLOGIA, PAIQKHKH, KUEQLL,
PIEAQSS, KAQLBB, PLAAN, NEAVVV,
JAEUQRR, PAARR, PESS, JUQVJJ,
PEAELDD, HOQPPI, KUEAQTT,
KUYLAZH, MIYRKK, MEERSAZH,
AACCIKH, TOLL, JEAQNNN, ATOM,
JUURD, PEIQVV, SIJDD, KHEQRJJ,
MIEAQRR, MUEQRJJ, PAAQJJ,
SIYKKK, SHOOMM, OOUMAZH
V
LAUKKOOLLYD, SILTTEED,
TEEQMEED, ILAUKKOOLLYD,
VOQLLJED, KAEQTTED, SOLLEED,
KHIORGGNED, SARNNAD, AALGXTED,
SHORRNED, KUYDHDHDHJED,
KHEEQRJTED, TVOQLLJED, VIIKKYD,
JEAELSTED, CEQPCCED, POOLLYD,
SHKUEAQTTED, TOBDDYD, ROVVYD
Table 2: List of supported Contlex for each POS
learned a consistent representation for all input
data, regardless of the specific task. We opted for
a shared embedding layer to leverage common lin-
guistic features across POS and Contlex prediction
tasks, ensuring that the model’s parameters were
efficiently utilized. By sharing these embeddings,
we aimed to capture general patterns in Skolt Sami
morphology that were common to both POS tag-
ging and inflection class categorization.
The transformer encoder consisted of two en-
coder layers with four attention heads each. This
configuration was chosen to balance the need for
model depth and computational efficiency. The at-
tention mechanism allowed the model to capture
dependencies between tokens effectively, which
is crucial for understanding the morphosyntactic
structure of Skolt Sami. The use of multiple atten-
tion heads enabled the model to focus on different
aspects of token relationships, allowing for a more
nuanced understanding of linguistic features.
At the end of the architecture, we imple-
mented separate output heads for each classifica-
tion task—one for POS classification and one for
Contlex classification. These output heads ensured
that the model optimized separately for each task,
while still sharing the underlying representations
learned through the shared embedding and trans-
former layers. This approach allowed the model to
benefit from multi-task learning, where the training
process for one task could enhance learning for the
other due to shared morphological features.
We have applied the Xavier uniform distribu-
tion (Glorot and Bengio,2010) on the embeddings
and classification layers to initialize the weights,
this is to ensure that the variance of the activations
stays consistent across layers, which is particularly
important in deep networks like transformers to
prevent vanishing or exploding gradients during
training.
3.8 Training
We employed the following training strategies to
improve the model’s performance and optimize
resource usage. The transformer model was trained
with a consistent set of hyperparameters throughout
the experiments. The embedding size was set to
128, the hidden layer size to 512, and a learning rate
of 0.003 was used. A batch size of 512 ensured that
the training was efficient while reducing overfitting
risk. Hyperparameter optimization was conducted
using grid search to identify the optimal settings
for dropout rates, the number of layers, and the
type of learning rate scheduler.
We have employed AdamW opti-
mizer (Loshchilov,2017) because it combines the
benefits of adaptive learning rates with weight
decay, which helps in better generalization by
decoupling the weight decay from the learning rate
schedule. Moreover, we experiment with different
schedulers, namely Cosine Annealing, which
gradually decreases the learning rate following
a cosine curve to allow for fine-tuning near the
44
end of training (Loshchilov and Hutter,2016),
Exponential, which reduces the learning rate by a
fixed factor after every epoch for steady decay (Li
and Arora,2020), and ReduceLROnPlateau, which
lowers the learning rate when the performance of
the model stops improving.
The model was trained for 100 epochs without
early stopping. At epoch 80, the learning rate
scheduler was replaced with ‘SWALR‘ (Stochastic
Weight Averaging Learning Rate) to further refine
the model parameters during the final phase of train-
ing. SWA has been demonstrated to improve model
generalization by allowing the model to converge
to a wider minimum in the loss landscape (Izmailov
et al.,2018). This approach helps reduce overfitting
and often results in better generalization on the test
set, particularly for complex neural architectures
like transformers.
We did not use mixed-precision training; instead,
we kept the precision consistent throughout the
experiments to ensure model stability and repro-
ducibility. During training, checkpoints were pe-
riodically saved based on the validation metrics
to ensure the optimal version of the model was
retained for further evaluation.
The loss function combined cross-entropy losses
from both POS and Contlex output heads, with
adjustable weights for each loss to balance the im-
portance of both tasks. We gave both an equal
weight of 1.0. This multi-task learning approach
allowed the model to leverage shared morphologi-
cal and syntactic information while optimizing for
distinct objectives.
4 Results
We conducted six different training experiments
to determine the optimal hyperparameter settings
for POS and Contlex classification. The batch size,
embedding size, and hidden layer size were con-
sistent across all experiments, set to 512, 128, and
512 respectively. The following table summarizes
the different setups and their corresponding perfor-
mance metrics for both tasks:
The reported results are based on the best-
performing model from these six training exper-
iments.
The proposed transformer-based model, when
trained on the Skolt Sami dataset, performed well
on both POS and Contlex classification tasks. The
best-performing model (Exp 3) achieved an aver-
age weighted F1 score of 1.00 for POS prediction
and 0.81 for Contlex classification. The hierar-
chical label encoding strategy and the use of BPE
tokenization enabled the model to effectively han-
dle data sparsity and morphological richness. The
shared transformer layers provided an efficient way
to learn the underlying linguistic structure, while
the separate output heads allowed for precise clas-
sification for each task.
4.1 POS Classification Results
The POS classification results from the best-
performing model (Exp 3) indicate exceptional per-
formance, achieving 100% precision, recall, and
F1 score for nouns (
N
) and verbs (
V
). The detailed
metrics are as follows:
The weighted average metrics for all POS labels
showed perfect scores across all evaluation crite-
ria. Specifically, the precision, recall, F1-score,
and accuracy metrics were all measured at 1.00,
indicating that the model correctly classified ev-
ery instance without any errors for both nouns and
verbs. This level of performance suggests that the
model has successfully learned to distinguish be-
tween the different parts of speech in the dataset
with complete reliability.
4.2 Contlex Classification Results
For Contlex classification, the best model (Exp 3)
performed well overall, although there were no-
table differences in performance across various cat-
egories. The macro-averaged F1 score was 0.84,
indicating that while the model performed well for
many categories, some rare categories were chal-
lenging to predict accurately. Below are notable
results for selected Contlex categories:
N_SAJOS: Precision = 0.82, Recall = 0.83,
F1-Score = 0.82 (Support = 597)
N_MAINSTUMMUSH: Precision = 0.33,
Recall = 0.32, F1-Score = 0.33 (Support =
156)
V_LAUKKOOLLYD: Precision = 0.91, Re-
call = 0.80, F1-Score = 0.85 (Support = 61)
The detailed metrics show that for frequent cat-
egories like
N_SAJOS
, the model performs well,
achieving an F1 score of 0.82. However, for less
frequent categories like
N_MAINSTUMMUSH
, perfor-
mance drops, reflecting challenges in predicting
low-frequency classes.
The precision, recall, F1-score, and accuracy
for the continuation lexicon classification were all
45
Experiment ID Scheduler Type Dropout N_layers N_heads POS F-1 Score Contlex F-1
Exp 1 CosineAnnealingLR, T_max=25 0.1 2 4 0.93 0.64
Exp 2 CosineAnnealingLR, T_max=25 0.2 3 4 1.00 0.78
Exp 3 CosineAnnealingLR, T_max=25 0.2 3 8 1.00 0.81
Exp 4 ExponentialLR, gamma=0.95 0.2 3 8 0.96 0.75
Exp 5 ReduceLROnPlateau, patience=10 0.2 3 8 0.82 0.37
Exp 6 CosineAnnealingLR, T_max=25 0.2 10 8 0.82 0.35
Table 3: The multiple experiments run with the scheduler and hyperparameters used, along with their results
Label Precision Recall F1-Score N
N 1.00 1.00 1.00 1520
V 1.00 1.00 1.00 338
Table 4: Classification results for predicting the POS
using the best model
recorded at approximately 0.81, indicating that the
model was able to consistently achieve a balanced
level of performance across all metrics. This sug-
gests that the model is reliable in its classification
for most categories, although there is still room for
improvement, particularly in handling rare classes.
These results indicate that data sparsity affects
performance on less frequent labels. The com-
parison across different experiments further high-
lighted the sensitivity of model performance to hy-
perparameter choices, such as the number of trans-
former layers and dropout rates. The results from
experiments 5 and 6, which achieved lower scores,
underscore the importance of carefully tuning these
parameters to avoid underfitting or overfitting. Data
augmentation using miniparadigms helped mitigate
some of these challenges, but further improvements
could be achieved by expanding the dataset or in-
corporating additional contextual features.
4.3 Accuracy per number of words
We also evaluated the model’s performance by lim-
iting the maximum number of word forms sent
to the model for prediction. Figure 1illustrates
how the accuracy of POS and Contlex classifica-
tion changes with an increasing number of word
forms provided to the model. The results showed
that both POS and Contlex accuracy improved as
the number of word forms increased, eventually
reaching a stable high performance. Specifically,
POS accuracy started at 0.973 when the maximum
number of word forms was 1 (just the lemma), and
steadily improved, reaching 0.999 for 14 or more
Figure 1: POS and Contlex accuracy by maximum num-
ber of word forms that are sent to the model for predic-
tion
word forms. Similarly, Contlex accuracy improved
from 0.365 at 1 word form to 0.69 for 5 word forms
and to above 0.81 for 14 or more word forms. This
demonstrates that providing more paradigmatic
context significantly enhances the model’s ability
to make accurate predictions.
5 Discussion and Conclusion
In this paper, we presented a transformer-based ap-
proach for predicting parts of speech and inflection
classes (Contlexes) for the Skolt Sami language.
The success of the model highlights the potential
of combining traditional linguistic tools with mod-
ern NLP techniques, particularly for endangered
languages. Our results demonstrate near-perfect
performance for POS classification and reasonably
good performance for most Contlex categories, al-
though predicting rare categories remains challeng-
ing. The results indicate that the use of shared
embeddings and multi-task learning can be effec-
tive in achieving high accuracy for parts of speech,
while data augmentation and careful hyperparame-
ter tuning help in handling the morphological com-
plexities of Skolt Sami.
The observed variability in Contlex classifica-
tion performance, especially for infrequent cate-
46
gories, highlights the challenges of data sparsity
and suggests the need for additional efforts in
data collection and augmentation. Frequent cate-
gories like
N_SAJOS
benefited from the availability
of more examples, whereas rare categories such
as
N_MAINSTUMMUSH
showed lower performance,
primarily due to limited training data. This un-
derscores the necessity for expanding the training
dataset to cover more diverse lexical entries and
reduce biases towards common categories. Incorpo-
rating additional features, such as syntactic or con-
textual information, could also enhance the model’s
understanding of rare categories.
The results from limiting the number of words
used for prediction suggest that context plays a cru-
cial role in improving model performance. When
fewer words were provided to the model, both POS
and Contlex accuracy suffered, indicating the im-
portance of sufficient contextual information for
effective classification. The model showed a con-
sistent improvement in both tasks as more words
were added, and the performance eventually stabi-
lized. This demonstrates that using larger contexts
allows the transformer model to better capture the
linguistic intricacies of Skolt Sami, improving the
reliability of its predictions.
Moreover, we believe that expanding the dataset
to include other related Uralic languages could en-
hance model performance through cross-linguistic
transfer learning, benefiting from shared morpho-
logical features. Another promising direction for
future work is the exploration of semi-supervised
or unsupervised learning techniques, which could
leverage unlabeled data to improve classification
performance without relying solely on manually
annotated resources. This is particularly relevant
given the resource constraints typical for endan-
gered languages like Skolt Sami.
In conclusion, the trained model and code will be
released publicly to support future research and ap-
plication in endangered language processing. We
hope that this contribution will aid in the ongo-
ing efforts to preserve and revitalize minority lan-
guages by providing computational tools that can
be used to automate linguistic analysis, document
new lexical entries, and contribute to the develop-
ment of educational and linguistic resources. Fu-
ture research should continue to focus on enriching
the dataset, exploring multi-lingual training, and
employing innovative learning paradigms to further
advance the field of NLP for endangered languages.
References
Gulinigeer Abudouwaili, Kahaerjiang Abiderexiti, Nian
Yi, and Aishan Wumaier. 2023. Joint learning model
for low-resource agglutinative language morpholog-
ical tagging. In Proceedings of the 20th SIGMOR-
PHON workshop on Computational Research in Pho-
netics, Phonology, and Morphology, pages 27–37,
Toronto, Canada. Association for Computational Lin-
guistics.
Khalid Alnajjar and Mika Hämäläinen. 2023. Pyhfst:
A pure python implementation of hfst. In Lightning
Proceedings of NLP4DH and IWCLUL 2023, pages
32–35.
Khalid Alnajjar, Mika Hämäläinen, Jack Rueter,
and Niko Partanen. 2020. Ve’rdd. narrowing
the gap between paper dictionaries, low-resource
nlp and community involvement. arXiv preprint
arXiv:2012.02578.
Philip Gage. 1994. A new algorithm for data compres-
sion. The C Users Journal, 12(2):23–38.
Xavier Glorot and Yoshua Bengio. 2010. Understand-
ing the difficulty of training deep feedforward neural
networks. In Proceedings of the Thirteenth Interna-
tional Conference on Artificial Intelligence and Statis-
tics, volume 9 of Proceedings of Machine Learning
Research, pages 249–256, Chia Laguna Resort, Sar-
dinia, Italy. PMLR.
Mika Hämäläinen. 2019. Uralicnlp: An nlp library for
uralic languages. Journal of open source software,
4(37):1345.
Mika Hämäläinen, Khalid Alnajjar, Jack Rueter, Miika
Lehtinen, and Niko Partanen. 2021a. An online tool
developed for post-editing the new skolt sami dictio-
nary. In Electronic lexicography in the 21st century
(eLex 2021), pages 653–664. Lexical Computing CZ
sro.
Mika Hämäläinen, Niko Partanen, Jack Rueter, and
Khalid Alnajjar. 2021b. Neural morphology dataset
and models for multiple languages, from the large to
the endangered. In Proceedings of the 23rd Nordic
Conference on Computational Linguistics (NoDaL-
iDa), pages 166–177, Reykjavik, Iceland (Online).
Linköping University Electronic Press, Sweden.
Mika Hämäläinen and Jack Rueter. 2019. Finding sami
cognates with a character-based nmt approach. In
Workshop on the Use of Computational Methods in
the Study of Endangered Languages, page 39.
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov,
Dmitry Vetrov, and Andrew Gordon Wilson. 2018.
Averaging weights leads to wider optima and better
generalization. arXiv preprint arXiv:1803.05407.
Eino Koponen and Jack Rueter. 2016. The first com-
plete scientific grammar of skolt saami in english.
Finnisch-Ugrische Forschungen, (63):254–266.
47
Zhiyuan Li and Sanjeev Arora. 2020. An exponential
learning rate schedule for deep learning. In Interna-
tional Conference on Learning Representations.
I Loshchilov. 2017. Decoupled weight decay regulariza-
tion. arXiv preprint arXiv:1711.05101.
Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochas-
tic gradient descent with warm restarts. arXiv
preprint arXiv:1608.03983.
Joakim Nivre, Dan Zeman, Jack Rueter, Markus Juuti-
nen, and Mika Hämäläinen. 2022. Ud_skolt_sami-
giellagas 2.11.
Jack Rueter and Mika Hämäläinen. 2020. Fst morphol-
ogy for the endangered skolt sami language. arXiv
preprint arXiv:2004.04803.
A Vaswani. 2017. Attention is all you need. Advances
in Neural Information Processing Systems.
Shijie Wu, Ryan Cotterell, and Mans Hulden. 2020. Ap-
plying the transformer to character-level transduction.
arXiv preprint arXiv:2005.10213.
Changbing Yang, Ruixin (Ray) Yang, Garrett Nicolai,
and Miikka Silfverberg. 2022. Generalizing mor-
phological inflection systems to unseen lemmas. In
Proceedings of the 19th SIGMORPHON Workshop
on Computational Research in Phonetics, Phonology,
and Morphology, pages 226–235, Seattle, Washing-
ton. Association for Computational Linguistics.
48
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
In this paper, we describe our work on reimplementing HFST optimized lookup on Python. Our tool is called PYHFST and it is available on GitHub (https:// github.com/Rootroo-ltd/pyhfst), PyPi (https://pypi.org/ project/pyhfst/) and Zenodo (https://zenodo.org/ records/7791470).
Conference Paper
Full-text available
We train neural models for morphological analysis, generation and lemmatization for morphologically rich languages. We present a method for automatically extracting substantially large amount of training data from FSTs for 22 languages, out of which 17 are endangered. The neural models follow the same tagset as the FSTs in order to make it possible to use them as fallback systems together with the FSTs. The source code, models and datasets have been released on Zenodo.
Conference Paper
Full-text available
We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms.
Conference Paper
Full-text available
We approach the problem of expanding the set of cognate relations with a sequence-to-sequence NMT model. The language pair of interest, Skolt Sami and North Sami, has too limited a set of parallel data for an NMT model as such. We solve this problem on the one hand, by training the model with North Sami cognates with other Uralic languages and, on the other, by generating more synthetic training data with an SMT model. The cognates found using our method are made publicly available in the Online Dictionary of Uralic Languages.
Article
Full-text available
In the past years the natural language processing (NLP) tools and resources for small Uralic languages have received a major uplift. The open-source Giellatekno infrastructure has served a key role in gathering these tools and resources in an open environment for researchers to use. However, the many of the crucially important NLP tools, such as FSTs and CGs require specialized tools with a learning curve. This paper presents UralicNLP, a Python library, the goal of which is to mask the actual implementation behind a Python interface. This not only lowers the threshold to use the tools provided in the Giellatekno infrastructure but also makes it easier to incorporate them as a part of research code written in Python.
Article
Full-text available
Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.
Article
Timothy Feist: A Grammar of Skolt Saami. Mémoires de la Société Finno-Ougrienne 273. Finno-Ugrian Society. Helsinki 2015. 414 p. https://doi.org/10.33339/fuf.86126 This is an assessment of the merits of the English-language Skolt Sami Grammar written by Timothy Feist with respect to existing scholarship already available in English, Finnish and German. Here the writers use their knowledge in comparative Sami research and finite-state morphological descriptions of the language.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.