Conference PaperPDF Available

Dialect Text Normalization to Normative Standard Finnish

Authors:

Abstract

We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for norma-tive Finnish text. We work on a corpus consisting of dialectal data from 23 distinct Finnish dialect varieties. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.
Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 141–146
Hong Kong, Nov 4, 2019. c
2019 Association for Computational Linguistics
141
Dialect Text Normalization to Normative Standard Finnish
Niko Partanen
Department of Finnish,
Finno-Ugrian
and Scandinavian Studies
University of Helsinki
Mika H¨
am¨
al¨
ainen
Department of Digital
Humanities
University of Helsinki
firstname.lastname@helsinki.fi
Khalid Alnajjar
Department of Computer Science
University of Helsinki
Abstract
We compare different LSTMs and transformer
models in terms of their effectiveness in nor-
malizing dialectal Finnish into the normative
standard Finnish. As dialect is the common
way of communication for people online in
Finnish, such a normalization is a necessary
step to improve the accuracy of the existing
Finnish NLP tools that are tailored for norma-
tive Finnish text. We work on a corpus consist-
ing of dialectal data from 23 distinct Finnish
dialect varieties. The best functioning BRNN
approach lowers the initial word error rate of
the corpus from 52.89 to 5.73.
1 Introduction
Normalization is one of the possible pre-
processing steps that can be applied to various text
types in order to increase their compatibility with
tools designed for the standard language. This ap-
proach can be taken in an essentially similar man-
ner with dialectal texts, historical texts or collo-
quial written genres, and can be beneficial also
as one processing step with many types of spoken
language materials.
Our study focuses to the normalization of di-
alect texts, especially within the format of tran-
scribed dialectal audio recordings, published pri-
marily for linguistic research use. However, the
dialectal correspondences in this kind of material
are comparable to phenomena in other texts where
dialectal features occur, the results are expected to
be generally applicable.
This paper introduces a method for dialect tran-
script normalization, which enables the possibility
to use existing NLP tools targeted for normative
Finnish on these materials. Previous work con-
ducted in English data indicates that normaliza-
tion is a viable way of improving the accuracy of
NLP methods such as POS tagging (van der Goot
et al.,2017). This is an important motivation as the
non-standard colloquial Finnish is the de facto lan-
guage of communication on a multitude of internet
platforms ranging from social media to forums and
blogs. In its linguistic form, the colloquial dialec-
tal Finnish deviates greatly from the standard nor-
mative Finnish, a fact that lowers the performance
of the existing NLP tools for processing Finnish
on such text.
2 Related work
Automated normalization has been tackled in the
past many times especially in the case of histori-
cal text normalization. A recent meta-analysis on
the topic (Bollmann,2019) divides the contempo-
rary approaches into five categories: substitution
lists like VARD (Rayson et al.,2005) and Norma
(Bollmann,2012), rule-based methods (Baron
and Rayson,2008;Porta et al.,2013), edit dis-
tance based approaches (Hauser and Schulz,2007;
Amoia and Martinez,2013), statistical methods
and most recently neural methods.
For statistical methods, the most prominent re-
cent ones have been different statistical machine
translation (SMT) based methods. These methods
often assimilate the normalization process with a
regular translation process by training an SMT
model on a character level. Such methods have
been used for historical text (Pettersson et al.,
2013;H¨
am¨
al¨
ainen et al.,2018) and contemporary
dialect normalization (Samardzic et al.,2015).
Recently, many normalization methods utilized
neural machine translation (NMT) analogously to
the previous SMT based approaches on a char-
acter level due to its considerable ability in ad-
dressing the task. Bollmann and Søgaard (2016)
have used a bidirectional long short-term mem-
ory (bi-LSTM) deep neural network to normalize
historical German on a character level. The au-
thors have also tested the efficiency of the model
142
when additional auxiliary data is used during the
training phase (i.e. multi-task learning). Based on
their benchmarks, normalizations using the neural
network approach outperformed the ones by con-
ditional random fields (CRF) and Norma, where
models trained with the auxiliary data generally
had the best accuracy.
Tursun and Cakici (2017) test out LSTM and
noisy channel model (NCM), a method commonly
used for spell-checking text, to normalize Uyghur
text. In addition to the base dataset (200
sentences obtained from social networks, auto-
matically and manually normalized), the authors
have generated synthetic data by crawling news
websites and introducing noise in it by substitut-
ing characters with their corresponding corrupted
characters at random. Both of the methods have
normalized the text with high accuracy which il-
lustrates the their effectiveness. Similarly, Man-
dal and Nanmaran (2018) had employed an LSTM
network and successfully normalized code-mixed
data with an accuracy of 90.27%.
A recent study on historical English letters
(H¨
am¨
al¨
ainen et al.,2019) compares different
LSTM architectures finding that bi-directional re-
current neural networks (BRNN) work better than
one-directional RNNs, however different attention
models or deeper architecture do not have a posi-
tive effect on the results. Also providing additional
data such as social metadata or century informa-
tion makes the accuracy worse. Their findings
suggest that post-processing is the most effective
way of improving a character level NMT normal-
ization model. The same method has been suc-
cessfully applied in OCR post-correction as well
(H¨
am¨
al¨
ainen and Hengchen,2019).
3 Data
Finnish dialect materials have been collected sys-
tematically since late 1950s. These materials are
currently stored in the Finnish Dialect Archive
within Institute for the Languages of Finland, and
they amount all in all 24,000 hours. The ini-
tial goal was to record 30 hours of speech from
each pre-war Finnish municipality. This goal was
reached in the 70s, and the work evolved toward
making parts of the materials available as pub-
lished text collections. Another approach that was
initiated in the 80s was to start follow-up record-
ings in the same municipalities that were the tar-
gets of earlier recording activity.
Later the work on these published materials has
resulted in multiple electronic corpora that are cur-
rently available. Although they represent only a
tiny fraction of the entire recorded material, they
reach remarkable coverage of different dialects
and varieties of spoken Finnish. Some of these
corpora contain various levels of manual annota-
tion, while others are mainly plain text with as-
sociated metadata. Materials of this type can be
characterized by an explicit attempt to represent
dialects in linguistically accurate manner, having
been created primarily by linguists with formal
training in the field. These transcriptions are usu-
ally written with a transcription systems specific
for each research tradition. The result of this type
of work is not simply a text containing some di-
alectal features, but a systematic and scientific
transcription of the dialectal speech.
The corpus we have used in training and testing
is the Samples of Spoken Finnish corpus (Institute
for the Languages of Finland,2014). It is one of
the primary traditional Finnish dialect collections,
and one that is accompanied with hand-annotated
normalization into standard Finnish. The size of
corpus is 696,376 transcribed words, of which
684,977 have been normalized. The corpus cov-
ers 50 municipalities, and each municipality has
two dialect samples. The materials were originally
published in a series between 1978-2000. The
goal was to include various dialects systematically
and equally into the collection. The modern dig-
ital corpus is released under CC-BY license, and
is available with its accompanying materials and
documentation in the Language Bank of Finland.1
The data has been tokenized and the normative
spellings have been aligned with the dialectal tran-
scriptions on a token level. This makes our task
with normalization model easier as no preprocess-
ing is required. We randomly sort the sentences in
the data and split them into a training (70% of the
sentences), validation (15% of the sentences) and
test (15% of the sentences) sets.
4 Dialect normalization
Our approach consists of a character level NMT
model that learns to translate the dialectal Finnish
to normative spelling. We experiment with two
different model types, one being an LSTM based
BRNN (bi-directional recurrent neural network)
approach as taken by many in the past, and the
1http://urn.fi/urn:nbn:fi:lb-201407141
143
other is a transformer model as it has been reported
to outperform LSTMs in many other sequence-to-
sequence tasks.
For the BRNN model, we use mainly the Open-
NMT (Klein et al.,2017) defaults. This means that
there are two layers both in the encoder and the de-
coder and the attention model is the general global
attention presented by Luong et al. (2015). The
transformer model is that of Vaswani et al. (2017).
Both models are trained for the default 100,000
training steps.
We experiment with three different ways of
training the models. We train a set of models on
a word level normalization, which means that the
source and target consist of single words split into
characters by white spaces. In order to make the
models more aware of the context, we also train a
set of models on chunked data. This means that
we train the models by feeding in 3 words at a
time; the words are split into characters and the
word boundaries are indicated with an underscore
character ( ). Lastly we train one set of models
on a sentence level. In this case the models are
trained to normalize full sentences of words split
into characters and separated by underscores.
In terms of the size of the training data, the word
level data consists of 590k, the chunk level of 208k
and the sentence level of 35k parallel rows. All of
the models use the same split of training, testing
and validation datasets as described earlier. The
only difference is in how the data is fed into the
models.
5 Results & Evaluation
We evaluate the methods by counting the word er-
ror rate2(WER) of their output in comparison with
the test dataset. WER is a commonly used metric
to assess the accuracy of text normalization.
Table 1shows the WERs of the different meth-
ods. The initial WER of the non normalized di-
alectal text in comparison with the normalized text
is shown in the column No normalization. As we
can see from this number, the dialectal text is very
different from the standardized spelling. Both the
word level and chunk level normalization methods
reach to a very high drop in the WER meaning that
they manage to normalize the text rather well. Out
of these, the chunk level BRNN achieves the best
results. The performance is the worst in the sen-
2We use the implementation provided in
https://github.com/nsmartinez/WERpp
tence level models, even to a degree that the trans-
former model manages to make the WER higher
than the original.
5.1 Error analysis
Table 2illustrates the general performance of the
model, with errors marked in bold. The example
sentence fragments are chosen by individual fea-
tures they exhibit, as well as by how well they rep-
resent the corpus data.
Since the model accuracy is rather high, the er-
rors are not very common in the output. We can
also see clearly that the chunk model is able to pre-
dict the right form even when form is reduced to
one character, as on line 5.
Since the dialectal variants often match the stan-
dard Finnish, over half of the forms need no
changes. The model learns this well. Vast ma-
jority of needed changes are individual insertions,
replacements or deletions in the word end, as il-
lustrated in Table 2at lines 2, 4, 6, 7, 15, 16, 17
and 18. However, also word-internal changes are
common, as shown at lines 11 and 12. Some dis-
tinct types of common errors can be detected, and
they are discussed below.
In some cases the errors are clearly connected
to the vowel lengthening that does not mark or-
dinary phonological contrast. Line 3 shows how
the dialectal pronoun variant of he ‘he / she’, het,
is occasionally present in dialect material as heet,
possibly being simply emphasized in a way that
surfaces with an unexpected long vowel. This kind
of sporadic vowel lengthening is rare, but seems to
lead regularly to a wrong prediction, as these pro-
cesses are highly irregular. This example also il-
lustrates that when the model is presented a rare or
unusual form, it seems to have a tendency to return
prediction that has overgone no changes at all.
The model seems to learn relatively well the
phonotactics of literary Finnish words. However,
especially with compounds it shows a trait to clas-
sify word boundaries incorrectly. A good exam-
ple of this is ratap¨
ol¨
okynter`
vaau
ˇs””kon ‘railroad
tie treatment machine’, for which the correct anal-
ysis would be ‘rata#p¨
olkyn#tervaus#kone’3, but
the model proposes ‘rata#p¨
olkyn#terva#uskoinen’
which roughly translates as ‘railroad tie creosote
believer’. The latter variant is semantically quite
awkward, but morphologically possible. This
3Here # is used for the illustrative purpose to indicate
word boundaries within the compound
144
No normalization Words Chunks of 3 Sentences
BRNN Transformer BRNN Transformer BRNN Transformer
WER 52.89 6.44 6.34 5.73 6.1 46.52 53.23
Table 1: The word error rates of the different models in relation to the test set
source correct target prediction
1 joo joo joo
2 ettEett¨
a ett¨
a
3 heet he heet
4 uskovah uskovat uskovat
5 n niin niin
6<
ettEett¨
a ett¨
a
7 sinn sinne sinne
8<
ei ei ei
9 ole ole ole
10 , , ,
11 kukhaan kukaan kukaan
12 ymm¨
art¨
anny ymm¨
art¨
anyt ymm¨
art¨
anyt
13 menn¨
a menn¨
a menn¨
a
14 . . .
15 Artj¨
arveNArtj¨
arven Artj¨
arven
16 kirkolt kirkolta kirkolta
17 menn¨
ah menn¨
a¨
an menn¨
a¨
an
18 sinneh sinne sinne
19 Hiitel¨
ah Hiitel¨
a¨
an Hiitel¨
ass¨
a
Table 2: Examples from input, output and prediction
phonotactic accuracy makes selection of correct
analysis from multiple predicted variants more dif-
ficult, as it is not possible to easily detect mor-
phologically valid and invalid forms. The longer
words such as this also have more environments
where normalization related changes have to be
done, which likely makes their correct prediction
increasingly difficult.
In word level model there are various errors re-
lated to morphology that has eroded from the di-
alectal realizations of the words, or correspond to
a more complicated sequences. Long vowel se-
quences in standard Finnish often correspond to
diphthongs or word internal -h- characters, and
these multiple correspondence patterns may be
challenging for the model to learn. Chunk model
performs few percentages better than word model
in predictions where long vowel sequences are
present, which could hint that the model bene-
fits from wider syntactic window the neighbouring
words can provide. On line 19 a case of wrongly
selected spatial case is illustrated.
There are cases where dialectal wordforms are
ambiguous without context, i.e. standard Finnish
cases adessive (-lla) and allative (-lle) are both
marked with single character (-l). Various sandhi-
phenomena at the word boundary also blurren the
picture by introducing even more possible inter-
pretations, such as vuoristol laitaa, where the cor-
rect underlying form of the first element would
be vuoriston ‘mountain-GEN’. The decision about
correct form cannot be done with information pro-
vided only by single forms in isolation. The chunk
level model shows small but consistent improve-
ments in these cases. This is expected, as the word
level model simply has no context to make the cor-
rect prediction.
It is important to note that since the model
is trained on linguistic transcriptions, its perfor-
mance is also limited to this context. For example,
in the transcriptions all numbers, such as years and
dates, are always written out as words. Thereby
the model has never seen a number, and is doesn’t
process them either. Improving the model with
additional training data that accounts this phe-
nomena should, on the other hand, be relatively
straightforward. Similarly the model has had only
very limited exposure to upper case characters and
some of the punctuation characters used in ordi-
nary literary language, which should all be taken
into account when attempting to use the model
with novel datasets.
6 Conclusion & Future work
The normalization method we have proposed
reaches remarkable accuracy with this dialectal
transcription dataset of spoken Finnish. The er-
ror rate is so low that even if manual normaliza-
tion would be the ultimate target, doing this in
combination with our approach would make the
work manifold faster. We have tested the results
with large enough material that we assume simi-
lar method would work in other conditions where
same preliminary conditions are met. These are
sufficiently large amount of training data and sys-
tematic transcription system used to represent the
dialectal speech.
145
Future work needs to be carried out to evaluate
the results on different dialectal Finnish datasets,
many of which have been created largely within
the activities described earlier, but which are also
continuously increasing as research on Finnish is a
very vibrant topic in Finland and elsewhere. This
method could also be a very efficient in increas-
ing the possibilities for natural language process-
ing of other contemporary spoken Finnish texts.
Our method could also be easily used within OCR
correction workflows, for example, as a step after
automatic error correction.
Situation is essentially similar, to our knowl-
edge, also in other countries with comparable his-
tory of dialectal text collection. Already within
Finnish archives there are large collections of di-
alectal transcriptions in Swedish, as well as in the
endangered Karelian and Sami languages. Apply-
ing our method into these resources would also di-
rectly improve their usability. However, it has to
be kept in mind that our work has been carried out
in a situation where the manually annotated train-
ing data is exceptionally large. In order to under-
stand how widely applicable our method is for an
endangered language setting, it would be impor-
tant to test further how well the model performs
with less data.
The performance with less data is especially
crucial with low-resource languages. Many en-
dangered languages around the world have text
collections published in the last centuries, which,
however, customarily use a linguistic transcription
system that deviates systematically from the cur-
rent standard orthography. Such a legacy data can
be highly useful in language documentation work
and enrich modern corpora, but there are chal-
lenges in normalization and further processing of
this data (Blokland et al.,2019). The approach
presented in our paper could be applicable into
such data in various language documentation sit-
uations, and the recent interest the field has dis-
played toward language technology creates good
conditions for further integration of these methods
(Gerstenberger et al.,2016).
We have released the chunk-level BRNN nor-
malization model openly on GitHub as a part of an
open-source library called Murre4. We hope that
the normalization models developed in this paper
are useful for other researchers dealing with a va-
riety of downstream Finnish NLP tasks.
4https://github.com/mikahama/murre
7 Acknowledgements
Niko Partanen’s work has been conducted within
the project Language Documentation meets Lan-
guage Technology: The Next Step in the Descrip-
tion of Komi, funded by the Kone Foundation.
References
Marilisa Amoia and Jose Manuel Martinez. 2013. Us-
ing comparable collections of historical texts for
building a diachronic dictionary for spelling normal-
ization. In Proceedings of the 7th workshop on lan-
guage technology for cultural heritage, social sci-
ences, and humanities, pages 84–89.
Alistair Baron and Paul Rayson. 2008. VARD2: A tool
for dealing with spelling variation in historical cor-
pora. In Postgraduate conference in corpus linguis-
tics.
Rogier Blokland, Niko Partanen, Michael Rießler, and
Joshua Wilbur. 2019. Using computational ap-
proaches to integrate endangered language legacy
data into documentation corpora: Past experiences
and challenges ahead. In Workshop on Computa-
tional Methods for Endangered Languages, Hon-
olulu, Hawai’i, USA, February 26–27, 2019, vol-
ume 2, pages 24–30. University of Colorado.
Marcel Bollmann. 2012. (Semi-)automatic normaliza-
tion of historical texts using distance measures and
the Norma tool. In Proceedings of the Second Work-
shop on Annotation of Corpora for Research in the
Humanities (ACRH-2), Lisbon, Portugal.
Marcel Bollmann. 2019. A large-scale comparison of
historical text normalization systems. In Proceed-
ings of the 2019 Conference of the North American
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1
(Long and Short Papers), pages 3885–3898, Min-
neapolis, Minnesota. Association for Computational
Linguistics.
Marcel Bollmann and Anders Søgaard. 2016. Im-
proving historical spelling normalization with bi-
directional LSTMs and multi-task learning. In Pro-
ceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Techni-
cal Papers, pages 131–139, Osaka, Japan. The COL-
ING 2016 Organizing Committee.
Ciprian Gerstenberger, Niko Partanen, Michael
Rießler, and Joshua Wilbur. 2016. Utilizing
language technology in the documentation of
endangered Uralic languages. Northern European
Journal of Language Technology, 4:29–47.
Rob van der Goot, Barbara Plank, and Malvina Nis-
sim. 2017. To normalize, or not to normalize: The
impact of normalization on Part-of-Speech tagging.
In Proceedings of the 3rd Workshop on Noisy User-
generated Text, pages 31–39.
146
Mika H¨
am¨
al¨
ainen and Simon Hengchen. 2019. From
the Paft to the Fiiture: a fully automatic NMT
and Word Embeddings Method for OCR Post-
Correction. In Recent Advances in Natural Lan-
guage Processing, pages 432–437. INCOMA.
Mika H¨
am¨
al¨
ainen, Tanja S¨
aily, Jack Rueter, J¨
org
Tiedemann, and Eetu M¨
akel¨
a. 2018. Normalizing
early English letters to present-day English spelling.
In Proceedings of the Second Joint SIGHUM Work-
shop on Computational Linguistics for Cultural
Heritage, Social Sciences, Humanities and Litera-
ture, pages 87–96.
Mika H¨
am¨
al¨
ainen, Tanja S¨
aily, Jack Rueter, J¨
org
Tiedemann, and Eetu M¨
akel¨
a. 2019. Revisiting
NMT for normalization of early English letters. In
Proceedings of the 3rd Joint SIGHUM Workshop
on Computational Linguistics for Cultural Heritage,
Social Sciences, Humanities and Literature, pages
71–75, Minneapolis, USA. Association for Compu-
tational Linguistics.
Andreas W Hauser and Klaus U Schulz. 2007. Unsu-
pervised learning of edit distance weights for retriev-
ing historical spelling variations. In Proceedings of
the First Workshop on Finite-State Techniques and
Approximate Search, pages 1–6.
Institute for the Languages of Finland. 2014.
Suomen kielen n¨
aytteit¨
a - Samples of Spo-
ken Finnish [online-corpus], version 1.0.
http://urn.fi/urn:nbn:fi:lb-201407141.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Senellart, and Alexander M. Rush. 2017. Open-
NMT: Open-Source Toolkit for Neural Machine
Translation. In Proc. ACL.
Minh-Thang Luong, Hieu Pham, and Christopher D
Manning. 2015. Effective approaches to attention-
based neural machine translation. arXiv preprint
arXiv:1508.04025.
Soumil Mandal and Karthick Nanmaran. 2018. Nor-
malization of transliterated words in code-mixed
data using Seq2Seq model & Levenshtein distance.
In Proceedings of the 2018 EMNLP Workshop W-
NUT: The 4th Workshop on Noisy User-generated
Text, pages 49–53, Brussels, Belgium. Association
for Computational Linguistics.
Eva Pettersson, Be´
ata Megyesi, and J¨
org Tiedemann.
2013. An SMT approach to automatic annotation
of historical text. In Proceedings of the workshop
on computational historical linguistics at NODAL-
IDA 2013; May 22-24; 2013; Oslo; Norway. NEALT
Proceedings Series 18, 087, pages 54–69. Link¨
oping
University Electronic Press.
Jordi Porta, Jos´
e-Luis Sancho, and Javier G´
omez.
2013. Edit transducers for spelling variation in Old
Spanish. In Proceedings of the workshop on compu-
tational historical linguistics at NODALIDA 2013;
May 22-24; 2013; Oslo; Norway. NEALT Proceed-
ings Series 18, 087, pages 70–79. Link¨
oping Univer-
sity Electronic Press.
Paul Rayson, Dawn Archer, and Nicholas Smith. 2005.
VARD versus WORD: a comparison of the UCREL
variant detector and modern spellcheckers on en-
glish historical corpora. Corpus Linguistics 2005.
Tanja Samardzic, Yves Scherrer, and Elvira Glaser.
2015. Normalising orthographic and dialectal vari-
ants for the automatic processing of Swiss German.
In Proceedings of the 7th Language and Technology
Conference. ID: unige:82397.
Osman Tursun and Ruket Cakici. 2017. Noisy Uyghur
text normalization. In Proceedings of the 3rd Work-
shop on Noisy User-generated Text, pages 85–93,
Copenhagen, Denmark. Association for Computa-
tional Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.
... As a research domain, the Natural Language Processing has regularly focused on the formal written varieties of the most widely used languages of the world. At the same time there has been a growing interest in both non-standard and informal language Partanen et al., 2019), and their historical varieties (Säily et al., 2021;. The research potential of historical language varieties is clearly on the upbound, and one can argue that the need is already quite evident, as digitization processes in libraries and archives around the world have reached relatively mature stages and already have large digital collections available. ...
... After this different operations can be applied, but at a different level: there are already many packages often provide deeper language specific functionality that should be leveraged. Example include UralicNLP (Hämäläinen, 2019) for basic NLP analysis of Uralic languages, and murre for specific dialectal and historical text normalization or lemmatization scenarios (Partanen et al., 2019;Hämäläinen et al., 2020). The NLP for Latin also seems fairly developed, and available models could be applied (Clérice, 2021). ...
Preprint
Full-text available
The study forms a technical report of various tasks that have been performed on the materials collected and published by Finnish ethnographer and linguist, Matthias Alexander Castr\'en (1813-1852). The Finno-Ugrian Society is publishing Castr\'en's manuscripts as new critical and digital editions, and at the same time different research groups have also paid attention to these materials. We discuss the workflows and technical infrastructure used, and consider how datasets that benefit different computational tasks could be created to further improve the usability of these materials, and also to aid the further processing of similar archived collections. We specifically focus on the parts of the collections that are processed in a way that improves their usability in more technical applications, complementing the earlier work on the cultural and linguistic aspects of these materials. Most of these datasets are openly available in Zenodo. The study points to specific areas where further research is needed, and provides benchmarks for text recognition tasks.
... For other languages such as Finnish (Partanen et al., 2019), Swedish (Hämäläinen et al., 2020a) and German (Scherrer et al., 2019), dialect normalization has been seen as good way of dealing with the issue of non-standard language. If a model can normalize dialectal text to a standard norm, then all normative language NLP models can be applied on that data. ...
Conference Paper
Full-text available
While standard Estonian is not a low-resourced language, the different dialects of the language are under-resourced from the point of view of NLP, given that there are no vast hand normalized resources available for training a machine learning model to normalize dialectal Estonian to standard Estonian. In this paper, we crawl a small corpus of parallel dialectal Estonian - standard Estonian sentences. In addition, we take a savvy approach of generating more synthetic training data for the normalization task by using an existing dialect generator model built for Finnish to "dialectalize" standard Estonian sentences from the Universal Dependencies tree banks. Our BERT based normalization model achieves a word error rate that is 26.49 points lower when using both the synthetic data and Estonian data in comparison to training the model with only the available Estonian data. Our results suggest that synthetic data generated by a model trained on a more resourced related language can indeed boost the results for a less resourced language.
... They have learned the transducer on parallel corpus of Egyptian Arabizi-Arabic words. Partanen et al. (2019) have used character level NMT to translate dialectal Finnish to standard one. They used LSTM and transformer models that has been trained on a hand-annotated corpus of transcriptions of different speech records starting from 1950. ...
Preprint
Full-text available
Social media user-generated text is actually the main resource for many NLP tasks. This text however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases further NLP tasks complexity. A dialect is a verbal language that does not have a standard orthography, which leads users to improvise spelling while writing. Thus, for the same word we can find multiple forms of transliterations. Subsequently, it is mandatory to normalize these different transliterations to one canonical word form. To reach this goal, we have exploited the powerfulness of word embedding models generated with a corpus of YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that provides the canonical forms, we have built a normalization dictionary that we refer to as MANorm. We have conducted several experiments to demonstrate the efficiency of MANorm, which have shown its usefulness in dialect normalization.
... We train the model to predict from text with compound errors into text without compound errors. As previous research (Partanen et al., 2019;Alnajjar et al., 2020) has found that using chunks of words instead of full sentences at a time improves the results in character level models, we will be training different models with different chunk sizes. This means that we will train a model to predict two words at a time, three words at a time, all the way to five words at a time. ...
... The impact of normalization was evaluated by Agić et al. (2016) in a study on multilingual projection for parsing low-resource languages. An attempt to normalize dialectal Finnish into the normative standard language was presented by Partanen et al. (2019). Hegazi et al. (2021) studied preprocessing of Arabic text on social media. ...
... The current approaches to Finnish dialect have focused on the textual modality only. Previously, bidirectional LSTM (long short-term memory) based models have been used to normalize Finnish dialects to standard Finnish (Partanen et al., 2019) and to adapt standard Finnish text into different dialectal forms (Hämäläinen et al., 2020). Similar approach has also been used to normalize historical Finnish . ...
Preprint
Full-text available
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the best accuracy is received by combining both of the modalities, as text only reaches to an overall accuracy of 57\%, where as text and audio reach to 85\%. Our code, models and data have been released openly on Github and Zenodo.
... Nonetheless, we are considering adding dialectal transcriptions and locale information to lexemes, which, in addition to preserving this information, would support geolinguistics studies of these languages and facilitate developing computational models for processing dialects (c.f. (Partanen et al., 2019)). ...
Article
Full-text available
We present an open-source online dictionary editing system, Ve rdd, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.
... The current approaches to Finnish dialect have focused on the textual modality only. Previously, bidirectional LSTM (long short-term memory) based models have been used to normalize Finnish dialects to standard Finnish (Partanen et al., 2019) and to adapt standard Finnish text into different dialectal forms (Hämäläinen et al., 2020). Similar approach has also been used to normalize historical Finnish . ...
Conference Paper
Full-text available
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the best accuracy is received by combining both of the modalities, as text only reaches to an overall accuracy of 57%, where as text and audio reach to 85%. Our code, models and data have been released openly on Github and Zenodo.
... The fact that spoken Finnish is very different to standard Finnish has drawn some attention in the past (Jauhiainen, 2001) and (Partanen et al., 2019). The latter leading to a Python library called Murre 12 for automatic normalization of dialectal Finnish. ...
Preprint
Full-text available
There are a lot of tools and resources available for processing Finnish. In this paper, we survey recent papers focusing on Finnish NLP related to many different subcategories of NLP such as parsing, generation, semantics and speech. NLP research is conducted in many different research groups in Finland, and it is frequently the case that NLP tools and models resulting from academic research are made available for others to use on platforms such as Github.
... We train the model to predict from text with compound errors into text without compound errors. As previous research (Partanen et al., 2019;Alnajjar et al., 2020) has found that using chunks of words instead of full sentences at a time improves the results in character level models, we will be training different models with different chunk sizes. This means that we will train a model to predict two words at a time, three words at a time, all the way to five words at a time. ...
Conference Paper
Full-text available
We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a cor- pus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.
Conference Paper
Full-text available
This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.
Conference Paper
Full-text available
A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.
Conference Paper
Full-text available
Uyghur is the second largest and most actively used social media language in China. However, a non-negligible part of Uyghur text appearing in social media is unsystematically written with the Latin alphabet, and it continues to increase in size. Uyghur text in this format is incomprehensible and ambiguous even to native Uyghur speakers. In addition, Uyghur texts in this form lack the potential for any kind of advancement for the NLP tasks related to the Uyghur language. Restoring and preventing noisy Uyghur text written with unsystematic Latin alphabets will be essential to the protection of Uyghur language and improving the accuracy of Uyghur NLP tasks. To this purpose, in this work we propose and compare the noisy channel model and the neural encoder-decoder model as normalizing methods.
Article
We introduce an open-source toolkit for neural machine translation (NMT) to support research into model architectures, feature representations, and source modalities, while maintaining competitive performance, modularity and reasonable training requirements.
Conference Paper
Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, they lack tools and resources for natural language processing. The main reason for this is the fact that the dialects are mostly spoken and that written resources are small and highly inconsistent. This paper addresses the great variability in writing that poses a problem for automatic processing. We propose an automatic approach to normalising the variants to a single representation intended for processing tools’ internal use (not shown to human users). We manually create a sample of transcribed and normalised texts, which we use to train and test three methods based on machine translation: word-by-word mappings, character-based machine translation, and language modelling. We show that an optimal combination of the three approaches gives better results than any of them separately.