PreprintPDF Available

Abstract and Figures

Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best results. We believe this is due to the size of the training data available for the model. Our models are accessible as a Python package. The study provides important information about the adaptability of these methods in different contexts, and gives important baselines for further study.
Content may be subject to copyright.
Normalization of Dierent Swedish Dialects Spoken in Finland
Mika Hämäläinen
mika@rootroo.com
University of Helsinki and Rootroo
Helsinki, Finland
Niko Partanen
niko.partanen@helsinki.
University of Helsinki
Helsinki, Finland
Khalid Alnajjar
khalid@rootroo.com
University of Helsinki and Rootroo
Helsinki, Finland
ABSTRACT
Our study presents a dialect normalization method for dierent
Finland Swedish dialects covering six regions. We tested 5 dierent
models, and the best model improved the word error rate from
76.45 to 28.58. Contrary to results reported in earlier research on
Finnish dialects, we found that training the model with one word
at a time gave best results. We believe this is due to the size of the
training data available for the model. Our models are accessible as
a Python package. The study provides important information about
the adaptability of these methods in dierent contexts, and gives
important baselines for further study.
CCS CONCEPTS
Computing methodologies Natural language processing
;
Applied computing Arts and humanities.
KEYWORDS
dialect normalization, regional languages, non-standard language
1 INTRODUCTION
Swedish is a minority language in Finland and it is the second ocial
language of the country. Finland Swedish is very dierent from
Swedish spoken in Sweden in terms of pronunciation, vocabulary
and some parts of the grammar. The Swedish dialects in Finland also
dier radically from one another and from one region to another.
Because of the wide geographical span of the Swedish speaking
communities in Finland and low population density of the country,
the dialects have not remained similar. Despite its ocial status,
Finland Swedish has hardly received any research attention within
the natural language processing community.
This paper introduces a method for dialect transcript normal-
ization, which enables the possibility to use existing NLP tools
targeted for normative Swedish on these materials. Previous work
conducted in English data indicates that normalization is a viable
way of improving the accuracy of NLP methods such as POS tag-
ging [
26
]. This is an important motivation as the non-standard
colloquial Swedish is the language of communication on a mul-
titude of internet platforms ranging from social media to forums
and blogs. In its linguistic form, the colloquial dialectal Finland
Swedish deviates greatly from the standard normative Swedish and
the dialectal variants of the language spoken in Sweden, a fact that
lowers the performance of the existing NLP tools for processing
Swedish on such text. Finland Swedish is also a continuous target of
research by the non-computational linguistics community, among
other elds of research, and we better methods to analyze these
texts are also benecial in these academic domains.
We train several dialect normalization models to cover dialects
of six dierent Swedish speaking regions of Finland. In terms of
area, this covers the Aland island and the regions along the Baltic
Sea coastline that have the largest number of Swedish speaking
population. See the Figure 1 for a map that shows the geographical
extent. The dialect normalization models have been made available
for everyone through a Python library called Murre
1
. This is im-
portant so that people both inside and outside of academia can use
the normalization models easily on their data processing pipelines.
Figure 1: Dialects of Swedish in Finland. & Fenn-O-maniC
[CC-BY-SA], via Wikimedia Commons. (https://commons.
wikimedia.org/wiki/File:Svenska_dialekter_i_Finland.svg).
2 RELATED WORK
Text normalization has been studied often in the past. The main
areas of application have been historical text normalization, dialect
text normalization, and noisy user generated text normalization.
All these are important domains, with their own distinct challenges.
Most important for our task is dialectal text normalization, but for
the sake of thoroughness we discuss the related work in a somewhat
wider context.
1https://github.com/mikahama/murre
arXiv:2012.05318v1 [cs.CL] 9 Dec 2020
GeoHumanities’20, November 3–6, 2020, Seale, WA, USA Hämäläinen, Partanen & Alnajjar
Bollmann [
5
] has provided a meta-analysis where contemporary
approaches are divided into ve categories: substitution lists like
VARD [
22
] and Norma [
4
], rule-based methods [
2
,
21
], edit distance
based approaches [
1
,
12
], statistical methods and most recently
neural methods.
Statistical machine translation (SMT) approaches have generally
been at the core of the most commonly used statistical methods.
In these methods, the normalization is often assimilated with the
regular translation process by training an SMT model on a character
level. These methods have been used both for historical text [
9
,
10
,
20] and contemporary dialect normalization [23].
Currently many normalization methods have utilized neural ma-
chine translation (NMT), in a way that is comparable to the earlier
SMT based approaches. Bollmann and Søgaard
[6]
used a bidirec-
tional long short-term memory (bi-LSTM) deep neural network
in a character level normalization of historical German. They also
tested the eciency of multi-task learning, when additional data
is used during the training phase. In these experiments neural net-
work approaches gave better results than conditional random elds
(CRF) and Norma, whereas multi-task learning provided the best
accuracy.
An approach based on LSTMs and the noisy channel model
(NCM) was tested by Tursun and Cakici
[25]
to normalize Uyghur
text. They used a small base dataset of
200 sentences, which
were obtained from social networks and normalized. The authors
generated noisy synthetic data by inserting random errors into
online-crawled data. These methods were able to normalize the
text with high accuracy. Mandal and Nanmaran
[18]
used an LSTM
network to normalize code-mixed data, and achieved an accuracy
of 90.27%, which can be considered successful.
Within the historical text normalization, a recent study [
11
] com-
pared various LSTM architectures, and found that bi-directional
recurrent neural networks (BRNN) were more accurate than one-
directional RNNs. Dierent attention models or deeper architec-
tures did not improve the results further. In the same vein, addi-
tional metadata decreased the accuracy. For both historical and
dialectal data such metadata is often available, which makes these
experiments particularly relevant. These ndings suggest that at
the moment post-processing appears as the best way to improve
a character level NMT normalization model. In this context some
attention needs to be paid over the fact that dialect normalization
is in many ways a distinct task from historical text normalization,
although it shares similarities.
Closely related to the current work, a very eective method
has been proposed for normalization of dialectal Finnish [
19
]. The
authors trained a character-based normalization model based on
bi-directional recurrent neural network. In this study the training
corpus was very large, almost 700,000 normalized word tokens.
3 DATA AND PREPROCESSING
As our dataset, we use the collection of recordings of Finland
Swedish collected between 2005 and 2008 [
14
]. We crawl the version
of this dataset that is hosted online on Finna by Society of Swedish
Literature in Finland, and is CC-BY licensed
2
. This dataset consists
of interviews with Swedish speaking people from dierent parts of
2https://sls.nna./Collection/sls.SLS+2098/
dialect lines words
Nyland 1903 29314
Åland 1949 14546
Åboland 571 9989
Österbotten 1827 30137
Birkaland 64 886
Kymmenedalen 64 1634
Table 1: Number of lines from each region
Finland together with coordinates indicating where the data was
collected from. The recordings have been transcribed by hand to a
textual format that follows the pronunciation of the participants.
These transcriptions have also been normalized by hand to standard
Finland Swedish writing. These recordings represent dialects of six
regions of Finland: Åland,Åboland,Nyland,Österbotten,Birkaland
and Kymmenedalen.
Table 1 shows how many lines of interviews the dataset has from
each region. One line represents a turn when a participant speaks
and it can consist of a single word or multiple sentences. The word
count shows how many words we have for each region. As we can
see by comparing the line count and the word count, some regions
had participants who spoke considerably more than others.
The data itself is not free of noise, and we take several prepro-
cessing steps. The rst step is manually going through the entire
dataset and nd all cases of characters that are not part of the
Swedish alphabets. This revealed that numbers and the word euro
were consistently normalized by using numbers and the euro sign
even though the dialectal transcription had them written out.
For example, nittånhondratrettitvåå was normalized simply as 1932.
We went through all these cases and wrote the numbers out in
standard Swedish, such as nittonhundratrettitvå. This is important
as we want the model to be able to normalize text, not to convert
text into numbers, and any noise in the data would make it harder
for the model to learn the correct normalization.
We lowercased the dataset, removed punctuation as they were
inconsistently marked between the dialectal transcriptions and
their normalizations, and tokenized the data with NLTK [
3
]. At
this point, most of the lines in the corpus had an equal number
of words in the transcriptions and their normalizations. However,
1253 lines had a dierent number of words, as sometimes words are
pronounced together but written separately in standard Swedish,
such as in the case of såhäna that was normalized as sådana här.
As we want to train our models to operate on token level rather
than normalizing full sentences of varying lengths, we need to map
the dialectal text to their normalizations on a token level.
For the token level mapping, we train a statistical word aligner
model with a tool called Fast align ng [
8
] which is an improved
version of the popular fast align tool [
7
]. Such tools are commonly
used in machine translation contexts. We train this aligner with all
of our data and use it to align the dialectal and normalized sentences
that do not have an equal number of tokens.
We shue the data and split it into training and testing. We use
70% of the data for training the models and 30 % for evaluation.
Normalization of Dierent Swedish Dialects Spoken in Finland GeoHumanities’20, November 3–6, 2020, Seale, WA, USA
dialectal text (source) normalized text (target)
chunkof1 huuvuintresse huvudintressen
chunkof3 kan_jo_nåo kan_ju_nog
Table 2: Examples of training data for dierent models
4 DIALECT NORMALIZATION
We use a character level NMT model, following the encouraging
results achieved with similar architecture over Finnish dialect data
[
19
]. The advantage of using a character level model is that the
model can better learn to generalize the dialectal dierences than a
word level model, and it can work for out of the vocabulary words
as well as it operates on characters instead of words. In practice,
when the model is trained, the words are split into characters, and
the underscore sign (_) is used to mark the word boundaries.
We trained dierent models by varying the length of the input
chunk that was given to the model. We trained separate models to
predict from one dialectal word at a time to their normalization, two
words at a time all the way to ve words at a time (see Table 2 for
an example). Providing context is important, as in many situations
the correct normalization of a dialectal word cannot be predicted in
isolation. At the same time, longer chunks may become harder to
learn [
19
], and thereby it is important to nd the optimal length that
gives the best performance. We use the same random seed
3
when
training all of the models to make their intercomparison possible.
For all variations in chunk size, we train a character based bi-
directional LSTM model [
13
] by using OpenNMT-py [
15
] with the
default settings except for the encoder where we use a BRNN (bi-
directional recurrent neural network) [
24
] instead of the default
RNN (recurrent neural network) as BRNN has been shown to pro-
vide a performance gain in a variety of tasks. We use the default
of two layers for both the encoder and the decoder and the default
attention model, which is the general global attention presented by
Luong et al. [
17
]. The models are trained for the default of 100,000
steps.
5 RESULTS AND EVALUATION
We report the results of the dierent models based on the accuracy
and WER (word-error rate) of their prediction when comparing to
the gold standard in the test set. WER is a a commonly used metric
to evaluate dierent systems that deal with text normalization
and it is derived from Levenshtein edit distance [
16
] as a better
measurement for calculating word-level errors. It takes into account
the number of deletions
𝐷
, substitutions
𝑆
, insertions
𝐼
and the
number of correct words
𝐶
, and it is calculated with the following
formula:
𝑊 𝐸𝑅 =
𝑆+𝐷+𝐼
𝑆+𝐷+𝐶(1)
The results in Table 3 show that the best working model is the one
that takes only one word into account at a time. This is interesting
as earlier research with Finnish shows that the lowest WER is
achieved by chunks of 3 words [
19
]. The dierence in our results is
probably due to the fact that we had less training data available for
3The seed used is 3435
WER accuracy
no normalization 76.45 23.5%
chunk of 1 28.58 71.4%
chunk of 2 33.87 66.1%
chunk of 3 93.47 14.3%
chunk of 4 147.24 3.6%
chunk of 5 103.56 4.6%
Table 3: Evaluation results of the dierent models
this task, therefore the model worked best in a situation where it did
not need to learn a larger context. Since the dialect normalization as
a task is often heavily dependent of the context, it must be assumed
that with enough data the use of larger chunks is benecial.
When looking at the results of the best performing model, most
of the words look right or have a very minor issue. The most com-
mon mistake the model makes is with ä, for instance, teevlingar
gets normalized into tevlingar, even though the correct spelling is
tävlingar (contests). Interestingly, there is some overlap between
how eand äare pronounced in Swedish, which means that the
model would need more data to learn this phenomenon that is not
a part of the phonetics of the language, but rather a matter of a
spelling convention. Another source of problems are long words,
for example, såmmararbeetare is normalized into sommarbetare
instead of sommararbetare (summer worker). This type of problems
could be solved by introducing a word segmentation model that
would split compounds before normalization.
The model trained with chunks of two words at a time has more
severe problems with long words as many of them get heaviliy
trunkated, for instance, teevlingar att becomes tällev att instead
of tävlingar att (contests to), and i låågstaadie turns into i låstade
instead of i lågstadiet (in the elementary school). This model makes
more mistakes that are more severe with long words than the model
trained with one word at a time.
As long words are problematic even for the models of chunks
of 1 and 2, it is not surprising that the models trained with longer
chunks get even more confused as the length of the input increases.
For example i låågstaadie jåå is normalized as iog och då då by the
chunk of 3 model, och då och by the chunk of 4 model and så var
var by the chunk of 5 model. Needless to say, all of these are very
wrong.
6 CONCLUSIONS
Based on previous research, it seemed that having some, but not
too much context in normalization was benecial for the model
and improved results. However, in our study, we can conclude that
context should be provided for the model only if you can aord it.
This means that the more data you have, the longer sequences can be
used. But with very little data, it is better to ignore the context and
normalize one word at a time, so that the model can learn a better
representation of the normative language. As the model can predict
top n candidates instead of the top 1 as we did in this research, in the
future, it might be interesting to see if contextual disambiguation
of normalization candidates can be left to a language model trained
only in the normative language.
GeoHumanities’20, November 3–6, 2020, Seale, WA, USA Hämäläinen, Partanen & Alnajjar
Our study provides a new important baseline for dialect normal-
ization as a character level machine translation task. We show that
also a training data that is signicantly smaller than previously
used can give useful results and decrease the word error rate dra-
matically. It remains as an important question for future research
what exactly is the ideal amount and type of training data for dialect
normalization. Also the variation in linguistic distance between
the dialects and orthographies must be one factor that inuences
the diculty of the normalization task. We have not attempted to
evaluate this, but for the further work this could be another useful
baseline when we evaluate how well the model can perform under
dierent conditions.
We have published our versions of the training data openly on
Zenodo
4
, and hope they will play a role in the future endeavors
in improving the normalization of Finland Swedish. In the future
attention should also be paid to normalization challenges of Swedish
dialects spoken outside of Finland.
REFERENCES
[1]
Marilisa Amoia and Jose Manuel Martinez. 2013. Using comparable collections
of historical texts for building a diachronic dictionary for spelling normalization.
In Proceedings of the 7th workshop on language technology for cultural heritage,
social sciences, and humanities. 84–89.
[2]
Alistair Baron and Paul Rayson. 2008. VARD2: A tool for dealing with spelling
variation in historical corpora. In Postgraduate conference in corpus linguistics.
[3]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing
with Python. O’Reilly Media.
[4]
Marcel Bollmann. 2012. (Semi-)Automatic Normalization of Historical TextsUsing
Distance Measures and the Norma Tool. In Proceedings of the Second Workshop on
Annotation of Corpora for Research in the Humanities (ACRH-2). Lisbon, Portugal.
https://marcel.bollmann.me/pub/acrh12.pdf
[5]
Marcel Bollmann. 2019. A Large-Scale Comparison of Historical Text Normaliza-
tion Systems. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers). Association for Computational Linguistics,
Minneapolis, Minnesota, 3885–3898. https://doi.org/10.18653/v1/N19-1389
[6]
Marcel Bollmann and Anders Søgaard. 2016. Improving historical spelling nor-
malization with bi-directional LSTMs and multi-task learning. In Proceedings
of COLING 2016, the 26th International Conference on Computational Linguis-
tics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan,
131–139. https://www.aclweb.org/anthology/C16-1013
[7]
Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A Simple, Fast, and Ef-
fective Reparameterization of IBM Model 2. In Proceedings of the 2013 Conference
of the North American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies. Association for Computational Linguistics, Atlanta,
Georgia, 644–648. https://www.aclweb.org/anthology/N13-1073
[8]
Douwe Gelling and Trevor Cohn. 2014. Simple extensions and POS Tags for a
reparameterised IBM Model 2. In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers). Association
for Computational Linguistics, Baltimore, Maryland, 150–154. https://doi.org/10.
3115/v1/P14-2025
[9]
Mika Hämäläinen and Simon Hengchen. 2019. From the Paft to the Fiiture: a
Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction.
In Recent Advances in Natural Language Processing. INCOMA, 432–437.
[10]
Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, and Eetu Mäkelä.
2018. Normalizing early English letters to present-day English spelling. In Pro-
ceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for
Cultural Heritage, Social Sciences, Humanities and Literature. 87–96.
[11]
Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, and Eetu Mäkelä.
2019. Revisiting NMT for Normalization of Early English Letters. In Proceedings
of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Her-
itage, Social Sciences, Humanities and Literature. Association for Computational
Linguistics, Minneapolis, USA, 71–75. https://doi.org/10.18653/v1/W19-2509
[12]
Andreas W Hauser and Klaus U Schulz. 2007. Unsupervised learning of edit
distance weights for retrieving historical spelling variations. In Proceedings of
the First Workshop on Finite-State Techniques and Approximate Search. 1–6.
[13]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735–1780.
4https://zenodo.org/record/4060296
[14]
Ann-Marie Ivars and Lisa Södergård. 2007. Spara det nlandssvenska talet.
Nordisk dialektologi og sociolingvistik (2007).
[15]
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M.
Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In
Proc. ACL. https://doi.org/10.18653/v1/P17-4012
[16]
Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions,
Insertions, and Reversals. Soviets Physics Doklady 10, 8 (1966), 707–710.
[17]
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Eec-
tive approaches to attention-based neural machine translation. arXiv preprint
arXiv:1508.04025 (2015).
[18]
Soumil Mandal and Karthick Nanmaran. 2018. Normalization of Transliterated
Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance. In
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-
generated Text. Association for Computational Linguistics, Brussels, Belgium,
49–53. https://doi.org/10.18653/v1/W18-6107
[19]
Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text Nor-
malization to Normative Standard Finnish. In Proceedings of the 5th Workshop on
Noisy User-generated Text (W-NUT 2019). 141–146.
[20]
Eva Pettersson, Beáta Megyesi, and Jörg Tiedemann. 2013. An SMT approach
to automatic annotation of historical text. In Proceedings of the workshop on
computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo;
Norway. NEALT Proceedings Series 18. Linköping University Electronic Press,
54–69.
[21]
Jordi Porta, José-Luis Sancho, and Javier Gómez. 2013. Edit transducers for
spelling variation in Old Spanish. In Proceedings of the workshop on computational
historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT
Proceedings Series 18. Linköping University Electronic Press, 70–79.
[22]
Paul Rayson, Dawn Archer, and Nicholas Smith. 2005. VARD versus WORD: A
comparison of the UCREL variant detector and modern spellcheckers on English
historical corpora. Corpus Linguistics 2005 (2005).
[23]
Tanja Samardzic, Yves Scherrer, and Elvira Glaser. 2015. Normalising ortho-
graphic and dialectal variants for the automatic processing of Swiss German. In
Proceedings of the 7th Language and Technology Conference. ID: unige:82397.
[24]
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-
works. IEEE transactions on Signal Processing 45, 11 (1997), 2673–2681.
[25]
Osman Tursun and Ruket Cakici. 2017. Noisy Uyghur Text Normalization. In
Proceedings of the 3rd Workshop on Noisy User-generated Text. Association for
Computational Linguistics, Copenhagen, Denmark, 85–93. https://doi.org/10.
18653/v1/W17-4412
[26]
Rob van der Goot, Barbara Plank, and Malvina Nissim. 2017. To normalize, or
not to normalize: The impact of normalization on Part-of-Speech tagging. In
Proceedings of the 3rd Workshop on Noisy User-generated Text. 31–39.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for norma-tive Finnish text. We work on a corpus consisting of dialectal data from 23 distinct Finnish dialect varieties. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.
Conference Paper
Full-text available
This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.
Conference Paper
Full-text available
A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.
Conference Paper
Full-text available
Uyghur is the second largest and most actively used social media language in China. However, a non-negligible part of Uyghur text appearing in social media is unsystematically written with the Latin alphabet, and it continues to increase in size. Uyghur text in this format is incomprehensible and ambiguous even to native Uyghur speakers. In addition, Uyghur texts in this form lack the potential for any kind of advancement for the NLP tasks related to the Uyghur language. Restoring and preventing noisy Uyghur text written with unsystematic Latin alphabets will be essential to the protection of Uyghur language and improving the accuracy of Uyghur NLP tasks. To this purpose, in this work we propose and compare the noisy channel model and the neural encoder-decoder model as normalizing methods.
Conference Paper
Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, they lack tools and resources for natural language processing. The main reason for this is the fact that the dialects are mostly spoken and that written resources are small and highly inconsistent. This paper addresses the great variability in writing that poses a problem for automatic processing. We propose an automatic approach to normalising the variants to a single representation intended for processing tools’ internal use (not shown to human users). We manually create a sample of transcribed and normalised texts, which we use to train and test three methods based on machine translation: word-by-word mappings, character-based machine translation, and language modelling. We show that an optimal combination of the three approaches gives better results than any of them separately.