Normalization of Dierent Swedish Dialects Spoken in Finland
University of Helsinki and Rootroo
University of Helsinki
University of Helsinki and Rootroo
Our study presents a dialect normalization method for dierent
Finland Swedish dialects covering six regions. We tested 5 dierent
models, and the best model improved the word error rate from
76.45 to 28.58. Contrary to results reported in earlier research on
Finnish dialects, we found that training the model with one word
at a time gave best results. We believe this is due to the size of the
training data available for the model. Our models are accessible as
a Python package. The study provides important information about
the adaptability of these methods in dierent contexts, and gives
important baselines for further study.
•Computing methodologies →Natural language processing
•Applied computing →Arts and humanities.
dialect normalization, regional languages, non-standard language
Swedish is a minority language in Finland and it is the second ocial
language of the country. Finland Swedish is very dierent from
Swedish spoken in Sweden in terms of pronunciation, vocabulary
and some parts of the grammar. The Swedish dialects in Finland also
dier radically from one another and from one region to another.
Because of the wide geographical span of the Swedish speaking
communities in Finland and low population density of the country,
the dialects have not remained similar. Despite its ocial status,
Finland Swedish has hardly received any research attention within
the natural language processing community.
This paper introduces a method for dialect transcript normal-
ization, which enables the possibility to use existing NLP tools
targeted for normative Swedish on these materials. Previous work
conducted in English data indicates that normalization is a viable
way of improving the accuracy of NLP methods such as POS tag-
]. This is an important motivation as the non-standard
colloquial Swedish is the language of communication on a mul-
titude of internet platforms ranging from social media to forums
and blogs. In its linguistic form, the colloquial dialectal Finland
Swedish deviates greatly from the standard normative Swedish and
the dialectal variants of the language spoken in Sweden, a fact that
lowers the performance of the existing NLP tools for processing
Swedish on such text. Finland Swedish is also a continuous target of
research by the non-computational linguistics community, among
other elds of research, and we better methods to analyze these
texts are also benecial in these academic domains.
We train several dialect normalization models to cover dialects
of six dierent Swedish speaking regions of Finland. In terms of
area, this covers the Aland island and the regions along the Baltic
Sea coastline that have the largest number of Swedish speaking
population. See the Figure 1 for a map that shows the geographical
extent. The dialect normalization models have been made available
for everyone through a Python library called Murre
. This is im-
portant so that people both inside and outside of academia can use
the normalization models easily on their data processing pipelines.
Figure 1: Dialects of Swedish in Finland. & Fenn-O-maniC
[CC-BY-SA], via Wikimedia Commons. (https://commons.
2 RELATED WORK
Text normalization has been studied often in the past. The main
areas of application have been historical text normalization, dialect
text normalization, and noisy user generated text normalization.
All these are important domains, with their own distinct challenges.
Most important for our task is dialectal text normalization, but for
the sake of thoroughness we discuss the related work in a somewhat
arXiv:2012.05318v1 [cs.CL] 9 Dec 2020
GeoHumanities’20, November 3–6, 2020, Seale, WA, USA Hämäläinen, Partanen & Alnajjar
] has provided a meta-analysis where contemporary
approaches are divided into ve categories: substitution lists like
] and Norma [
], rule-based methods [
], edit distance
based approaches [
], statistical methods and most recently
Statistical machine translation (SMT) approaches have generally
been at the core of the most commonly used statistical methods.
In these methods, the normalization is often assimilated with the
regular translation process by training an SMT model on a character
level. These methods have been used both for historical text [
20] and contemporary dialect normalization .
Currently many normalization methods have utilized neural ma-
chine translation (NMT), in a way that is comparable to the earlier
SMT based approaches. Bollmann and Søgaard
used a bidirec-
tional long short-term memory (bi-LSTM) deep neural network
in a character level normalization of historical German. They also
tested the eciency of multi-task learning, when additional data
is used during the training phase. In these experiments neural net-
work approaches gave better results than conditional random elds
(CRF) and Norma, whereas multi-task learning provided the best
An approach based on LSTMs and the noisy channel model
(NCM) was tested by Tursun and Cakici
to normalize Uyghur
text. They used a small base dataset of
200 sentences, which
were obtained from social networks and normalized. The authors
generated noisy synthetic data by inserting random errors into
online-crawled data. These methods were able to normalize the
text with high accuracy. Mandal and Nanmaran
used an LSTM
network to normalize code-mixed data, and achieved an accuracy
of 90.27%, which can be considered successful.
Within the historical text normalization, a recent study [
pared various LSTM architectures, and found that bi-directional
recurrent neural networks (BRNN) were more accurate than one-
directional RNNs. Dierent attention models or deeper architec-
tures did not improve the results further. In the same vein, addi-
tional metadata decreased the accuracy. For both historical and
dialectal data such metadata is often available, which makes these
experiments particularly relevant. These ndings suggest that at
the moment post-processing appears as the best way to improve
a character level NMT normalization model. In this context some
attention needs to be paid over the fact that dialect normalization
is in many ways a distinct task from historical text normalization,
although it shares similarities.
Closely related to the current work, a very eective method
has been proposed for normalization of dialectal Finnish [
authors trained a character-based normalization model based on
bi-directional recurrent neural network. In this study the training
corpus was very large, almost 700,000 normalized word tokens.
3 DATA AND PREPROCESSING
As our dataset, we use the collection of recordings of Finland
Swedish collected between 2005 and 2008 [
]. We crawl the version
of this dataset that is hosted online on Finna by Society of Swedish
Literature in Finland, and is CC-BY licensed
. This dataset consists
of interviews with Swedish speaking people from dierent parts of
dialect lines words
Nyland 1903 29314
Åland 1949 14546
Åboland 571 9989
Österbotten 1827 30137
Birkaland 64 886
Kymmenedalen 64 1634
Table 1: Number of lines from each region
Finland together with coordinates indicating where the data was
collected from. The recordings have been transcribed by hand to a
textual format that follows the pronunciation of the participants.
These transcriptions have also been normalized by hand to standard
Finland Swedish writing. These recordings represent dialects of six
regions of Finland: Åland,Åboland,Nyland,Österbotten,Birkaland
Table 1 shows how many lines of interviews the dataset has from
each region. One line represents a turn when a participant speaks
and it can consist of a single word or multiple sentences. The word
count shows how many words we have for each region. As we can
see by comparing the line count and the word count, some regions
had participants who spoke considerably more than others.
The data itself is not free of noise, and we take several prepro-
cessing steps. The rst step is manually going through the entire
dataset and nd all cases of characters that are not part of the
Swedish alphabets. This revealed that numbers and the word euro
were consistently normalized by using numbers and the euro sign
even though the dialectal transcription had them written out.
For example, nittånhondratrettitvåå was normalized simply as 1932.
We went through all these cases and wrote the numbers out in
standard Swedish, such as nittonhundratrettitvå. This is important
as we want the model to be able to normalize text, not to convert
text into numbers, and any noise in the data would make it harder
for the model to learn the correct normalization.
We lowercased the dataset, removed punctuation as they were
inconsistently marked between the dialectal transcriptions and
their normalizations, and tokenized the data with NLTK [
this point, most of the lines in the corpus had an equal number
of words in the transcriptions and their normalizations. However,
1253 lines had a dierent number of words, as sometimes words are
pronounced together but written separately in standard Swedish,
such as in the case of såhäna that was normalized as sådana här.
As we want to train our models to operate on token level rather
than normalizing full sentences of varying lengths, we need to map
the dialectal text to their normalizations on a token level.
For the token level mapping, we train a statistical word aligner
model with a tool called Fast align ng [
] which is an improved
version of the popular fast align tool [
]. Such tools are commonly
used in machine translation contexts. We train this aligner with all
of our data and use it to align the dialectal and normalized sentences
that do not have an equal number of tokens.
We shue the data and split it into training and testing. We use
70% of the data for training the models and 30 % for evaluation.
Normalization of Dierent Swedish Dialects Spoken in Finland GeoHumanities’20, November 3–6, 2020, Seale, WA, USA
dialectal text (source) normalized text (target)
chunkof1 huuvuintresse huvudintressen
chunkof3 kan_jo_nåo kan_ju_nog
Table 2: Examples of training data for dierent models
4 DIALECT NORMALIZATION
We use a character level NMT model, following the encouraging
results achieved with similar architecture over Finnish dialect data
]. The advantage of using a character level model is that the
model can better learn to generalize the dialectal dierences than a
word level model, and it can work for out of the vocabulary words
as well as it operates on characters instead of words. In practice,
when the model is trained, the words are split into characters, and
the underscore sign (_) is used to mark the word boundaries.
We trained dierent models by varying the length of the input
chunk that was given to the model. We trained separate models to
predict from one dialectal word at a time to their normalization, two
words at a time all the way to ve words at a time (see Table 2 for
an example). Providing context is important, as in many situations
the correct normalization of a dialectal word cannot be predicted in
isolation. At the same time, longer chunks may become harder to
], and thereby it is important to nd the optimal length that
gives the best performance. We use the same random seed
training all of the models to make their intercomparison possible.
For all variations in chunk size, we train a character based bi-
directional LSTM model [
] by using OpenNMT-py [
] with the
default settings except for the encoder where we use a BRNN (bi-
directional recurrent neural network) [
] instead of the default
RNN (recurrent neural network) as BRNN has been shown to pro-
vide a performance gain in a variety of tasks. We use the default
of two layers for both the encoder and the decoder and the default
attention model, which is the general global attention presented by
Luong et al. [
]. The models are trained for the default of 100,000
5 RESULTS AND EVALUATION
We report the results of the dierent models based on the accuracy
and WER (word-error rate) of their prediction when comparing to
the gold standard in the test set. WER is a a commonly used metric
to evaluate dierent systems that deal with text normalization
and it is derived from Levenshtein edit distance [
] as a better
measurement for calculating word-level errors. It takes into account
the number of deletions
number of correct words
, and it is calculated with the following
𝑊 𝐸𝑅 =
The results in Table 3 show that the best working model is the one
that takes only one word into account at a time. This is interesting
as earlier research with Finnish shows that the lowest WER is
achieved by chunks of 3 words [
]. The dierence in our results is
probably due to the fact that we had less training data available for
3The seed used is 3435
no normalization 76.45 23.5%
chunk of 1 28.58 71.4%
chunk of 2 33.87 66.1%
chunk of 3 93.47 14.3%
chunk of 4 147.24 3.6%
chunk of 5 103.56 4.6%
Table 3: Evaluation results of the dierent models
this task, therefore the model worked best in a situation where it did
not need to learn a larger context. Since the dialect normalization as
a task is often heavily dependent of the context, it must be assumed
that with enough data the use of larger chunks is benecial.
When looking at the results of the best performing model, most
of the words look right or have a very minor issue. The most com-
mon mistake the model makes is with ä, for instance, teevlingar
gets normalized into tevlingar, even though the correct spelling is
tävlingar (contests). Interestingly, there is some overlap between
how eand äare pronounced in Swedish, which means that the
model would need more data to learn this phenomenon that is not
a part of the phonetics of the language, but rather a matter of a
spelling convention. Another source of problems are long words,
for example, såmmararbeetare is normalized into sommarbetare
instead of sommararbetare (summer worker). This type of problems
could be solved by introducing a word segmentation model that
would split compounds before normalization.
The model trained with chunks of two words at a time has more
severe problems with long words as many of them get heaviliy
trunkated, for instance, teevlingar att becomes tällev att instead
of tävlingar att (contests to), and i låågstaadie turns into i låstade
instead of i lågstadiet (in the elementary school). This model makes
more mistakes that are more severe with long words than the model
trained with one word at a time.
As long words are problematic even for the models of chunks
of 1 and 2, it is not surprising that the models trained with longer
chunks get even more confused as the length of the input increases.
For example i låågstaadie jåå is normalized as iog och då då by the
chunk of 3 model, och då och by the chunk of 4 model and så var
var by the chunk of 5 model. Needless to say, all of these are very
Based on previous research, it seemed that having some, but not
too much context in normalization was benecial for the model
and improved results. However, in our study, we can conclude that
context should be provided for the model only if you can aord it.
This means that the more data you have, the longer sequences can be
used. But with very little data, it is better to ignore the context and
normalize one word at a time, so that the model can learn a better
representation of the normative language. As the model can predict
top n candidates instead of the top 1 as we did in this research, in the
future, it might be interesting to see if contextual disambiguation
of normalization candidates can be left to a language model trained
only in the normative language.
GeoHumanities’20, November 3–6, 2020, Seale, WA, USA Hämäläinen, Partanen & Alnajjar
Our study provides a new important baseline for dialect normal-
ization as a character level machine translation task. We show that
also a training data that is signicantly smaller than previously
used can give useful results and decrease the word error rate dra-
matically. It remains as an important question for future research
what exactly is the ideal amount and type of training data for dialect
normalization. Also the variation in linguistic distance between
the dialects and orthographies must be one factor that inuences
the diculty of the normalization task. We have not attempted to
evaluate this, but for the further work this could be another useful
baseline when we evaluate how well the model can perform under
We have published our versions of the training data openly on
, and hope they will play a role in the future endeavors
in improving the normalization of Finland Swedish. In the future
attention should also be paid to normalization challenges of Swedish
dialects spoken outside of Finland.
Marilisa Amoia and Jose Manuel Martinez. 2013. Using comparable collections
of historical texts for building a diachronic dictionary for spelling normalization.
In Proceedings of the 7th workshop on language technology for cultural heritage,
social sciences, and humanities. 84–89.
Alistair Baron and Paul Rayson. 2008. VARD2: A tool for dealing with spelling
variation in historical corpora. In Postgraduate conference in corpus linguistics.
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing
with Python. O’Reilly Media.
Marcel Bollmann. 2012. (Semi-)Automatic Normalization of Historical TextsUsing
Distance Measures and the Norma Tool. In Proceedings of the Second Workshop on
Annotation of Corpora for Research in the Humanities (ACRH-2). Lisbon, Portugal.
Marcel Bollmann. 2019. A Large-Scale Comparison of Historical Text Normaliza-
tion Systems. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long and Short Papers). Association for Computational Linguistics,
Minneapolis, Minnesota, 3885–3898. https://doi.org/10.18653/v1/N19-1389
Marcel Bollmann and Anders Søgaard. 2016. Improving historical spelling nor-
malization with bi-directional LSTMs and multi-task learning. In Proceedings
of COLING 2016, the 26th International Conference on Computational Linguis-
tics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan,
Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A Simple, Fast, and Ef-
fective Reparameterization of IBM Model 2. In Proceedings of the 2013 Conference
of the North American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies. Association for Computational Linguistics, Atlanta,
Georgia, 644–648. https://www.aclweb.org/anthology/N13-1073
Douwe Gelling and Trevor Cohn. 2014. Simple extensions and POS Tags for a
reparameterised IBM Model 2. In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers). Association
for Computational Linguistics, Baltimore, Maryland, 150–154. https://doi.org/10.
Mika Hämäläinen and Simon Hengchen. 2019. From the Paft to the Fiiture: a
Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction.
In Recent Advances in Natural Language Processing. INCOMA, 432–437.
Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, and Eetu Mäkelä.
2018. Normalizing early English letters to present-day English spelling. In Pro-
ceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for
Cultural Heritage, Social Sciences, Humanities and Literature. 87–96.
Mika Hämäläinen, Tanja Säily, Jack Rueter, Jörg Tiedemann, and Eetu Mäkelä.
2019. Revisiting NMT for Normalization of Early English Letters. In Proceedings
of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Her-
itage, Social Sciences, Humanities and Literature. Association for Computational
Linguistics, Minneapolis, USA, 71–75. https://doi.org/10.18653/v1/W19-2509
Andreas W Hauser and Klaus U Schulz. 2007. Unsupervised learning of edit
distance weights for retrieving historical spelling variations. In Proceedings of
the First Workshop on Finite-State Techniques and Approximate Search. 1–6.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735–1780.
Ann-Marie Ivars and Lisa Södergård. 2007. Spara det nlandssvenska talet.
Nordisk dialektologi og sociolingvistik (2007).
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M.
Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In
Proc. ACL. https://doi.org/10.18653/v1/P17-4012
Vladimir I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions,
Insertions, and Reversals. Soviets Physics Doklady 10, 8 (1966), 707–710.
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Eec-
tive approaches to attention-based neural machine translation. arXiv preprint
Soumil Mandal and Karthick Nanmaran. 2018. Normalization of Transliterated
Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance. In
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-
generated Text. Association for Computational Linguistics, Brussels, Belgium,
Niko Partanen, Mika Hämäläinen, and Khalid Alnajjar. 2019. Dialect Text Nor-
malization to Normative Standard Finnish. In Proceedings of the 5th Workshop on
Noisy User-generated Text (W-NUT 2019). 141–146.
Eva Pettersson, Beáta Megyesi, and Jörg Tiedemann. 2013. An SMT approach
to automatic annotation of historical text. In Proceedings of the workshop on
computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo;
Norway. NEALT Proceedings Series 18. Linköping University Electronic Press,
Jordi Porta, José-Luis Sancho, and Javier Gómez. 2013. Edit transducers for
spelling variation in Old Spanish. In Proceedings of the workshop on computational
historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway. NEALT
Proceedings Series 18. Linköping University Electronic Press, 70–79.
Paul Rayson, Dawn Archer, and Nicholas Smith. 2005. VARD versus WORD: A
comparison of the UCREL variant detector and modern spellcheckers on English
historical corpora. Corpus Linguistics 2005 (2005).
Tanja Samardzic, Yves Scherrer, and Elvira Glaser. 2015. Normalising ortho-
graphic and dialectal variants for the automatic processing of Swiss German. In
Proceedings of the 7th Language and Technology Conference. ID: unige:82397.
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net-
works. IEEE transactions on Signal Processing 45, 11 (1997), 2673–2681.
Osman Tursun and Ruket Cakici. 2017. Noisy Uyghur Text Normalization. In
Proceedings of the 3rd Workshop on Noisy User-generated Text. Association for
Computational Linguistics, Copenhagen, Denmark, 85–93. https://doi.org/10.
Rob van der Goot, Barbara Plank, and Malvina Nissim. 2017. To normalize, or
not to normalize: The impact of normalization on Part-of-Speech tagging. In
Proceedings of the 3rd Workshop on Noisy User-generated Text. 31–39.