PreprintPDF Available

Abstract and Figures

We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.
Content may be subject to copyright.
arXiv:2009.02685v1 [cs.CL] 6 Sep 2020
Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity
Mika H¨
Niko Partanen2
Khalid Alnajjar1
Jack Rueter2
Thierry Poibeau3
1Rootroo Ltd & University of Helsinki, 2University of Helsinki,
3Lab. LATTICE, ENS/PSL & CNRS & Univ. Sorbonne nouvelle, FR
We present a novel approach for adapting text written
in standard Finnish to different dialects. We experiment
with character level NMT models both by using a multi-
dialectal and transfer learning approaches. The models
are tested with over 20 different dialects. The results
seem to favor transfer learning, although not strongly
over the multi-dialectal approach. We study the influ-
ence dialectal adaptation has on perceived creativity of
computer generated poetry. Our results suggest that the
more the dialect deviates from the standard Finnish, the
lower scores people tend to give on an existing evalua-
tion metric. However, on a word association test, peo-
ple associate creativity and originality more with dialect
and fluency more with standard Finnish.
We present a novel method for adapting text written in stan-
dard Finnish to different Finnish dialects. The models de-
veloped in this paper have been released in an open-source
Python library1to boost the limited Finnish NLP resources,
and to encourage both replication of the current study and
further research in this topic. In addition to the new method-
ological contribution, we use our models to test the effect
they have on perceived creativity of poems authored by a
computationally creative system.
Finnish language exhibits numerous differences between
colloquial spoken regional varieties and the written standard.
This situation is a result of a long historical development.
Literary Finnish variety known as Modern Finnish devel-
oped into its current form in late 19th century, after which
the changes have been mainly in the details (H¨akkinen 1994,
16). Many of the changes have been lexical due to technical
innovations and modernization of the society: orthographic
spelling conventions have largely remained the same. Spo-
ken Finnish, on the other hand, traditionally represents an
areally divided dialect continuum, with several sharp bound-
aries, and many regions of gradual differentiation from one
municipality to another municipality.
Especially in the later parts of 21th century the spoken
varieties have been leveling away from very specific local
dialects, and although regional varieties still exist, most of
the local varieties have certainly became endangered. Simi-
lar processes of dialect convergence have been reported from
different regions in Europe, although with substantial varia-
tion (Auer 2018). In the case of Finnish this has not, how-
ever, resulted in merging of the written and spoken stan-
dards, but the spoken Finnish has remained, to our day, very
distinct from the written standard. In a late 1950s, a pro-
gram was set up to document extant spoken dialects, with
the goal of recording 30 hours of speech from each munici-
pality. This work resulted in very large collections of dialec-
tal recordings (Lyytik¨ainen 1984, 448-449). Many of these
have been published, and some portion has also been manu-
ally normalized. Dataset used is described in more detail in
Section Data and Preprocessing.
Finnish orthography is largely phonemic within the lan-
guage variety used in that representation, although, as dis-
cussed above, the relationship to actual spoken Finnish is
complicated. Phonemicity of the orthography is still a very
important factor here, as the differences between different
varieties are mainly displaying historically developed differ-
ences, and not orthographic particularities that would be es-
sentially random from contemporary point of view. Thereby
the differences between Finnish dialects, spoken Finnish and
Standard Finnish are highly systematic and based to histor-
ical sound correspondences and sound changes, instead of
more random adaptation of historical spelling conventions
that would be typical for many languages.
Due to the phonemicity of the Finnish writing system,
dialectal differences are also reflected in informal writing.
People speaking a dialect oftentimes also write it as they
would speak it when communicating with friends and fam-
ily members. This is different from English in that, for ex-
ample, although Australians and Americans pronounce the
word today differently, they would still write the word in
the same way. In Finnish, such a dialectal difference would
result in a different written form as well.
We hyphotesize that dialect increases the perceived value
of computationally created artefacts. Dialectal text is some-
thing that people are not expecting from a machine as much
as they would expect standard Finnish. The effect dialect has
on results can be revealing of the shortcomings of evaluation
methods used in the field.
Related Work
Text adaptation has received some research attention in the
past. The task consists of adapting or transferring a text to
a new form that follows a certain style or domain. As the
particular task of dialect adaptation has not received a wide
research interest, we dedicate this section in describing dif-
ferent text adaptation systems in a mode broad sense.
Adaptation of written language to a more spoken lan-
guage style has previously been tackled as a lexical adap-
tation problem (Kaji and Kurohashi 2005). They use style
and topic classification to gather data representing written
and spoken language styles, thereafter, they learn the prob-
abilities of lexemes occurring in both categories. This way
they can learn the differences between the spoken and the
written on a lexical level and use this information for style
adaptation. The difference to our approach is that we ap-
proach the problem on a character level rather than lexical
level. This makes it possible for our approach to deal with
out-of-vocabulary words and to learn inflectional differences
as well without additional modeling.
Poem translation has been tackled from
the point of view of adaptation as well
(Ghazvininejad, Choi, and Knight 2018). The authors
train a neural model to translate French poetry into English
while making the output adapt to specified rhythm and
rhyme patterns. They use an FSA (finite-state acceptor) to
enforce a desired rhythm and rhyme.
Back-translation is also a viable starting point for style
adaptation (Prabhumoye et al. 2018). They propose a
method consisting of two neural machine translation sys-
tems and style generators. They first translate the English in-
put into French and then back again to English in the hopes
of reducing the characteristics of the initial style. A style
specific bi-LSTM model is then used to adapt the back trans-
lated sentence to a given style based on gender, political ori-
entation and sentiment.
A recent line of work within the paradigm of computa-
tional creativity presents a creative contextual style adapta-
tion in video game dialogs (H¨am¨al¨ainen and Alnajjar 2019).
They adapt video game dialog to better suit the state of the
video game character. Their approach works in two steps:
first, they use a machine translation model to paraphrase the
syntax of the sentences in the dialog to increase the variety
of the output. After this, they refill the new syntax with the
words from the dialog and adapt some of the content words
with a word embedding model to fit better the domain dic-
tated by the player’s condition.
A recent style adaptation (Li et al. 2019) learns to sepa-
rate stylistic information from content information, so that it
can maximize the preservation of the content while adapting
the text to a new style. They propose an encoder-decoder ar-
chitecture for solving this task and evaluate it on two tasks;
sentiment transfer and formality transfer.
Earlier work on Finnish dialect normalization to stan-
dard Finnish has shown that the relationship between spo-
ken Finnish varieties and literary standard language can
be modeled as a character level machine translation task
(Partanen, H¨am¨al¨ainen, and Alnajjar 2019).
Data and Preprocessing
We use a corpus called Samples of Spoken Finnish
(Institute for the Languages of Finland 2014) for dialect
adaptation. This corpus consists of over 51,000 hand
annotated sentences of dialectal Finnish. These sentences
have been normalized on a word level to standard Finnish.
This provides us with an ideal parallel data set consisting of
dialectal text and their standard Finnish counterparts.
The corpus was designed so that all main dialects and
the transition varieties would be represented. The last di-
alect booklet in the series of 50 items was published in
2000, and the creation process was summarised there by
Rekunen (2000). For each location there is one hour of tran-
scribed text from two different speakers. Almost all speak-
ers are born in the 19th century. Transcriptions are done in
semi-narrow transcription that captures well the dialect spe-
cific particularities, without being phonetically unnecessar-
ily narrow.
The digitally available version of the corpus has a man-
ual normalization for 684,977 tokens. The entire normalized
corpus was used in our experiments.
Dialect Short Sentences
Etel¨a-H¨ame EH 1860
Etel¨a-Karjala EK 813
Etel¨a-Pohjanmaa EP 2684
Etel¨a-Satakunta ES 848
Etel¨a-Savo ESa 1744
Etel¨ainen Keski-Suomi EKS 2168
Inkerinsuomalaismurteet IS 4035
Kaakkois-H¨ame KH 8026
Kainuu K 3995
Keski-Karjala KK 1640
Keski-Pohjanmaa KP 900
ansi-Satakunta LS 1288
ansi-Uusimaa LU 1171
ansipohja LP 1026
antinen Keski-Suomi LKS 857
Per¨apohjola P 1913
Pohjoinen Keski-Suomi PKS 733
Pohjoinen Varsinais-Suomi PVS 3885
Pohjois-H¨ame PH 859
Pohjois-Karjala PK 4292
Pohjois-Pohjanmaa PP 1801
Pohjois-Satakunta PS 2371
Pohjois-Savo PSa 2344
Table 1: Dialects and the number of sentences in each dialect
in the corpus
Despite the attempts of the authors of the corpus to in-
clude all dialects, the dialects are not equally represented in
the corpus. One reason for this is certainly the different sizes
of the dialect areas, and the variation introduced by different
speech rates of individual speakers. The difference in the
number of sentences per dialect can be seen in Table 1. We
do not consider this uneven distribution to be a problem, as
it is mainly a feature of this dataset, but we have paid at-
tention to these differences in data splitting. In order to get
proportionally even numbers of each dialect in the different
data sets, we split the sentences of each dialect into training
(70%), validation (15%) and testing (15%) the split is done
after shuffling the data. The same split is used throughout
this paper.
The dialectal data contains non-standard annotations that
are meant to capture phonetic and prosodic features that are
usually not represented in the writing. These include the use
of the acute accent to represent stress, superscripted charac-
ters, IPA characters and others. We go through all characters
in the dialectal sentences that do not occur in the normal-
izations, i.e all characters that are not part of the Finnish
alphabets and ordinary punctuation characters. We remove
all annotations that mark prosodic features as these are not
usually expressed in writing. This is done entirely manually
as sometimes the annotations are additional characters that
can be entirely removed and sometimes the annotations are
added to vowels and consonants, in which case they form
new Unicode characters and need to be replaced with their
non-annotated counterparts.
Automatic Dialect Adaptation
In order to adapt text written in standard Finnish
to dialects, we train several different models on the
data set. As a character level sequence-to-sequence
neural machine translation (NMT) approach has been
proven successful in the past for the opposite prob-
lem of normalization of dialectal or historical language
variant to the standard language (see (Bollmann 2019;
am¨al¨ainen et al. 2019; Veliz, De Clercq, and Hoste 2019;
am¨al¨ainen and Hengchen 2019)), we approach the prob-
lem form a similar character based methodology. The ad-
vantage of character level models to word level models is
their adaptability to out of vocabulary words; a requirement
which needs to be satisfied for our experiments to be suc-
cessful. In practice, this means splitting the words into char-
acters separated by white-spaces and marking word bound-
aries with a special character, which is underscore ( ) in our
In NMT, language flags have been used in the past to train
multi-lingual models (Johnson et al. 2017). The idea is that
the model can benefit from the information in multiple lan-
guages when predicting the translation for a particular lan-
guage a expressed by a language specific flag given to the
system. We train one model with all the dialect data, ap-
pending a dialect flag to the source side. The model will then
learn to use the flag when adapting the standard Finnish text
the the desired dialect.
Additionally, we train one model without any flags or di-
alectal cues. This model is trained to predict from standard
Finnish to dialectal text (without any specification in terms
of the dialect). This model serves two purposes, firstly if it
performs poorly on individual dialects, it means that there is
a considerable distance between each dialect so that a single
model that adapts text to a generic dialect cannot sufficiently
capture all of the dialects. Secondly, this model is used as a
starting point for dialect specific transfer learning.
We use the generic model without flags for training dialect
specific models. We do this by freezing the first layer of the
encoder, as the encoder only sees standard Finnish, it does
not require any further training. Then we train the dialect
specific models from the generic model by continuing the
training with only the training and validation data specific to
a given dialect. We train each dialect specific model in the
described transfer learning fashion for an additional 20,000
Our models are recurrent neural networks. The ar-
chitecture consists of two encoding layers and two de-
coding layers and the general global attention model
(Luong, Pham, and Manning 2015). We train the models
by using the OpenNMT Python package (Klein et al. 2017)
with otherwise the default settings. The model with flags
and the generic model are trained for 100,000 steps. We
train the models by providing chunks of three words at a
time as opposed to training one word or whole sentence at
a time, as a chunk of three words has been suggested to be
more effective in a character-level text normalization task
(Partanen, H¨am¨al¨ainen, and Alnajjar 2019).
Table 2 shows an example of the sequences used for train-
ing. The model receiving the dialect flag has the name of the
dialect appended to the beginning of the source data, where
as the generic model has no additional information apart
from the character sequences. The dialect specific transfer
learning models are also trained without an additional flag,
but rather the exposure solely to the dialect specific data is
considered sufficient for the model to better learn the desired
Results and Evaluation
In this section, we present the results of the dialect adapta-
tion models on different dialects. We use a commonly used
metric called word error rate (WER) and compare the di-
alect adaptations of the test sets of each dialect to the gold
standard. WER is calculated for each sentence by using the
following formula:
W ER =
WER is derived from Levenshtein edit distance
(Levenshtein 1966) as a better measurement for calcu-
lating word-level errors. It takes into account the number of
deletions D, substitutions S, insertions Iand the number of
correct words C.
The results are shown in Tables 3 and 4. On the vertical
axis are the models. Flags represents the results of the model
that was trained with initial tokens indicating the desired di-
alect the text should be adapted in. No flags is the model
trained without any dialectal information, and the rest of the
models are dialect specific transfer learning models trained
on the no flags model.
The results are to be interpreted as the lower the better,
i.e. the lower the WER, the closer the output is to the gold
dialect data in a given dialect. These results indicate that
the no flag model does not get the best results for any of the
dialects, which is to be expected, as if it reached to good re-
sults, that would indicate that the dialects do not differ from
Source Target
Flags Inkerinsuomalaismurteet m i n ¨a k u n n ¨a i n m i e k o n ¨a i n
No flags m i n ¨a k u n n ¨a i n m i e k o n ¨a i n
Table 2: Example of the training data. The sentence reads ”when I saw” in English
Flags 24.37 19.8 25.13 28.09 27.22 25.19 21.09 28.73 25.56 24.59 22.51 30.49
No flags 38.87 36.21 41.98 42.16 37.71 37.35 39.38 39.03 37.05 42.43 39.08 42.3
EH 24.21 43.6 37.64 35.77 46.83 42.98 51.51 41.05 42.38 53.26 38.95 37.53
EK 48.65 19.28 52.63 47.57 35.69 39.94 31.86 42.97 47.14 33.13 49.76 45.51
EP 38.8 50.37 24.9 42.3 49.2 46.3 54.47 46.39 44.71 55.68 39.21 44.24
ES 34.36 44.81 41.49 29.03 49.35 47.8 50.05 45.56 47.74 51.16 38.02 37.12
ESa 46.06 32.28 49.5 50.38 26.81 32.43 42.01 44.26 38.4 40.32 45.88 47.9
EKS 44.3 37.3 47.06 51.05 34.15 25.07 45.56 42.97 36.5 42.84 42.65 47.86
IS 52.09 28.4 55.13 49.53 41.52 44.57 19.69 41.13 50.24 29.14 52.26 46.65
KH 43.98 38.34 47.75 47.66 45.46 43.23 41.16 28.43 47.88 44.36 47.9 45.76
K 42.59 45.05 45.11 50.11 39.79 35.97 50.56 48.17 25.56 49.34 40.89 49.63
KK 54.1 30 55.59 51.52 40.52 43.12 29.21 43.65 49.74 24.87 53.52 50.21
KP 35.58 43.94 38.58 40.2 44.54 41.53 51.03 45.84 39.26 52.04 22.51 44.32
LS 36.05 39.56 42.77 35.73 46.21 45.34 47.7 43.4 46.73 48.19 40.76 29.71
LU 38.45 45.07 44.24 39.17 51.68 51.03 47.35 41.04 51.14 49.54 46.74 38.97
LP 40.58 44.55 42.07 41.94 46.1 44.94 49.32 46.35 44.42 50.71 35 44.57
LKS 33.25 40.03 37.48 39.88 39.42 35.24 49.09 42.59 33.99 49.82 32.79 42.64
P 39.05 44.38 40.83 42.72 45.09 42.25 50.06 46.11 41.1 51.14 35.12 44.34
PKS 45.73 43.03 48.96 51.9 36.41 33.39 48.55 47.2 33.37 46.73 43.46 52.63
PVS 50.34 41.51 52.91 44.13 50.96 53.29 44.48 46.03 55.99 46.38 53.35 43.09
PH 31.26 44.72 38.38 37.56 44.61 39.4 52.07 42.19 38.51 52.73 35.43 40.82
PK 44.14 44.33 47.18 50.83 36.98 37.08 46.76 46.09 33.51 46.5 42.58 51.05
PP 34.73 44.38 37.87 41.85 43.24 39.46 52 45.12 36.91 52.84 27.12 43.25
PS 28.42 46.29 35.51 36.62 46.96 42.46 53.15 41.84 42.31 53.84 36.63 38.6
PSa 43.12 40.86 47.81 49.71 34.74 33.12 46.47 44.95 32.01 45.44 45.28 51.15
Table 3: WER for the different models for dialects from Etel¨a-H ¨ame to L¨ansi-Satakunta
each other. Interestingly, we can observe that the transfer
learning method gives the best scores for almost all the di-
alect, except for Etel ¨a-Satakunta (ES), Keski-Karjala (KK),
Keski-Pohjanmaa (KP), Pohjois-Karjala (PK) and Pohjois-
Satakunta (PS), for which the model with flags gives the best
results. Both methods are equally good for Pohjois-H¨ame
(PH). All in all, the difference between the two methods is
rather small in the WER. An example of the dialectal adap-
tation can be seen in Table 5.
Based on these results it is difficult to suggest one method
over the other as both of them are capable of reaching to
the best results on different dialects. On the practical side,
the model with dialectal flags trains faster and requires less
computational resources, as the model is trained once only
and it works for all the dialects immediately, where as trans-
fer learning has to be done for each dialect individually after
training a generic model.
Evaluation of the models with and without dialectal flags
shows that especially in word forms that are highly diver-
gent in the dialect, it is almost impossible for the model to
predict the correct result that is in the test set. This doesn’t
mean that the model’s output would necessarily be entirely
incorrect, as the result may still be perfectly valid dialectal
representation, it just is in a different variety.
There are also numerous examples of features that are in
variation also within one dialect. In these cases the model
may produce a form different from that in the specific row
of a test set. These kind of problems are particularly promi-
nent in examples where the dialectal transcription contains
prosodic phenomena at the word boundary level. Since the
model starts the prediction from standard Finnish input, it
cannot have any knowledge about specific prosodic features
of the individual examples in test data. Some phonologi-
cal features such as assimilation of nasals seem to be over-
generalized by the model, and also in this case it would be
impossible for the model to predict the instances where such
phenomena does not take place due to particularly careful
Another interesting feature of the model is that it seems
to be able to generalize its predictions into unseen words,
as long as they exhibit morphology common for the training
data. There are, however, instances of clearly contemporary
word types, such as recent international loans, that have gen-
eral shape and phonotactics that are entirely absent from the
training data. The problems caused by this are somewhat
mitigated by fact that in many cases the standard Finnish
word can be left intact, and it will pass within the dialectal
text relatively well.
This has a consequence that the scores reported here
are possibly slightly worse than the model’s true abilities.
The resulting dialectal text can still be very accurate and
closely approximate the actual dialect, although the predic-
tion would slightly differ from the test instances. At the
same time if the model slips into predicted text some lit-
erary Finnish forms, the result is still perfectly understand-
able, and also in real use the dialects would rarely be used in
entire isolation from the standard language.
It must also be taken into account that only either a na-
tive dialect speaker or an advanced specialist in Finnish di-
alectology can reliably detect minute disfluencies in dialec-
tal predictions, especially when the error is introduced by
a form of other dialect. Similarly it would be very uncom-
mon to have such knowledge about all the Finnish dialects
the model operates on. After this careful examination of the
Flags 27.87 20.02 21.89 27.53 28.73 32.4 20.03 27.15 21.51 21.56 27.6
No flags 43.49 37.1 35.06 38.35 40.54 49.19 34.9 36.44 35.12 38.54 37.86
EH 39.9 35.63 33.65 39.92 48.42 51.05 27.61 41.9 32.54 27.46 43.91
EK 50.59 45.08 46.23 46.75 46.03 47.19 45.62 43.82 45.84 51.85 43.21
EP 47.04 37.78 39.13 41.56 52.06 56.16 33.28 44.64 34.32 35.23 46.35
ES 43.01 36.6 40.35 42.12 52.27 47.34 34.46 46.08 37.09 35.13 48.23
ESa 53.26 40.5 38.89 43.85 37.36 54.04 39.56 35.68 40.02 46.65 35.55
EKS 52.05 40.5 35.72 41.94 36.11 55.63 38.27 37.34 38.33 42.99 35.35
IS 48.72 47.29 49.67 48.39 49.59 45.74 49.89 46.45 48.51 54.29 46.18
KH 44.26 44.17 43.45 46.91 49.09 49.42 42.14 45.54 44.09 43.86 44.09
K 52.71 39.03 34.47 39.75 35.07 58.24 35.52 33.57 35.43 41.77 33.67
KK 51.83 48.19 50.94 49.37 49.41 48.7 50.84 45.66 50.27 55.14 45.86
KP 48.5 27.67 34.21 35.92 43.07 56.07 30.26 40.42 25.17 35.28 42.05
LS 42.57 36.9 39.88 42.58 51.71 47.31 34.74 46.31 38.71 36.28 47.7
LU 25.9 43.04 44.97 45.66 54.87 43.76 40.63 49.41 43.76 40.15 50.93
LP 49.41 19.57 38.2 32.23 47.75 55.87 35.04 43.57 33.96 37.92 45.85
LKS 45.39 33.2 21.41 34.97 40.88 55.13 26.9 36.47 28.22 31.57 36.37
P 47.29 28 35.54 26.81 46.23 56.06 33.45 40.98 32.7 37.82 43.27
PKS 55.67 41.86 39.87 42.62 28.65 57.68 39.28 35.8 38.08 46.01 33.67
PVS 46.42 49.26 54.99 52.31 57.36 31.67 52.13 52.69 51.34 53.44 53.39
PH 44.15 33.2 31.47 36.83 44.44 55.22 20.03 38.14 30.76 28.94 40.5
PK 53.77 41.01 38.61 42.59 38.34 57.18 37.99 27.28 37.87 46.02 34.98
PP 48.43 30.77 32.43 34.85 43.11 57.03 28.9 38.09 21.04 34.2 39.94
PS 42.17 35.18 32.03 38.9 47.14 54.2 26.49 41.34 31.54 22 42.98
PSa 52.19 42.13 36.28 42.45 35.29 56.52 38.29 32.8 37.67 43.96 27.24
Table 4: WER for the different models for dialects from L¨ansi-Uusimaa to Pohjois-Savo
models, we proceed to the generation of dialectal poems and
their further evaluation by native Finnish speakers.
Effect on Perceived Creativity
In this section, we apply the dialect adaptation trained in
the earlier sections to text written in standard Finnish. We
are interested in seeing what the effect of the automati-
cally adapted dialect is on computer generated text. We use
an existing Finnish poem generator (H¨am¨al ¨ainen 2018) that
produces standard Finnish (SF) text as it relies heavily on
hand defined syntactic structures that are filled with lemma-
tized words that are inflected with a normative Finnish mor-
phological generator by using a tool called Syntax Maker
(H¨am¨al ¨ainen and Rueter 2018). We use this generator to
generate 10 different poems.
The poems generated by the system are then adapted to
dialects with the models we elaborated in this paper. As
the number of different dialects is extensive and conducting
human questionnaire with such a myriad of dialects is not
feasible, we limit our study to three dialects. We pick Etel¨a-
Karjala (EK) and Inkerinsuomalaismurteet (IS) dialects be-
cause they are the best perfo rming ones in terms of WER and
Pohjoinen Varsinais-Suomi (PVS) dialect as it is the worst
performing in terms of WER. For this study, we use the di-
alect specific models tuned with transfer learning.
A qualitative look at the predictions revealed that the di-
alectal models have a tendency of over generating when a
word chunk has less than three words. The models tend to
predict one or two additional words in such cases, however,
if the chunk contains three words, the models do not over nor
under generate words. Fortunately this is easy to overcome
by ensuring that only as many dialectal words are consid-
ered from the prediction as there were in the chunk written
in standard Finnish. For instance olen vanha (I am old) gets
predicted in IS as olev vanha a. The first two words are cor-
rectly adapted to the dialect, while the third word ais an
invention by the model. However, the models do not sys-
tematically predict too many words as in pieni ? (small?)
to pien ? adaptation. For this reason, we only consider as
many words as in the original chunks when doing the dialec-
tal adaptation.
Replicating the Poem Generator Evaluation
In our first experiment, we replicate the poem generator
evaluation that was used to evaluate the Finnish poem gen-
erator used in this experiment. We are interested in see-
ing whether dialectal adaptation has an effect on the eval-
uation results of the creative system. They evaluated their
system based on the evaluation questions initially elab-
orated in a study on an earlier Finnish poem generator
(Toivanen et al. 2012). The first evaluation question is a bi-
nary one Is the text a poem?. The rest of the evaluation ques-
tions are asked on a 5-point Likert scale:
1. How typical is the text as a poem?
2. How understandable is it?
3. How good is the language?
4. Does the text evoke mental images?
5. Does the text evoke emotions?
6. How much do you like the text?
The subjects are not told that they are to read poetry nor
that they are reading fully computer generated and dialec-
tally adapted text. We conduct dialectal adaptation to the 10
generated poems to the three different dialects, this means
that there are altogether four variants of each poem, one in
standard Finnish, and three in dialects. We produce the ques-
tionnaires automatically in such a fashion that each ques-
tionnaire has the 10 different poems shuffled in random or-
der each time. The variants of each poem are picked ran-
domly so that each questionnaire has randomly picked vari-
ant for each of the poems. Every questionnaire contains po-
ems from all of the different variant types, but none of them
contains the same poem more than once. Each questionnaire
SF EK PVS IS Translation
himo on palo,
se syttyy herk¨asti
taas intona se kokoaa
milloin into on eloisa?
aemmek¨o me,
ennen kuin into j¨a pois?
mik¨ali innot pysyisiv¨at,
sin¨a huomaisit innon
min¨a alan maksamaan innon
olenko liiallinen?
himo om palos,
se syttyy herk¨ast
taas intonna se kokovaa
millo into on elosa?
ammek¨o met,
enne ku into j¨a pois?
mik¨ali innot pysysiit,
sie huomasit inno
mie alan maksamaa inno
olenko siialli?
himo om palo,
se sytty herk¨asti
taas inton se kokko
millon innoo on elosa?
amek¨o me,
ennen ku into j¨a pois?
mik¨al innop pysysiv¨at,
si¨a huamasit inno
a ala maksaman inno
olenko liialline?
himo om palloo,
se syttyy herk¨ast
toas inton se kokohoa
millon into on eloisa?
ne¨amm¨aks me¨a,
ennen ku into j¨a pois?
mik¨alt innot pysysiit,
sie huomaisit inno
mie ala maksamaa inno
olenko liialine?
desire is a fire,
it gets easily ignited
again, as an ardor it shall rise
when is ardor vivacious?
will we see
before ardor disappears?
if ardors stayed,
you would notice the ardor
I will start paying for the ardor
Am I extravagant?
Table 5: An example poem generated in standard Finnish and its dialectal adaptations to three different dialects
is unique in the order and combination of the variants. We
introduce all this randomness to reduce constant bias that
might otherwise be present if the poem variants were always
presented in the same order.
We print out the questionnaires and recruit people native
in Finnish in the university campus. We recruit 20 people to
evaluate the questionnaires each of which consisting of 10
poems. This means that each variant of a poem is evaluated
by five different people.
Table 6 shows the results from this experiment, however
some evaluators did not complete the task for all poems in
their pile2. Interestingly, the results drop on all the param-
eters when the poems are adapted into the different dialects
in question. The best performing dialect in the experiment
was the Etel¨a-Karjala dialect, and the worst performing one
was the Pohjoinen Varsinais-Suomi dialect all though it got
the exact same average scores with Inkerinsuomalaismurteet
on the last three questions. Now these results are not to be
interpreted as that dialectal poems would always get worse
results, as we only used a handful of dialects form the pos-
sibilities. However, the results indicate an interesting find-
ing that something as superficial as a dialect can affect the
results. It is to be noted that the dialectal adaptation only
alters the words to be more dialectal, it does not substitute
the words with new ones, nor does it alter their order.
In order to better understand why the dialects were ranked
in this order, we compare the dialectal poems to the standard
Finnish poems automatically by calculating WER. These
WERs should not be understood as ”error rates” since we
are not comparing the dialects to a gold standard, but rather
to the standard Finnish poems. The idea is that the higher
the WER, the more they differ from the standard. Table 7
shows the results of this experiment. The results seem to be
in line with the human evaluation results; the further away
the dialect is from the standard Finnish, the lower it scores in
the human evaluation. This is a potential indication of famil-
iarity bias; people tend to prefer the more familiar language
Word Association Test
In the second experiment, we are interested in seeing how
people associate words when they are presented with a stan-
2The data is based on 47 observations for SF, 46 for EK, 43 for
PVS and 49 for IS out of the maximum of 50.
dard Finnish version and a dialectally adapted variant of the
same poem. The two poems are presented on the same page,
labeled as A and B. The order is randomized again, which
means that both the order of poems in the questionnarie and
whether the dialectal one is A or B is randomized. This is
done again to reduce bias in the results that might be caused
by always maintaining the same order. The concepts we
study are the following:
The subjects are asked to associate each concept with A
or B, one of which is the dialectal and the other the standard
Finnish version of the same poem. We use the same dialects
as before, but which dialect gets used is not controlled in
this experiment. We divide each questionnaire of 10 poems
into piles of two to reduce the work load on each annotator
as each poem is presented in two different variant forms.
This way, we recruit altogether 10 different people for this
task, again native speakers from the university campus. Each
poem with a dialectal variant gets annotated by five different
Table 8 shows the results of this experiment. Some of
the people did not answer to all questions for some poems.
This is reflected in the no answer column. The results in-
dicate that the standard Finnish variant poems were consid-
ered considerably more fluent than the dialectal poems, and
slightly more emotive and artificial. The dialectal poems
were considered considerably more original and creative,
and slightly more poem-like.
It is interesting that while dialectal poems can get
clearly better results on some parameters on this experi-
ment, they still scored lower on all the parameters in the
first experiment. This potentially highlights a more gen-
eral problem on evaluation in the field of computational
creativity, as results are heavily dependent on the metric
that happened to be chosen. The problems arising from
this ”ad hoc” evaluation practice are also discussed by
(Lamb, Brown, and Clarke 2018).
Poem Typical Understandable Language Mental images Emotions Liking
% M Mo Me M Mo Me M Mo Me M Mo Me M Mo Me M Mo Me
SF 87.2% 2.85 4 3 3.62 4 4 3.51 4 4 3.57 4 4 2.94 2 3 3.02 4 3
EK 82.6% 2.5 2 2 3 4 3 2.87 3 3 3.26 4 3 2.67 2 2 2.70 2 3
IS 77.6% 2.69 2 3 2.90 3, 4 3 2.78 2 3 3.27 4 3 2.86 2 3 2.61 3 3
PVS 77.0% 2.51 2 2 2.80 2 3 2.58 2 3 3.27 4 3 2.86 2 3 2.61 3 3
Table 6: Results form the first human evaluation. Mean, mode and median are reported for the questions on Likert-scale
WER 34.38 43.41 54.69
Table 7: The distance of the dialectal poems form the origi-
nal poem written in standard Finnish
SF Dialect No answer
emotive 48% 46% 6%
original 40% 60% 0%
creative 32% 64% 4%
poem-like 46% 50% 4%
artificial 50% 46% 4%
fluent 74% 24% 2%
Table 8: Results of the second experiment with human an-
We have presented our work on automatic dialect adapta-
tion by using a character-level NMT approach. Based on
our automatic evaluation, both the transfer learning method
and a multi-dialectal model with flags can achieve the best
results in different dialects. The transfer learning method,
however, receives the highest scores on most of the dialects.
Nevertheless, the difference in WERs of the two methods
is generally small, therefore it is not possible to clearly rec-
ommend one over another to be used for different character-
level data sets. If the decision is based on the computational
power used, then the multi-dialectal model with flags should
be used as it only needs to be trained once and it can handle
all the dialects.
The dialect adaptation models elaborated in this paper
have been made publicly available as an open-source Python
library3. This not only makes the replication of the re-
sults easier but also makes it possible to apply these unique
Finnish NLP tools on other related research or tasks outside
of the academia as well.
Our study shows that automatic dialect adaptation has a
clear impact to how different attributes of the text are per-
ceived. In the first experiment that was based on existing
evaluation questions, a negative impact was found as the
scores dropped on all the metrics in comparison to the orig-
inal standard Finnish poem. However, when inspecting the
distance the dialects have from the standard Finnish, we no-
ticed that the further away the dialect is form the standard,
the lower it scores.
We believe that the low scores might be an indication of
familiarity bias, which means that people have a tendency
of preferring things they are more familiar with. Especially
since the evaluation was conducted in a region in Finland
with a high number of migrants from different parts of the
country. This leads to a situation where the most familiar
language variety for everyone regardless of their dialectal
background is the standard Finnish variety. Also, as the di-
alectal data used in our model originates from the Finnish
speakers born in the 19th century, it remains possible that
the poems were transformed into a variety not entirely fa-
miliar to the individuals who participated into our survey. In
the upcoming research it is necessary to investigate the per-
ceptions of wider demographics, taking into account larger
areal representation.
Based on our results, it is too early to generalize that fa-
miliarity bias is a problem in evaluation of computationally
creative systems. However, it is an important aspect to take
into consideration in the future research. We are interested
in testing this particular bias out in the future in a more con-
trolled fashion. Nevertheless, the fact that a variable, such as
dialect that is never controlled in the computational creativ-
ity evaluations, has a clear effect on the evaluation results,
raises a real question about the validity of such evaluation
methods. As abstract questions on 5-point Likert scale are
a commonly used evaluation methodology, the question of
narrowing down the unexpected variables, such as dialect,
that affect the evaluation results positively or negatively is
vital for the progress in the field in terms of comparability
of results from different systems.
Even though the initial hypothesis we had on dialects in-
creasing the p erceived value of com putationally created arte-
facts was proven wrong by the first experiment, the second
experiment showed that dialects can indeed have a positive
effect on the results as well, in terms of perceived creativ-
ity and originality. This finding is also troublesome form
the point of view of computational creativity evaluation in a
larger context. Our dialect adaptation system is by no means
designed to exhibit any creative behavior of its own, yet peo-
ple are more prone to associating the concept creativity with
dialectally adapted poetry.
The results of the first and second experiment give a very
different picture of the impact dialect adaptation has on per-
ceived creativity. This calls for a more thorough research
on the effect different evaluation practices have on the re-
sults of a creative system. Is the difference in results fully
attributable to subjectivity in the task, what was asked on
how it was asked. Does making people pick between two
(dialectal and standard Finnish in our case) introduce a bias
not present when people rate the poems individually? It is
important these questions be systematically addressed in the
future research.
Thierry Poibeau is partly supported by a PRAIRIE 3IA In-
stitute fellowship (”Investissements d’avenir” program, ref-
erence ANR-19-P3IA-0001).
[Auer 2018] Auer, P. 2018. Dialect change in europe–
leveling and convergence. The Handbook of Dialectology
[Bollmann 2019] Bollmann, M. 2019. A large-scale com-
parison of historical text normalization systems. In Proceed-
ings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers),
3885–3898. Minneapolis, Minnesota: Association for Com-
putational Linguistics.
[Ghazvininejad, Choi, and Knight 2018] Ghazvininejad, M.;
Choi, Y.; and Knight, K. 2018. Neural poetry translation. In
Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 2 (Short Papers),
[H¨akkinen 1994] H¨akkinen, K. 1994. Agricolasta
nykykieleen: suomen kirjakielen historia. S¨oderstr ¨om.
[H¨am¨al ¨ainen and Alnajjar 2019] H¨am ¨al ¨ainen, M., and Alna-
jjar, K. 2019. Creative contextual dialog adaptation in an
open world rpg. In Proceedings of the 14th International
Conference on the Foundations of Digital Games, 1–7.
[H¨am¨al ¨ainen and Hengchen 2019] H¨am ¨al¨ainen, M., and
Hengchen, S. 2019. From the paft to the fiiture: a fully
automatic nmt and word embeddings method for ocr
post-correction. In Recent Advances in Natural Language
Processing, 432–437. INCOMA.
[H¨am¨al ¨ainen and Rueter 2018] am¨al ¨ainen, M., and Rueter,
J. 2018. Development of an open source natural lan-
guage generation tool for finnish. In Proceedings of the
Fourth International Workshop on Computational Linguis-
tics of Uralic Languages, 51–58.
[H¨am¨al ¨ainen et al. 2019] H ¨am¨al¨ainen, M.; aily, T.; Rueter,
J.; Tiedemann, J.; and M¨akel¨a, E. 2019. Revisiting NMT
for normalization of early English letters. In Proceedings
of the 3rd Joint SIGHUM Workshop on Computational Lin-
guistics for Cultural Heritage, Social Sciences, Humanities
and Literature, 71–75. Minneapolis, USA: Association for
Computational Linguistics.
[H¨am¨al ¨ainen 2018] H¨am ¨al ¨ainen, M. 2018. Harnessing nlg
to create finnish poetry automatically. In International Con-
ference on Computational Creativity, 9–15. Association for
Computational Creativity (ACC).
[Institute for the Languages of Finland 2014] Institute for
the Languages of Finland. 2014. Suomen kielen n¨aytteit¨a
- Samples of Spoken Finnish [online-corpus], version 1.0.
[Johnson et al. 2017] Johnson, M.; Schuster, M.; Le, Q. V.;
Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Vi´egas, F.; Wat-
tenberg, M.; Corrado, G.; Hughes, M.; and Dean, J. 2017.
Google’s multilingual neural machine translation system:
Enabling zero-shot translation. Transactions of the Asso-
ciation for Computational Linguistics 5:339–351.
[Kaji and Kurohashi 2005] Kaji, N., and Kurohashi, S. 2005.
Lexical choice via topic adaptation for paraphrasing written
language to spoken language. In International Conference
on Natural Language Processing, 981–992. Springer.
[Klein et al. 2017] Klein, G.; Kim, Y.; Deng, Y.; Senellart,
J.; and Rush, A. M. 2017. OpenNMT: Open-Source Toolkit
for Neural Machine Translation. In Proc. ACL.
[Lamb, Brown, and Clarke 2018] Lamb, C.; Brown, D. G.;
and Clarke, C. L. 2018. Evaluating computational creativ-
ity: An interdisciplinary tutorial. ACM Computing Surveys
(CSUR) 51(2):1–34.
[Levenshtein 1966] Levenshtein, V. I. 1966. Binary codes
capable of correcting deletions, insertions, and reversals. So-
viets Physics Doklady 10(8):707–710.
[Li et al. 2019] Li, D.; Zhang, Y.; Gan, Z.; Cheng, Y.; Brock-
ett, C.; Sun, M.-T.; and Dolan, B. 2019. Domain adaptive
text style transfer. arXiv preprint arXiv:1908.09395.
[Luong, Pham, and Manning 2015] Luong, M.-T.; Pham, H.;
and Manning, C. D. 2015. Effective approaches to
attention-based neural machine translation. arXiv preprint
[Lyytik¨ainen 1984] Lyytik¨ainen, E. 1984. Suomen kielen
nauhoitearkiston nelj¨annesvuosisata. Viritt¨
[Partanen, H¨am¨al ¨ainen, and Alnajjar 2019] Partanen, N.;
am¨al¨ainen, M.; and Alnajjar, K. 2019. Dialect text
normalization to normative standard finnish. In Proceedings
of the 5th Workshop on Noisy User-generated Text (W-NUT
2019), 141–146.
[Prabhumoye et al. 2018] Prabhumoye, S.; Tsvetkov, Y.;
Salakhutdinov, R.; and Black, A. W. 2018. Style transfer
through back-translation. In Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), 866–876. Melbourne, Australia:
Association for Computational Linguistics.
[Rekunen 2000] Rekunen, J. 2000. Suomen kielen n ¨
50. Kotimaisten kielten tutkimuskeskus.
[Toivanen et al. 2012] Toivanen, J.; Toivonen, H.; Valitutti,
A.; and Gross, O. 2012. Corpus-Based Generation of Con-
tent and Form in Poetry. In Proceedings of the Third Inter-
national Conference on Computational Creativity.
[Veliz, De Clercq, and Hoste 2019] Veliz, C. M.; De Clercq,
O.; and Hoste, V. 2019. Benefits of data augmentation for
nmt-based text normalization of user-generated content. In
Proceedings of the 5th Workshop on Noisy User-generated
Text (W-NUT 2019), 275–285.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.
We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English$\rightarrow$French and surpasses state-of-the-art results for English$\rightarrow$German. Similarly, a single multilingual model surpasses state-of-the-art results for French$\rightarrow$English and German$\rightarrow$English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
Lexical choice via topic adaptation for paraphrasing written language to spoken language
  • N Kaji
  • S Kurohashi
  • G Klein
  • Y Kim
  • Y Deng
  • J Senellart
  • A M Rush
  • C Lamb
  • D G Brown
  • C L Clarke
and Kurohashi 2005] Kaji, N., and Kurohashi, S. 2005. Lexical choice via topic adaptation for paraphrasing written language to spoken language. In International Conference on Natural Language Processing, 981-992. Springer. [Klein et al. 2017] Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A. M. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proc. ACL. [Lamb, Brown, and Clarke 2018] Lamb, C.; Brown, D. G.; and Clarke, C. L. 2018. Evaluating computational creativity: An interdisciplinary tutorial. ACM Computing Surveys (CSUR) 51(2):1-34.