Content uploaded by Mika Hämäläinen
Author content
All content in this area was uploaded by Mika Hämäläinen on Sep 07, 2020
Content may be subject to copyright.
Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity
Mika H¨
am¨
al¨
ainen1
mika.hamalainen@helsinki.fi
Niko Partanen2
niko.partanen@helsinki.fi
Khalid Alnajjar3
khalid.alnajjar@helsinki.fi
Jack Rueter1
jack.rueter@helsinki.fi
Thierry Poibeau4
thierry.poibeau@ens.fr
1Digital Humanities, 2Finnish, Finno-Ugrian and Scandinavian Studies, 3Computer Science, University of Helsinki, FI
4Lab. LATTICE, ENS/PSL & CNRS & Univ. Sorbonne nouvelle, FR
Abstract
We present a novel approach for adapting text written
in standard Finnish to different dialects. We experiment
with character level NMT models both by using a multi-
dialectal and transfer learning approaches. The models
are tested with over 20 different dialects. The results
seem to favor transfer learning, although not strongly
over the multi-dialectal approach. We study the influ-
ence dialectal adaptation has on perceived creativity of
computer generated poetry. Our results suggest that the
more the dialect deviates from the standard Finnish, the
lower scores people tend to give on an existing evalua-
tion metric. However, on a word association test, peo-
ple associate creativity and originality more with dialect
and fluency more with standard Finnish.
Introduction
We present a novel method for adapting text written in stan-
dard Finnish to different Finnish dialects. The models de-
veloped in this paper have been released in an open-source
Python library1to boost the limited Finnish NLP resources,
and to encourage both replication of the current study and
further research in this topic. In addition to the new method-
ological contribution, we use our models to test the effect
they have on perceived creativity of poems authored by a
computationally creative system.
Finnish language exhibits numerous differences between
colloquial spoken regional varieties and the written standard.
This situation is a result of a long historical development.
Literary Finnish variety known as Modern Finnish devel-
oped into its current form in late 19th century, after which
the changes have been mainly in the details (H¨
akkinen 1994,
16). Many of the changes have been lexical due to technical
innovations and modernization of the society: orthographic
spelling conventions have largely remained the same. Spo-
ken Finnish, on the other hand, traditionally represents an
areally divided dialect continuum, with several sharp bound-
aries, and many regions of gradual differentiation from one
municipality to another municipality.
Especially in the later parts of 21th century the spoken
varieties have been leveling away from very specific local
dialects, and although regional varieties still exist, most of
1https://github.com/mikahama/murre
the local varieties have certainly became endangered. Simi-
lar processes of dialect convergence have been reported from
different regions in Europe, although with substantial varia-
tion (Auer 2018). In the case of Finnish this has not, how-
ever, resulted in merging of the written and spoken stan-
dards, but the spoken Finnish has remained, to our day, very
distinct from the written standard. In a late 1950s, a pro-
gram was set up to document extant spoken dialects, with
the goal of recording 30 hours of speech from each munici-
pality. This work resulted in very large collections of dialec-
tal recordings (Lyytik¨
ainen 1984, 448-449). Many of these
have been published, and some portion has also been manu-
ally normalized. Dataset used is described in more detail in
Section Data and Preprocessing.
Finnish orthography is largely phonemic within the lan-
guage variety used in that representation, although, as dis-
cussed above, the relationship to actual spoken Finnish is
complicated. Phonemicity of the orthography is still a very
important factor here, as the differences between different
varieties are mainly displaying historically developed differ-
ences, and not orthographic particularities that would be es-
sentially random from contemporary point of view. Thereby
the differences between Finnish dialects, spoken Finnish and
Standard Finnish are highly systematic and based to histor-
ical sound correspondences and sound changes, instead of
more random adaptation of historical spelling conventions
that would be typical for many languages.
Due to the phonemicity of the Finnish writing system,
dialectal differences are also reflected in informal writing.
People speaking a dialect oftentimes also write it as they
would speak it when communicating with friends and fam-
ily members. This is different from English in that, for ex-
ample, although Australians and Americans pronounce the
word today differently, they would still write the word in
the same way. In Finnish, such a dialectal difference would
result in a different written form as well.
We hyphotesize that dialect increases the perceived value
of computationally created artefacts. Dialectal text is some-
thing that people are not expecting from a machine as much
as they would expect standard Finnish. The effect dialect has
on results can be revealing of the shortcomings of evaluation
methods used in the field.
Proceedings of the 11th International
Conference on Computational Creativity (ICCC’20)
ISBN: 978-989-54160-2-8
204
Related Work
Text adaptation has received some research attention in the
past. The task consists of adapting or transferring a text to
a new form that follows a certain style or domain. As the
particular task of dialect adaptation has not received a wide
research interest, we dedicate this section in describing dif-
ferent text adaptation systems in a mode broad sense.
Adaptation of written language to a more spoken lan-
guage style has previously been tackled as a lexical adap-
tation problem (Kaji and Kurohashi 2005). They use style
and topic classification to gather data representing written
and spoken language styles, thereafter, they learn the prob-
abilities of lexemes occurring in both categories. This way
they can learn the differences between the spoken and the
written on a lexical level and use this information for style
adaptation. The difference to our approach is that we ap-
proach the problem on a character level rather than lexical
level. This makes it possible for our approach to deal with
out-of-vocabulary words and to learn inflectional differences
as well without additional modeling.
Poem translation has been tackled from the point of view
of adaptation as well (Ghazvininejad, Choi, and Knight
2018). The authors train a neural model to translate French
poetry into English while making the output adapt to spec-
ified rhythm and rhyme patterns. They use an FSA (finite-
state acceptor) to enforce a desired rhythm and rhyme.
Back-translation is also a viable starting point for style
adaptation (Prabhumoye et al. 2018). They propose a
method consisting of two neural machine translation sys-
tems and style generators. They first translate the English in-
put into French and then back again to English in the hopes
of reducing the characteristics of the initial style. A style
specific bi-LSTM model is then used to adapt the back trans-
lated sentence to a given style based on gender, political ori-
entation and sentiment.
A recent line of work within the paradigm of computa-
tional creativity presents a creative contextual style adapta-
tion in video game dialogs (H¨
am¨
al¨
ainen and Alnajjar 2019).
They adapt video game dialog to better suit the state of the
video game character. Their approach works in two steps:
first, they use a machine translation model to paraphrase the
syntax of the sentences in the dialog to increase the variety
of the output. After this, they refill the new syntax with the
words from the dialog and adapt some of the content words
with a word embedding model to fit better the domain dic-
tated by the player’s condition.
A recent style adaptation (Li et al. 2019) learns to sepa-
rate stylistic information from content information, so that it
can maximize the preservation of the content while adapting
the text to a new style. They propose an encoder-decoder ar-
chitecture for solving this task and evaluate it on two tasks;
sentiment transfer and formality transfer.
Earlier work on Finnish dialect normalization to stan-
dard Finnish has shown that the relationship between spoken
Finnish varieties and literary standard language can be mod-
eled as a character level machine translation task (Partanen,
H¨
am¨
al¨
ainen, and Alnajjar 2019).
Data and Preprocessing
We use a corpus called Samples of Spoken Finnish (Insti-
tute for the Languages of Finland 2014) for dialect adapta-
tion. This corpus consists of over 51,000 hand annotated
sentences of dialectal Finnish. These sentences have been
normalized on a word level to standard Finnish. This pro-
vides us with an ideal parallel data set consisting of dialectal
text and their standard Finnish counterparts.
The corpus was designed so that all main dialects and
the transition varieties would be represented. The last di-
alect booklet in the series of 50 items was published in
2000, and the creation process was summarised there by
Rekunen (2000). For each location there is one hour of tran-
scribed text from two different speakers. Almost all speak-
ers are born in the 19th century. Transcriptions are done in
semi-narrow transcription that captures well the dialect spe-
cific particularities, without being phonetically unnecessar-
ily narrow.
The digitally available version of the corpus has a man-
ual normalization for 684,977 tokens. The entire normalized
corpus was used in our experiments.
Dialect Short Sentences
Etel¨
a-H¨
ame EH 1860
Etel¨
a-Karjala EK 813
Etel¨
a-Pohjanmaa EP 2684
Etel¨
a-Satakunta ES 848
Etel¨
a-Savo ESa 1744
Etel¨
ainen Keski-Suomi EKS 2168
Inkerinsuomalaismurteet IS 4035
Kaakkois-H¨
ame KH 8026
Kainuu K 3995
Keski-Karjala KK 1640
Keski-Pohjanmaa KP 900
L¨
ansi-Satakunta LS 1288
L¨
ansi-Uusimaa LU 1171
L¨
ansipohja LP 1026
L¨
antinen Keski-Suomi LKS 857
Per¨
apohjola P 1913
Pohjoinen Keski-Suomi PKS 733
Pohjoinen Varsinais-Suomi PVS 3885
Pohjois-H¨
ame PH 859
Pohjois-Karjala PK 4292
Pohjois-Pohjanmaa PP 1801
Pohjois-Satakunta PS 2371
Pohjois-Savo PSa 2344
Table 1: Dialects and the number of sentences in each dialect
in the corpus
Despite the attempts of the authors of the corpus to in-
clude all dialects, the dialects are not equally represented in
the corpus. One reason for this is certainly the different sizes
of the dialect areas, and the variation introduced by different
speech rates of individual speakers. The difference in the
number of sentences per dialect can be seen in Table 1. We
do not consider this uneven distribution to be a problem, as
it is mainly a feature of this dataset, but we have paid at-
Proceedings of the 11th International
Conference on Computational Creativity (ICCC’20)
ISBN: 978-989-54160-2-8
205
tention to these differences in data splitting. In order to get
proportionally even numbers of each dialect in the different
data sets, we split the sentences of each dialect into training
(70%), validation (15%) and testing (15%) the split is done
after shuffling the data. The same split is used throughout
this paper.
The dialectal data contains non-standard annotations that
are meant to capture phonetic and prosodic features that are
usually not represented in the writing. These include the use
of the acute accent to represent stress, superscripted charac-
ters, IPA characters and others. We go through all characters
in the dialectal sentences that do not occur in the normal-
izations, i.e all characters that are not part of the Finnish
alphabets and ordinary punctuation characters. We remove
all annotations that mark prosodic features as these are not
usually expressed in writing. This is done entirely manually
as sometimes the annotations are additional characters that
can be entirely removed and sometimes the annotations are
added to vowels and consonants, in which case they form
new Unicode characters and need to be replaced with their
non-annotated counterparts.
Automatic Dialect Adaptation
In order to adapt text written in standard Finnish to dialects,
we train several different models on the data set. As a char-
acter level sequence-to-sequence neural machine translation
(NMT) approach has been proven successful in the past for
the opposite problem of normalization of dialectal or histori-
cal language variant to the standard language (see (Bollmann
2019; H¨
am¨
al¨
ainen et al. 2019; Veliz, De Clercq, and Hoste
2019; H¨
am¨
al¨
ainen and Hengchen 2019)), we approach the
problem form a similar character based methodology. The
advantage of character level models to word level models is
their adaptability to out of vocabulary words; a requirement
which needs to be satisfied for our experiments to be suc-
cessful. In practice, this means splitting the words into char-
acters separated by white-spaces and marking word bound-
aries with a special character, which is underscore ( ) in our
approach.
In NMT, language flags have been used in the past to train
multi-lingual models (Johnson et al. 2017). The idea is that
the model can benefit from the information in multiple lan-
guages when predicting the translation for a particular lan-
guage a expressed by a language specific flag given to the
system. We train one model with all the dialect data, ap-
pending a dialect flag to the source side. The model will then
learn to use the flag when adapting the standard Finnish text
the the desired dialect.
Additionally, we train one model without any flags or di-
alectal cues. This model is trained to predict from standard
Finnish to dialectal text (without any specification in terms
of the dialect). This model serves two purposes, firstly if it
performs poorly on individual dialects, it means that there is
a considerable distance between each dialect so that a single
model that adapts text to a generic dialect cannot sufficiently
capture all of the dialects. Secondly, this model is used as a
starting point for dialect specific transfer learning.
We use the generic model without flags for training dialect
specific models. We do this by freezing the first layer of the
encoder, as the encoder only sees standard Finnish, it does
not require any further training. Then we train the dialect
specific models from the generic model by continuing the
training with only the training and validation data specific to
a given dialect. We train each dialect specific model in the
described transfer learning fashion for an additional 20,000
steps.
Our models are recurrent neural networks. The architec-
ture consists of two encoding layers and two decoding layers
and the general global attention model (Luong, Pham, and
Manning 2015). We train the models by using the Open-
NMT Python package (Klein et al. 2017) with otherwise the
default settings. The model with flags and the generic model
are trained for 100,000 steps. We train the models by pro-
viding chunks of three words at a time as opposed to train-
ing one word or whole sentence at a time, as a chunk of three
words has been suggested to be more effective in a character-
level text normalization task (Partanen, H¨
am¨
al¨
ainen, and Al-
najjar 2019).
Table 2 shows an example of the sequences used for train-
ing. The model receiving the dialect flag has the name of the
dialect appended to the beginning of the source data, where
as the generic model has no additional information apart
from the character sequences. The dialect specific transfer
learning models are also trained without an additional flag,
but rather the exposure solely to the dialect specific data is
considered sufficient for the model to better learn the desired
dialect.
Results and Evaluation
In this section, we present the results of the dialect adapta-
tion models on different dialects. We use a commonly used
metric called word error rate (WER) and compare the di-
alect adaptations of the test sets of each dialect to the gold
standard. WER is calculated for each sentence by using the
following formula:
W ER =S+D+I
S+D+C(1)
WER is derived from Levenshtein edit distance (Leven-
shtein 1966) as a better measurement for calculating word-
level errors. It takes into account the number of deletions
D, substitutions S, insertions Iand the number of correct
words C.
The results are shown in Tables 3 and 4. On the vertical
axis are the models. Flags represents the results of the model
that was trained with initial tokens indicating the desired di-
alect the text should be adapted in. No flags is the model
trained without any dialectal information, and the rest of the
models are dialect specific transfer learning models trained
on the no flags model.
The results are to be interpreted as the lower the better,
i.e. the lower the WER, the closer the output is to the gold
dialect data in a given dialect. These results indicate that
the no flag model does not get the best results for any of the
dialects, which is to be expected, as if it reached to good re-
sults, that would indicate that the dialects do not differ from
each other. Interestingly, we can observe that the transfer
Proceedings of the 11th International
Conference on Computational Creativity (ICCC’20)
ISBN: 978-989-54160-2-8
206
Source Target
Flags Inkerinsuomalaismurteet m i n ¨
a k u n n ¨
ain mie ko n¨
ain
No flags m i n ¨
a k u n n ¨
ain mie ko n¨
ain
Table 2: Example of the training data. The sentence reads ”when I saw” in English
model EH EK EP ES ESa EKS IS KH K KK KP LS
Flags 24.37 19.8 25.13 28.09 27.22 25.19 21.09 28.73 25.56 24.59 22.51 30.49
No flags 38.87 36.21 41.98 42.16 37.71 37.35 39.38 39.03 37.05 42.43 39.08 42.3
EH 24.21 43.6 37.64 35.77 46.83 42.98 51.51 41.05 42.38 53.26 38.95 37.53
EK 48.65 19.28 52.63 47.57 35.69 39.94 31.86 42.97 47.14 33.13 49.76 45.51
EP 38.8 50.37 24.9 42.3 49.2 46.3 54.47 46.39 44.71 55.68 39.21 44.24
ES 34.36 44.81 41.49 29.03 49.35 47.8 50.05 45.56 47.74 51.16 38.02 37.12
ESa 46.06 32.28 49.5 50.38 26.81 32.43 42.01 44.26 38.4 40.32 45.88 47.9
EKS 44.3 37.3 47.06 51.05 34.15 25.07 45.56 42.97 36.5 42.84 42.65 47.86
IS 52.09 28.4 55.13 49.53 41.52 44.57 19.69 41.13 50.24 29.14 52.26 46.65
KH 43.98 38.34 47.75 47.66 45.46 43.23 41.16 28.43 47.88 44.36 47.9 45.76
K 42.59 45.05 45.11 50.11 39.79 35.97 50.56 48.17 25.56 49.34 40.89 49.63
KK 54.1 30 55.59 51.52 40.52 43.12 29.21 43.65 49.74 24.87 53.52 50.21
KP 35.58 43.94 38.58 40.2 44.54 41.53 51.03 45.84 39.26 52.04 22.51 44.32
LS 36.05 39.56 42.77 35.73 46.21 45.34 47.7 43.4 46.73 48.19 40.76 29.71
LU 38.45 45.07 44.24 39.17 51.68 51.03 47.35 41.04 51.14 49.54 46.74 38.97
LP 40.58 44.55 42.07 41.94 46.1 44.94 49.32 46.35 44.42 50.71 35 44.57
LKS 33.25 40.03 37.48 39.88 39.42 35.24 49.09 42.59 33.99 49.82 32.79 42.64
P 39.05 44.38 40.83 42.72 45.09 42.25 50.06 46.11 41.1 51.14 35.12 44.34
PKS 45.73 43.03 48.96 51.9 36.41 33.39 48.55 47.2 33.37 46.73 43.46 52.63
PVS 50.34 41.51 52.91 44.13 50.96 53.29 44.48 46.03 55.99 46.38 53.35 43.09
PH 31.26 44.72 38.38 37.56 44.61 39.4 52.07 42.19 38.51 52.73 35.43 40.82
PK 44.14 44.33 47.18 50.83 36.98 37.08 46.76 46.09 33.51 46.5 42.58 51.05
PP 34.73 44.38 37.87 41.85 43.24 39.46 52 45.12 36.91 52.84 27.12 43.25
PS 28.42 46.29 35.51 36.62 46.96 42.46 53.15 41.84 42.31 53.84 36.63 38.6
PSa 43.12 40.86 47.81 49.71 34.74 33.12 46.47 44.95 32.01 45.44 45.28 51.15
Table 3: WER for the different models for dialects from Etel¨
a-H¨
ame to L¨
ansi-Satakunta
learning method gives the best scores for almost all the di-
alect, except for Etel¨
a-Satakunta (ES), Keski-Karjala (KK),
Keski-Pohjanmaa (KP), Pohjois-Karjala (PK) and Pohjois-
Satakunta (PS), for which the model with flags gives the best
results. Both methods are equally good for Pohjois-H ¨
ame
(PH). All in all, the difference between the two methods is
rather small in the WER. An example of the dialectal adap-
tation can be seen in Table 5.
Based on these results it is difficult to suggest one method
over the other as both of them are capable of reaching to
the best results on different dialects. On the practical side,
the model with dialectal flags trains faster and requires less
computational resources, as the model is trained once only
and it works for all the dialects immediately, where as trans-
fer learning has to be done for each dialect individually after
training a generic model.
Evaluation of the models with and without dialectal flags
shows that especially in word forms that are highly diver-
gent in the dialect, it is almost impossible for the model to
predict the correct result that is in the test set. This doesn’t
mean that the model’s output would necessarily be entirely
incorrect, as the result may still be perfectly valid dialectal
representation, it just is in a different variety.
There are also numerous examples of features that are in
variation also within one dialect. In these cases the model
may produce a form different from that in the specific row
of a test set. These kind of problems are particularly promi-
nent in examples where the dialectal transcription contains
prosodic phenomena at the word boundary level. Since the
model starts the prediction from standard Finnish input, it
cannot have any knowledge about specific prosodic features
of the individual examples in test data. Some phonologi-
cal features such as assimilation of nasals seem to be over-
generalized by the model, and also in this case it would be
impossible for the model to predict the instances where such
phenomena does not take place due to particularly careful
pronunciation.
Another interesting feature of the model is that it seems
to be able to generalize its predictions into unseen words,
as long as they exhibit morphology common for the training
data. There are, however, instances of clearly contemporary
word types, such as recent international loans, that have gen-
eral shape and phonotactics that are entirely absent from the
training data. The problems caused by this are somewhat
mitigated by fact that in many cases the standard Finnish
word can be left intact, and it will pass within the dialectal
text relatively well.
This has a consequence that the scores reported here
are possibly slightly worse than the model’s true abilities.
The resulting dialectal text can still be very accurate and
closely approximate the actual dialect, although the predic-
tion would slightly differ from the test instances. At the
same time if the model slips into predicted text some lit-
erary Finnish forms, the result is still perfectly understand-
able, and also in real use the dialects would rarely be used in
entire isolation from the standard language.
It must also be taken into account that only either a na-
tive dialect speaker or an advanced specialist in Finnish di-
alectology can reliably detect minute disfluencies in dialec-
tal predictions, especially when the error is introduced by
a form of other dialect. Similarly it would be very uncom-
mon to have such knowledge about all the Finnish dialects
the model operates on. After this careful examination of the
models, we proceed to the generation of dialectal poems and
Proceedings of the 11th International
Conference on Computational Creativity (ICCC’20)
ISBN: 978-989-54160-2-8
207
model LU LP LKS P PKS PVS PH PK PP PS PSa
Flags 27.87 20.02 21.89 27.53 28.73 32.4 20.03 27.15 21.51 21.56 27.6
No flags 43.49 37.1 35.06 38.35 40.54 49.19 34.9 36.44 35.12 38.54 37.86
EH 39.9 35.63 33.65 39.92 48.42 51.05 27.61 41.9 32.54 27.46 43.91
EK 50.59 45.08 46.23 46.75 46.03 47.19 45.62 43.82 45.84 51.85 43.21
EP 47.04 37.78 39.13 41.56 52.06 56.16 33.28 44.64 34.32 35.23 46.35
ES 43.01 36.6 40.35 42.12 52.27 47.34 34.46 46.08 37.09 35.13 48.23
ESa 53.26 40.5 38.89 43.85 37.36 54.04 39.56 35.68 40.02 46.65 35.55
EKS 52.05 40.5 35.72 41.94 36.11 55.63 38.27 37.34 38.33 42.99 35.35
IS 48.72 47.29 49.67 48.39 49.59 45.74 49.89 46.45 48.51 54.29 46.18
KH 44.26 44.17 43.45 46.91 49.09 49.42 42.14 45.54 44.09 43.86 44.09
K 52.71 39.03 34.47 39.75 35.07 58.24 35.52 33.57 35.43 41.77 33.67
KK 51.83 48.19 50.94 49.37 49.41 48.7 50.84 45.66 50.27 55.14 45.86
KP 48.5 27.67 34.21 35.92 43.07 56.07 30.26 40.42 25.17 35.28 42.05
LS 42.57 36.9 39.88 42.58 51.71 47.31 34.74 46.31 38.71 36.28 47.7
LU 25.9 43.04 44.97 45.66 54.87 43.76 40.63 49.41 43.76 40.15 50.93
LP 49.41 19.57 38.2 32.23 47.75 55.87 35.04 43.57 33.96 37.92 45.85
LKS 45.39 33.2 21.41 34.97 40.88 55.13 26.9 36.47 28.22 31.57 36.37
P 47.29 28 35.54 26.81 46.23 56.06 33.45 40.98 32.7 37.82 43.27
PKS 55.67 41.86 39.87 42.62 28.65 57.68 39.28 35.8 38.08 46.01 33.67
PVS 46.42 49.26 54.99 52.31 57.36 31.67 52.13 52.69 51.34 53.44 53.39
PH 44.15 33.2 31.47 36.83 44.44 55.22 20.03 38.14 30.76 28.94 40.5
PK 53.77 41.01 38.61 42.59 38.34 57.18 37.99 27.28 37.87 46.02 34.98
PP 48.43 30.77 32.43 34.85 43.11 57.03 28.9 38.09 21.04 34.2 39.94
PS 42.17 35.18 32.03 38.9 47.14 54.2 26.49 41.34 31.54 22 42.98
PSa 52.19 42.13 36.28 42.45 35.29 56.52 38.29 32.8 37.67 43.96 27.24
Table 4: WER for the different models for dialects from L¨
ansi-Uusimaa to Pohjois-Savo
their further evaluation by native Finnish speakers.
Effect on Perceived Creativity
In this section, we apply the dialect adaptation trained in
the earlier sections to text written in standard Finnish. We
are interested in seeing what the effect of the automatically
adapted dialect is on computer generated text. We use an
existing Finnish poem generator (H¨
am¨
al¨
ainen 2018) that
produces standard Finnish (SF) text as it relies heavily on
hand defined syntactic structures that are filled with lemma-
tized words that are inflected with a normative Finnish mor-
phological generator by using a tool called Syntax Maker
(H¨
am¨
al¨
ainen and Rueter 2018). We use this generator to
generate 10 different poems.
The poems generated by the system are then adapted to
dialects with the models we elaborated in this paper. As
the number of different dialects is extensive and conducting
human questionnaire with such a myriad of dialects is not
feasible, we limit our study to three dialects. We pick Etel¨
a-
Karjala (EK) and Inkerinsuomalaismurteet (IS) dialects be-
cause they are the best performing ones in terms of WER and
Pohjoinen Varsinais-Suomi (PVS) dialect as it is the worst
performing in terms of WER. For this study, we use the di-
alect specific models tuned with transfer learning.
A qualitative look at the predictions revealed that the di-
alectal models have a tendency of over generating when a
word chunk has less than three words. The models tend to
predict one or two additional words in such cases, however,
if the chunk contains three words, the models do not over nor
under generate words. Fortunately this is easy to overcome
by ensuring that only as many dialectal words are consid-
ered from the prediction as there were in the chunk written
in standard Finnish. For instance olen vanha (I am old) gets
predicted in IS as olev vanha a. The first two words are cor-
rectly adapted to the dialect, while the third word ais an
invention by the model. However, the models do not sys-
tematically predict too many words as in pieni ? (small?)
to pien ? adaptation. For this reason, we only consider as
many words as in the original chunks when doing the dialec-
tal adaptation.
Replicating the Poem Generator Evaluation
In our first experiment, we replicate the poem generator
evaluation that was used to evaluate the Finnish poem gen-
erator used in this experiment. We are interested in seeing
whether dialectal adaptation has an effect on the evaluation
results of the creative system. They evaluated their system
based on the evaluation questions initially elaborated in a
study on an earlier Finnish poem generator (Toivanen et al.
2012). The first evaluation question is a binary one Is the
text a poem?. The rest of the evaluation questions are asked
on a 5-point Likert scale:
1. How typical is the text as a poem?
2. How understandable is it?
3. How good is the language?
4. Does the text evoke mental images?
5. Does the text evoke emotions?
6. How much do you like the text?
The subjects are not told that they are to read poetry nor
that they are reading fully computer generated and dialec-
tally adapted text. We conduct dialectal adaptation to the 10
generated poems to the three different dialects, this means
that there are altogether four variants of each poem, one in
standard Finnish, and three in dialects. We produce the ques-
tionnaires automatically in such a fashion that each ques-
tionnaire has the 10 different poems shuffled in random or-
der each time. The variants of each poem are picked ran-
domly so that each questionnaire has randomly picked vari-
ant for each of the poems. Every questionnaire contains po-
ems from all of the different variant types, but none of them
contains the same poem more than once. Each questionnaire
is unique in the order and combination of the variants. We
Proceedings of the 11th International
Conference on Computational Creativity (ICCC’20)
ISBN: 978-989-54160-2-8
208
SF EK PVS IS Translation
himo on palo,
se syttyy herk¨
asti
taas intona se kokoaa
milloin into on eloisa?
n¨
aemmek¨
o me,
ennen kuin into j¨
a¨
a pois?
mik¨
ali innot pysyisiv¨
at,
sin¨
a huomaisit innon
min¨
a alan maksamaan innon
olenko liiallinen?
himo om palos,
se syttyy herk¨
ast
taas intonna se kokovaa
millo into on elosa?
n¨
a¨
ammek¨
o met,
enne ku into j¨
a¨
a pois?
mik¨
ali innot pysysiit,
sie huomasit inno
mie alan maksamaa inno
olenko siialli?
himo om palo,
se sytty herk¨
asti
taas inton se kokko
millon innoo on elosa?
n¨
a¨
amek¨
o me,
ennen ku into j¨
a¨
a pois?
mik¨
al innop pysysiv¨
at,
si¨
a huamasit inno
m¨
a¨
a ala maksaman inno
olenko liialline?
himo om palloo,
se syttyy herk¨
ast
toas inton se kokohoa
millon into on eloisa?
ne¨
amm¨
aks me¨
a,
ennen ku into j¨
a¨
a pois?
mik¨
alt innot pysysiit,
sie huomaisit inno
mie ala maksamaa inno
olenko liialine?
desire is a fire,
it gets easily ignited
again, as an ardor it shall rise
when is ardor vivacious?
will we see
before ardor disappears?
if ardors stayed,
you would notice the ardor
I will start paying for the ardor
Am I extravagant?
Table 5: An example poem generated in standard Finnish and its dialectal adaptations to three different dialects
introduce all this randomness to reduce constant bias that
might otherwise be present if the poem variants were always
presented in the same order.
We print out the questionnaires and recruit people native
in Finnish in the university campus. We recruit 20 people to
evaluate the questionnaires each of which consisting of 10
poems. This means that each variant of a poem is evaluated
by five different people.
Table 6 shows the results from this experiment, however
some evaluators did not complete the task for all poems in
their pile2. Interestingly, the results drop on all the param-
eters when the poems are adapted into the different dialects
in question. The best performing dialect in the experiment
was the Etel¨
a-Karjala dialect, and the worst performing one
was the Pohjoinen Varsinais-Suomi dialect all though it got
the exact same average scores with Inkerinsuomalaismurteet
on the last three questions. Now these results are not to be
interpreted as that dialectal poems would always get worse
results, as we only used a handful of dialects form the pos-
sibilities. However, the results indicate an interesting find-
ing that something as superficial as a dialect can affect the
results. It is to be noted that the dialectal adaptation only
alters the words to be more dialectal, it does not substitute
the words with new ones, nor does it alter their order.
In order to better understand why the dialects were ranked
in this order, we compare the dialectal poems to the standard
Finnish poems automatically by calculating WER. These
WERs should not be understood as ”error rates” since we
are not comparing the dialects to a gold standard, but rather
to the standard Finnish poems. The idea is that the higher
the WER, the more they differ from the standard. Table 7
shows the results of this experiment. The results seem to be
in line with the human evaluation results; the further away
the dialect is from the standard Finnish, the lower it scores in
the human evaluation. This is a potential indication of famil-
iarity bias; people tend to prefer the more familiar language
variety.
Word Association Test
In the second experiment, we are interested in seeing how
people associate words when they are presented with a stan-
dard Finnish version and a dialectally adapted variant of the
2The data is based on 47 observations for SF, 46 for EK, 43 for
PVS and 49 for IS out of the maximum of 50.
same poem. The two poems are presented on the same page,
labeled as A and B. The order is randomized again, which
means that both the order of poems in the questionnarie and
whether the dialectal one is A or B is randomized. This is
done again to reduce bias in the results that might be caused
by always maintaining the same order. The concepts we
study are the following:
•emotive
•original
•creative
•poem-like
•artificial
•fluent
The subjects are asked to associate each concept with A
or B, one of which is the dialectal and the other the standard
Finnish version of the same poem. We use the same dialects
as before, but which dialect gets used is not controlled in
this experiment. We divide each questionnaire of 10 poems
into piles of two to reduce the work load on each annotator
as each poem is presented in two different variant forms.
This way, we recruit altogether 10 different people for this
task, again native speakers from the university campus. Each
poem with a dialectal variant gets annotated by five different
people.
Table 8 shows the results of this experiment. Some of
the people did not answer to all questions for some poems.
This is reflected in the no answer column. The results in-
dicate that the standard Finnish variant poems were consid-
ered considerably more fluent than the dialectal poems, and
slightly more emotive and artificial. The dialectal poems
were considered considerably more original and creative,
and slightly more poem-like.
It is interesting that while dialectal poems can get clearly
better results on some parameters on this experiment, they
still scored lower on all the parameters in the first experi-
ment. This potentially highlights a more general problem
on evaluation in the field of computational creativity, as re-
sults are heavily dependent on the metric that happened to be
chosen. The problems arising from this ”ad hoc” evaluation
practice are also discussed by (Lamb, Brown, and Clarke
2018).
Proceedings of the 11th International
Conference on Computational Creativity (ICCC’20)
ISBN: 978-989-54160-2-8
209
Poem Typical Understandable Language Mental images Emotions Liking
% M Mo Me M Mo Me M Mo Me M Mo Me M Mo Me M Mo Me
SF 87.2% 2.85 4 3 3.62 4 4 3.51 4 4 3.57 4 4 2.94 2 3 3.02 4 3
EK 82.6% 2.5 2 2 3 4 3 2.87 3 3 3.26 4 3 2.67 2 2 2.70 2 3
IS 77.6% 2.69 2 3 2.90 3, 4 3 2.78 2 3 3.27 4 3 2.86 2 3 2.61 3 3
PVS 77.0% 2.51 2 2 2.80 2 3 2.58 2 3 3.27 4 3 2.86 2 3 2.61 3 3
Table 6: Results form the first human evaluation. Mean, mode and median are reported for the questions on Likert-scale
EK IS PVS
WER 34.38 43.41 54.69
Table 7: The distance of the dialectal poems form the origi-
nal poem written in standard Finnish
SF Dialect No answer
emotive 48% 46% 6%
original 40% 60% 0%
creative 32% 64% 4%
poem-like 46% 50% 4%
artificial 50% 46% 4%
fluent 74% 24% 2%
Table 8: Results of the second experiment with human an-
notators
Conclusions
We have presented our work on automatic dialect adapta-
tion by using a character-level NMT approach. Based on
our automatic evaluation, both the transfer learning method
and a multi-dialectal model with flags can achieve the best
results in different dialects. The transfer learning method,
however, receives the highest scores on most of the dialects.
Nevertheless, the difference in WERs of the two methods
is generally small, therefore it is not possible to clearly rec-
ommend one over another to be used for different character-
level data sets. If the decision is based on the computational
power used, then the multi-dialectal model with flags should
be used as it only needs to be trained once and it can handle
all the dialects.
The dialect adaptation models elaborated in this paper
have been made publicly available as an open-source Python
library3. This not only makes the replication of the re-
sults easier but also makes it possible to apply these unique
Finnish NLP tools on other related research or tasks outside
of the academia as well.
Our study shows that automatic dialect adaptation has a
clear impact to how different attributes of the text are per-
ceived. In the first experiment that was based on existing
evaluation questions, a negative impact was found as the
scores dropped on all the metrics in comparison to the orig-
inal standard Finnish poem. However, when inspecting the
distance the dialects have from the standard Finnish, we no-
ticed that the further away the dialect is form the standard,
the lower it scores.
We believe that the low scores might be an indication of
3https://github.com/mikahama/murre
familiarity bias, which means that people have a tendency
of preferring things they are more familiar with. Especially
since the evaluation was conducted in a region in Finland
with a high number of migrants from different parts of the
country. This leads to a situation where the most familiar
language variety for everyone regardless of their dialectal
background is the standard Finnish variety. Also, as the di-
alectal data used in our model originates from the Finnish
speakers born in the 19th century, it remains possible that
the poems were transformed into a variety not entirely fa-
miliar to the individuals who participated into our survey. In
the upcoming research it is necessary to investigate the per-
ceptions of wider demographics, taking into account larger
areal representation.
Based on our results, it is too early to generalize that fa-
miliarity bias is a problem in evaluation of computationally
creative systems. However, it is an important aspect to take
into consideration in the future research. We are interested
in testing this particular bias out in the future in a more con-
trolled fashion. Nevertheless, the fact that a variable, such as
dialect that is never controlled in the computational creativ-
ity evaluations, has a clear effect on the evaluation results,
raises a real question about the validity of such evaluation
methods. As abstract questions on 5-point Likert scale are
a commonly used evaluation methodology, the question of
narrowing down the unexpected variables, such as dialect,
that affect the evaluation results positively or negatively is
vital for the progress in the field in terms of comparability
of results from different systems.
Even though the initial hypothesis we had on dialects in-
creasing the perceived value of computationally created arte-
facts was proven wrong by the first experiment, the second
experiment showed that dialects can indeed have a positive
effect on the results as well, in terms of perceived creativ-
ity and originality. This finding is also troublesome form
the point of view of computational creativity evaluation in a
larger context. Our dialect adaptation system is by no means
designed to exhibit any creative behavior of its own, yet peo-
ple are more prone to associating the concept creativity with
dialectally adapted poetry.
The results of the first and second experiment give a very
different picture of the impact dialect adaptation has on per-
ceived creativity. This calls for a more thorough research
on the effect different evaluation practices have on the re-
sults of a creative system. Is the difference in results fully
attributable to subjectivity in the task, what was asked on
how it was asked. Does making people pick between two
(dialectal and standard Finnish in our case) introduce a bias
not present when people rate the poems individually? It is
Proceedings of the 11th International
Conference on Computational Creativity (ICCC’20)
ISBN: 978-989-54160-2-8
210
important these questions be systematically addressed in the
future research.
Acknowledgments
Thierry Poibeau is partly supported by a PRAIRIE 3IA In-
stitute fellowship (”Investissements d’avenir” program, ref-
erence ANR-19-P3IA-0001).
References
Auer, P. 2018. Dialect change in europe–leveling and con-
vergence. The Handbook of Dialectology 159–76.
Bollmann, M. 2019. A large-scale comparison of historical
text normalization systems. In Proceedings of the 2019 Con-
ference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers), 3885–3898. Min-
neapolis, Minnesota: Association for Computational Lin-
guistics.
Ghazvininejad, M.; Choi, Y.; and Knight, K. 2018. Neural
poetry translation. In Proceedings of the 2018 Conference
of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Vol-
ume 2 (Short Papers), 67–71.
H¨
akkinen, K. 1994. Agricolasta nykykieleen: suomen kir-
jakielen historia. S¨
oderstr¨
om.
H¨
am¨
al¨
ainen, M., and Alnajjar, K. 2019. Creative contex-
tual dialog adaptation in an open world rpg. In Proceedings
of the 14th International Conference on the Foundations of
Digital Games, 1–7.
H¨
am¨
al¨
ainen, M., and Hengchen, S. 2019. From the paft
to the fiiture: a fully automatic nmt and word embeddings
method for ocr post-correction. In Recent Advances in Nat-
ural Language Processing, 432–437. INCOMA.
H¨
am¨
al¨
ainen, M., and Rueter, J. 2018. Development of an
open source natural language generation tool for finnish. In
Proceedings of the Fourth International Workshop on Com-
putational Linguistics of Uralic Languages, 51–58.
H¨
am¨
al¨
ainen, M.; S¨
aily, T.; Rueter, J.; Tiedemann, J.; and
M¨
akel¨
a, E. 2019. Revisiting NMT for normalization of early
English letters. In Proceedings of the 3rd Joint SIGHUM
Workshop on Computational Linguistics for Cultural Her-
itage, Social Sciences, Humanities and Literature, 71–75.
Minneapolis, USA: Association for Computational Linguis-
tics.
H¨
am¨
al¨
ainen, M. 2018. Harnessing nlg to create finnish
poetry automatically. In International Conference on Com-
putational Creativity, 9–15. Association for Computational
Creativity (ACC).
Institute for the Languages of Finland. 2014. Suomen kie-
len n¨
aytteit¨
a - Samples of Spoken Finnish [online-corpus],
version 1.0. http://urn.fi/urn:nbn:fi:lb-201407141.
Johnson, M.; Schuster, M.; Le, Q. V.; Krikun, M.; Wu, Y.;
Chen, Z.; Thorat, N.; Vi´
egas, F.; Wattenberg, M.; Corrado,
G.; Hughes, M.; and Dean, J. 2017. Google’s multilin-
gual neural machine translation system: Enabling zero-shot
translation. Transactions of the Association for Computa-
tional Linguistics 5:339–351.
Kaji, N., and Kurohashi, S. 2005. Lexical choice via topic
adaptation for paraphrasing written language to spoken lan-
guage. In International Conference on Natural Language
Processing, 981–992. Springer.
Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; and Rush, A. M.
2017. OpenNMT: Open-Source Toolkit for Neural Machine
Translation. In Proc. ACL.
Lamb, C.; Brown, D. G.; and Clarke, C. L. 2018. Evaluating
computational creativity: An interdisciplinary tutorial. ACM
Computing Surveys (CSUR) 51(2):1–34.
Levenshtein, V. I. 1966. Binary codes capable of correcting
deletions, insertions, and reversals. Soviets Physics Doklady
10(8):707–710.
Li, D.; Zhang, Y.; Gan, Z.; Cheng, Y.; Brockett, C.; Sun, M.-
T.; and Dolan, B. 2019. Domain adaptive text style transfer.
arXiv preprint arXiv:1908.09395.
Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Ef-
fective approaches to attention-based neural machine trans-
lation. arXiv preprint arXiv:1508.04025.
Lyytik¨
ainen, E. 1984. Suomen kielen nauhoitearkiston
nelj¨
annesvuosisata. Viritt¨
aj¨
a88(4):448–448.
Partanen, N.; H¨
am¨
al¨
ainen, M.; and Alnajjar, K. 2019. Di-
alect text normalization to normative standard finnish. In
Proceedings of the 5th Workshop on Noisy User-generated
Text (W-NUT 2019), 141–146.
Prabhumoye, S.; Tsvetkov, Y.; Salakhutdinov, R.; and
Black, A. W. 2018. Style transfer through back-translation.
In Proceedings of the 56th Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long Pa-
pers), 866–876. Melbourne, Australia: Association for
Computational Linguistics.
Rekunen, J. 2000. Suomen kielen n ¨
aytteit¨
a 50. Kotimaisten
kielten tutkimuskeskus.
Toivanen, J.; Toivonen, H.; Valitutti, A.; and Gross, O. 2012.
Corpus-Based Generation of Content and Form in Poetry. In
Proceedings of the Third International Conference on Com-
putational Creativity.
Veliz, C. M.; De Clercq, O.; and Hoste, V. 2019. Benefits of
data augmentation for nmt-based text normalization of user-
generated content. In Proceedings of the 5th Workshop on
Noisy User-generated Text (W-NUT 2019), 275–285.
Proceedings of the 11th International
Conference on Computational Creativity (ICCC’20)
ISBN: 978-989-54160-2-8
211