Conference PaperPDF Available

Revisiting NMT for Normalization of Early English Letters

Authors:
Proc. of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 71–75
Minneapolis, MN, USA, June 7, 2019. c
2019 Association for Computational Linguistics
71
Revisiting NMT for Normalization of Early English Letters
Mika H¨
am¨
al¨
ainen, Tanja S¨
aily, Jack Rueter, J¨
org Tiedemann and Eetu M¨
akel¨
a
Department of Digital Humanities
University of Helsinki
firstname.lastname@helsinki.fi
Abstract
This paper studies the use of NMT (neural ma-
chine translation) as a normalization method
for an early English letter corpus. The corpus
has previously been normalized so that only
less frequent deviant forms are left out with-
out normalization. This paper discusses dif-
ferent methods for improving the normaliza-
tion of these deviant forms by using differ-
ent approaches. Adding features to the train-
ing data is found to be unhelpful, but using a
lexicographical resource to filter the top can-
didates produced by the NMT model together
with lemmatization improves results.
1 Introduction
Natural language processing of historical data is
not a trivial task. A great deal of NLP tools and
resources work out of the box with modern data,
whereas they can be of little use with historical
data. Lack of a written standard in the early days,
and the fact that the language has changed over
the centuries require addressing in order to achieve
higher-level NLP tasks.
The end goal of our project is to iden-
tify neologisms and study their spread in the
CEEC (Corpora of Early English Correspondence)
(Nevalainen et al.,1998–2006), a letter corpus
consisting of texts starting from the 15th century
ranging all the way to the 19th century. In order to
achieve a higher recall in neologisms, the corpus
needs to be normalized to present-day spelling.
A regular-expression based study of neologisms
(S¨
aily et al.,In press) in the same corpus suggested
the use of the Oxford English Dictionary (OED,
n.d.) as a viable way of detecting neologism can-
didates. Words occurring in the corpus before the
earliest attestation in the OED would thus be con-
sidered potential neologism candidates. However,
in order to achieve this, the words in the corpus
need to be mappable to the OED, in other words,
normalized to their modern spelling. As we are
dealing with historical data, the fact that a neolo-
gism exists in the OED is a way of ensuring that
the new word has become established in the lan-
guage.
A previous study in automatic normaliza-
tion of the CEEC comparing different methods
(H¨
am¨
al¨
ainen et al.,2018) suggested NMT (neural
machine translation) as the single most effective
method. This discovery is the motivation for us
to continue this work and focus only on the NMT
approach, expanding on what was proposed in the
earlier work by using different training and post-
processing methods.
In this paper, we will present different NMT
models and evaluate their effectiveness in normal-
izing the CEEC. As a result of the previous study,
all the easily normalizable historical forms have
been filtered out and we will focus solely on the
historical spellings that are difficult to normalize
with existing methods.
2 Related Work
Using character level machine translation for nor-
malization of historical text is not a new idea. Re-
search in this vein has existed already before the
dawn of neural machine translation (NMT), during
the era of statistical machine translation (SMT).
Pettersson et al. (2013) present an SMT ap-
proach for normalizing historical text as part of
a pipeline where NLP tools for the modern vari-
ant of the language are then used to do tagging
and parsing. The normalization is conducted on a
character level. They do alignment of the parallel
data on both word and character level.
SMT has also been used in normalization of
contemporary dialectal language to the standard-
ized normative form (Samardzic et al.,2015).
They test normalization with word-by-word trans-
72
lation and character level SMT. The character level
SMT improves the normalization of unseen and
ambiguous words.
Korchagina (2017) proposes an NMT based
normalization for medieval German. It is sup-
posedly one of the first attempts to use NMT for
historical normalization. The study reports NMT
outperforming the existing rule-based and SMT
methods.
A recent study by Tang et al. (2018) compared
different NMT models for historical text normal-
ization in five different languages. They report that
NMT outperforms SMT in four of the five lan-
guages. In terms of performance, vanilla RNNs
are comparable to LSTMs and GRUs, and also
the difference between attention and no attention
is small.
3 The Corpus
We use the CEEC as our corpus. It consists of
written letters from the 15th all the way to the 19th
century. The letters have been digitized by hand by
editors who have wanted to maintain the linguis-
tic form as close to the original as possible. This
means that while our data is free of OCR errors,
words are spelled in their historical forms.
The corpus has been annotated with social
metadata. This means that for each author in the
corpus we can get various kinds of social infor-
mation such as the rank and gender of the author,
time of birth and death and so on. The corpus also
records additional information on a per letter ba-
sis, such as the year the letter was written, the rela-
tionship between the sender and the recipient, and
so on.
4 The NMT Approach
We use OpenNMT1(Klein et al.,2017) to train the
NMT models discussed in this paper. The models
are trained on a character level. This means that
the model is supplied with parallel lists of histori-
cal spellings and their modern counterparts, where
the words have been split into individual charac-
ters separated by white spaces.
The training is done for pairs of words, i.e. the
normalization is to be conducted without a con-
text. The NMT model would then treat individual
characters as though they were words in a sentence
and ”translate” them into the corresponding mod-
ernized spelling.
1Version 0.2.1 of opennmt-py
4.1 The Parallel Data
We use different sources of historical-modern En-
glish parallel data. These include the normalized
words from the CEEC, the historical forms pro-
vided in the OED and the historical lemmas in
the Middle English Dictionary (MED,n.d.) that
have been linked to the OED lemmas with modern
spelling. This parallel data of 183505 words is the
same as compiled and used in H¨
am¨
al¨
ainen et al.
(2018).
For testing the accuracy of the models we pre-
pare by hand gold standards by taking sets of 100
words of the previously non-normalized words
in the CEEC. The accuracy is tested as an ex-
act match to the gold standard. We prepare one
generic test set and four century specific test sets
of the 15th, 16th, 17th and 18th century words.
Each of these five gold-annotated test sets con-
sists of 100 words normalized by a linguist knowl-
edgeable in historical English. The reason why we
choose to prepare our own gold standard is that we
are interested in the applicability of our approach
in the study of the CEEC corpus as a step in our
neologism identification pipeline.
4.2 Different NMT models
The previous work (H¨
am¨
al¨
ainen et al.,2018) on
the normalization of the CEEC corpus used the
default settings of OpenNMT. This means that
the encoder is a simple recurrent neural network
(RNN), there are two layers both in the encoder
and the decoder and the attention model is the
general global attention presented by Luong et al.
(2015).
In this section we train the model with different
parameters to see their effect on the accuracy of
the model. The accuracy is evaluated and reported
over a concatenated test set of all the five different
gold standards.
At first, we change one parameter at a time and
compare the results to the default settings. We try
two different encoder types, bi-directional recur-
rent neural networks (BRNNs) and mean, which
is an encoder applying mean pooling. BRNN uses
two independent encoders to encode the sequence
reversed and without reversal. The default RNN,
in contrast, only encodes the sequence normally
without reversing it.
In addition to the default attention model, we
also try out the MLP (multi-layer perceptron)
model proposed by Bahdanau et al. (2014). We
73
change the number of layers used by the encoder
and decoder and run the training with four and six
layers for both encoding and decoding.
default mlp mean brnn 4
layers
6
layers
acc. 35.6% 36.6% 13% 39.8% 37.2% 36.6%
Table 1: Accuracy of each method
Table 1shows the accuracy of the model trained
with the different parameters. BRNNs seem to
produce the best results, while the MLP attention
model and additional layers can be beneficial over
the default attention and number of layers. Next,
we will try out different combinations with the
BRNN encoder to see whether we can increase the
overall accuracy.
brnn brnn
+mlp
brnn
+4 layers
brnn+mlp
+4 layers
acc. 39.8% 36% 35.8% 38.2%
Table 2: Accuracy of BRNN models
We can see in Table 2that the BRNN with the
default attention and the default number of lay-
ers works better than the other combinations. This
means that for our future models, we will pick the
BRNN encoder with default settings.
4.3 Additional Information
The previous study (H¨
am¨
al¨
ainen et al.,2018)
showed that using information about the centuries
of the historical forms in training the NMT and
SMT models was not beneficial. However, there
might still be other additional information that
could potentially boost the performance of the
NMT model. In this part, we show the results of
models trained with different additional data.
In addition to the century, the CEEC comes with
social metadata on both the letters and the authors.
We use the sender ID, sender rank, relationship
code and recipient rank as additional information
for the model. The sender ID is used to uniquely
identify different senders in the CEEC, the ranks
indicate the person’s social status at the time of the
letter (such as nobility or upper gentry) and the re-
lationship code indicates whether the sender and
recipient were friends, had a formal relationship
and so on.
The social information is included in the paral-
lel data in such a way that for each historical form,
15th 16th 17th 18th generic
eSpeak IPA
with graphemes 22% 25% 31% 14% 20%
Only
eSpeak IPA 43% 35% 52% 20% 36%
Metaphone 22% 23% 25% 12% 23%
Bigram 16% 9% 11% 3% 9%
No feature 45% 35% 48% 25% 42%
Table 3: Results with additional information
the social metadata is added if the form has ap-
peared in the CEEC. If the form has not appeared
in the CEEC, generic placeholders are added in-
stead of real values. The metadata is appended as
a list separated by white spaces to the beginning of
each historical form.
When reading the historical letters, what is
helpful for a human reader in understanding the
historical forms is reading them out loud. Because
of this discovery, we add pronunciation informa-
tion to the parallel data. We add an estimation of
pronunciation to the beginning of each historical
form as an individual token. This estimation is
done by the Metaphone algorithm (Philips,1990).
Metaphone produces an approximation of the pro-
nunciation of a word, not an exact phonetic rep-
resentation, which could be useful for the NMT
model.
In addition to the Metaphone approximation,
we use eSpeak NG2to produce an IPA transcrip-
tion of the historical forms. For the transcription,
we use British English as the language variant, as
the letters in our corpus are mainly from different
parts of England. We use the transcription to train
two different models, one where the transcription
is appended character by character to the begin-
ning of the historical form, and another where we
substitute the transcription for the historical form.
The final alteration in the training data we try
in this section is that instead of providing more
information, we try to train the model with char-
acter bigrams rather than the unigrams used in all
the other models.
The results for the different approaches dis-
cussed in this section are shown in Table 3. As we
can see, only the eSpeak produced IPA, when it no
longer includes the original written form, comes
close to using the character unigrams from the par-
allel data. Training with just the IPA transcrip-
tion outperforms the character approach only in
the 17th century.
2https://github.com/espeak-ng/espeak-ng/
74
4.4 Picking Normalization Candidate
Looking at the results of the NMT model, we can
see that more often than not, when the normal-
ization is not correct, the resulting word form is
not a word of the English language. Therefore,
it makes sense to explore whether the model can
reach a correct normalization if instead of consid-
ering the best normalization candidate produced
by the NMT model, we look at multiple top candi-
dates.
During the translation step, we make the NMT
model output 10 best candidates. We go through
these candidates starting from the best one and
compare them against the OED. If the produced
modern form exists in the OED or exists in the
OED after lemmatization with Spacy (Honnibal
and Montani,2017)3, we pick the form as the final
normalization. In other words, we use a dictionary
to pick the best normalization candidate that exists
in the English language.
15th 16th 17th 18th generic
OED
+Lemma 49% 42% 51% 19% 43%
Lemma 45% 35% 48% 25% 42%
Table 4: Results with picking the best candidate with
OED
Table 4shows the results when we pick the first
candidate that is found in the OED and when we
only use the top candidate for the BRNN model.
We can see improvement on all the test sets except
for the 18th century.
15th 16th 17th 18th generic
OED
+Lemma 69% 78% 71% 50% 61%
Lemma 61% 67% 63% 45% 53%
Table 5: Results with OED and lemmatization
If we lemmatize both the input of the NMT
model and the correct modernized form in the gold
standard with Spacy before the evaluation, we can
assess the overall accuracy of OED mapping with
the normalization strategies. The results shown in
Table 5indicate a performance boost in the map-
ping task, however this type of normalization does
not match the actual inflectional forms. Neverthe-
less, in our case, lemmatization is possible as we
3With model en core web md
are ultimately interested in mapping words to the
OED rather than their exact form in a sentence.
5 Conclusions
Improving the NMT model for normalization is
a difficult task. A different sequence-to-sequence
model can improve the results to a degree, but the
gains are not big. Adding more features, no mat-
ter how useful they might sound intuitively, does
not add any performance boost. At least that is the
case for the corpus used in this study, as the great
deal of social variety and the time-span of multiple
centuries represented in the CEEC are reflected in
the non-standard spelling.
Using a lexicographical resource and a good
lemmatizer, as simplistic as they are, are a good
way to improve the normalization results. How-
ever, as getting even more performance gains for
the NMT model seems tricky, probably the best di-
rection for the future is to improve on the method
for picking the contextually most suitable nor-
malization out of the results of multiple differ-
ent normalization methods as originally explored
in H¨
am¨
al¨
ainen et al. (2018). Thus, the small im-
provement of this paper can be brought back to the
original setting as one of the normalization meth-
ods.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly
learning to align and translate. arXiv preprint
arXiv:1409.0473.
Matthew Honnibal and Ines Montani. 2017. spaCy 2:
Natural Language Understanding with Bloom Em-
beddings, Convolutional Neural Networks and In-
cremental Parsing. To appear.
Mika H¨
am¨
al¨
ainen, Tanja S¨
aily, Jack Rueter, J¨
org
Tiedemann, and Eetu M¨
akel¨
a. 2018. Normalizing
early English letters to Present-day English spelling.
In Proceedings of the 2nd Joint SIGHUM Workshop
on Computational Linguistics for Cultural Heritage,
Social Sciences, Humanities and Literature, pages
87–96.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
Senellart, and Alexander M. Rush. 2017. Open-
NMT: Open-Source Toolkit for Neural Machine
Translation. In Proc. ACL.
Natalia Korchagina. 2017. Normalizing medieval ger-
man texts: from rules to deep learning. In Proceed-
ings of the NoDaLiDa 2017 Workshop on Processing
Historical Language, pages 12–17.
75
Minh-Thang Luong, Hieu Pham, and Christopher D
Manning. 2015. Effective approaches to attention-
based neural machine translation. arXiv preprint
arXiv:1508.04025.
MED. n.d. Middle English Dictionary. University of
Michigan. Https://quod.lib.umich.edu/m/med/.
Terttu Nevalainen, Helena Raumolin-Brunberg,
Jukka Ker¨
anen, Minna Nevala, Arja Nurmi,
Minna Palander-Collin, Samuli Kaislaniemi,
Mikko Laitinen, Tanja S¨
aily, and Anni
Sairio. 1998–2006. CEEC, Corpora of Early
English Correspondence. Department of
Modern Languages, University of Helsinki.
Http://www.helsinki.fi/varieng/CoRD/corpora/CEEC/.
OED. n.d. OED Online. Oxford University Press.
Http://www.oed.com/.
Eva Pettersson, Be´
ata Megyesi, and J¨
org Tiedemann.
2013. An SMT approach to automatic annotation
of historical text. In Proceedings of the workshop
on computational historical linguistics at NODAL-
IDA 2013; May 22-24; 2013; Oslo; Norway. NEALT
Proceedings Series 18, 087, pages 54–69. Link¨
oping
University Electronic Press.
Lawrence Philips. 1990. Hanging on the Metaphone.
Computer Language, 7(12).
Tanja Samardzic, Yves Scherrer, and Elvira Glaser.
2015. Normalising orthographic and dialectal vari-
ants for the automatic processing of swiss german.
In Proceedings of the 7th Language and Technology
Conference.
Tanja S¨
aily, Eetu M¨
akel¨
a, and Mika H¨
am¨
al¨
ainen. In
press. Explorations into the social contexts of ne-
ologism use in early English correspondence. Prag-
matics & Cognition.
Gongbo Tang, Fabienne Cap, Eva Pettersson, and
Joakim Nivre. 2018. An evaluation of neural ma-
chine translation models on historical spelling nor-
malization. In Proceedings of the 27th International
Conference on Computational Linguistics, pages
1320–1331.
... Their method tries to first detect a neologism and then normalize it into standard language. Interestingly, they apply the idea of normalization in order to remove neologisms, where as we use normalization to find neologisms [15]. ...
... The details of our automated mapping procedure are described in [14]. In short, the bestperforming approach we could come up with was based on neural machine translation, with a post-filtering step that accepted only lemmas appearing in the OED [15]. Access to a local version of the OED was kindly provided by Oxford University Press. ...
... Despite the improvements in the normalization step [15], normalization of the entire CEEC is still a problem that is far from solved. While using synthetic data improves lowresourced sequence-to-sequence models including character-level models [13,16], our experiments with back-translation on the training data available to us have not yielded better accuracies. ...
Preprint
Full-text available
We study neologism use in two samples of early English correspondence, from 1640--1660 and 1760--1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer-assisted approach and note the difficulties associated with massive variation in the corpus. Our findings include that while male letter-writers tend to use neologisms more frequently than women, the eighteenth century seems to have provided more opportunities for women and the lower ranks to participate in neologism use as well. In both samples, neologisms most frequently occur in letters written between close friends, which could be due to this less stable relationship triggering more creative language use. In the seventeenth-century sample, we observe the influence of the English Civil War, while the eighteenth-century sample appears to reflect the changing functions of letter-writing, as correspondence is increasingly being used as a tool for building and maintaining social relationships in addition to exchanging information.
... Their method tries to first detect a neologism and then normalize it into standard lan guage. Interestingly, they apply the idea of normalization in order to remove neologisms, where as we use normalization to find neologisms [15]. ...
... The details of our automated mapping procedure are described in [14]. In short, the best performing approach we could come up with was based on neural machine translation, with a postfiltering step that accepted only lemmas appearing in the OED [15]. Access to a local version of the OED was kindly provided by Oxford University Press. ...
... Despite the improvements in the normalization step [15], normalization of the entire CEEC is still a problem that is far from solved. While using synthetic data improves low resourced sequencetosequence models including characterlevel models [13,16], our experiments with backtranslation on the training data available to us have not yielded better accuracies. ...
Chapter
Full-text available
We study neologism use in two samples of early English correspondence, from 1640-1660 and 1760-1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer assisted approach and note the difficulties associated with massive variation in the corpus. Our findings include that while male letter writers tend to use neologisms more frequently than women, the eighteenth century seems to have provided more opportunities for women and the lower ranks to participate in neologism use as well. In both samples, neologisms most frequently occur in letters written between close friends, which could be due to this less stable relationship triggering more creative language use. In the seventeenth century sample, we observe the influence of the English Civil War, while the eighteenth century sample appears to reflect the changing functions of letter writing, as correspondence is increasingly being used as a tool for building and maintaining social relationships in addition to exchanging information.
... 13 Regardless of size, spelling variation is also an issue for diachronic corpora. Better methods of spelling normalization are currently being developed (e.g., Hämäläinen et al., 2019), an endeavor that should be pursued further. Future methodological development in dealing with these small but complex datasets should perhaps especially focus on issues of potential bias and outliers, so that they could be alleviated, and so that human analysts would be alerted to particularly sparse or skewed data in specific time periods. ...
Article
Full-text available
Endeavors to computationally model language variation and change are ever increasing. While analyses of recent diachronic trends are frequently conducted, long-term trends accounting for sociolinguistic variation are less well-studied. Our work sheds light on the temporal dynamics of language use of British 18th century women as a group in transition across two situational contexts. Our findings reveal that in formal contexts women adapt to register conventions, while in informal contexts they act as innovators of change in language use influencing others. While adopted from other disciplines, our methods inform (historical) sociolinguistic work in novel ways. These methods include diachronic periodization by Kullback-Leibler divergence to determine periods of change and relevant features of variation, and event cascades as influencer models.
... Within the historical text normalization, a recent study [11] compared various LSTM architectures, and found that bi-directional recurrent neural networks (BRNN) were more accurate than onedirectional RNNs. Different attention models or deeper architectures did not improve the results further. ...
Preprint
Full-text available
Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best results. We believe this is due to the size of the training data available for the model. Our models are accessible as a Python package. The study provides important information about the adaptability of these methods in different contexts, and gives important baselines for further study.
... Within the historical text normalization, a recent study [11] compared various LSTM architectures, and found that bi-directional recurrent neural networks (BRNN) were more accurate than onedirectional RNNs. Different attention models or deeper architectures did not improve the results further. ...
Conference Paper
Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best results. We believe this is due to the size of the training data available for the model. Our models are accessible as a Python package. The study provides important information about the adaptability of these methods in different contexts, and gives important baselines for further study.
... We are interested in seeing what the effect of the automatically adapted dialect is on computer generated text. We use an existing Finnish poem generator (Hämäläinen 2018) that produces standard Finnish (SF) text as it relies heavily on hand defined syntactic structures that are filled with lemmatized words that are inflected with a normative Finnish morphological generator by using a tool called Syntax Maker . We use this generator to generate 10 different poems. ...
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.
Conference Paper
While standard Estonian is not a low-resourced language, the different dialects of the language are under-resourced from the point of view of NLP, given that there are no vast hand normalized resources available for training a machine learning model to normalize dialectal Estonian to standard Estonian. In this paper, we crawl a small corpus of parallel dialectal Estonian - standard Estonian sentences. In addition, we take a savvy approach of generating more synthetic training data for the normalization task by using an existing dialect generator model built for Finnish to "dialectalize" standard Estonian sentences from the Universal Dependencies tree banks. Our BERT based normalization model achieves a word error rate that is 26.49 points lower when using both the synthetic data and Estonian data in comparison to training the model with only the available Estonian data. Our results suggest that synthetic data generated by a model trained on a more resourced related language can indeed boost the results for a less resourced language.
Conference Paper
Full-text available
We investigate both rule-based and machine learning methods for the task of compound error correction and evaluate their efficiency for North Sámi, a low resource language. The lack of error-free data needed for a neural approach is a challenge to the development of these tools, which is not shared by bigger languages. In order to compensate for that, we used a rule-based grammar checker to remove erroneous sentences and insert compound errors by splitting correct compounds. We describe how we set up the error detection rules, and how we train a bi-RNN based neural network. The precision of the rule-based model tested on a cor- pus with real errors (81.0%) is slightly better than the neural model (79.4%). The rule-based model is also more flexible with regard to fixing specific errors requested by the user community. However, the neural model has a better recall (98%). The results suggest that an approach that combines the advantages of both models would be desirable in the future. Our tools and data sets are open-source and freely available on GitHub and Zenodo.
Preprint
Full-text available
We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.
Conference Paper
Full-text available
This paper presents multiple methods for normalizing the most deviant and infrequent historical spellings in a corpus consisting of personal correspondence from the 15th to the 19th century. The methods include machine translation (neural and statistical), edit distance and rule-based FST. Different normalization methods are compared and evaluated. All of the methods have their own strengths in word normalization. This calls for finding ways of combining the results from these methods to leverage their individual strengths.
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Article
Full-text available
In this paper we propose an approach to tagging and parsing of historical text, using characterbased SMT methods for translating the historical spelling to a modern spelling before applying the NLP tools. This way, existing modern taggers and parsers may be used to analyse historical text instead of training new tools specialised in historical language, which might be hard considering the lack of linguistically annotated historical corpora. We show that our approach to spelling normalisation is successful even with small amounts of training data, and that it is generalisable to several languages. For the two languages presented in this paper, the proportion of tokens with a spelling identical to the modern gold standard spelling increases from 64.8% to 83.9%, and from 64.6% to 92.3% respectively, which has a positive impact on subsequent tagging and parsing using modern tools.
Article
We introduce an open-source toolkit for neural machine translation (NMT) to support research into model architectures, feature representations, and source modalities, while maintaining competitive performance, modularity and reasonable training requirements.
Conference Paper
Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, they lack tools and resources for natural language processing. The main reason for this is the fact that the dialects are mostly spoken and that written resources are small and highly inconsistent. This paper addresses the great variability in writing that poses a problem for automatic processing. We propose an automatic approach to normalising the variants to a single representation intended for processing tools’ internal use (not shown to human users). We manually create a sample of transcribed and normalised texts, which we use to train and test three methods based on machine translation: word-by-word mappings, character-based machine translation, and language modelling. We show that an optimal combination of the three approaches gives better results than any of them separately.
Article
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-based NMT. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. We demonstrate the effectiveness of both approaches over the WMT translation tasks between English and German in both directions. With local attention, we achieve a significant gain of 5.0 BLEU points over non-attentional systems which already incorporate known techniques such as dropout. Our ensemble model using different attention architectures has established a new state-of-the-art result in the WMT'15 English to German translation task with 25.9 BLEU points, an improvement of 1.0 BLEU points over the existing best system backed by NMT and an n-gram reranker.
An evaluation of neural machine translation models on historical spelling normalization
  • Gongbo Tang
  • Fabienne Cap
  • Eva Pettersson
  • Joakim Nivre
Gongbo Tang, Fabienne Cap, Eva Pettersson, and Joakim Nivre. 2018. An evaluation of neural machine translation models on historical spelling normalization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1320-1331.