PresentationPDF Available

# Presentation workshop Phylogenetic methods in Historical Linguistics Tübingen 29-03-17: Reconstructing language ancestry by performing word prediction

Authors:

## Abstract

In this presentation, I show how machine learning can be applied to historical linguistics. This presentation describes the progress of my thesis until march 2017.
Reconstructing language ancestry by
performing word prediction
Peter Dekker
University of Amsterdam
peter@peterdekker.eu
March 29, 2017
Workshop Phylogenetic methods in historical linguistics
Tübingen, March 27-30, 2017
1
Introduction
Introduction
MSc student Artificial Intelligence, University of Amsterdam
MSc thesis, supervised by:
Gerhard Jäger, SfS, University of Tübingen
Jelle Zuidema, ILLC, University of Amsterdam
2
Overview
Introduction
Method
Word prediction
Phylogenetic tree reconstruction
Identification of sound correspondences
Results
Word prediction
Phylogenetic tree reconstruction
Identification of sound correspondences
Future work and conclusion
3
Research question
How are languages related?
Which sounds change regularly between languages?
Many methods depend on manual cognate judgments
Method: word prediction, based on phonetic basic vocabulary
lists
Predict words in language Afrom words in language B
Use regularity of sound change
Serves as basis for:
Phylogenetic tree reconstruction, without manual cognate
judgments
Identification of sound correspondences
4
Related work
Cognate production: Beinborn et al. (2013); Ciobanu (2016)
Only on cognates, orthograpic input
Our method: both cognates and non-cognates, phonetic input
Cognate detection: Inkpen et al. (2005); List (2012); Jäger et al.
(2017); Rama (2016)
generation
5
Method
Method overview
Word prediction algorithm
• Applications:
Phylogenetic tree reconstruction
Identification of sound correspondences
6
Method
Word prediction
Machine learning
Supervised learning
Train a model on pairs (x,y)
Predict yfor an unseen x
Eg. image classification
Linguistic motivation
Regular sound change is predictable
7
Word prediction
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
NL DE
ɣeːst ɡaɪst
nɛt nɛt͡s
ɑprɪl ʔapʁiːl
vaːdə faːtɐ
kɔrt kʊʁt͡s
Analogy to machine translation (Kondrak, 2002)
Learn to detect cognates, partial cognates and loanwords
8
Word prediction
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
NL DE
ɣeːst ɡaɪst
nɛt nɛt͡s
ɑprɪl ʔapʁiːl
vaːdə faːtɐ
kɔrt kʊʁt͡s
Analogy to machine translation (Kondrak, 2002)
Learn to detect cognates, partial cognates and loanwords
8
Word prediction
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
NL DE
ɣeːst ɡaɪst
nɛt nɛt͡s
ɑprɪl ʔapʁiːl
vaːdə faːtɐ
kɔrt kʊʁt͡s
Analogy to machine translation (Kondrak, 2002)
Learn to detect cognates, partial cognates and loanwords
8
Word prediction
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
NL DE
ɣeːst ɡaɪst
nɛt nɛt͡s
ɑprɪl ʔapʁiːl
vaːdə faːtɐ
kɔrt kʊʁt͡s
vreːmt
Analogy to machine translation (Kondrak, 2002)
Learn to detect cognates, partial cognates and loanwords
8
Word prediction
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
NL DE
ɣeːst ɡaɪst
nɛt nɛt͡s
ɑprɪl ʔapʁiːl
vaːdə faːtɐ
kɔrt kʊʁt͡s
vreːmt fʁɛmt
Analogy to machine translation (Kondrak, 2002)
Learn to detect cognates, partial cognates and loanwords
8
Word prediction
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
NL DE
ɣeːst ɡaɪst
nɛt nɛt͡s
ɑprɪl ʔapʁiːl
vaːdə faːtɐ
kɔrt kʊʁt͡s
vreːmt fʁɛmt
Analogy to machine translation (Kondrak, 2002)
Learn to detect cognates, partial cognates and loanwords
8
Word prediction
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
NL DE
ɣeːst ɡaɪst
nɛt nɛt͡s
ɑprɪl ʔapʁiːl
vaːdə faːtɐ
kɔrt kʊʁt͡s
vreːmt fʁɛmt
Analogy to machine translation (Kondrak, 2002)
Learn to detect cognates, partial cognates and loanwords
8
Recurrent neural network
Neural network that
models sequential data
Recurrent connections
Weights (=information) are
shared between time steps
Unfold one node over time:
Output
Hidden
Input
9
Recurrent neural network
Neural network that
models sequential data
Recurrent connections
Weights (=information) are
shared between time steps
Unfold one node over time:
Output
Hidden
Input
9
Recurrent neural network
Neural network that
models sequential data
Recurrent connections
Weights (=information) are
shared between time steps
Unfold one node over time:
Output
Hidden
Input
9
Recurrent neural network
Neural network that
models sequential data
Recurrent connections
Weights (=information) are
shared between time steps
Unfold one node over time:
Output
Hidden
Input
9
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
ht0
ɣ
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
ht0ht1
W
ɣ
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
ht0ht1ht2
W W
ɣeː s
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
ht0ht1ht2ht3
W W W
ɣeː s t
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
ht0ht1ht2ht3
W W W
ɣeː s t
ɡaɪ s t
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
Direct output of encoder: assumes same input and output
length. Solution:
Encoder-decoder structure
Successful in machine translation: Cho et al. (2014), Sutskever
et al. (2014)
ht0ht1ht2ht3
W W W
ɣeː s t
ɡaɪ s t
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
Encoder-decoder structure
Successful in machine translation: Cho et al. (2014), Sutskever
et al. (2014)
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
10
Recurrent Neural Network (RNN)
Our model
Sequence-to-sequence in Cho et al. (2014), not in our
approach
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
ht0ht1ht2ht3
VVV Decoder
10
Recurrent Neural Network (RNN)
Our model
Sequence-to-sequence in Cho et al. (2014), not in our
approach
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
ht0ht1ht2ht3
VVV Decoder
Tt0Tt1Tt2Tt3
ɡaɪ s t
Loss
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
ht0ht1ht2ht3
VVV Decoder
Tt0Tt1Tt2Tt3
ɡaɪ s t
Loss
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
ht0ht1ht2ht3
VVV Decoder
Tt0Tt1Tt2Tt3
ɡaɪ s t
Loss
10
Recurrent Neural Network (RNN)
Sequence-to-sequence in Cho et al. (2014), not in our approach
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
ht0ht1ht2ht3
VVV Decoder
Tt0Tt1Tt2Tt3
ɡaɪ s t
Loss
10
Loss function
Network learns from the loss between target and prediction
General loss function: cross-entropy
non-cognates
Error between target and prediction, with sharp decline if error
exceeds θ
θbased on mean error in training history
L=CE(t,p)·p(cog)
p(cog) = 1
1+eE(t,p)θ
θ=Ehistory +vσ
p(cog)
E(t,p)
11
Model details
400 hidden units for encoder and decoder
Gated Recurrent Units (GRU) (Cho et al., 2014) as recurrent
nodes: remember long-distance dependencies
Bidirectional encoder
Encoder input is read until exact word length
Decoder input has standard length, empty positions encoded
by special character
Implemented using Lasagne framework, based on Theano
12
Input format
Input alphabet:
Phonetic representation of words: usually IPA
ASJPcode: 41 sound classes, many-to-one from IPA to ASJP
(Brown et al., 2008)
Input encoding:
One-hot character encoding:
Phonetic encoding:
13
Input format
Input alphabet:
Phonetic representation of words: usually IPA
ASJPcode: 41 sound classes, many-to-one from IPA to ASJP
(Brown et al., 2008)
Input encoding:
One-hot character encoding:
Vector of length ncharacters, with 1 at right character, 0 for all other
characters
Single-label classification: somax output layer, categorical
cross-entropy loss
Phonetic encoding:
p 1 0 0 0 0
b 0 0 0 1 0
13
Input format
Input alphabet:
Phonetic representation of words: usually IPA
ASJPcode: 41 sound classes, many-to-one from IPA to ASJP
(Brown et al., 2008)
Input encoding:
One-hot character encoding:
Phonetic encoding:
Vector of length nfeatures, with 1 at every feature that applies
Multi-label classification: sigmoid output layer, binary
cross-entropy loss
Voiced Labial Dental . . .
p 0 1 0 . . .
b 1 1 0 . . .
13
Method
Phylogenetic tree reconstruction
Reconstructing a phylogenetic tree
Intuition: performance on prediction corresponds to genetic
relationship between languages
Cognate pairs are predictable through regular sound
correspondences
Language pairs with high prediction score share more cognates
Hierachical clustering of languages based on edit distance
matrix between target and prediction
For a language pair, distance is mean of distances in both
directions
UPGMA (Sokal and Michener, 1958), neighbor joining (Saitou
and Nei, 1987) or other phylogenetic algorithm
Baseline: clustering based on edit distances between source
and target
14
Method
Identification of sound correspondences
Identification of sound correspondences
Which sound correspondences occur regularly, in which
contexts?
Perform Needleman-Wunsch alignment (Needleman and
Wunsch, 1970) between source target and
sourceprediction
Look at frequencies of substitutions
15
Results
Results
Word prediction
Word prediction
NorthEuraLex dataset (Dellert, 2015)
Indo-European portion 1016 concepts for 31 languages (still
unpublished)
Seings in following experiments:
15 epochs over training set of 800
Test set: 100
16
Word prediction: NL-DE
Source Target Prediction
rur3 GiG3n GGG33n
vorst fGost fuGst
vErbet3r3 fEabEsan fEaaeG3nn
Eiverix flaisiS EegiitS
zom3r zoma zum
spor Spua SaaG3
xras gGas gGas
ent Ent3 aint
sprek3 SpGES3n S3EG3nn
blEiv3 blaib3n blib3nn
Distance 0.54
17
Word prediction: EN-FR
Source Target Prediction
spred apate prE
stik bato afose
haumoC kobyE mumi
wil3u sol aaa
3j puse arot
bend kurbe bE
mir3 glas mrry
swon siN s3
teibl tabl3 traoe
fo katr3 poaS
be3 urs bE
8en pyi lE
Distance 0.87
18
Prediction performance
Baseline: source prediction and prediction of sounds using PMI
(Jäger et al., 2017; Church and Hanks, 1990; Wieling et al., 2009)
Lang pair Distance
Prediction Source PMI
ES IT 0.55 0.51 0.50
DE NL 0.56 0.62 0.54
NL DE 0.58 0.62 0.51
IT ES 0.59 0.51 0.54
FR IT 0.63 0.72 0.65
CZ RU 0.67 0.60 0.57
CZ PL 0.68 0.62 0.58
FR ES 0.68 0.76 0.67
PL CZ 0.68 0.62 0.52
PL RU 0.69 0.68 0.60
Lang pair Distance
Prediction Source PMI
ES EN 0.84 0.87 0.98
DE PL 0.84 0.91 0.87
RU FR 0.85 0.90 0.96
CZ FR 0.85 0.88 0.95
EN PL 0.85 0.89 0.88
EN RU 0.85 0.88 0.83
PL EN 0.86 0.89 1.00
EN CZ 0.86 0.89 0.87
RU EN 0.88 0.88 1.00
CZ EN 0.89 0.89 0.98
19
Results
Phylogenetic tree reconstruction
Phylogenetic tree reconstruction
Perform word prediction for 9 Indo-European languages
NL DE EN RU PL CZ FR IT ES
NL 0.00 0.57 0.75 0.81 0.82 0.83 0.82 0.80 0.78
DE 0.57 0.00 0.78 0.80 0.81 0.82 0.81 0.76 0.78
EN 0.75 0.78 0.00 0.87 0.86 0.88 0.83 0.82 0.82
RU 0.81 0.80 0.87 0.00 0.69 0.68 0.83 0.76 0.78
PL 0.82 0.81 0.86 0.69 0.00 0.68 0.83 0.79 0.79
CZ 0.83 0.82 0.88 0.68 0.68 0.00 0.85 0.77 0.80
FR 0.82 0.81 0.83 0.83 0.83 0.85 0.00 0.66 0.69
IT 0.80 0.76 0.82 0.76 0.79 0.77 0.66 0.00 0.57
ES 0.78 0.78 0.82 0.78 0.79 0.80 0.69 0.57 0.00
20
Phylogenetic tree: word prediction
Neighbor joining on word prediction distance matrix:
Dutch
German
English
French
Italian
Spanish
Russian
Polish
Czech
21
Phylogenetic tree: word prediction
UPGMA on word prediction distance matrix:
English
Dutch
German
French
Italian
Spanish
Russian
Polish
Czech
22
Phylogenetic tree: source-target baseline
Neighbor joining on source-target baseline distance matrix:
Dutch
German
English
Russian
Czech
Polish
French
Italian
Spanish
23
Phylogenetic tree: source-target baseline
UPGMA on source-target baseline distance matrix:
Polish
Russian
Czech
French
Italian
Spanish
English
Dutch
German
24
Results
Identification of sound correspondences
Identification of sound correspondences: NL-DE
Most frequent substitutions, using Needleman-Wunsch alignment:
Source Prediction Number
- n 32
- 3 23
r G 17
r - 10
r a 9
v f 9
s S 9
w v 8
E a 8
- a 7
Source Target Number
- n 31
r G 21
- 3 15
E a 10
- a 10
r a 10
w v 8
s S 8
e E 7
t c 7
25
Future work and conclusion
Future work
Word prediction algorithm
Aention: use weighted sum of all encoder output steps
(Bahdanau et al., 2014)
Obscured sound correspondences
Use more input languages, so model learns kind of
proto-language
Multiple encoders, or even use of multiple languages in one
encoder (Ha et al., 2016; Johnson et al., 2016)
Encode input as character embeddings
Perform evaluation only on cognates
Cognate detection
Use Bayesian MCMC for tree reconstruction
Sound correspondences:
Extract sound correspondences from neural network
26
Future work
Word prediction algorithm
Aention: use weighted sum of all encoder output steps
(Bahdanau et al., 2014)
Obscured sound correspondences
Use more input languages, so model learns kind of
proto-language
Multiple encoders, or even use of multiple languages in one
encoder (Ha et al., 2016; Johnson et al., 2016)
Encode input as character embeddings
Perform evaluation only on cognates
Cognate detection
Use Bayesian MCMC for tree reconstruction
Sound correspondences:
Extract sound correspondences from neural network
26
Future work
Word prediction algorithm
Aention: use weighted sum of all encoder output steps
(Bahdanau et al., 2014)
Obscured sound correspondences
Use more input languages, so model learns kind of
proto-language
Multiple encoders, or even use of multiple languages in one
encoder (Ha et al., 2016; Johnson et al., 2016)
Encode input as character embeddings
Perform evaluation only on cognates
Cognate detection
Use Bayesian MCMC for tree reconstruction
Sound correspondences:
Extract sound correspondences from neural network
26
Future work
Word prediction algorithm
Aention: use weighted sum of all encoder output steps
(Bahdanau et al., 2014)
Obscured sound correspondences
Use more input languages, so model learns kind of
proto-language
Multiple encoders, or even use of multiple languages in one
encoder (Ha et al., 2016; Johnson et al., 2016)
Encode input as character embeddings
Perform evaluation only on cognates
Cognate detection
Use Bayesian MCMC for tree reconstruction
Sound correspondences:
Extract sound correspondences from neural network
26
Conclusion
historical linguistics:
Phylogenetic tree reconstruction
Identification of sound correspondences
Cognate detection
Results of applications can become even more meaningful by
improving prediction performance
27
Future work and conclusion
Related work: cognate production
Beinborn et al. (2013):
Ciobanu (2016):
28
Cognate detection
Perform word prediction for all language pairs between
languages
For every concept:
From prediction results of all language pairs, take into account
word pairs for this concept
Cluster into cognate clusters based on prediction distance for
only this word pair
Compute B-Cubed F (Bagga and Baldwin, 1998) and compare
to other approaches
29
References
Bagga, A. and Baldwin, B. (1998). Algorithms for scoring coreference chains. In The
first international conference on language resources and evaluation workshop on
linguistics coreference, volume 1, pages 563–566.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473.
Beinborn, L., Zesch, T., and Gurevych, I. (2013). Cognate production using
character-based machine translation. In IJCNLP, pages 883–891.
Brown, C. H., Holman, E. W., Wichmann, S., and Velupillai, V. (2008). Automated
classification of the world′ s languages: a description of the method and
preliminary results. STUF-Language Typology and Universals Sprachtypologie
und Universalienforschung, 61(4):285–308.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
H., and Bengio, Y. (2014). Learning phrase representations using rnn
encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078.
30
Church, K. W. and Hanks, P. (1990). Word association norms, mutual information,
and lexicography. Computational linguistics, 16(1):22–29.
Ciobanu, A. M. (2016). Sequence labeling for cognate production. Procedia
Computer Science, 96:1391–1399.
Dellert, J. (2015). Compiling the uralic dataset for northeuralex, a lexicostatistical
database of northern eurasia. In Septentrio Conference Series, number 2, pages
34–44.
Ha, T.-L., Niehues, J., and Waibel, A. (2016). Toward multilingual neural machine
translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798.
Inkpen, D., Frunza, O., and Kondrak, G. (2005). Automatic identification of cognates
and false friends in french and english. In Proceedings of the International
Conference Recent Advances in Natural Language Processing, pages 251–257.
Jäger, G., List, J.-M., and Sofroniev, P. (2017). Using support vector machines and
state-of-the-art algorithms for phonetic alignment to identify cognates in
multi-lingual wordlists. Mayan, 895:0–05.
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N.,
neural machine translation system: Enabling zero-shot translation. arXiv
preprint arXiv:1611.04558.
31
Kondrak, G. (2002). Determining recurrent sound correspondences by inducing
translation models. In Proceedings of the 19th international conference on
Computational linguistics-Volume 1, pages 1–7. Association for Computational
Linguistics.
List, J.-M. (2012). Lexstat: Automatic detection of cognates in multilingual
wordlists. EACL 2012, page 117.
Needleman, S. B. and Wunsch, C. D. (1970). A gene method applicable to the
search for similarities in the amino acid sequence of two proteins. Journal of
Molecular Biology, 48:443–453.
Rama, T. (2016). Siamese convolutional networks based on phonetic features for
cognate identification. arXiv preprint arXiv:1605.05172.
Saitou, N. and Nei, M. (1987). The neighbor-joining method: A new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425.
Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating
systematic relationships. University of Kansas Scientific Bulletin, 28:1409–1438.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with
neural networks. In Advances in neural information processing systems, pages
3104–3112.
32
Wieling, M., Prokić, J., and Nerbonne, J. (2009). Evaluating the pairwise string
alignment of pronunciations. In Proceedings of the EACL 2009 workshop on
language technology and resources for cultural heritage, social sciences, humanities,
and education, pages 26–34. Association for Computational Linguistics.
33
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, we present our first attempts in building a multilingual Neural Machine Translation framework under a unified approach. We are then able to employ attention-based NMT for many-to-many multilingual translation tasks. Our approach does not require any special treatment on the network architecture and it allows us to learn minimal number of free parameters in a standard way of training. Our approach has shown its effectiveness in an under-resourced translation scenario with considerable improvements up to 2.6 BLEU points. In addition, the approach has achieved interesting and promising results when applied in the translation task that there is no direct parallel corpus between source and target languages.
Article
Full-text available
We propose a sequence labeling approach to cognate production based on the orthography of the words. Our approach leverages the idea that orthographic changes represent sound correspondences to a fairly large extent. Given an input word in language L1, we seek to determine its cognate pair in language L2. To this end, we employ a sequential model which captures the intuition that orthographic changes are highly dependent on the context in which they occur. We apply our method on two pairs of languages. Finally, we investigate how second language learners perceive the orthographic changes from their mother tongue to the language they learn.
Conference Paper
Full-text available
Cognates are words in different languages that are associated with each other by language learners. Thus, cognates are important indicators for the prediction of the perceived difficulty of a text. We introduce a method for automatic cognate production using character-based machine translation. We show that our approach is able to learn production patterns from noisy training data and that it works for a wide range of language pairs. It even works across different alphabets, e.g. we obtain good results on the tested language pairs English-Russian, English-Greek, and English-Farsi. Our method performs significantly better than similarity measures used in previous work on cognates. abstract
Article
Full-text available
In this paper, we explore the use of convolutional networks (ConvNets) for the purpose of cognate identification. We compare our architecture with binary classifiers based on string similarity measures on different language families. Our experiments show that convolutional networks achieve competitive results across concepts and across language families at the task of cognate identification.
Article
Full-text available
Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.
Article
We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for English$\rightarrow$French and surpasses state-of-the-art results for English$\rightarrow$German. Similarly, a single multilingual model surpasses state-of-the-art results for French$\rightarrow$English and German$\rightarrow$English on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
Article
A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.