DataPDF Available
Reconstructing language ancestry by
performing word prediction with neural
networks
Peter Dekker
January 25, 2018
Thesis defense
MSc Artificial Intelligence
University of Amsterdam
1
Introduction
Introduction
MSc student Artificial Intelligence, University of Amsterdam
MSc thesis, supervised by:
Gerhard Jäger, SfS, University of Tübingen
Jelle Zuidema, ILLC, University of Amsterdam
2
Overview
Introduction
Word prediction
Applications
Phylogenetic word prediction
Conclusion and discussion
3
Machine learning algorithms can be used to predict words
between languages and serve as a model of sound change, in
order to reconstruct the ancestry of languages.
4
Machine learning algorithms can be used to predict
words between languages and serve as a model of sound
change, in order to reconstruct the ancestry of languages.
4
Machine learning algorithms can be used to predict
words between languages and serve as a model of sound
change, in order to reconstruct the ancestry of languages.
4
Historical linguistics
What is the ancestry of languages?
5
Historical linguistics
What is the ancestry of languages?
Icelandic
Dutch
Danish
German
Norwegian
Swedish
English
False
6
Historical linguistics
What is the ancestry of languages?
Icelandic
Norwegian
Danish
Swedish
German
Dutch
English
True
7
Comparative method
Use current languages as evidence for the past
8
Regular sound correspondences
Sound change takes place according to laws that admit no
exception
Ostho and Brugmann (1880)
Crucial in comparative method
Dutch German
out uit aus
brown bruin braun
house huis Haus
mouse muis Maus
louse luis Laus
9
Regular sound correspondences
Sound change takes place according to laws that admit no
exception
Ostho and Brugmann (1880)
Crucial in comparative method
Dutch German
out uitaus
brown bruin braun
house huis Haus
mouse muis Maus
louse luis Laus
9
Comparative method
Use current languages as evidence for the past:
1. Establish regular sound correspondences between words in
dierent languages
2. Infer protoforms in ancestor of current languages
3. Determine if words in dierent languages are ancestrally
related (cognates)
4. Reconstruct phylogenetic tree
(adopted from Jäger and List (2016))
10
Computational methods
11
Computational methods
Bouckaert et al. (2012): Mapping the origins and expansion of the
Indo-European language family
12
Computational methods
Bouckaert et al. (2012): Mapping the origins and expansion of the
Indo-European language family
13
Machine learning
Predict by learning paerns from
past examples
Generalize over training examples
Example: image classification
Train a model on pairs (x,y)
Predict yfor an unseen x
Succesful in areas such as:
Machine translation (eg. Google
Translate)
Pedestrian detection in cars
(Sermanet et al., 2013)
Analysis of medical images
(Avendi et al., 2016)
14
Machine learning
Predict by learning paerns from
past examples
Generalize over training examples
Example: image classification
Train a model on pairs (x,y)
Predict yfor an unseen x
Succesful in areas such as:
Machine translation (eg. Google
Translate)
Pedestrian detection in cars
(Sermanet et al., 2013)
Analysis of medical images
(Avendi et al., 2016)
cat
dog
dog
cat
14
Machine learning
Predict by learning paerns from
past examples
Generalize over training examples
Example: image classification
Train a model on pairs (x,y)
Predict yfor an unseen x
Succesful in areas such as:
Machine translation (eg. Google
Translate)
Pedestrian detection in cars
(Sermanet et al., 2013)
Analysis of medical images
(Avendi et al., 2016)
cat
dog
dog
cat
14
Machine learning
Predict by learning paerns from
past examples
Generalize over training examples
Example: image classification
Train a model on pairs (x,y)
Predict yfor an unseen x
Succesful in areas such as:
Machine translation (eg. Google
Translate)
Pedestrian detection in cars
(Sermanet et al., 2013)
Analysis of medical images
(Avendi et al., 2016)
cat
dog
dog
cat
cat
14
Thesis
Historical linguistics
Determine ancestry of
languages
Use regular sound
correspondences
Machine learning
Predict based on paerns in
data
Generalize over regularities
in data
15
Thesis
Historical linguistics
Determine ancestry of
languages
Use regular sound
correspondences
Machine learning
Predict based on paerns in
data
Generalize over regularities
in data
15
Thesis
Historical linguistics
Determine ancestry of
languages
Use regular sound
correspondences
Machine learning
Predict based on paerns in
data
Generalize over regularities
in data
Word prediction
15
Word prediction
Word prediction
For two languages Aand B, we have lists of words for the same
concepts.
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
3. Calculate edit distance (Levenshtein, 1966) between prediction
and target
NL DE
Prediction Distance
ku ku
spits Spic
lerar leGa
dot das
16
Word prediction
For two languages Aand B, we have lists of words for the same
concepts.
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
3. Calculate edit distance (Levenshtein, 1966) between prediction
and target
NL DE
Prediction Distance
ku ku
spits Spic
lerar leGa
dot das
brot
16
Word prediction
For two languages Aand B, we have lists of words for the same
concepts.
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
3. Calculate edit distance (Levenshtein, 1966) between prediction
and target
NL DE
Prediction Distance
ku ku
spits Spic
lerar leGa
dot das
brot bGat
16
Word prediction
For two languages Aand B, we have lists of words for the same
concepts.
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
3. Calculate edit distance (Levenshtein, 1966) between prediction
and target
NL DE
Prediction Distance
ku ku
spits Spic
lerar leGa
dot das
brot bGot bGat
16
Word prediction
For two languages Aand B, we have lists of words for the same
concepts.
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
3. Calculate edit distance (Levenshtein, 1966) between prediction
and target
NL DE Prediction Distance
ku ku
spits Spic
lerar leGa
dot das
brot bGot bGat1
4
16
Word prediction
For two languages Aand B, we have lists of words for the same
concepts.
1. Train a model on pairs (wc,A,wc,B): words for concept cin
languages Aand B
2. For a new concept d, give wd,Aand predict wd,B
3. Calculate edit distance (Levenshtein, 1966) between prediction
and target
NL DE Prediction Distance
ku ku
spits Spic
lerar leGa
dot das
brot bGot bGat1
4=0.25
16
One task, many applications
Use output of word prediction to perform:
Phylogenetic tree reconstruction
Identification of sound correspondences
Cognate detection
17
Machine learning models
Two machine learning models
Recurrent Neural Network (RNN) encoder-decoder
Structured perceptron
Both neural networks
18
Neural networks
Every node receives input
from previous node and
sends output to next nodes
Weights between nodes are
updated during training
Nodes have a non-linear
activation function
Neural network is
combination of all those
non-linear functions
Ability to model complex
relationships between
input and output
1. Feed data
2. Generate output value
3. Compare output and target
values, using loss function
4. Update weights, by
backpropagation of
derivative of loss
Output
Hidden
Input
19
Neural networks
1. Feed data
2. Generate output value
3. Compare output and target
values, using loss function
4. Update weights, by
backpropagation of
derivative of loss
Output
Hidden
Input
19
Neural networks
1. Feed data
2. Generate output value
3. Compare output and target
values, using loss function
4. Update weights, by
backpropagation of
derivative of loss
Output
Hidden
Input
19
Neural networks
1. Feed data
2. Generate output value
3. Compare output and target
values, using loss function
4. Update weights, by
backpropagation of
derivative of loss
Target
Output
Hidden
Input
Loss
19
Neural networks
1. Feed data
2. Generate output value
3. Compare output and target
values, using loss function
4. Update weights, by
backpropagation of
derivative of loss
Target
Output
Hidden
Input
Loss
19
Neural networks
1. Feed data
2. Generate output value
3. Compare output and target
values, using loss function
4. Update weights, by
backpropagation of
derivative of loss
Target
Output
Hidden
Input
Loss
19
Recurrent Neural Network (RNN)
ht0
ɣ
20
Recurrent Neural Network (RNN)
ht0ht1
W
ɣ
20
Recurrent Neural Network (RNN)
ht0ht1ht2
W W
ɣeː s
20
Recurrent Neural Network (RNN)
ht0ht1ht2ht3
W W W
ɣeː s t
20
Recurrent Neural Network (RNN)
ht0ht1ht2ht3
W W W
ɣeː s t
ɡaɪ s t
20
Recurrent Neural Network (RNN)
Direct output of encoder: assumes same input and output
length. Solution:
Encoder-decoder structure
Successful in machine translation: Cho et al. (2014), Sutskever
et al. (2014)
ht0ht1ht2ht3
W W W
ɣeː s t
ɡaɪ s t
20
Recurrent Neural Network (RNN)
Encoder-decoder structure
Successful in machine translation: Cho et al. (2014), Sutskever
et al. (2014)
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
20
Recurrent Neural Network (RNN)
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
ht0ht1ht2ht3
V V V Decoder
20
Recurrent Neural Network (RNN)
ht0ht1ht2ht3
W W W
ɣeː s t
C
Encoder
Fixed-size vector
ht0ht1ht2ht3
V V V Decoder
Tt0Tt1Tt2Tt3
ɡaɪ s t
Loss
20
Structured perceptron
Extension of perceptron (Rosenbla, 1958) (one-layer neural
network) for sequences
At every iteration, find output sequence for which perceptron
gives highest output
Eiciently estimate using Viterbi algorithm
21
Data
NorthEuraLex dataset (Dellert and Jäger, 2017)
1016 concepts for 107 languages in Northern Eurasia
More than basic vocabulary
Phonetic encoding
ASJPcode: 41 sound classes, many-to-one from IPA to ASJP
(Brown et al., 2008)
22
Data
Source Target
blut blut
inslap3 ainSlaf3n
blot blat
wExan vEge3n
xlot glat
warhEit vaahait
orbEit aabait
mElk3 mElk3n7
vostbind3 anbind3n
hak hak3n
stEl3 StEl3n
hust3 hust3n
xord3l giat3l
23
Data
Source Target
blut blut
inslap3 ainSlaf3n
blot blat
wExan vEge3n
xlot glat
warhEit vaahait
orbEit aabait
mElk3 mElk3n7
vostbind3 anbind3n
hak hak3n
stEl3 StEl3n
hust3 hust3n
xord3l giat3l
23
Data
b l u t
00101 01010 01001 01101
ht0ht1ht2ht3
WWW
C
Encoder
Fixed-size context vector
ht0ht1ht2ht3
V V V Decoder
Tt0Tt1Tt2Tt3
ɡaɪ s t
Loss
24
Data
b l u t
00101 01010 01001 01101
ht0ht1ht2ht3
WWW
ɣeː s t
C
Encoder
Fixed-size context vector
ht0ht1ht2ht3
V V V Decoder
Tt0Tt1Tt2Tt3
ɡaɪ s t
Loss
24
Data
b l u t
00101 01010 01001 01101
ht0ht1ht2ht3
WWW
00101 01010 01001 01101
C
Encoder
Fixed-size context vector
ht0ht1ht2ht3
V V V Decoder
Tt0Tt1Tt2Tt3
ɡaɪ s t
Loss
24
Input encoding: existing
One-hot encoding: Vector of length ncharacters, with 1 at right
character, 0 for all other characters
Phonetic encoding: Vector of length nfeatures, with 1 at every
feature that applies
p 1 0 0 0 0
b 0 0 0 1 0
25
Input encoding: existing
One-hot encoding: Vector of length ncharacters, with 1 at right
character, 0 for all other characters
Phonetic encoding: Vector of length nfeatures, with 1 at every
feature that applies
Voiced Labial Dental . . .
p 0 1 0 . . .
b 1 1 0 . . .
25
Embedding encoding
You shall know a word by the company it keeps (Firth, 1957)
Succesful in many machine learning tasks, on word level (Mikolov
et al., 2013; Pennington et al., 2014)
Use this for phonemes: define phoneme by all surrounding phonemes
in dataset for that language pair
Learn phonotactics from data
ASJP
phoneme
START iLEFT SLEFT pRIGHT · · ·
3 0.004 0.003 0.001 0.002 · · ·
E 0.024 0.000 0.000 0.003 · · ·
a 0.050 0.002 0.000 0.012 · · ·
b 0.388 0.000 0.000 0.004 · · ·
p 0.152 0.039 0.000 0.000 · · ·
26
Embedding encoding
Embedding Phonetic
27
Training examples
Training on all word pairs (cognate and non-cognate): wide
applicability
But only cognates contain regularities that should be learned!
Dutch German
slope hEliN abhaN
eight oxt aGt
28
Loss function
Network learns from the loss between target and prediction
General loss function: cross-entropy
Cognacy prior: learn more from probable cognates, less from
non-cognates
Error between target and prediction, with sharp decline if error
exceeds θ
θbased on mean error in training history
L=CE(t,p)·p(cog)
p(cog) = 1
1+eE(t,p)θ
θ=Ehistory +vσ
p(cog)
E(t,p)
29
Word prediction: NL-DE (structured perceptron)
Source Target Prediction Distance
blut blut blut 0.00
inslap3 ainSlaf3n inSlaun 0.33
blot blat blat 0.00
wExan vEge3n vag3n 0.33
xlot glat glat 0.00
warhEit vaahait vaahait 0.00
orbEit aabait oabait 0.17
mElk3 mElk3n mEl3n 0.17
vostbind3 anbind3n fostaiN3n 0.78
hak hak3n hak 0.40
stEl3 StEl3n Staln 0.33
hust3 hust3n hiSta 0.67
xord3l giat3l goad3l 0.33
30
Word prediction performance: Slavic
Baseline: prediction of sounds using PMI (Jäger et al., 2017; Church
and Hanks, 1990; Wieling et al., 2009)
31
Applications
Phylogenetic tree reconstruction
Prediction distance per language pair as measure of relatedness
Clustering algorithms (UPGMA (Sokal and Michener, 1958) and
Neighbor joining (Saitou and Nei, 1987)) on distance matrix
Comparison to Gloolog (Hammarström et al., 2017)
32
Phylogenetic tree reconstruction
bel slv hrv ukr pol ces bul slk rus
bel 0.00 0.36 0.37 0.19 0.33 0.35 0.41 0.32 0.20
slv 0.36 0.00 0.20 0.33 0.42 0.34 0.29 0.31 0.42
hrv 0.37 0.20 0.00 0.34 0.40 0.34 0.31 0.33 0.39
ukr 0.19 0.33 0.34 0.00 0.37 0.35 0.43 0.38 0.27
pol 0.33 0.42 0.40 0.37 0.00 0.31 0.49 0.29 0.37
ces 0.35 0.34 0.34 0.35 0.31 0.00 0.40 0.17 0.40
bul 0.41 0.29 0.31 0.43 0.49 0.40 0.00 0.36 0.39
slk 0.32 0.31 0.33 0.38 0.29 0.17 0.36 0.00 0.36
rus 0.20 0.42 0.39 0.27 0.37 0.40 0.39 0.36 0.00
33
Phylogenetic tree reconstruction: best-performing
bul
slv
hrv
rus
bel
ukr
pol
ces
slk
Structured perceptron, one-hot
encoding, UPGMA clustering. artet
distance: 0.047619.
bel
rus
ukr
hrv
slv
bul
ces
slk
pol
Gloolog reference tree
34
Phylogenetic tree reconstruction: worst-performing
bul
ces
slk
slv
hrv
pol
rus
bel
ukr
RNN encoder-decoder, one-hot
encoding, UPGMA clustering.
artet distance: 0.31746.
bel
rus
ukr
hrv
slv
bul
ces
slk
pol
Gloolog reference tree
35
Identification of sound correspondences: NL-DE
Source Target Prediction Distance
blut blut blut 0.00
inslap3- ainSlaf3n inSlaun 0.33
blot blat blat 0.00
wExan vEge3n vag3n 0.33
xlot glat glat 0.00
warhEit vaahait vaahait 0.00
orbEit aabait oabait 0.17
mElk3- mElk3n mEl3n 0.17
vostbind3- anbind3n fostaiN3n 0.78
hak hak3n hak 0.40
stEl3- StEl3n Staln 0.33
hust3- hust3n hiSta- 0.67
xord3l giat3l goad3l 0.33
36
Identification of sound correspondences: NL-DE
Source Target Prediction Distance
blut blut blut 0.00
inslap3-ainSlaf3ninSlaun 0.33
blot blat blat 0.00
wExan vEge3n vag3n 0.33
xlot glat glat 0.00
warhEit vaahait vaahait 0.00
orbEit aabait oabait 0.17
mElk3-mElk3nmEl3n 0.17
vostbind3-anbind3nfostaiN3n 0.78
hak hak3n hak 0.40
stEl3-StEl3nStaln 0.33
hust3-hust3nhiSta- 0.67
xord3l giat3l goad3l 0.33
36
Identification of sound correspondences: NL-DE
Source Target Prediction Distance
blut blut blut 0.00
inslap3-ainSlaf3ninSlaun0.33
blot blat blat 0.00
wExan vEge3n vag3n 0.33
xlot glat glat 0.00
warhEit vaahait vaahait 0.00
orbEit aabait oabait 0.17
mElk3-mElk3nmEl3n0.17
vostbind3-anbind3nfostaiN3n0.78
hak hak3n hak 0.40
stEl3-StEl3nStaln0.33
hust3-hust3nhiSta-0.67
xord3l giat3l goad3l 0.33
36
Identification of sound correspondences: NL-DE
Source Target Prediction Distance
blut blut blut 0.00
inslap3- ainSlaf3n inSlaun 0.33
blot blat blat 0.00
wExan vEge3n vag3n 0.33
xlot glat glat 0.00
warhEit vaahait vaahait 0.00
orbEit aabait oabait 0.17
mElk3- mElk3n mEl3n 0.17
vostbind3- anbind3n fostaiN3n 0.78
hak hak3n hak 0.40
stEl3- StEl3n Staln 0.33
hust3- hust3n hiSta- 0.67
xord3l giat3l goad3l 0.33
36
Identification of sound correspondences: NL-DE
Source Target Prediction Distance
blut blut blut 0.00
inslap3- ainSlaf3n inSlaun 0.33
blot blat blat 0.00
wExan vEge3n vag3n 0.33
xlot glat glat 0.00
warhEit vaahait vaahait 0.00
orbEit aabait oabait 0.17
mElk3- mElk3n mEl3n 0.17
vostbind3- anbind3n fostaiN3n 0.78
hak hak3n hak 0.40
stEl3- StEl3n Staln 0.33
hust3- hust3n hiSta- 0.67
xord3l giat3l goad3l 0.33
36
Identification of sound correspondences: NL-DE
Most frequent substitutions, using Needleman-Wunsch alignment
(Needleman and Wunsch, 1970)
Substitution Source-prediction frequency Source-target frequency
o a 21 13
r a 14 13
s S 14 8
v f 12 10
E a 12 9
3 n 10 1
r G 912
x g 910
w v 89
- n 728
3 a 41
i n 4
37
Cognate detection
Determine if words are ancestrally related (cognate)
Create clustering
hand tail tree flower
English hand B tail L tree C flower B
German Hand B Schwanz N Baum B Blume B
Dutch hand B staart B boom B bloem B
French main E queue F arbre F fleur B
38
Cognate detection
Determine if words are ancestrally related (cognate)
Create clustering
hand tail tree flower
English hand B tail L tree C flower B
German Hand B Schwanz N Baum B Blume B
Dutch hand B staart B boom B bloem B
French main E queue F arbre F fleur B
38
Cognate detection
Perform cognate detection by clustering on prediction
distances per word
Clustering algorithms: flat UPGMA (Sokal and Michener, 1958),
MCL (van Dongen, 2000)
Ground truth: IELex cognate judgments (Dunn, 2012)
Similarity measure between clusterings: B-Cubed F (Bagga and
Baldwin, 1998)
39
Cognate detection
Model Clustering
algorithm
B-Cubed F
Slavic Germanic
Encoder-decoder MCL 0.8000 0.8954
Encoder-decoder fUPGMA 0.8983 0.8611
Structured perceptron MCL 0.8775 0.9321
Structured perceptron fUPGMA 0.9197 0.8898
Source prediction baseline MCL 0.9208 0.8518
Source prediction baseline fUPGMA 0.9298 0.8787
40
Phylogenetic word prediction
Phylogenetic word prediction
Until now: language pairs predicted separately
For languages Dutch, German, English
• DutchGerman
• DutchEnglish
• GermanDutch
• GermanEnglish
• EnglishDutch
• EnglishGerman
But no information shared between language pairs!
41
Phylogenetic word prediction
Phylogenetic word prediction: assume phylogenetic tree
Neural network architecture where weights are shared
Enc A Enc B
Dec C
42
Phylogenetic word prediction
Ultimate goal: integrated word prediction and phylogenetic tree
reconstruction.
Pick tree with highest prediction score.
Icelandic
Dutch
Danish
German
Norwegian
Swedish
English
False. Prediction score: 0.8
43
Phylogenetic word prediction
Ultimate goal: integrated word prediction and phylogenetic tree
reconstruction.
Pick tree with highest prediction score.
Icelandic
Norwegian
Danish
Swedish
German
Dutch
English
True. Prediction score 0.4
44
Experiments
Pilot study on Dutch, German, English
Does right tree give beer performance than false tree?
Protoform reconstruction
Enc A Enc B
Dec C
45
Experiments
Model Pearson correlation
PhylNet, ref. tree
((nld,deu),eng)
0.8945
PhylNet, false tree
((nld,eng),deu)
0.7546
PhylNet, false tree
((deu,eng),nld)
0.8899
46
Protoform reconstruction
Input (Dutch) Protoform
blut oi
inslap3 i33n
blot no
wExan bol
xlot no
warhEit fol
orbEit st3
mElk3 oi
vostbind3 s3rn
hak po3
(a) Protoforms for
Proto-Franconian
(ancestor of Dutch and German).
Input (Dutch) Protoform
blut kik
inslap3 op
blot silEm
wExan doi
xlot Ekkom
warhEit sE
orbEit sk3r3
mElk3 kutaa
vostbind3 sorl
hak on
(b) Protoforms for Proto-West
Germanic
(ancestor of Dutch, German and
English).
47
Conclusion and discussion
Conclusion
Machine learning paradigm applicable to several tasks in
historical linguistics:
Phylogenetic tree reconstruction
Identification of sound correspondences
Cognate detection
Results on par with baseline models
Phylogenetic word prediction opens up new possibilities:
Protoform reconstruction
Optimize tree during prediction
Proposed solutions for key issues:
Numerical representation of phonemes: embedding encoding
Learning less from non-cognates: cognacy prior
48
References
Avendi, M., Kheradvar, A., and Jafarkhani, H. (2016). A combined deep-learning and
deformable-model approach to fully automatic segmentation of the le
ventricle in cardiac mri. Medical image analysis, 30:108–119.
Bagga, A. and Baldwin, B. (1998). Entity-based cross-document coreferencing using
the vector space model. In Proceedings of the 17th international conference on
Computational linguistics-Volume 1, pages 79–85. Association for Computational
Linguistics.
Beinborn, L., Zesch, T., and Gurevych, I. (2013). Cognate production using
character-based machine translation. In IJCNLP, pages 883–891.
Bouckaert, R., Lemey, P., Dunn, M., Greenhill, S. J., Alekseyenko, A. V., Drummond,
A. J., Gray, R. D., Suchard, M. A., and Atkinson, Q. D. (2012). Mapping the
origins and expansion of the indo-european language family. Science,
337(6097):957–960.
49
Brown, C. H., Holman, E. W., Wichmann, S., and Velupillai, V. (2008). Automated
classification of the world′ s languages: a description of the method and
preliminary results. STUF-Language Typology and Universals Sprachtypologie
und Universalienforschung, 61(4):285–308.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,
H., and Bengio, Y. (2014). Learning phrase representations using rnn
encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078.
Church, K. W. and Hanks, P. (1990). Word association norms, mutual information,
and lexicography. Computational linguistics, 16(1):22–29.
Ciobanu, A. M. (2016). Sequence labeling for cognate production. Procedia
Computer Science, 96:1391–1399.
Dellert, J. and Jäger, G. e. (2017). Northeuralex (version 0.9).
Dunn, M. (2012). Indo-European lexical cognacy database (IELex). URL: hp://ielex.
mpi. nl.
Firth, J. R. (1957). A synopsis of linguistic theory, 1930-1955. Studies in linguistic
analysis.
50
Hammarström, H., Bank, S., Forkel, R., and Haspelmath, M. (2017). Gloolog 3.1.
Max Planck Institute for the Science of Human History.
Jäger, G. and List, J.-M. (2016). Statistical and computational elaborations of the
classical comparative method.
Jäger, G., List, J.-M., and Sofroniev, P. (2017). Using support vector machines and
state-of-the-art algorithms for phonetic alignment to identify cognates in
multi-lingual wordlists. Mayan, 895:0–05.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions,
and reversals. In Soviet physics doklady, volume 10, pages 707–710.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In Advances in
neural information processing systems, pages 3111–3119.
Needleman, S. B. and Wunsch, C. D. (1970). A gene method applicable to the
search for similarities in the amino acid sequence of two proteins. Journal of
Molecular Biology, 48:443–453.
Ostho, H. and Brugmann, K. (1880). Morphologische Untersuchungen auf dem
Gebiete der indogermanischen Sprachen, volume 3. S. Hirzel.
51
Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word
representation. In Proceedings of the 2014 conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 1532–1543.
Rosenbla, F. (1958). The perceptron: A probabilistic model for information storage
and organization in the brain. Psychological review, 65(6):386.
Saitou, N. and Nei, M. (1987). The neighbor-joining method: A new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425.
Sermanet, P., Kavukcuoglu, K., Chintala, S., and LeCun, Y. (2013). Pedestrian
detection with unsupervised multi-stage feature learning. In Proceedings of the
IEEE Conference on Computer Vision and Paern Recognition, pages 3626–3633.
Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating
systematic relationships. University of Kansas Scientific Bulletin, 28:1409–1438.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with
neural networks. In Advances in neural information processing systems, pages
3104–3112.
van Dongen, S. M. (2000). Graph clustering by flow simulation.
52
Wieling, M., Prokić, J., and Nerbonne, J. (2009). Evaluating the pairwise string
alignment of pronunciations. In Proceedings of the EACL 2009 workshop on
language technology and resources for cultural heritage, social sciences, humanities,
and education, pages 26–34. Association for Computational Linguistics.
53
Related work: cognate production
Beinborn et al. (2013):
Ciobanu (2016):
54
Context vector analysis
Context vectors 55
Context vector analysis
Input words, one-hot encoding 56
Context vector analysis
Target words, one-hot encoding 57

File (1)

Content uploaded by Peter Dekker
Author content
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We propose a sequence labeling approach to cognate production based on the orthography of the words. Our approach leverages the idea that orthographic changes represent sound correspondences to a fairly large extent. Given an input word in language L1, we seek to determine its cognate pair in language L2. To this end, we employ a sequential model which captures the intuition that orthographic changes are highly dependent on the context in which they occur. We apply our method on two pairs of languages. Finally, we investigate how second language learners perceive the orthographic changes from their mother tongue to the language they learn.
Conference Paper
Full-text available
Cognates are words in different languages that are associated with each other by language learners. Thus, cognates are important indicators for the prediction of the perceived difficulty of a text. We introduce a method for automatic cognate production using character-based machine translation. We show that our approach is able to learn production patterns from noisy training data and that it works for a wide range of language pairs. It even works across different alphabets, e.g. we obtain good results on the tested language pairs English-Russian, English-Greek, and English-Farsi. Our method performs significantly better than similarity measures used in previous work on cognates. abstract
Article
Full-text available
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
Conference Paper
Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a strong phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which beats the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Article
Segmentation of the left ventricle (LV) from cardiac magnetic resonance imaging (MRI) datasets is an essential step for calculation of clinical indices such as ventricular volume and ejection fraction. In this work, we employ deep learning algorithms combined with deformable models to develop and evaluate a fully automatic segmentation tool for the LV from short-axis cardiac MRI datasets. The method employs deep learning algorithms to learn the segmentation task from the ground true data. Convolutional networks are employed to automatically detect the LV chamber in MRI dataset. Stacked autoencoders are utilized to infer the shape of the LV. The inferred shape is incorporated into deformable models to improve the accuracy and robustness of the segmentation. We validated our method using 45 cardiac MR datasets taken from the MICCAI 2009 LV segmentation challenge and showed that it outperforms the state-of-the art methods. Excellent agreement with the ground truth was achieved. Validation metrics, percentage of good contours, Dice metric, average perpendicular distance and conformity, were computed as 96.69%, 0.94, 1.81mm and 0.86, versus those of 79.2%-95.62%, 0.87-0.9, 1.76-2.97mm and 0.67-0.78, obtained by other methods, respectively.
Article
A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.