ArticlePDF Available

Fast and unsupervised methods for multilingual cognate clustering

Authors:
Article

Fast and unsupervised methods for multilingual cognate clustering

Abstract and Figures

In this paper we explore the use of unsupervised methods for detecting cognates in multilingual word lists. We use online EM to train sound segment similarity weights for computing similarity between two words. We tested our online systems on geographically spread sixteen different language groups of the world and show that the Online PMI system (Pointwise Mutual Information) outperforms a HMM based system and two linguistically motivated systems: LexStat and ALINE. Our results suggest that a PMI system trained in an online fashion can be used by historical linguists for fast and accurate identification of cognates in not so well-studied language families.
Content may be subject to copyright.
Fast and unsupervised methods for multilingual
cognate clustering
Taraka Rama, Johannes Wahle, Pavel Sofroniev, and Gerhard J¨
ager
University of T¨
ubingen
Abstract
In this paper we explore the use of unsupervised methods for detecting
cognates in multilingual word lists. We use online EM to train sound seg-
ment similarity weights for computing similarity between two words. We
tested our online systems on geographically spread sixteen different language
groups of the world and show that the Online PMI system (Pointwise Mutual
Information) outperforms a HMM based system and two linguistically moti-
vated systems: LexStat and ALINE. Our results suggest that a PMI system
trained in an online fashion can be used by historical linguists for fast and
accurate identification of cognates in not so well-studied language families.
1 Introduction
Cognates are genetically related words that can be traced to a common word in a
language that is no longer spoken. For example, English nail and German nagel
are cognates with each other which can be traced back to the stage of Proto-Indo-
European *h3enogh. Accurate identification of cognates is important for infer-
ring the internal structure of a language family.
Recent years has seen an surge in the number of publications in the field of
computational historical linguistics due to the availability of word lists for large
number of languages of the world [Brown et al., 2013]1and cognate databases
for Austronesian [Greenhill and Gray, 2009] and Indo-European [Bouckaert et al.,
2012].
The availability of word lists (without cognate judgments) has allowed scholars
like Rama and Borin [2015] and J¨
ager [2015] to experiment with different weighted
string similarity measures for the purpose of inferring the family trees of world’s
languages, without explicit cognate identification. On the other hand, List [2012]
proposed a cognate clustering system that combines handcrafted weighted string
similarity measures and permutation tests for the purpose of automated cognate
identification. In a different approach, Hauer and Kondrak [2011] experimented
1Known as Automated Similarity Judgment Program (ASJP). http://asjp.clld.org/
1
arXiv:1702.04938v1 [cs.CL] 16 Feb 2017
with linear classifiers like SVMs for the purpose of identifying cognate clusters.
Finally, Rama [2015] use string kernel inspired features for training a SVM linear
classifier for pair-wise cognate identification. As noted by Hauer and Kondrak
[2011], availability of a reliable multilingual cognate identification system can be
used to supply the cognate judgments as an input to the phylogenetic inference
algorithms introduced by Gray and Atkinson [2003] and reconstruction methods
of Bouchard-Cˆ
ot´
e et al. [2013].2
The phylogenetic inference methods require cognate judgments which are only
available for a small number of well-studied language families such as Indo-European
and Austronesian. For instance, the ASJP database provides Swadesh word lists (of
length 40 which are resistant to lexical replacement and borrowing) transcribed in
a uniform format for more than 60% of the world’s languages. However, the cog-
nacy judgments are only available for a subset of language families. An example
of such a word list is given in table 1.
ALL AND ANIMAL . . .
English ol End Enim3l . . .
German al3 unt tia . . .
French tu e animal . . .
Spanish to8o i animal . . .
Swedish ala ok y3r . . .
Table 1: Example of a word list for five languages belonging to Germanic (English,
German, and Swedish) and Romance (Spanish and French) subfamilies transcribed
in ASJP alphabet.
The task at hand is to automatically cluster words that show genealogical re-
lationship. This is achieved by computing similarities between all the word pairs
belonging to a meaning and then supplying the resulting distance matrix as an
input to a clustering algorithm. The clustering algorithm groups the words into
clusters by optimizing a similarity criterion. The similarity between a word pair
can be computed using supervised approaches [Hauer and Kondrak, 2011] or by
using sequence alignment algorithms such as Needleman-Wunsch [Needleman and
Wunsch, 1970] or Levenshtein distance [Levenshtein, 1966].
In dialectometry, Wieling et al. [2007] compared Pair Hidden Markov Model
(PHMM) [Mackay and Kondrak, 2005] and pointwise mutual information (PMI)
[Church and Hanks, 1990] weighted Levenshtein distance for Dutch dialect com-
parison. In historical linguistics, J¨
ager [2013] developed a PMI based method for
computing the string similarity using the ASJP database. In this paper, we apply
online algorithms to train our PMI and PHMM systems for the purpose of comput-
ing word similarity.
2The cognate clustering system in Bouchard-Cˆ
ot´
e et al. [2013] requires the tree structure of the
language family to be know beforehand. This is not a practical assumption since the tree structure of
many language families of the world is not known beforehand.
2
We train our PHMM and PMI systems in different settings and test it on sixteen
different families of the world. Our results show that online training can perform
better than a linguistically well-informed system known as LexStat [List, 2012].
Also, the online algorithms allow our systems to be trained in few minutes and
give similar accuracies as the batch trained systems of J¨
ager [2013].
The paper is organized as follows. We discuss the relevant work in section 2.
We describe the PMI and PHMM models in section 3. The Online EM procedure
is described in section 4. We describe the clustering algorithm in section 5. We
discuss the experimental settings and motivation behind our choices in section 6.
We present and discuss the results of our experiments in section 7. We discuss the
effect of different model parameters in section 8. Finally, we conclude the paper in
section 9.
2 Related work
Kondrak [2000] introduced a dynamic programming algorithm for computing the
similarity between two sequences based on articulatory phonetic features deter-
mined by Ladefoged [1975]. The author evaluated his algorithm on a list of English-
Latin cognates. In this paper, we evaluate on the Indo-European dataset consisting
of English and Latin.
Hauer and Kondrak [2011] trained a linear SVM on word similarity features
and use the SVM model to assign a similarity score to the word pair. For each
meaning, a word pair distance matrix is computed and supplied to the average
linkage clustering algorithm for inferring cognate clusters. The authors observe
that the SVM trained system performs better than a baseline that judges the simi-
larity of two words based on the identity of the first two consonants.
List [2012] introduced a system known as LexStat (described in section 6) that
is sensitive to segment similarities and chance similarities due to borrowing or
semantic shift. The author tests this system on a number of small-sized (consisting
of less than 20 languages) datasets for the purpose of cognate identification and
reports that the system performs better than Levenshtein distance.
In a recent paper, List et al. [2016] explore the use of InfoMap [Rosvall and
Bergstrom, 2008] for the detection of partial cognates in subgroups of Sino-Tibetan
language family. The authors compare the performance of average linkage cluster-
ing against InfoMap and find that InfoMap performs better than average linkage
clustering.
The above listed works test similar datasets using different experimental set-
tings. For instance, Hauer and Kondrak [2011] trained and tested on a subset of
language families that were provided by Wichmann and Holman [2013]. At the
same time, to the best of our knowledge, the LexStat system has not been eval-
uated on all the available language families. Moreover, the PMI-LANG [J¨
ager,
2013], has not been evaluated at the task of unsupervised cognate clustering.
3
3 Models
In this section, we briefly describe the PMI weighted Needleman-Wunsch algo-
rithm and Pair Hidden Markov Model (PHMM).
3.1 PMI-weighted alignment
The vanilla Needleman-Wunsch (VNW) algorithm is the similarity counterpart of
the Levenshtein distance. It maximizes similarity whereas Levenshtein distance
minimizes the distance. In VNW, a character or sound segment match increases the
similarity by 1and a character mismatch has a weight of 1. In contrast to Lev-
enshtein distance which treats insertion, deletion, and substitution equally, VNW
introduces a gap opening (deletion operation) penalty parameter that has to be set
separately. A second parameter known as gap extension penalty has lesser or equal
penalty than the gap opening parameter and models the fact that deletions occur in
chunks J¨
ager [2013].
VNW is not sensitive to segment pairs, but a realistic algorithm should assign
higher similarity score to sound correspondences such as /l/ /r/ than the sound
correspondences /p/ /r/. The weighted Needleman-Wunsch algorithm requires a
similarity score for each pair of segments, and it finds the alignment(s) betwen two
input strings maximizing the sum of the pairwise similiarities of matched segment
pairs.
In computational historical linguistics, similarity between two segments is es-
timated using PMI. The PMI score for two sounds iand jis defined as followed:
PMI(i, j) = log p(i, j)
q(i)·q(j)(1)
where, p(i, j)is the probability of i, j being matched in a pair of cognate words,
whereas, q(i)is the probability that an arbitrarily chosen segment in an arbitrarily
chosen word equals i. A positive PMI value between iand jindicates that the
probability of ibeing aligned with jin a pair of cognates is higher than what
would be expected by chance. Conversely, a negative PMI value indicates that an
alignment of iwith jis more likely the result of chance than of shared inheritance.
We estimated PMI scores from raw data, largely following the method de-
scribed in J¨
ager [2013].
The whole training procedure can be described as follows:
1. Extract a set of word pairs that are probably cognate using a suitable heuris-
tics. In this paper, we treat all word pairs belonging to the same meaning
with a length normalized Levenshtein distance (LDN) below 0.5as probable
cognates.3
3We experimented with LDN cutoffs of 0.25 and 0.75 and found that the results are best for a
cutoff of 0.5
4
2. Align the list of probable cognates using the vanilla Needleman-Wunsch al-
gorithm.
3. Extract aligned segment pairs and compute the PMI value for a segment pair
using equation 1 and estimating probabilities as relative frequencies.
4. Generate a new set of aligments using Needleman-Wunsch algorithm and
the segment weights learned from step 2. For the gap penalties we used the
values proposed in J¨
ager [2013].
5. We iterate between step 2 and 3 until the average similarity between the two
iterations does not change.
This procedure yields a PMI based similarity score for each word pair. We con-
vert the similarity score xinto a distance score using the sigmoid transformation:
1.0(1 + exp(x))1that converts the PMI similarity score into the range of
[0,1].
3.2 Pair Hidden Markov Model
Pair Hidden Markov Model was first proposed in the context of computational
biology as a tool for the comparison of DNA or protein sequences [Durbin et al.,
2001].
A Pair Hidden Markov Model (PHMM) uses two output streams, instead of
a single output stream; one for each of the two sequences being aligned. In its
simplest version, a PHMM consists of five states. A begin state, an end state,
a match state (M) that emits pairs of symbols, a deletion state (X) that emits a
symbol in the first string and a gap in the second string; and an insertion state (Y)
that emits a gap in the first and a symbol in the second string (cf. figure 1).
The PHMMs, as used in historical linguistics, differ from its biological coun-
terpart in the following regard:
The historical linguistic PHMM allows a transition between the states Xand
Y(dashed line, figure 1). An alignment of Italian due and Spanish dos ‘two’
cannot be generated by a PHMM without the transition between Xand Y
[Mackay and Kondrak, 2005].
due-
do-s
Another difference between the biological and the linguistic PHMM is the
split of the parameter for the transition into the end state. Whilst the original
version only has one parameter for this purpose, the linguistic PHMM makes
use of two different probabilities τMand τXY . This split of parameters en-
ables the model to distinguish between the match state (M) being the final
5
M
poio0
j
X
qoi
Y
qo0
j
Begin End
δ
δ
τM
12δτM
τXY
1λτXY
λ
τXY
λ
1λτXY
δ
12δτM
δ
Figure 1: Pair Hidden Markov model as proposed by Mackay & Kondrak [Mackay
and Kondrak, 2005]. The states of the model are depicted as the circles. The arrows
show the possible transitions between the states. < δ, , λ, τM, τX Y >represent
the transition probabilities.
emitting state or any of the gap states (X,Y) (see figure 1). This modifica-
tion preserves the symmetry of the model, while allowing a little bit more
freedom.
The PHMMs are trained using Baum-Welch expectation maximization algo-
rithm [Durbin et al., 2001]. The best alignment between two sequences xand yis
determined by using the Viterbi algorithm.
The probability of two sequences xand yof lengths mand nrespectively
evolving independently under a null model Ris given by the following equation 2.
P(x, y|R) = ι2(1 ι)n+m
n
Y
i=1
fxi
m
Y
j=1
fyj,(2)
with fxiis the equilibrium frequency of the sound at position iin sequence x
where, ι=1
m+n
2+1 .
The probability of relatedness between xand yis computed as the logarithmic
ratio of the probability scores P(x, y|µ)and P(x, y|R), where µis the trained
model and Ris the null model.
We employ the same sigmoid transformation, as in PMI, to convert the similar-
ity score (computed under a PHMM) to a distance score.
4 Online EM
The Expectation Maximization algorithm (EM) is widely used in computational
linguistics for the purpose of word alignment, document classification, and word
6
segmentation. The EM algorithm starts with an initial setting of model parameters
and uses that model parameters to realign words in a sentence pair. The model
parameters are reestimated using the word alignments obtained from the previous
iteration. The EM algorithm reestimates the model parameters after each full scan
of the training data.
Liang and Klein [2009] observe that batch training procedure can lead to slow
convergence. As a matter of fact, J¨
ager [2013] trains his PMI system using the
standard EM (also known as batch EM) which updates the parameters in a PMI
scoring matrix only after aligning all the word pairs. In contrast, Online EM [Liang
and Klein, 2009], updates the model parameters after aligning a subset of word
pairs (also known as minibatch in online learning literature).
The Online EM algorithm combines the parameters estimated (s) from the cur-
rent update step kwith the previous parameters θk1using the following equation:
θk= (1 ηk)θk1+ηks(3)
where ηkis defined as: ηk= (k+ 2)α.
In the case of PMI, θconstitutes the PMI scores for all segment pairs. The
parameter ηkdetermines how fast to forget or remember the updates from the pre-
vious steps. The αparameter is in the range of 0.5α1. A smaller αimplies
a large update to the model parameters. The parameter kis related to minibatch
parameter (m;m=dD/ke; where, Dis the size of training data) and determines
the number of updates to be performed. The setting k= 1 recovers the batch EM
whereas, when k=D, implies an update for each sample in the training data.
5 Clustering algorithm
The InfoMap clustering method is an information theoretic approach to detect com-
munity structure within a connected network. The method uses random walks on a
network as a proxy for information flow to detect communities, i.e., clusters, with-
out the need for a threshold. A community is a group of nodes with more edges
connecting the nodes within the community than connecting them with nodes out-
side the community [Newman and Girvan, 2004].
In our case, a community refers to the words which are cognate and have higher
edge weights between them. The idea behind the algorithm is that the random walk
is statistically more likely to spend a long period of time within a community than
switching communities due to the nature of the network.
A pair-wise distance matrix is a complete weighted graph and any edge that has
a weight <0.5and a PMI score <0(due to the sigmoid-based distance transfor-
mation). Due to the PMI score’s definition, a PMI score <0implies that the words
might not be cognate. We use this property to construct a non-complete graph and
supply the resulting network as an input to the InfoMap algorithm.
7
6 Experiments
In this section, we describe the experimental settings, datasets, evaluation mea-
sures, and the comparing systems: Baseline, ALINE, PMI-LANG, and LexStat.
6.1 Hyperparameters of Online EM
We determine the best setting of mand αparameter by searching for min the
range of m= 2swhere s[5,15]; and, α[0.5,1.0] with a step size of 0.05. We
fix the gap opening and gap extension penalties to 2.5and 1.75.
6.2 Datasets
6.2.1 Indo-European database
The Indo-European Lexical database (IELex) was created by Dyen et al. [1992]
and curated by Michael Dunn.4The IELex database is not transcribed in uniform
IPA and retains many forms transcribed in the Romanized IPA format of Dyen
et al. [1992]. We cleaned the IELex database of any non-IPA-like transcriptions
and converted the cleaned subset of the database into ASJP format.
6.2.2 Austronesian vocabulary database
The Austronesian Vocabulary Database (ABVD) [Greenhill and Gray, 2009] has
word lists for 210 Swadesh concepts and 378 languages.5The database does not
have transcriptions in a uniform IPA format. We removed all symbols that do
not appear in the standard IPA and converted the lexical items to ASJP format.
For comparison purpose, we use randomly selected 100 languages’ dataset in this
paper.6
6.2.3 Short word lists with cognacy judgments:
Wichmann and Holman [2013] and List [2014a] compiled cognacy wordlists for
subsets of families from various scholarly sources such as comparative handbooks
and historical linguistics’ articles. The details of different databases is given in
table 2.
6.3 Evaluation Measures
We evaluate the results of clustering analysis using B-cubed F-score [Amig´
o et al.,
2009]. The B-cubed scores are defined for each word belonging to a meaning as
followed. The precision for a word is defined as the ratio between the number of
4http://ielex.mpi.nl/
5http://language.psy.auckland.ac.nz/austronesian/
6LexStat takes many hours to run on a dataset of 100 languages.
8
Family NOM NOL AveCC AveWC
Austronesian 210 100 20.2142 4.1143
Afrasian 40 21 9.5 2.6868
Bai dialects 110 9 2.5909 6.0166
Chinese dialects 179 18 6.8771 5.2635
Huon 84 14 6.3929 2.7672
Indo-European 207 52 12.2126 7.3461
Japanese dialects 200 10 2.3 6.1373
Kadai 40 12 3.225 5.0027
Kamasau 36 8 1.6667 5.3981
Lolo-Burmese 40 15 2.625 7.3121
Mayan 100 30 8.58 6.1521
Miao-Yao 39 6 1.8974 3.9667
Mixe-Zoque 100 10 3 4.6535
Mon-Khmer 100 16 7.75 2.7956
ObUgrian 110 21 2.2 11.8162
Tujia 109 5 1.6422 3.3792
Table 2: Number of languages (NOL), Number of meanings (NOM), Average num-
ber of cognate classes per meaning (AveCC), and Average number of words per
cognate class (AveWC).
cognates in its cluster to the total number of words in its cluster. The recall for a
word is defined as the ratio between the number of cognates in its cluster to the total
number of expert labeled cognates. The B-cubed precision and recall are defined
as the average of the words’ precision and recall across all the clusters. Finally, the
B-cubed F-score for a meaning, is computed as the harmonic mean of the average
items’ precision and recall. The Averaged B-cubed F-score for the whole dataset
is computed as the average of the B-cubed F-scores across all the meanings.
Amig´
o et al. [2009] show that the B-cubed F-score satisfies four formal con-
straints known as cluster homogeneity, cluster completeness, rag bag (robustness
to misplacement of a true singleton item), and robustness to variation in cluster
size. The authors show that cluster evaluation measures based on entropy such as
Mutual Information and V-measure [Rosenberg and Hirschberg, 2007] and Rand
index do not satisfy the four constraints. Both Hauer and Kondrak [2011] and List
et al. [2016] use B-cubed F-scores to evaluate their cognate clustering systems.
6.4 Comparing systems
Baseline We adopt length normalized Levenshtein distance as the baseline in our
experiments.
6.4.1 ALINE
ALINE is a sequence alignment system designed by Kondrak [2000] for comput-
ing similarity between two words by decomposing phonemes into multivalued and
9
binary phonetic features. Each phoneme is decomposed into multivalued features
such as place and manner for consonants; height and backness for vowels. Multi-
valued features take values on a continous scale ranging from [0,1] and the values
represent the distance between the sources of articulation. Binary valued features
consist of nasal, voicing, aspirated, and retroflex.
Each feature is weighed by a salience value that is determined manually. The
similarity score between two sequences is computed as the sum of the aligned
sound segments. Following Downey et al. [2008], we convert ALINE’s similarity
score sab between two words a, b is converted to a distance score based on the
following formula: 1.02.0sab
saa+sbb .7
6.4.2 PMI-LANG
J¨
ager [2013] developed a system that learns PMI sound matrices to optimize a
criterion designed to optimize language relatedness. The core idea is to tie up word
similarity to language similarity such that close languages such as English/German
tend to have more similarity than English/Hindi. The language similarity function
amounts to maximizing similarity between probable cognates to learn a PMI score
matrix. J¨
ager [2013] applies the learned PMI score matrix to infer phylogenetic
trees of language families. However, the learned PMI score matrix has not been
applied for cognate clustering.
6.4.3 LexStat
LexStat [List, 2012] is part of LingPy [List and Forkel, 2016] library offering state-
of-the-art alignment algorithms for aligning word pairs and clustering them into
cognate sets. We describe the workflow of LexStat system below:
1. LexStat uses a hand-crafted sound segment matrix, h, to align and score the
word pairs for each meaning. Let a segment pair i, j’s similarity be given as
hij .
2. For each language pair, l1, l2the word pairs belonging to the same meaning
are aligned. The frequency of a segment pair i, j belonging to the same
meaning is given as aij.
3. For l1, l2, the words belonging to one of the language is shuffled and re-
aligned using Needleman-Wunsch algorithm. This procedure is repeated for
all language pairs for 100 times. The average frequency of a segment pair
i, j from the reshuffling step is given as eij.
7We use the Python implementation provided by Huff and Lonsdale [2011] which is available at
https://sourceforge.net/projects/pyaline/.
10
4. All the parameters h, a, e are combined according to the following formula
to give a new segment similarity score sij where, w1+w2= 1.
sij = 2 w1log aij
eij
+w2hij (4)
5. The weights sij are then used to score word pairs and cluster words in a
meaning.
The intuition behind step 3 is to reduce the effect of chance similarities between
the sound segments that can obscure real genetic sound correspondences.8We
supply the word distances from all the above systems as input to InfoMap to infer
cognate clusters.
7 Results
In this section, we present the results of our experiments. We perform two sets of
experiments by training with different datasets which are described below.
Family LDN PMI-LANG Batch PMI Online PMI Batch PHMM Online PHMM LexStat ALINE
Austronesian 0.7175 0.7355 0.6539 0.7364 0.6224 0.6709 0.7173 0.5321
Afrasian 0.7993 0.8133 0.7496 0.8392 0.7213 0.7044 – 0.6442
Bai dialects 0.8348 0.8766 0.8716 0.8774 0.8741 0.8639 0.8417 0.8462
Chinese dialects 0.7687 0.7521 0.7217 0.7803 0.7455 0.7396 0.7815 0.6651
Huon 0.8536 0.8556 0.7518 0.8775 0.7612 0.7437 0.6413
Indo-European 0.7367 0.7752 0.7337 0.7812 0.715 0.7126 0.7316 0.6583
Japanese dialects 0.893 0.9031 0.8943 0.9051 0.9006 0.9083 0.8875 0.8699
Kadai 0.7581 0.8175 0.8139 0.8309 0.8 0.8159 – 0.7647
Kamasau 0.9561 0.9850 0.9543 0.9823 0.9605 0.9674 – 0.9479
Lolo-Burmese 0.6469 0.713 0.7862 0.7805 0.7846 0.8218 – 0.8027
Mayan 0.8198 0.7798 0.6958 0.8074 0.6804 0.6797 0.7931 0.627
Miao-Yao 0.6412 0.7003 0.7679 0.7801 0.7411 0.7879 0.8426
Mixe-Zoque 0.9055 0.9149 0.8528 0.9209 0.8521 0.8599 0.8656 0.8298
Mon-Khmer 0.7883 0.8209 0.7054 0.8302 0.6921 0.7008 0.7925 0.6472
ObUgrian 0.8623 0.911 0.8987 0.9214 0.8951 0.8874 0.8837 0.8826
Tujia 0.8882 0.9091 0.9018 0.9105 0.895 0.9027 0.8905 0.8757
Average 0.8044 0.8289 0.7971 0.8415 0.7901 0.7955 0.8185 0.7548
Table 3: The B-cubed F-scores of different models on sixteen language groups.
The last row reports the average of the B-cubed F-scores across all the datasets.
The numbers in bold show the highest scores across columns.
7.1 Out-of-family training
In this experiment, we train our PHMM and PMI systems on wordlists from the
ASJP database belonging to families other than those language groups present in
table 2. We made sure that there is no overlap between the languages present in test
dataset and the training dataset. We extracted a list of probable cognates and trained
our PMI and PHMM models on the list of probable cognates. We trained all the
8We obtained the code from https://github.com/lingpy. We convert the LexStat simi-
larity scores into distance scores using the same formula as ALINE.
11
batch and online systems on 1151178 word pairs. The results of our experiments
are given in table 3. We report the InfoMap clustering results for a threshold of
0.5for all the systems. We expect LexStat to perform better in the case of Chinese
since LexStat handles tones internally whereas, the ASJP representation does not
handle tones. In the case of online systems, we report the best results for m, α.
Following List [2014b], we do not report LexStat results for the language groups
which have word lists shorter than 100 meanings.
PMI PHMM
Family m α m α
Austronesian 64 0.75 32 0.5
Afrasian 256 0.65 32 0.8
Bai dialects 8192 0.75 32 0.55
Chinese dialects 128 0.95 512 0.6
Huon 32 1 32 0.65
Indo-European 512 0.55 1024 0.5
Japanese dialects 512 0.55 32 0.6
Kadai 2048 0.7 32 0.7
Kamasau 512 0.5 128 0.55
Lolo-Burmese 16384 0.5 32 0.75
Mayan 64 0.5 32 0.55
Miao-Yao 8192 0.95 128 0.7
Mixe-Zoque 256 0.7 32 0.7
Mon-Khmer 256 0.7 32 0.5
ObUgrian 512 0.75 32768 0.5
Tujia 1024 0.65 32 0.5
Table 4: Best settings of mand αfor Online variants of PMI and PHMM.
The Online PMI performs better than the rest of the systems at nine out of the
sixteen families. On an average, the Online PMI system ranks the best followed
by PMI-LANG and LexStat system. ALINE performs the best on Miao-Yao lan-
guage group. The Online PMI system perform better than the Batch PMI on all
the datasets. As expected, the LexStat system performs the best on Chinese dialect
dataset. Surprisingly, despite its complexity the PHMM systems do not perform as
well as the simpler PMI systems.
Now, we will comment on the results of Austronesian and Indo-European lan-
guage families. Greenhill [2011] applied Levenshtein distance for the classification
of Austronesian languages and argued that Levenshtein distance does not perform
well at the task of detecting language relationships. Our experiment shows that
Levenshtein distance comes close to LexStat in the case of Austronesian language
family. Both PMI-LANG and Online PMI are two points better than Levenshtein
distance at the task of cognate identification.
The results are much clearer in the case of Indo-European language family.
The PMI-LANG and Online PMI systems perform better than rest of the systems.
Levenshtein distance performs better than LexStat for the Indo-European language
12
Family Training word pairs Online PHMM Online PMI Batch PHMM Batch PMI
m α F-score m α F-score
ASJP Indo-European 380769 128 0.60 0.7646 4096 0.60 0.7868 0.7656 0.7704
Indo-European 25386 64 0.50 0.7901 1024 0.85 0.7971 0.7797 0.7914
ASJP Mayan 91665 256 0.55 0.7765 128 0.90 0.8250 0.7814 0.7677
Mayan 11889 32 0.55 0.7952 64 0.70 0.7997 0.7888 0.7544
ASJP Austronesian 1000000 32 0.65 0.6190 128 0.80 0.7453 0.6239 0.6429
Austronesian 84311 32 0.5 0.6709 128 0.80 0.7460 0.6517 0.6509
Table 5: The results of training the PMI and PHMM systems on the ASJP 40 word
lists and the full word lists of Indo-European, Mayan, and Austronesian.
family. On an average, ALINE shows the lowest performance of all the systems.
We report the corresponding setting of m, α for all the online systems in table 4.
The value of mis quite variable across language families whereas, αtends to be in
the range of 0.50.75. We investigate the effect of mand αfor Indo-European and
Austronesian languages by plotting the results of Online PMI system in figures 2.
The B-cubed F-scores are stable across the range of αbut show variable results for
value of m. The top-3 F-scores for Indo-European are at m= 256,512,1024 and
at m= 64,128,256 for Austronesian language family. These results suggest that
the online training helps cognate clustering than the batch training. The plots (cf.
figure 2) suggest that small batch size improves the performance whereas a large
batch size (eg., 32768) hurts the performance on Indo-European and Austronesian
language families.
0.72
0.74
0.76
0.78
0.5 0.6 0.7 0.8 0.9 1.0
α
F−score
Mini−Batch size
32
64
128
256
512
1024
2048
4096
8192
16384
32768
(a) Indo-European
0.68
0.70
0.72
0.74
0.5 0.6 0.7 0.8 0.9 1.0
α
F−score
Mini−Batch size
32
64
128
256
512
1024
2048
4096
8192
16384
32768
(b) Austronesian
Figure 2: Plots of mand αagainst B-cubed F-scores for out-of-family training.
7.2 Within-family training
In this experiment, we train our PMI and PHMM systems on three largest lan-
guage families in our dataset: Mayan, Indo-European, and Austronesian language
13
families. We train our systems on word pairs extracted from two different sources.
1. The ASJP database has 40-length word lists for more languages (3 times)
than the languages in cognate databases of Mayan, Indo-European, and Aus-
tronesian language families. The database allows us to access more word
pairs than any other database in existence.
2. We extract list of probable cognate pairs from the IELex, ABVD, and Mayan
language databases.
The motivation behind these experiments is to investigate the performance of
PMI and PHMM systems when trained on the word lists belonging to the same
language family but compiled by different group of annotators. A successful ex-
periment indicates that this approach of training a PMI matrix on ASJP 40 word
lists can be applied to language families that have longer word lists but no cognate
judgments. The number of training word pairs and the results of our experiments
are given in table 5.
The Online variants perform better than the batch systems across all the lan-
guage families and settings. Online PMI performs the best across all the language
families than the Batch PMI. Online PMI trained on ASJP word lists of a language
family show close performance to an Online PMI system trained within the lan-
guage family in the case of Indo-European and Austronesian language families.
The performance of batch PMI system comes close to the Online PMI system in
the case of Indo-European but falls behind in the case of other language families.
Training the online system on ASJP word lists improves the performance in the
case of Mayan language family. This performance is not observed in the case of
Indo-European and Austronesian language families. The reason for this could be
due to the source of origin of the datasets.
The batch PMI/PHMM systems perform better than LexStat on Indo-European
and Mayan language families. The Online PHMM system comes close in perfor-
mance to Online PMI system in the case of Indo-European and Mayan language
families. PHMM systems how the lowest performance on Austronesian language
family. Except for Indo-European, the best batch sizes for online PMI system are
small and are typically 256.
8 Discussion
In this section, we discuss the effects of various parameters on our results.
8.1 Effect of mand α
Throughout our experiments, we observe that low minibatch size gives better re-
sults than a large minibatch size. We also observe that a intermediary value of αis
usually sufficient for obtaining the best results.
14
Figure 2 shows that small values of myields stable F-scores across the range
of α. Small values of mtypically gives better results than larger values of α. In
contrast to other NLP tasks that require large mand smaller α, the task of align-
ing two words requires smaller values of m. The small value of mimplies large
number of updates which is important for a task where the average sequence length
(5) and the average number of word pairs are in less than 100,000. Further, an
intermediary value of αcontrols the amount of memory retained at each update.
8.2 Speed
One advantage of our online systems (either PMI or PHMM) is that the training
time is typically in the range of 10 minutes on a single thread of i7-6700 processor.
In the case of PHMM, online training speeds up the convergence and yields, typi-
cally, better results than the batch variant. In comparison, the PMI-LANG system
takes days to train. Finally, our results show that the online algorithm can yield
better performance than LexStat. LexStat and PHMM take more than 5 hours to
test on the language subset of the Austronesian language family. In contrast, PMI
(both online and batch) takes less than 10 minutes for each value of m, α in the
case of out-of-family training. We also observe that 5scans over the full data was
sufficient for convergence.
8.3 Analyzing PHMM’s performance
Although PHMMs are the most complex among the tested models, the performance
of these models is not as good as the conceptually simpler PMI models. This lack
of performance could be due to the characteristics of the PHMM. The transition
probability from the begin state to the match or gap states is the same as the transi-
tion probability from the match state to either gap state or itself (figure 1). Although
desirable for biological purposes, this poses a big problem for linguistic applica-
tions. To start an alignment with a match is more likely than to start with a gap.9
Therefore, the alignments generated by PHHMs are more likely to show gaps at
the end of the string than in the beginning. This results in problems for data sets
where word length differ a lot. The PHMM performs the worst for those datasets
that show a huge difference in the word length. On the other hand, for Kamasau
and Tujia – the two datasets with the best performance – the difference in word
length is much less pronounced (cf. figure 3).
Based on the results of these experiments, we propose that training the PMI-
based segment scores in an online fashion and supplied to InfoMap clustering could
yield reliable cognate judgments.
912δτMis larger than δin all models (c.f. figure 1).
15
Figure 3: Distribution of average of word length differences across concepts.
9 Conclusion
In this paper, we evaluated the performance of various sequence alignment algo-
rithms – both learned and linguistically designed – for the task of cognate detec-
tion across different language families. We find that training PMI and PHMM in
an online fashion speeds up convergence and yields comparable or better results
than the batch variant and the state-of-the-art LexStat system. Online PMI system
shows the best performance across different language families. In conclusion, PMI
systems can be trained faster in an online fashion and yield better accuracies than
the current state-of-the-art systems.
References
Enrique Amig´
o, Julio Gonzalo, Javier Artiles, and Felisa Verdejo. A comparison of
extrinsic clustering evaluation metrics based on formal constraints. Information
retrieval, 12(4):461–486, 2009.
Alexandre Bouchard-Cˆ
ot´
e, David Hall, Thomas L. Griffiths, and Dan Klein. Auto-
mated reconstruction of ancient languages using probabilistic models of sound
change. Proceedings of the National Academy of Sciences, 110(11):4224–4229,
2013. doi: 10.1073/pnas.1204678110. URL http://www.pnas.org/
content/early/2013/02/05/1204678110.abstract.
Remco Bouckaert, Philippe Lemey, Michael Dunn, Simon J. Greenhill, Alexan-
der V. Alekseyenko, Alexei J. Drummond, Russell D. Gray, Marc A. Suchard,
and Quentin D. Atkinson. Mapping the origins and expansion of the Indo-
European language family. Science, 337(6097):957–960, 2012.
16
Cecil H. Brown, Eric W. Holman, and Søren Wichmann. Sound correspondences
in the world’s languages. Language, 89(1):4–29, 2013.
Kenneth Ward Church and Patrick Hanks. Word association norms, mutual infor-
mation, and lexicography. Computational Linguistics, 16(1):22–29, 1990. ISSN
0891-2017.
Sean S Downey, Brian Hallmark, Murray P Cox, Peter Norquest, and J Stephen
Lansing. Computational feature-sensitive reconstruction of language relation-
ships: Developing the aline distance for comparative historical linguistic recon-
struction. Journal of Quantitative Linguistics, 15(4):340–369, 2008.
Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological
sequence analysis: probabilistic models of proteins and nucleic acids. Cam-
bridge Univ. Press, repr. edition, 2001.
Isidore Dyen, Joseph B. Kruskal, and Paul Black. An Indo-European classification:
A lexicostatistical experiment. Transactions of the American Philosophical So-
ciety, 82(5):1–132, 1992.
Russell D Gray and Quentin D Atkinson. Language-tree divergence times support
the anatolian theory of indo-european origin. Nature, 426(6965):435–439, 2003.
Simon J Greenhill. Levenshtein distances fail to identify language relationships
accurately. Computational Linguistics, 37(4):689–698, 2011.
Simon J. Greenhill and Russell D. Gray. Austronesian language phylogenies:
Myths and misconceptions about Bayesian computational methods. Austrone-
sian Historical Linguistics and Culture History: A Festschrift for Robert Blust,
pages 375–397, 2009.
Bradley Hauer and Grzegorz Kondrak. Clustering semantically equivalent words
into cognate sets in multilingual lists. In Proceedings of the 5th International
Joint Conference on Natural Language Processing, pages 865–873, 2011.
Paul Huff and Deryle Lonsdale. Positing language relationships using aline. Lan-
guage Dynamics and Change, 1(1):128–162, 2011.
Gerhard J¨
ager. Phylogenetic inference from word lists using weighted alignment
with empirically determined weights. Language Dynamics and Change, 3(2):
245–291, 2013.
Gerhard J¨
ager. Support for linguistic macrofamilies from weighted sequence align-
ment. Proceedings of the National Academy of Sciences, 112(41):12752–12757,
2015. doi: {10.1073/pnas.1500331112}.
17
Grzegorz Kondrak. A new algorithm for the alignment of phonetic sequences. In
Proceedings of the 1st North American chapter of the Association for Compu-
tational Linguistics conference, pages 288–295. Association for Computational
Linguistics, 2000.
Peter Ladefoged. A course in phonetics. Hardcourt Brace Jovanovich Inc. NY,
1975.
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and
reversals. Soviet Physics Doklady, 10(8):707–710, 1966.
Percy Liang and Dan Klein. Online em for unsupervised models. In Proceedings
of Human Language Technologies: The 2009 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, NAACL
’09, pages 611–619, Stroudsburg, PA, USA, 2009. Association for Computa-
tional Linguistics. ISBN 978-1-932432-41-1. URL http://dl.acm.org/
citation.cfm?id=1620754.1620843.
Johann-Mattis List. Lexstat: Automatic detection of cognates in multilingual
wordlists. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS &
UNCLH, pages 117–125. Association for Computational Linguistics, 2012.
Johann-Mattis List. Sequence comparison in historical linguistics. D ¨
usseldorf Uni-
versity Press, D¨
usseldorf, 2014a. URL http://sequencecomparison.
github.io/.
Johann-Mattis List. Investigating the impact of sample size on cognate detection.
Journal of Language Relationship, 11:91–101, 2014b.
Johann-Mattis List and Robert Forkel. Lingpy. a python library for historical lin-
guistics, 2016. URL http://lingpy.org.
Johann-Mattis List, Philippe Lopez, and Eric Bapteste. Using sequence similarity
networks to identify partial cognates in multilingual wordlists. In Proceedings of
the 54th Annual Meeting of the Association for Computational Linguistics (Vol-
ume 2: Short Papers), pages 599–605, Berlin, Germany, August 2016. Associ-
ation for Computational Linguistics. URL http://anthology.aclweb.
org/P16-2097.
Wesley Mackay and Grzegorz Kondrak. Computing word similarity and identi-
fying cognates with pair hidden Markov models. CONLL ’05, pages 40–47,
Stroudsburg, PA, USA, June 2005. Association for Computational Linguistics.
Saul B. Needleman and Christian D. Wunsch. A general method applicable to the
search for similarities in the amino acid sequence of two proteins. Journal of
Molecular Biology, 48(3):10, 1970.
18
Mark EJ Newman and Michelle Girvan. Finding and evaluating community struc-
ture in networks. Phys. Rev. E, 69:026113, Feb 2004. doi: 10.1103/PhysRevE.
69.026113. URL http://link.aps.org/doi/10.1103/PhysRevE.
69.026113.
Taraka Rama. Automatic cognate identification with gap-weighted string subse-
quences. In Proceedings of the 2015 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technolo-
gies., pages 1227–1231, 2015.
Taraka Rama and Lars Borin. Comparative evaluation of string similarity measures
for automatic language classification. In J´
an Maˇ
cutek and George K. Mikros,
editors, Sequences in Language and Text, pages 203–231. Walter de Gruyter,
2015.
Andrew Rosenberg and Julia Hirschberg. V-measure: A conditional entropy-based
external cluster evaluation measure. In EMNLP-CoNLL, volume 7, pages 410–
420, 2007.
Martin Rosvall and Carl T. Bergstrom. Maps of random walks on complex net-
works reveal community structure. Proceedings of the National Academy of
Sciences, 105(4):1118–1123, 2008. doi: 10.1073/pnas.0706851105. URL
http://www.pnas.org/content/105/4/1118.abstract.
Søren Wichmann and Eric W Holman. Languages with longer words have more
lexical change. In Approaches to Measuring Linguistic Differences, pages 249–
281. Mouton de Gruyter, 2013.
Martijn Wieling, Therese Leinonen, and John Nerbonne. Inducing sound segment
differences using Pair Hidden Markov Models. pages 48–56. Association for
Computational Linguistics, 2007.
19
... An alternative to LexStat is the OnlinePMI method published by Rama et al. (2017). OnlinePMI performed better than LexStat in identifying cognate classes (Rama et al., 2017), but it gave worse results when reconstructing trees using Bayesian phylogenetics on the bases of these classes .1 Compared to LexStat, the OnlinePMI method is better able to handle data with lower numbers of shared concepts. ...
... An alternative to LexStat is the OnlinePMI method published by Rama et al. (2017). OnlinePMI performed better than LexStat in identifying cognate classes (Rama et al., 2017), but it gave worse results when reconstructing trees using Bayesian phylogenetics on the bases of these classes .1 Compared to LexStat, the OnlinePMI method is better able to handle data with lower numbers of shared concepts. This comes at the cost of a significant random component, making it harder to optimize OnlinePMI for a particular application. ...
... Due to these advantages and disadvantages, we chose LexStat as our baseline method, but included OnlinePMI for a comparative investigation (D4). In that analysis we cognate-coded our data using OnlinePMI, with parameters α = 0.75, initial cut off c = 0.5, and batch size m = 256, following the general observations of Rama et al. (2017). We use the implementation available from https://github.com/evolaemp/online_cognacy_ident in the version from commit 3b998ae. ...
Article
Full-text available
This paper refines the subgroupings of the Timor-Alor-Pantar (TAP) family of Papuan languages, using a systematic Bayesian phylogenetics study. While recent work indicates that the TAP family comprises a Timor (T) branch and an Alor-Pantar (AP) branch (Holton et al., 2012; Schapper et al., 2017), the internal structure of the AP branch has proven to be a challenging issue, and earlier studies leave large gaps in our understanding. Our Bayesian inference study is based on an extensive set of TAP lexical data from the online LexiRumah database (Kaiping et al., 2019b; Kaiping and Klamer, 2018). Systematically comparing different analytical models and tying them back to the evidence in terms of historical linguistics, we arrive at a subgrouping structure of the TAP family that is based on features of the phylogenies shared across the different analyses. Our TAP tree differs from all earlier proposals by inferring the East Alor subgroup as an early split-off from all other AP languages, instead of the most deeply embedded subgroup inside that branch. The evidences suggests that dialect cluster effects played a major role in the formation of today's Timor-Alor-Pantar languages.
... Computational historical linguistics is a relatively young discipline which aims to provide automated solutions for those problems which have been traditionally dealt with in an exclusively manual fashion in historical linguistics. Computational historical linguists thus try to develop automated approaches to detect historically related words (called "cognates"; Rama et al. 2017;Rama 2018a), to infer language phylogenies ("language trees"; Greenhill and Gray 2009), to estimate the time depths of language families (Rama, 2018b;Chang et al., 2015;Gray and Atkinson, 2003), to determine the homelands of their speakers (Bouckaert et al., 2012;Wichmann et al., 2010), to determine diachronic word stability (Pagel and Meade, 2006;Rama and Wichmann, 2018), or to estimate evolutionary rates for linguistic features (Greenhill et al., 2010). ...
... Numerous methods for automatic cognate detection in historical linguistics have been proposed in the past List, 2014;Rama et al., 2017;Turchin et al., 2010;Arnaud et al., 2017). Most of them are based on the same general workflow, by which -in a first stage -all possible pairs of words within the same meaning slot of a wordlist are compared with each other in order to compute a matrix of pairwise distances or similarities. ...
... In general, we use the threshold of θ = 0.55 that performed best in List et al.'s analysis. An alternative to LexStat is the Online PMI method published by Rama et al. (2017). Where ...
... OnlinePMI performed better than LexStat in identifying cognate classes (Rama et al. 2017), but it gave worse results when reconstructing trees using Bayesian phylogenetics on the bases of these classes ). 2 Compared to LexStat, the Online PMI method is better able to handle data with lower mutual coverage. This comes at the cost of a significant random component, making it harder to optimize OnlinePMI for a particular application. ...
... For our analysis, we tested six different methods for cognate detection: The Consonant-Class-Matching (CCM) Method (Turchin et al., 2010), the Normalized Edit Distance (NED) approach (Levenshtein, 1965), the Sound-Class-Based Aligmnent (SCA) method (List, 2014), the LexStat-Infomap method (List et al., 2017b), the SVM method , and the Online PMI approach (Rama et al., 2017). ...
... The OnlinePMI approach (Rama et al., 2017) estimates the sound-pair PMI matrix using the online procedure described in Liang and Klein (2009). The approach starts with an empty PMI matrix and a list of synonymous word pairs from all the language pairs. ...
... edit distance, Dice's coefficient, or LCSR. As these methods do not work across scripts, they are completed by phonetic similarity, exploiting transformations and sound changes across related languages (Kondrak, 2000;List, 2012;Jäger, 2013;Jäger et al., 2017;Rama et al., 2017). Phonetic similarity measures, however, require phonetic transcriptions to be a priori available. ...
Article
Full-text available
We present CogNet, a large-scale, automatically-built database of sense-tagged cognates—words of common origin and meaning across languages. CogNet is continuously evolving: its current version contains over 8 million cognate pairs over 338 languages and 35 writing systems, with new releases already in preparation. The paper presents the algorithm and input resources used for its computation, an evaluation of the result, as well as a quantitative analysis of cognate data leading to novel insights on language diversity. Furthermore, as an example on the use of large-scale cross-lingual knowledge bases for improving the quality of multilingual applications, we present a case study on the use of CogNet for bilingual lexicon induction in the framework of cross-lingual transfer learning.
... The online Pointwise Mutual Information method (Rama et al., 2017) (PMI) is similar to ours in that it performs alignment with a distance matrix for pairs of phones which is adjusted iteratively. However, they do not update weights in a probabilistic way based on tentative cognacy probabilities at each iteration in the same way as us. ...
... Their approach implements a general model and language specific models using support vector machine (SVM). Rama et al. (2017) present an unsupervised method for cognate identification. The method consists of extracting suitable cognate pairs with normalized Levenshtein distance, aligning the pairs and counting a point-wise mutual information score for the aligned segments. ...
Conference Paper
Full-text available
We approach the problem of expanding the set of cognate relations with a sequence-to-sequence NMT model. The language pair of interest, Skolt Sami and North Sami, has too limited a set of parallel data for an NMT model as such. We solve this problem on the one hand, by training the model with North Sami cognates with other Uralic languages and, on the other, by generating more synthetic training data with an SMT model. The cognates found using our method are made publicly available in the Online Dictionary of Uralic Languages.
... For orthographic similarity, string metrics (Hauer and Kondrak, 2011;St Arnaud et al., 2017) are often employed, e.g., edit distance, Dice's coefficient, or LCSR. As these methods do not work across scripts, they are completed by phonetic similarity, exploiting transformations and sound changes across related languages (Kondrak, 2000;Jäger, 2013;Rama et al., 2017). Phonetic similarity measures, however, require phonetic transcriptions to be a priori available. ...
Conference Paper
Full-text available
This paper introduces CogNet, a new, large-scale lexical database that provides cognates—words of common origin and meaning—across languages. The database currently contains 3.1 million cognate pairs across 338 languages using 35 writing systems. The paper also describes the automated method by which cognates were computed from publicly available wordnets, with an accuracy evaluated to 94%. Finally, statistics and early insights about the cognate data are presented, hinting at a possible future exploitation of the resource by various fields of lingustics.
... The last decades have seen a large amount of computational effort towards automatizing the process of cognate identification since the work of Covington (1996) and Kondrak (2002). The computational effort involved devising new sequence alignment algorithms (Kondrak, 2005(Kondrak, , 2009, novel sound transition matrices which are linguistically guided (Kondrak, 2001;List, 2012b) or data-driven (Jäger, 2013;Rama et al., 2013Rama et al., , 2017List, 2012a), and machine learning approaches (Hauer and Kondrak, 2011;Rama, 2015Rama, , 2016 to identify cognates within multilingual word lists (see table 1; Swadesh, 1952) belonging to different language families and dictionaries (St Arnaud et al., 2017). ...
Conference Paper
Article
The Neogrammarian approach to historical phonology involves propounding sound‐change laws and explaining exceptions by means such as sub‐laws, rearranging the relative chronology, and appeal to special factors such as analogy, borrowing, and sporadic phenomena like metathesis. Progress is mostly made manually, but in the second half of the twentieth century some linguists looked forward to the ‘triumph of the electronic Neogrammarian’. Although this has not been realized yet, there are opportunities to make important advances. This paper offers a critical survey of the field since the 1950s and suggestions for the future.
Chapter
Full-text available
Historical linguistics has never been particularly intimate with computers. The first wave of computational historical linguistics—lexicostatistics—was developed in the 1950s (Swadesh 1952; Lees 1953) and quickly applied to language groups around the world from Indo-European to Austronesian (Lees 1953; Hymes 1960; Embleton 1986). However, critics were quick to point out the problems caused by assuming a single constant rate of lexical replacement and repeatedly noted the erroneous results that this produced (Hoijer 1956; Bergsland and Vogt 1962; Blust 1981; McMahon and McMahon 2006). As a consequence of these critiques lexicostatistics has been widely rejected by mainstream historical linguists (Campbell 2004). The last few years have seen a second wave of computational approaches entering historical linguistics: phylogenetic methods. These techniques, drawn from evolutionary biology, have been used to investigate some provocative and controversial claims about human prehistory.
Conference Paper
Full-text available
Pair Hidden Markov Models (PairHMMs) are trained to align the pronunciation transcriptions of a large contemporary collection of Dutch dialect material, the Goeman-Taeldeman-Van Reenen-Project (GTRP, collected 1980–1995). We focus on the question of how to incorporate information about sound segment distances to improve sequence distance measures for use in dialect comparison. PairHMMs induce segment distances via expectation maximisa-tion (EM). Our analysis uses a phonologi-cally comparable subset of 562 items for all 424 localities in the Netherlands. We evaluate the work first via comparison to analyses obtained using the Levenshtein distance on the same dataset and second, by comparing the quality of the induced vowel distances to acoustic differences.
Conference Paper
In this paper, we describe the problem of cog-nate identification in NLP. We introduce the idea of gap-weighted subsequences for discriminating cognates from non-cognates. We also propose a scheme to integrate phonetic features into the feature vectors for cognate identification. We show that subsequence based features perform better than state-of-the-art classifier for the purpose of cognate identification. The contribution of this paper is the use of subsequence features for cognate identification.
Article
The paper investigates the task of inferring a phylogenetic tree of languages from the collection of word lists made available by the Automated Similarity Judgment Project. This task involves three steps: (1) computing pairwise word distances, (2) aggregating word distances to a distance measure between languages and inferring a phylogenetic tree from these distances, and (3) evaluating the result by comparing it to expert classifications. For the first task, weighted alignment will be used, and a method to determine weights empirically will be presented. For the second task, a novel method will be developed that attempts to minimize the bias resulting from missing data. For the third task, several methods from the literature will be applied to a large collection of language samples to enable statistical testing. It will be shown that the language distance measure proposed here leads to substantially more accurate phylogenies than a method relying on unweighted Levenshtein distances between words.
Article
Several methods for determining a numerical distance between languages have been proposed in the literature. In this paper we implement one of them, the ALINE distance. We also develop a methodology for comparing its results with other language distance metrics. In particular, we generate trees from distance matrices created by the language distance metrics using two different algorithms developed by computational biologists: Neighbor Joining and UPGMA. We compare these automatically generated trees with expert trees based on those compiled by the Ethnologue project using a tree distance metric also developed by computational biologists. By determining how close the trees generated using the language distance metrics are to the expert trees, we are able to compare different language distance metrics with one another. We compare the ALINE distance with another leading metric, the LDND distance, proposed by the ASJP project. Both metrics perform similarly on the datasets processed, though details differ in sometimes interesting ways.