ArticlePDF Available


We propose a sequence labeling approach to cognate production based on the orthography of the words. Our approach leverages the idea that orthographic changes represent sound correspondences to a fairly large extent. Given an input word in language L1, we seek to determine its cognate pair in language L2. To this end, we employ a sequential model which captures the intuition that orthographic changes are highly dependent on the context in which they occur. We apply our method on two pairs of languages. Finally, we investigate how second language learners perceive the orthographic changes from their mother tongue to the language they learn.
Procedia Computer Science 96 ( 2016 ) 1391 1399
Available online at
1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
Peer-review under responsibility of KES International
doi: 10.1016/j.procs.2016.08.184
20th International Conference on Knowledge Based and Intelligent Information and Engineering
Systems, KES2016, 5-7 September 2016, York, United Kingdom
Sequence Labeling for Cognate Production
Alina Maria Ciobanua,
aFaculty of Mathematics and Computer Science, University of Bucharest, Bucharest, Romania
We propose a sequence labeling approach to cognate production based on the orthography of the words. Our approach leverages
the idea that orthographic changes represent sound correspondences to a fairly large extent. Given an input word in language L1,
we seek to determine its cognate pair in language L2. To this end, we employ a sequential model which captures the intuition that
orthographic changes are highly dependent on the context in which they occur. We apply our method on two pairs of languages.
Finally, we investigate how second language learners perceive the orthographic changes from their mother tongue to the language
they learn.
2016 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of KES International.
Keywords: cognates, sequence labeling, language similarity, alignment
1. Introduction
Cognates are words in different languages having the same etymology and a common ancestor. For the research
areas in which the common etymology is not essential (such as machine translation), cognates are regarded as words
with high cross-lingual meaning similarity, and with resembling orthographic or phonetic forms. Cognates are relevant
in many research areas, such as phylogenetic inference1, language acquisition, cross-lingual information retrieval2
and machine translation3.
There are two research problems of high interest related to cognates: cognate detection (discriminating between
related and unrelated words) and cognate production (determining the form of a given word’s cognate pair). While
the former task has been actively studied during the recent period, the latter has received significantly less attention,
despite its importance in various research fields. We emphasize two research directions that rely on the task of
cognate production: (i) diachronic linguistics, which seeks to reconstruct the relationships between languages, and
(ii) the study of foreign language learning, which focuses on the learning process and on the influence of the learner’s
mother tongue in the process of second language acquisition. Cognate production can also contribute to the task of
lexicon generation, for poorly-documented languages, with scarce resources.
Corresponding author.
E-mail address:
© 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
Peer-review under responsibility of KES International
1392 Alina Maria Ciobanu / Procedia Computer Science 96 ( 2016 ) 1391 – 1399
1.1. Related Work
In his Ph.D. thesis, Kondrak4drew attention to two interesting and challenging research problems in diachronic
linguistics: historical derivation and comparative reconstruction. Historical derivation consists of deriving the modern
forms of the words from the old ones. Comparative reconstruction is the opposite process, in which the old forms of
the words are reconstructed from the modern ones. Most of the previous approaches to word form inference relied
on phonetic transcriptions. They built on the idea that, given the phonological context, sound changes follow certain
regularities across the entire vocabulary of a language. The proposed methods5,6,7 required a list of known sound
correspondences as input, collected from dictionaries or published studies.
Nowadays, given the development of the machine learning techniques, computers are able to learn these changes
from pairs of known related words. Beinborn et al.8proposed such a method for cognate production, using the
orthographic form of the words, and applying a machine translation method based on characters instead of words. The
orthographic approach relies on the idea that sound changes leave traces in the orthography and alphabetic character
correspondences represent, to a fairly large extent, sound correspondences9. Aligning the related words to extract
transitions from one language to another has proven very effective, when applied to both the orthographic10 and the
phonetic form of the words11. For the task of cognate production based on the orthography of the words, besides the
character-based machine translation mentioned above, another contribution belongs to Mulloni12, who introduced an
algorithm for cognate production based on edit distance alignment and the identification of orthographic cues when
words enter a new language.
Our goal is to perform cognate production without using any external resources (e.g., a lexicon or a dataset in the
target language). We use sequence labeling, an approach that has been proven useful in generating transliterations13,14 .
1.2. Problem Formulation
Given a list of words in the source language, we aim at determining their cognate pairs in the target language.
Note that the terms “source” and “target” only denote the direction of the production, not the way words entered one
language from another. We look at this problem both ways: given a pair of languages L1 and L2, we first use L1 as
the source language and L2 as the target language, and then viceversa, L2 as the source language and L1 as the target
2. Cognate Production
Words undergo various changes when entering new languages. From the alignment of the cognate pairs in the train-
ing set we learn orthographic cues and patterns for the changes in spelling, and we attempt to infer the orthographic
form in the target language of the cognate pairs of the input words in the test set.
2.1. Orthographic Alignment
String alignment is closely related to the task of sequence alignment in computational biology. To align pairs of
words we employ the Needleman-Wunsch global alignment algorithm15, which has been mainly used for aligning
sequences of proteins or nucleotides. For orthographic alignment, we consider words as input sequences and we use a
very simple substitution matrix, which gives equal scores to all substitutions, disregarding diacritics (e.g., we ensure
that eand `
eare matched).
2.2. Sequence Labeling
Sequence labeling represents the task of assigning a sequence of labels to a sequence of tokens. It is particularly
well-suited for problems that involve data with significant sequential correlation and require modeling many interde-
pendent variables, such as part-of-speech tagging or named-entity recognition. In our case, the words in the source
language are the sequences, and the characters are the tokens. Our purpose is to obtain, for each input word in the
source language, a sequence of characters that compose its cognate pair in the target language. To this end, we use
Alina Maria Ciobanu / Procedia Computer Science 96 ( 2016 ) 1391 – 1399
 
 !
Fig. 1. Methodology for cognate production. Our experiments are: Exp. #1 - only the sequential model, Exp. #2 - sequential model with reranking
based on the orthography of the words.
conditional random fields16, a graphical model which can handle a large number of rich features. This framework has
some majors advantages over other graphical models, especially the fact that it avoids the label bias problem. The
label bias problem consists in favoring the states with less outgoing transition. It is caused by the fact that “the tran-
sitions leaving a given state compete only against each other, rather than against all other transitions in the model”17
and the transitions scores are normalized separately for each state. In the context of cognate production, the label bias
problem would lead to predicting the wrong cognate if, at a certain point in the sequence of characters, the following
transition would have a higher score but would lead in the end to an incorrect production. CRFs solve this problem
by using a global normalization and accounting for the entire sequence at once.
2.3. Example
For the English word discount and its Spanish cognate pair descuento, the alignment is as follows:
For each character in the source word, the label is the character which occurs on the same position in the target
word. In the case of insertions, because there is no input character in the source language to which we could associate
the inserted character as label, we add it to the previous label. We account for affixes separately: for each input word
we add two more characters B and E, marking the beginning and the end of the word. The characters that are inserted
in the target word at the beginning or at the end of the word are associated to these special characters. In order to
reduce the number of labels, for input tokens that are identical to their labels we replace the label with *. Thus, for
the previous example, the labels are as follows:
Bdisco u nt E
↓↓↓↓↓↓ ↓ ↓↓ ↓
As features we use n-grams of characters from the input word around the current token, in a window of size w,
where n∈{1, ..., w}. For example, if the current token is the letter oin the word discount, with w=2, we have the
following features:
f(-2)=s, f(-1)=c, f(0)=o, f(1)=u, f(2)=n,
f(-2,-1)=sc, f(-1,0)=co, f(0,1)=ou, f(1,2)=un.
2.4. Reranking
We investigate whether the performance of the sequential model can be improved without using additional re-
sources (e.g., a lexicon or a corpus). This would be very useful because, in a real scenario, obtaining a lexicon in the
1394 Alina Maria Ciobanu / Procedia Computer Science 96 ( 2016 ) 1391 – 1399
target language might be problematic (for resource-poor languages, for example). We employ a maximum entropy
classifier to rerank the n-best output lists provided by the sequential model, using n-grams of characters and word
length as features. We evaluate the results with cross-validation on the training set.
We also investigated several other methods for obtaining good results without additional resources for the target
language, but their performance was lower than that of the maximum entropy reranker. First, we investigated a
measure of reranking based on occurrence probabilities for the n-grams in the produced sequences adapted from
language modeling18, using only the training data to obtain probabilities. This approach did not produce good results,
most probably because of insufficient data. Secondly, we split the training dataset based on the part of speech of the
words. Based on the intuition that certain orthographic patterns are specific to certain parts of speech, we investigated
whether training separate models would produce better results. While for some parts of speech the results improved,
overall the performance of the model was lower than that of the CRF model followed by maximum entropy reranking.
3. Experiments
In this section we describe the experimental setup used to assess the performance and to analyze the proposed
method for cognate production.
3.1. Datasets
We use the datasets proposed by Beinborn et al.8, from which we select lists of English - Spanish cognates (EN-ES)
and English - German cognates (EN-DE). These datasets have been used in previous experiments and allow us a fair
comparison between the current and previous methods.We seek to understand what influences the production process,
and to see how the proposed method is able to deal with production rules specific to closely or more remotely related
For a deeper insight into these experiments, we describe the datasets for cognate production in Table 1. What is
interesting is the difference in length between words from the source and the target language, because it could show
what operations we would expect when aligning the words. For example, for E N-DElengthlis higher than lengthr,
so we would expect more deletions. The edit columns shows how much words vary from the source language to the
target language. For example, for EN-DEwe have the highest average edit distance. This implies more operations
(insertions, deletions, substitutions), which might make the learning more difficult.
Table 1. Statistics for the cognate production datasets. Given a pair of languages L1 - L2, the lengthland lengthrcolumns represent the average
word length of the words in L1 (left) and L2 (right). The edit column represents the average normalized edit distance between the cognate pairs.
The values in the last three columns are computed only on the training data, to keep the test data unseen.
Lang. # words lengthllengthredit
EN-ES3,403 8.00 8.49 0.30
EN-DE1,002 6.57 7.16 0.32
3.2. Task Setup
We split the data in three subsets for training, development and testing with a ratio of 3:1:1. We use the CRF
implementation provided by the Mallet toolkit for machine learning19. For parameter tuning, we perform a grid
search for the number of iterations in {1,5,10,25,50,100}, for the size of the window win {1,2,3}and for the order
of the CRF in {1,2}. We train the classifier on the training set and evaluate its performance on the development set.
For each dataset, we choose the model that obtains the highest instance (word-level) accuracy on the development set
and use it to infer cognate pairs for the words in the test set.
Alina Maria Ciobanu / Procedia Computer Science 96 ( 2016 ) 1391 – 1399
Table 2. Coverage (COV) for n = 5 and mean reciprocal rank (MRR) on cognate production. The Dir. column represents the direction of the
production. The previous method to which we compare our results is COP 8. The Baseline is described in Section 3.2.1. Our experiments are:
Exp. #1 - only the sequential model, Exp. #2 - sequential model with reranking based on the orthography of the words.
Lang. Dir. Prev. (COP) Baseline Exp. #1 Exp. #2
.65 .54 .05 .02 .59 .45 .62 .45
.68 .48 .21 .09 .63 .51 .67 .52
.55 .46 .17 .12 .38 .26 .40 .28
.16 .10 .40 .31 .41 .32
3.2.1. Baseline
We use a “majority class” type of baseline. First, we align the pairs of words using the Needleman-Wunsch
algorithm. Then, we extract unigram orthographic cues (aligned subsequences of length 1) from the training set, to
obtain a list of transitions for each character, ranked by their relative frequency. We consider the character on the
right hand side of a transition as being the label of the character on the left hand side of the transition. We aggregate
the labels of a word’s characters to obtain an output word. The score of an output word is the sum of the relative
frequencies of its characters. Finally, we produce an n-best list of outputs for each test instance, by selecting the
sequences which cumulate the highest scores (i.e., for each character we select the label with the highest frequency in
the training data).
For example, if the English word rich is in our test set, and our aim is to obtain its Spanish cognate pair rico,we
first look at the transition rankings obtained from the training set. The top five transitions for each letter are as follows:
Thus, the 5-best list of outputs is {ric-, rich, rico, rici, rica}, having the correct cognate pair of the input word on
the third position.
3.2.2. Evaluation Measures
To assess the performance of our method on cognate production and to compare the current results with previous
ones, we use the following evaluation measures:
Coverage. The coverage (also known as top-n accuracy) is a relaxed metric which computes the percentage of
input words for which the n-best output list contains the correct cognate pair (the gold standard). We use n = 5
for a proper comparison with the previous results on cognate production reported by Beinborn et al.8.
Mean reciprocal rank. The mean reciprocal rank is an evaluation measure which applies to systems that
produce an ordered output list for each input instance. Given an input word, the higher the position of its
correct cognate pair in the output list, the higher the mean reciprocal rank value:
where m is the number of input instances, and rankiis the position of wi’s cognate pair in the output list. If wi’s
correct cognate pair is not in the output list, we consider the reciprocal rank 0.
1396 Alina Maria Ciobanu / Procedia Computer Science 96 ( 2016 ) 1391 – 1399
0.1 .2 .3 .4 .5 .6 .7 .8 .9 1
% words
0.1 .2 .3 .4 .5 .6 .7 .8 .9 1
Fig. 2. The normalized edit distance between the produced cognates and the true cognates (OX axis) for Exp. #2. The arrows in the legend
represent the direction of the production.
3.3. Results and Discussion
In Table 2 we report the results of our experiments. The very poor results of the baseline algorithm show that the
task is not trivial. Since the baseline doesn’t use any additional resources (such as a lexicon of the target language),
we can fairly compare its performance with Exp. #1 and Exp. #2. We observe that the baseline leads to a significantly
lower performance than the sequential model. The baseline behaves worst for EN-ES. Since our system obtains a
fairly good result on this dataset, we are inclined to believe that, for this pair of languages, the transitions are strongly
dependent on the context in which they occur (which is not captured by the baseline, as opposed to the sequential
model). The sequential model obtains the lowest coverage for EN-DE. This might be a consequence of the small size
of the dataset.
Reranking based on the orthography of the words improves the performance of the sequential model by little. The
reranking of the n-best list can be seen, in a broad way, as a language identification problem, where a binary classifier
decides whether the output sequence belongs to the target language or not. This is an interesting problem, which
requires further attention in method selection and feature engineering.
To gain more insight into our system’s beahvior, we further investigate how the coverage varies with the length
of the words. We evaluate the performance of the CRF system with reranking (Exp. #2) on three subsets (for each
dataset) broken down by the length of the English words (1-5, 6-10, 11-15). For all the pairs of languages, the
performance improves as the word length increases, confirming the intuition that longer words – that provide more
context to learn from – lead to cognate production of higher quality. For example, for E S-ENwe obtain 0.65 coverage
for length 1-5, 0.66 coverage for length 6-10, and 0.72 coverage for length 11-15.
For the larger dataset EN-ES, the results that we obtain without additional resources (Exp. #2) are comparable
to those previously obtained by the COP system, that uses a lexicon in the target language. For the smaller dataset
EN-DE, our system obtains a lower performance. We conclude that, if the dataset is large enough (EN-ES), the CRF
system obtains results comparable to those previously reported, but without using external resources. This is a major
advantage if the method is applied on resource-poor languages, or for reconstructing extinct languages, for which
resources are not available.
In Figure 2 we plot the normalized edit distance between the true cognates of the input words and the produced
sequences. The edit distance20 counts the minimum number of operations (insertion, deletion and substitution) re-
quired to transform one string into another. We use a normalized version of this metric, dividing the edit distance by
the length of the longest string. For EN-ESwe have the highest number of 0 distances, which means that the produced
cognate is correct (identical with the true cognate). We observe that even when the output sequence doesn’t match the
true cognate, it might be a valid word in a target language (for example, a feminine form for nouns, or an inflected
form for verbs). Sometimes, the produced sequences represent older forms of the words used today. In Table 3 we
provide examples of our system’s output for cognate production.
Alina Maria Ciobanu / Procedia Computer Science 96 ( 2016 ) 1391 – 1399
Table 3. Examples of cognate production output. The Dir. column represents the direction of the production. For example, if we have the cognate
pair word1-word2and the direction is , it means we use word1as an input word and our goal is to obtain its cognate pair, word2. If the direction
is ,weuseword2as an input word and our goal is to obtain its cognate pair, word1. In the Output column we report the 5-best productions and
we emphasize the true cognate (in bold).
Lang. Cognate pair Dir. Output (5-best list)
EN-ESpossibly - posiblemente
posiblente, posiblicar, posiblemente, posibamente, posiblir
possibly, posibly, possiblyent, possiblyer, possible
EN-DEelectricity - elektrizit¨
at, elektrikit¨
at, elektrizit¨
at, elektrismut¨
at, electricit¨
electricity, electrizity, electricit¨
at, electricatty, electricitte
4. Human Perception
Beinborn et al.8conducted a study on the connections between cognate production rules and human associations.
They compiled a list of Czech words with German roots and asked several German native speakers without knowledge
of Czech to try to guess the German translations of the Czech words. Comparing the associations made by people
and the production rules learned by their system, they reach the conclusion that the automatic system for cognate
production can replicate the human judgment association process.
Nagata21 and Whittaker22 studied mother tongue interference (the transfer of linguistic rules from L1 to L2) for
non-native English speakers. They showed that the phenomenon is so strong, that the language family relationships
between the mother tongues of the learners are preserved in their texts written in English.
Cognate interference represents the situation in which an L1 native speaker learning L2 either (i) uses a cognate
word instead of a correct L2 translation, or (ii) misspells an L2 word because of the influence of the L1 spelling
conventions. Garrett et al. 23 observed that cognate interference provides useful information in native language identi-
fication, reducing misclassification errors by about 4%.
Based on these ideas, we conduct a study which complements the work of Beinborn et al.8: in the context of second
language learning, we investigate how native speakers of L1 perceive the orthographic changes from L1 to L2 (i.e., we
study the second case of cognate interference). We use a learner corpus (a corpus of texts written by people learning
a second language) and we identify the misspellings of the native L1 speakers in L2. We align the misspelled English
words with the translation of their correct forms in Spanish and extract aligned subsequences of length 2, using the
method of Ciobanu and Dinu24. The transitions, in this case, are 2-grams around mismatches in the character-level
orthographic alignment of the words. For example, for the cognate pair (discount, descuento), the features are:
dide, ises, coc-, ou-u, u-ue, -nen, t-to, -EoE.
Note that, for this experiment, we ignore straightforward transitions (i.e., we ignore transitions such as isis,
but we take into account transitions such as ises). We sort the transitions by frequency and obtain a ranking. In a
similar manner, we extract aligned subsequences of length 2 from the ES-ENdataset presented in Section 3.1. Thus,
we obtain two rankings of orthographic transitions of length 2: one from the L1 native speakers learning L2, and one
from a dataset of L1 - L2 cognates.
We conduct this experiment on Spanish and English, using the CLC FCE dataset released by Yannakoudakis et al.25 .
This dataset is extracted from the Cambridge Learner Corpus (CLC) and contains 1,244 exam scripts written by
non-native English speakers. The dataset is annotated, among others, with the native language of the speakers and
with information about the spelling errors committed. We extract scripts written by Spanish native speakers. We
obtain 2,158 orthographic transitions for the learner corpus and 1,805 for the cognate dataset. We rank them by fre-
quency and we observe that 7 out of the top 10 ranked transitions occur in both rankings. In Table 4 we report the
top 10 ranking of orthographic transitions extracted from the ES-ENdataset of cognates and from the misspellings in
the learner corpus, and we provide an example for each of them. While some transitions are entitled, we notice that
learners use some of these transitions even when it would not be necessary. For example, learners use the transition
oEeE in the consenso - consense word pair (intended English word: consensus), or the transition rE-E in the
1398 Alina Maria Ciobanu / Procedia Computer Science 96 ( 2016 ) 1391 – 1399
Table 4. Top 10 ranking of orthographic transitions for Spanish - English cognates (left hand side) and misspellings made by Spanish learners of
English (right hand side). The underlined transitions occur in both rankings. The characters Band Emark the beginning and the end of the word.
Rank Cognates Learners
Transition Example Transition Example
1oE-E prometido - promised oE-E acceso - acces
2rE-E permitir - permit eE-E desastre - desaster
onon can´
onico - canonical dod- prometido - promed
oio noci´
on - notion rE-E familiar - familia
5eE-E arte - art aE-E raqueta - raquet
6citi acci´
on - action -EeE cobrar - charche
7aE-E sopa - soup ´
onon condici´
on - condicion
8-EeE programa - programme -EhE mes - mounth
9ar-- asignar - assign oEeE consenso - consense
10 dod- enamorado - enamoured BqBw quien - whe
familiar - familia word pair (intended English word: familiar). This suggests that learners perceive the most frequent
language-specific transitions between their mother tongue and the foreign language they learn, but they sometimes
have difficulties in correctly identifying the context in which to use them.
5. Conclusions
In this paper we proposed a method for cognate production, based on the orthographic form of the words. We
built on the idea that the orthographic transitions that occur when words enter one language from another follow
systematic patterns, which can be learned and used for cognate form inference. We used a sequential model which
captures the context in which the transitions occur. We employed a maximum entropy classifier to rerank the n-best
output list of the sequential model, but the results showed only slight variations. Our main contribution is being
able, for a large enough dataset (EN-ES), to obtain results comparable to those previously reported, but without using
external resources. We investigated how second language learners perceive the orthographic transitions from their
mother tongue to the learned language, and the results of our experiment show that learners identify the most common
transitions, but do no always correctly perceive the context in which to use them.
1. Atkinson, Q.D., Gray, R.D., Nicholls, G.K., Welch, D.J.. From Words to Dates: Water into Wine, Mathemagic or Phylogenetic Inference?
Transactions of the Philological Society 2005;103:193–219.
2. Buckley, C., Mitra, M., Walz, J.A., Cardie, C.. Using Clustering and SuperConcepts Within SMART: TREC 6. In: The 6th Text Retrieval
Conference, TREC 1997. 1997, p. 107–124.
3. Kondrak, G., Marcu, D., Knight, K.. Cognates Can Improve Statistical Translation Models. In: Proceedings of the Conference of the North
American Chapter of the Association for Computational Linguistics on Human Language Technology, volume 2: Short Papers, NAACL-HLT
2003. 2003, p. 46–48.
4. Kondrak, G.. Algorithms for Language Reconstruction. Ph.D. thesis; University of Toronto; 2002.
5. Eastlack, C.L.. Iberochange: a program to simulate systematic sound change in Ibero-Romance. Computers and the Humanities 1977;
6. Hartman, S.L.. A universal alphabet for experiments in comparative phonology. Computers and the Humanities 1981;15:75–82.
7. Hewson, J.. Comparative reconstruction on the computer. In: Proceedings of the 1st International Conference on Historical Linguistics.
1974, p. 191–197.
8. Beinborn, L., Zesch, T., Gurevych, I.. Cognate Production using Character-based Machine Translation. In: Proceedings of the 6th
International Joint Conference on Natural Language Processing, IJCNLP 2013. 2013, p. 883–891.
9. Delmestri, A., Cristianini, N.. String Similarity Measures and PAM-like Matrices for Cognate Identification. Bucharest Working Papers in
Linguistics 2010;12(2):71–82.
Alina Maria Ciobanu / Procedia Computer Science 96 ( 2016 ) 1391 – 1399
10. Gomes, L., Lopes, J.G.P.. Measuring Spelling Similarity for Cognate Identification. In: Proceedings of the 15th Portugese Conference on
Progress in Artificial Intelligence, EPIA 2011. 2011, p. 624–633.
11. Kondrak, G.. A New Algorithm for the Alignment of Phonetic Sequences. In: Proceedings of the 1st North American Chapter of the
Association for Computational Linguistics Conference, NAACL 2000. 2000, p. 288–295.
12. Mulloni, A.. Automatic Prediction of Cognate Orthography Using Support Vector Machines. In: Proceedings of the 45th Annual Meeting of
the ACL: Student Research Workshop, ACL 2007. 2007, p. 25–30.
13. Ganesh, S., Harsha, S., Pingali, P., Verma, V.. Statistical transliteration for cross language information retrieval using hmm alignment
model and crf. In: Proceedings of the 2nd Workshop on Cross Lingual Information Access. 2008, .
14. Ammar, W., Dyer, C., Smith, N.A.. Transliteration by sequence labeling with lattice encodings and reranking. In: Proceedings of the 4th
Named Entity Workshop. 2012, p. 66–70.
15. Needleman, S.B., Wunsch, C.D.. A general method applicable to the search for similarities in the amino acid sequence of two proteins.
Journal of Molecular Biology 1970;48(3):443–453.
16. Sutton, C.A., McCallum, A.. An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 2012;4(4):267–
17. Lafferty, J.D., McCallum, A., Pereira, F.C.N.. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence
Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001. 2001, p. 282–289.
18. Zhang, Y., Hildebrand, A.S., Vogel, S.. Distributed Language Modeling for N-best List Re-ranking. In: Proceedings of the 2006 Conference
on Empirical Methods in Natural Language Processing, EMNLP 2006. 2006, p. 216–223.
19. McCallum, A.K.. MALLET: A Machine Learning for Language Toolkit; 2002. URL:
20. Levenshtein, V.I.. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady 1965;10:707–710.
21. Nagata, R.. Language Family Relationship Preserved in Non-native English. In: Proceedings of the 25th International Conference on
Computational Linguistics: Technical Papers, COLING 2014. 2014, p. 1940–1949.
22. Nagata, R., Whittaker, E.. Reconstructing an Indo-European Family Tree from Non-native English Texts. In: Proceedings of the 51st
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2013. 2013, p. 1137–1147.
23. Nicolai, G., Hauer, B., Salameh, M., Yao, L., Kondrak, G.. Cognate and Misspelling Features for Natural Language Identification. In:
Proceedings of the 8th Workshop on Innovative Use of NLP for Building Educational Applications. 2013, p. 140–145.
24. Ciobanu, A.M., Dinu, L.P.. Automatic Detection of Cognates Using Orthographic Alignment. In: Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics, volume 2: Short Papers, ACL 2014. 2014, p. 99–105.
25. Yannakoudakis, H., Briscoe, T., Medlock, B.. A New Dataset and Method for Automatically Grading ESOL Texts. In: Proceedings of
the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, HLT 2011. 2011, p.
... • Cognate production: Beinborn et al. (2013); Ciobanu (2016) • Analogy to machine translation (Kondrak, 2002) • Learn to detect cognates, partial cognates and loanwords ...
... Related work: cognate production Beinborn et al. (2013): Ciobanu (2016): ...
Full-text available
In this presentation, I show how machine learning can be applied to historical linguistics. This presentation describes the progress of my thesis until march 2017.
... Earlier work related to word prediction is Mulloni (2007), Beinborn et al. (2013) and Ciobanu (2016). Although the applied methods differ, the approaches have in common that their input consists solely of cognates. ...
Full-text available
In recent years, computational methods have led to new discoveries in the field of historical linguistics. In my thesis, I applied the machine learning paradigm, succesful in many computing tasks, to historical linguistics. I proposed the task of word prediction: by training a machine learning model on pairs of words in two languages, it learns the sound correspondences between the two languages and should be able to predict unseen words. I used two neural network models, a recurrent neural network (RNN) encoder-decoder and a structured perceptron, to perform this task. I have shown that, by performing the task of word prediction, results for multiple tasks in historical linguistics can be obtained, such as phylogenetic tree reconstruction, identification of sound correspondences and cognate detection. On top of this, I showed that the task of word prediction can be extended to phylogenetic word prediction, in which information is shared between language pairs, based on the assumed structure of the ancestry tree. This task could be used for protoform reconstruction and could in the future lead to the direct reconstruction of the optimal tree at prediction time.
Full-text available
In this paper, we investigate how the prediction paradigm from machine learning and Natural Language Processing (NLP) can be put to use in computational historical linguistics. We propose word prediction as an intermediate task, where the forms of unseen words in some target language are predicted from the forms of the corresponding words in a source language. Word prediction allows us to develop algorithms for phylogenetic tree reconstruction, sound correspondence identification and cognate detection, in ways close to attested methods for linguistic reconstruction. We will discuss different factors, such as data representation and the choice of machine learning model, that have to be taken into account when applying prediction methods in historical linguistics. We present our own implementations and evaluate them on different tasks in historical linguistics.
Full-text available
Languages borrow words from one another for various reasons. How the borrowing process takes place, how new words enter a recipient language are key questions of historical linguistics. In this paper, we propose a multilingual method for word form production based on the orthography of the words. For borrowed words, we investigate the derivation from a donor language into a recipient language. We also address the problem of genetic cognates derivation. We experiment with Romanian as a recipient language and we investigate borrowings from multiple donor languages. The advantages of the proposed method are that it does not use any external knowledge, except for the training word pairs, and it does not require the phonetic transcriptions of the input words.
Conference Paper
Full-text available
Cognates are words in different languages that are associated with each other by language learners. Thus, cognates are important indicators for the prediction of the perceived difficulty of a text. We introduce a method for automatic cognate production using character-based machine translation. We show that our approach is able to learn production patterns from noisy training data and that it works for a wide range of language pairs. It even works across different alphabets, e.g. we obtain good results on the tested language pairs English-Russian, English-Greek, and English-Farsi. Our method performs significantly better than similarity measures used in previous work on cognates. abstract
Conference Paper
Full-text available
Words undergo various changes when entering new languages. Based on the assumption that these linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method which proved relevant for sequence alignment in computational biology. We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Given a list of known cognates, our approach does not require any other linguistic information. However, it can be customized to integrate historical information regarding language evolution.
Conference Paper
Mother tongue interference is the phenomenon where linguistic systems of a mother tongue are transferred to another language. Although there has been plenty of work on mother tongue interference, very little is known about how strongly it is transferred to another language and about what relation there is across mother tongues. To address these questions, this paper explores and visualizes mother tongue interference preserved in English texts written by Indo-European language speakers. This paper further explores linguistic features that explain why certain relations are preserved in English writing, and which contribute to related tasks such as native language identification.
Conference Paper
We consider the task of generating transliterated word forms. To allow for a wide range of interacting features, we use a conditional random field (CRF) sequence labeling model. We then present two innovations: a training objective that optimizes toward any of a set of possible correct labels (since more than one transliteration is often possible for a particular input), and a k-best reranking stage to incorporate nonlocal features. This paper presents results on the Arabic-English transliteration task of the NEWS 2012 workshop.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Gray & Atkinson's (2003) application of quantitative phylogenetic methods to Dyen, Kruskal & Black's (1992) Indo-European database produced controversial divergence time estimates. Here we test the robustness of these results using an alternative data set of ancient Indo-European languages. We employ two very different stochastic models of lexical evolution – Gray & Atkinson's (2003) finite-sites model and a stochastic-Dollo model of word evolution introduced by Nicholls & Gray (in press). Results of this analysis support the findings of Gray & Atkinson (2003). We also tested the ability of both methods to reconstruct phylogeny and divergence times accurately from synthetic data. The methods performed well under a range of scenarios, including widespread and localized borrowing.