To read the file of this research, you can request a copy directly from the author.
Abstract
In this presentation, I show how machine learning can be applied to historical linguistics. This presentation describes the progress of my thesis until march 2017.
In this paper, we present our first attempts in building a multilingual Neural Machine Translation framework under a unified approach. We are then able to employ attention-based NMT for many-to-many multilingual translation tasks. Our approach does not require any special treatment on the network architecture and it allows us to learn minimal number of free parameters in a standard way of training. Our approach has shown its effectiveness in an under-resourced translation scenario with considerable improvements up to 2.6 BLEU points. In addition, the approach has achieved interesting and promising results when applied in the translation task that there is no direct parallel corpus between source and target languages.
We propose a sequence labeling approach to cognate production based on the orthography of the words. Our approach leverages the idea that orthographic changes represent sound correspondences to a fairly large extent. Given an input word in language L1, we seek to determine its cognate pair in language L2. To this end, we employ a sequential model which captures the intuition that orthographic changes are highly dependent on the context in which they occur. We apply our method on two pairs of languages. Finally, we investigate how second language learners perceive the orthographic changes from their mother tongue to the language they learn.
Cognates are words in different languages that are associated with each other by language learners. Thus, cognates are important indicators for the prediction of the perceived difficulty of a text. We introduce a method for automatic cognate production using character-based machine translation. We show that our approach is able to learn production patterns from noisy training data and that it works for a wide range of language pairs. It even works across different alphabets, e.g. we obtain good results on the tested language pairs English-Russian, English-Greek, and English-Farsi. Our method performs significantly better than similarity measures used in previous work on cognates. abstract
In this paper, we explore the use of convolutional networks (ConvNets) for the purpose of cognate identification. We compare our architecture with binary classifiers based on string similarity measures on different language families. Our experiments show that convolutional networks achieve competitive results across concepts and across language families at the task of cognate identification.
Neural machine translation is a recently proposed approach to machine
translation. Unlike the traditional statistical machine translation, the neural
machine translation aims at building a single neural network that can be
jointly tuned to maximize the translation performance. The models proposed
recently for neural machine translation often belong to a family of
encoder-decoders and consists of an encoder that encodes a source sentence into
a fixed-length vector from which a decoder generates a translation. In this
paper, we conjecture that the use of a fixed-length vector is a bottleneck in
improving the performance of this basic encoder-decoder architecture, and
propose to extend this by allowing a model to automatically (soft-)search for
parts of a source sentence that are relevant to predicting a target word,
without having to form these parts as a hard segment explicitly. With this new
approach, we achieve a translation performance comparable to the existing
state-of-the-art phrase-based system on the task of English-to-French
translation. Furthermore, qualitative analysis reveals that the
(soft-)alignments found by the model agree well with our intuition.
In this paper, we propose a novel neural network model called RNN Encoder--Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder--Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
In this paper, a new method for automatic cognate detection in multilingual wordlists will be presented. The main idea behind the method is to combine different approaches to sequence comparison in historical linguistics and evolutionary biology into a new framework which closely models the most important aspects of the comparative method. The method is implemented as a Python program and provides a convenient tool which is publicly available, easily applicable, and open for further testing and improvement. Testing the method on a large gold standard of IPA-encoded wordlists showed that its results are highly consistent and outperform previous methods.
An approach to the classification of languages through automated lexical comparison is described. This method produces near-expert classifications. At the core of the approach is the Automated Similarity Judgment Program (ASJP). ASJP is applied to 100-item lists of core vocabulary from 245 globally distributed languages. The output is 29,890 lexical similarity percentages for the same number of paired languages. Percentages are used as a database in a program designed originally for generating phylogenetic trees in biology. This program yields branching structures (ASJP trees) reflecting the lexical similarity of languages. ASJP trees for languages of the sample spoken in Middle America and South America show that the method is capable of grouping together on distinct branches languages of non-controversial genetic groups. In addition, ASJP sub-branching for each of nine respective genetic groups Mayan, Mixe-Zoque, Otomanguean, Huitotoan-Ocaina, Tacanan, Chocoan, Muskogean, Indo-European, and Austro-Asiatic agrees substantially with subgrouping for those groups produced by expert historical linguists. ASJP can be applied, among many other uses, to search for possible relationships among languages heretofore not observed or only provisionally recognized. Preliminary ASJP analysis reveals several such possible relationships for languages of Middle America and South America. Expanding the ASJP database to all of the world′s languages for which 100-word lists can be assembled is a realistic goal that could be achieved in a relatively short period of time, maybe one year or even less.
We propose a simple, elegant solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages. Our solution requires no change in the model architecture from our base system but instead introduces an artificial token at the beginning of the input sentence to specify the required target language. The rest of the model, which includes encoder, decoder and attention, remains unchanged and is shared across all languages. Using a shared wordpiece vocabulary, our approach enables Multilingual NMT using a single model without any increase in parameters, which is significantly simpler than previous proposals for Multilingual NMT. Our method often improves the translation quality of all involved language pairs, even while keeping the total number of model parameters constant. On the WMT'14 benchmarks, a single multilingual model achieves comparable performance for EnglishFrench and surpasses state-of-the-art results for EnglishGerman. Similarly, a single multilingual model surpasses state-of-the-art results for FrenchEnglish and GermanEnglish on WMT'14 and WMT'15 benchmarks respectively. On production corpora, multilingual models of up to twelve language pairs allow for better translation of many individual pairs. In addition to improving the translation quality of language pairs that the model was trained with, our models can also learn to perform implicit bridging between language pairs never seen explicitly during training, showing that transfer learning and zero-shot translation is possible for neural translation. Finally, we show analyses that hints at a universal interlingua representation in our models and show some interesting examples when mixing languages.
A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.
Deep Neural Networks (DNNs) are powerful models that have achieved excellent
performance on difficult learning tasks. Although DNNs work well whenever large
labeled training sets are available, they cannot be used to map sequences to
sequences. In this paper, we present a general end-to-end approach to sequence
learning that makes minimal assumptions on the sequence structure. Our method
uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to
a vector of a fixed dimensionality, and then another deep LSTM to decode the
target sequence from the vector. Our main result is that on an English to
French translation task from the WMT-14 dataset, the translations produced by
the LSTM achieve a BLEU score of 34.7 on the entire test set, where the LSTM's
BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did
not have difficulty on long sentences. For comparison, a strong phrase-based
SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the
LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system,
its BLEU score increases to 36.5, which beats the previous state of the art.
The LSTM also learned sensible phrase and sentence representations that are
sensitive to word order and are relatively invariant to the active and the
passive voice. Finally, we found that reversing the order of the words in all
source sentences (but not target sentences) improved the LSTM's performance
markedly, because doing so introduced many short term dependencies between the
source and the target sentence which made the optimization problem easier.
Bagga, A. and Baldwin, B. (1998). Algorithms for scoring coreference chains. In The
first international conference on language resources and evaluation workshop on
linguistics coreference, volume 1, pages 563-566.
Compiling the uralic dataset for northeuralex, a lexicostatistical database of northern eurasia
Jan 2015
34-44
J Dellert
Dellert, J. (2015). Compiling the uralic dataset for northeuralex, a lexicostatistical
database of northern eurasia. In Septentrio Conference Series, number 2, pages
34-44.
Automatic identification of cognates and false friends in french and english
Jan 2005
251-257
D Inkpen
O Frunza
G Kondrak
Inkpen, D., Frunza, O., and Kondrak, G. (2005). Automatic identification of cognates
and false friends in french and english. In Proceedings of the International
Conference Recent Advances in Natural Language Processing, pages 251-257.
Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists
Jan 2017
Mayan
0-05
G Jäger
J.-M List
P Sofroniev
Jäger, G., List, J.-M., and Sofroniev, P. (2017). Using support vector machines and
state-of-the-art algorithms for phonetic alignment to identify cognates in
multi-lingual wordlists. Mayan, 895:0-05.
Determining recurrent sound correspondences by inducing translation models
Jan 2002
1-7
G Kondrak
Kondrak, G. (2002). Determining recurrent sound correspondences by inducing
translation models. In Proceedings of the 19th international conference on
Computational linguistics-Volume 1, pages 1-7. Association for Computational
Linguistics.