Conference Paper

Cognate Identification to improve Phylogenetic trees for Indian Languages

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and techniques to analyze these variants and infer phylogenetic trees for a hypothesized accurate representation based on the output of the computational algorithm used. In our work, we detect cognates among a few Indian languages namely Hindi, Marathi, Punjabi, and Sanskrit for helping build cognate sets for phylogenetic inference. Cognate detection helps phylogenetic inference by helping isolate diachronic sound changes and thus detect the words of a common origin. A cognate set manually annotated with the help of a lexicographer is generally used to automatically infer phylogenetic trees. Our work creates cognate sets of each language pair and infers phylogenetic trees based on a bayesian framework using the Maximum likelihood method. We also implement our work to an online interface and infer phylogenetic trees based on automatically detected cognate sets. The online interface helps create phylogenetic trees based on the textual data provided as an input. It helps a lexicographer provide manual input of data, edit the data based on their expert opinion and eventually create phylogenetic trees based on various algorithms including our work on automatically creating cognate sets. We go on to discuss the nuances in detection cognates with respect to these Indian languages and also discuss the categorization of Cognate words i.e., "Tatasama" and "Tadbhava" words.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Distance measurement based scores have become the feature set to identify cognates in these cases (Mann and Yarowsky, 2001;Tiedemann, 1999). Kanojia et al. (2019a) performed a cognate detection task on Indian languages, which includes a large amount of manual intervention during identification. Kanojia et al. (2019b) introduced a character sequence-based recurrent neural network for identifying cognates between Indian language pairs. ...
... These cognates can be used to challenge the previously established cognate detection approaches further. Kanojia et al. (2019a) perform cognate detection for some Indian languages, but a prominent part of their work includes manual verification and segratation of their output into cognates and noncognates. Identification of cognates for improving IR has already been explored for Indian languages (Makin et al., 2007). ...
Preprint
Full-text available
Cognates are present in multiple variants of the same text across different languages (e.g., "hund" in German and "hound" in English language mean "dog"). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages, namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends' dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.
... Ciobanu and Dinu (2014) employ dynamic programming based methods for sequence alignment. Kanojia et al. (2019a) perform cognate detection for some Indian languages, but a prominent part of their work includes manual verification and segratation of their output into cognates and non-cognates. Kanojia et al. (2019b) utilize recurrent neural networks to harness the character sequence among cognates and non-cognates for Indian languages, but employ monolingual embeddings for the task. ...
Preprint
Full-text available
Cognates are variants of the same lexical form across different languages; for example 'fonema' in Spanish and 'phoneme' in English are cognates, both of which mean 'a unit of sound'. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.
... Using false friends as data points with negative labels restricts us to the use of semantic similarity based features, as orthographic or phonetic similarity-based measures would fail to detect sufficient distinction between them. Hence, we use the features proposed by Rama (2016) and Kanojia et al. (2019a) as baseline features for a comparative evaluation. ...
Preprint
Full-text available
Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational Phylogenetics and Cross-lingual Named Entity Recognition. Previous approaches for the task of cognate detection use orthographic, phonetic and semantic similarity based features sets. In this paper, we propose a novel method for enriching the feature sets, with cognitive features extracted from human readers' gaze behaviour. We collect gaze behaviour data for a small sample of cognates and show that extracted cognitive features help the task of cognate detection. However, gaze data collection and annotation is a costly task. We use the collected gaze behaviour data to predict cognitive features for a larger sample and show that predicted cognitive features, also, significantly improve the task performance. We report improvements of 10% with the collected gaze features, and 12% using the predicted gaze features, over the previously proposed approaches. Furthermore, we release the collected gaze behaviour data along with our code and cross-lingual models.
... 6 Bilingual Lexicon as orthographic or phonetic similarity-based measures would fail to detect sufficient distinction between them. Hence, we use the features proposed by Rama (2016) and Kanojia et al. (2019a) as baseline features for a comparative evaluation. ...
... Ciobanu and Dinu (2014) employ dynamic programming based methods for sequence alignment. Kanojia et al. (2019a) perform cognate detection for some Indian languages, but a prominent part of their work includes manual verification and segratation of their output into cognates and non-cognates. Kanojia et al. (2019b) utilize recurrent neural networks to harness the character sequence among cognates and non-cognates for Indian languages, but employ monolingual embeddings for the task. ...
... There are different studies found on the lexical similarity like [20] worked on the dialectal differences among the pair of texts using cosine similarity, Hamming distance, and Levenshtein distance and [21] worked on cognates identification among different languages based on inter-related vocabulary. It shows that the lexical similarity can be computed by using the phonetic level features of words rather than orthographic features. ...
Article
Full-text available
The semantic coexistence is the reason to adopt the language spoken by other people. In such human habitats, different languages share words typically known as loan words which appears not only as the principal medium of enriching language vocabulary, but also for creating influence upon each other for building a stronger relationships and forming multilingualism. In this context, the spoken words are usually common but their writing scripts vary or the language may has become a digraphia. In this paper, we presented the similarities and relatedness between Hindi and Urdu (that are mutually intelligible and major languages of Indian sub-continent). In general, the method modifies edit-distance; and works in the fashion that instead of using alphabets from the words it uses articulatory features from the International Phonetic Alphabets (IPA) to get the phonetic edit distance. This paper also shows the results for the languages consonant under the method which quantifies the evidence that the Urdu and Hindi languages are 67.8% similar on average despite the script differences.
Article
Full-text available
Identifying the type of relationship be-tween words provides a deeper insight into the history of a language and allows a bet-ter characterization of language related-ness. In this paper, we propose a com-putational approach for discriminating be-tween cognates and borrowings. We show that orthographic features have discrimi-native power and we analyze the underly-ing linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind.
Conference Paper
Full-text available
Words undergo various changes when entering new languages. Based on the assumption that these linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method which proved relevant for sequence alignment in computational biology. We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Given a list of known cognates, our approach does not require any other linguistic information. However, it can be customized to integrate historical information regarding language evolution.
Conference Paper
Full-text available
In this paper, a new method for automatic cognate detection in multilingual wordlists will be presented. The main idea behind the method is to combine different approaches to sequence comparison in historical linguistics and evolutionary biology into a new framework which closely models the most important aspects of the comparative method. The method is implemented as a Python program and provides a convenient tool which is publicly available, easily applicable, and open for further testing and improvement. Testing the method on a large gold standard of IPA-encoded wordlists showed that its results are highly consistent and outperform previous methods.
Article
Full-text available
A novel probabilistic retrieval model is presented. It forms a basis to interpret the TF-IDF term weights as making relevance decisions. It simulates the local relevance decision-making for every location of a document, and combines all of these “local” relevance decisions as the “document-wide” relevance decision for the document. The significance of interpreting TF-IDF in this way is the potential to: (1) establish a unifying perspective about information retrieval as relevance decision-making; and (2) develop advanced TF-IDF-related term weights for future elaborate retrieval models. Our novel retrieval model is simplified to a basic ranking formula that directly corresponds to the TF-IDF term weights. In general, we show that the term-frequency factor of the ranking formula can be rendered into different term-frequency factors of existing retrieval systems. In the basic ranking formula, the remaining quantity - log p(&rmacr;|t ∈ d) is interpreted as the probability of randomly picking a nonrelevant usage (denoted by &rmacr;) of term t. Mathematically, we show that this quantity can be approximated by the inverse document-frequency (IDF). Empirically, we show that this quantity is related to IDF, using four reference TREC ad hoc retrieval data collections.
Article
Background: The association between gait speed and cognition has been reported; however, there is limited knowledge about the temporal associations between gait slowing and cognitive decline among cognitively normal individuals. Methods: The Mayo Clinic Study of Aging is a population-based study of Olmsted County, Minnesota, United States, residents aged 70-89 years. This analysis included 1,478 cognitively normal participants who were evaluated every 15 months with a nurse visit, neurologic evaluation, and neuropsychological testing. The neuropsychological battery used nine tests to compute domain-specific (memory, language, executive function, and visuospatial skills) and global cognitive z-scores. Timed gait speed (m/s) was assessed over 25 feet (7.6 meters) at a usual pace. Using mixed models, we examined baseline gait speed (continuous and in quartiles) as a predictor of cognitive decline and baseline cognition as a predictor of gait speed changes controlling for demographics and medical conditions. Results: Cross-sectionally, faster gait speed was associated with better performance in memory, executive function, and global cognition. Both cognitive scores and gait speed declined over time. A faster gait speed at baseline was associated with less cognitive decline across all domain-specific and global scores. These results were slightly attenuated after excluding persons with incident mild cognitive impairment or dementia. By contrast, baseline cognition was not associated with changes in gait speed. Conclusions: Our study suggests that slow gait precedes cognitive decline. Gait speed may be useful as a reliable, easily attainable, and noninvasive risk factor for cognitive decline.
Article
n-grams have been used widely and successfully for approximate string matching in many areas. s-grams have been introduced recently as an n-gram based matching technique, where di-grams are formed of both adjacent and non-adjacent characters. s-grams have proved successful in approximate string matching across language boundaries in Information Retrieval (IR). s-grams however lack precise definitions. Also their similarity comparison lacks precise definition. In this paper, we give precise definitions for both. Our definitions are developed in a bottom-up manner, only assuming character strings and elementary mathematical concepts. Extending established practices, we provide novel definitions of s-gram profiles and the L1 distance metric for them. This is a stronger string proximity measure than the popular Jaccard similarity measure because Jaccard is insensitive to the counts of each n-gram in the strings to be compared. However, due to the popularity of Jaccard in IR experiments, we define the reduction of s-gram profiles to binary profiles in order to precisely define the (extended) Jaccard similarity function for s-grams. We also show that n-gram similarity/distance computations are special cases of our generalized definitions.
Conference Paper
Commonly used vocabulary in Indian language documents found on the web contain a number of words that have Sanskrit, Persian or English origin. However, such words may be written in different scripts with slight variations in spelling and morphology. In this paper we explore approximate string matching techniques to exploit this situation of relatively large number of cognates among Indian languages, which are higher when compared to an Indian language and a non-Indian language. We present an approach to identify cognates and make use of them for improving dictionary based CLIR when the query and documents both belong to two different Indian languages. We conduct experiments using a Hindi document collection and a set of Telugu queries and report the improvement due to cognate recognition and translation.
Conference Paper
Profile hidden Markov models (Profile HMMs) are specific types of hidden Markov models used in biological sequence analysis. We propose the use of Profile HMMs for word-related tasks. We test their applicability to the tasks of multiple cognate alignment and cognate set matching, and find that they work well in general for both tasks. On the latter task, the Profile HMM method outperforms average and minimum edit distance. Given the success for these two tasks, we further discuss the potential applications of Profile HMMs to any task where consideration of a set of words is necessary.
Article
Alignment of phonetic sequences is a necessary step in many applications in computational phonology. After discussing various approaches to phonetic alignment, I present a new algorithm that combines a number of techniques developed for sequence comparison with a scoring scheme for computing phonetic similarity on the basis of multivalued features. The algorithm performs better on cognate alignment, in terms of accuracy and efficiency, than other algorithms reported in the literature.
Article
This article advances the state of the art ofbitext mapping by formulating the problem in terms of pattern recognition. From this point of view, the success of a bitext mapping algorithm hinges on how well it performs three tasks: signal generation, noise filtering, and search. The Smooth Injective Map Recognizer (SIMR) algorithm presented here integrates innovative approaches to each of these tasks. Objective evaluation has shown that SIMR's accuracy is consistently high for language pairs as diverse as French/English and Korean/English. If necessary, S IMR's bitext maps can be efficiently converted into segment alignments using the Geometric Segment Alignment (GSA) algorithm, which is also presented here. SIMR has produced bitext maps for over 200 megabytes of French-English bitexts. GSA has converted these maps into alignments. Both the maps and the alignments are available from the Linguistic Data Consortium) 1.
Article
. We describe an application of sentence alignment techniques and approximate string matching to the problem of extracting lexicographically interesting wordword pairs from multilingual corpora. Since our interest is in support systems for lexicographers rather than in fully automatic construction of lexicons, we would like to provide access to parameters allowing a tunable trade-off between precision and recall. We evaluate two techniques for doing this. Since sentence alignment tends to associate semantically similar words, approximate string matching draws attention to orthographic similarities, they can be used to serve different lexicographic purposes, as can the combination of the two techniques, which amounts, inter alia, to a tool for uncovering faux amis. We conclude by sketching a simple and flexible means for allowing lexicographers to provide information which has the potential to improve system performance. 1 Introduction One of the central challenges of comput...
Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection
  • A Pranav
  • Pranav A
Assessing the temporal relationship between cognition and gait: slow gait predicts cognitive decline in the Mayo Clinic Study of Aging
  • Michelle M Mielke
  • O Rosebud
  • Rodolfo Roberts
  • Ruth Savica
  • Dina I Cha
  • Teresa Drubach
  • Christianson
  • S Vernon
  • Yonas E Pankratz
  • Mary M Geda
  • Robert J Machulda
  • Ivnik
Approximate string matching techniques for effective CLIR among Indian languages
  • Nikita Ranbeer Makin
  • Prasad Pandey
  • Vasudeva Pingali
  • Varma