Alexa Ai’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (1)


Fig. 1. TTR at character level of 1-7 grams over the fixed vocabulary of each language
Fig. 2. Zipf's curve for Bhojpuri, Maithili, Magahi, and Hindi: The plot on the left is for full corpora, while the one on the right is for minimum (Maithili) size
Fig. 3. Word length analysis in terms of characters: The plot on the left is for full corpora, while the one on the right is for minimum (Maithili) size
The top ten most frequent words for the three languages in decreasing order of their frequency (Freq) with their Relative frequency (RF) and Cumulative Coverage (CC) values specified, compared with Hindi
Frequency of orthographic syllables based on positions

+10

Basic Linguistic Resources and Baselines for Bhojpuri, Magahi and Maithili for Natural Language Processing
  • Preprint
  • File available

April 2020

·

889 Reads

Rajesh Kumar Mundotiya

·

·

Alexa Ai

·

[...]

·

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the northeastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we make our comparisons with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were meant to give an indication of linguistic properties such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to keep the size of the corpus the same across the languages so as to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we try to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The sizes of the POS tagged data are 16067, 14669 and 12310 sentences, respectively for Bhojpuri, Magahi and Maithili. The sizes for chunking are 9695 and 1954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement during these annotations, using Cohen's Kappa inter-annotator was 0.92, 0.64, and 0.74, respectively for the three languages. These (annotated) corpora have been used for developing automated tools which include POS tagger, Chunker and Language identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram based language identification algorithm. Finally, we provide baselines for some of the basic NLP tools such as POS tagging, chunking and language identification for closely related languages.

Download