Conference Paper

Kannpos-Kannada Parts of Speech Tagger Using Conditional Random Fields

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Parts Of Speech (POS) tagging is one of the basic text processing tasks of Natural Language Processing (NLP). It is a great challenge to develop POS tagger for Indian Languages, especially Kannada due to its rich morphological and highly agglutinative nature. A Kannada POS tagger has been developed using Conditional Random Fields (CRFs), a supervised machine learning technique and it is discussed in this paper. The results presented are based on experiments conducted on a large corpus consisting of 80,000 words, where 64,000 is used for training and 16,000 is used for testing. These words are collected from Kannada Wikipedia and annotated with POS tags. The tagset from Technology Development for Indian Languages (TDIL) containing 36 tags are used to assign the POS. The n-gram CRF model gave a maximum accuracy of 92.94 %. This work is the extension of “Parts of Speech (POS) Tagger for Kannada Using Conditional Random Fields (CRFs).

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Currently there are no freely available large online corpus and gazetteer lists for Kannada. There are only few NERs, pos taggers and chunk taggers reported (Amarappa and Sathyanarayana, 2013a;Bhuvaneshwari, 2014;Pallavi and Pillai, 2015). ...
... POS tagger using CRFs has been developed. There are three noun categories (common noun, proper noun, location) in the pos tagset (Pallavi and Pillai, 2015) which helped to identify named entities. Often, NE is always represented as a proper noun. ...
... The pos tagger developed by the author achieved an accuracy of 92.8% (Pallavi and Pillai, 2015) and the new NP chunker with 95.32% accuracy was developed for the task were used to improve Kannada NER system. The system was trained using pos and chunk information in the beginning. ...
Article
Full-text available
Named Entities (NEs) that exist in the sentences are essential to build Natural Language Processing (NLP) applications for Information Extraction (IE) from large corpora. However, generating a large corpus is challenging for resource poor languages, such as Kannada. Further, there is no annotated corpus available online. The challenges faced in annotating NEs with pre-defined classes are: It is morphologically joined with other words and the spelling variations are more frequent for Kannada words. Sentence structure varies according to morphology, parts of speech (pos) and chunking of a language. These parameters differ from one language to another. To address these challenges, a novel application system is proposed to identify NEs in Kannada using a large corpus of 73,676 tokens. The Named Entity Recognition (NER) system consist of a robust pos tagger and Noun Phrase (NP) chunker developed for generic data. Five gazetteer lists were created from many orthographic patterns for each word. Context information such as previous two words, next two words, word morphology and gazetteer lists were added to feature lists. An unigram-bigram template was designed and incorporated into Conditional Random Fields (CRFs) to generate conditional feature functions. The proposed system resulted in 86.85% and 71.01% f-measure for gold test data and newspaper data respectively.
... Currently there are no freely available large online corpus and gazetteer lists for Kannada. There are only few NERs, pos taggers and chunk taggers reported (Amarappa and Sathyanarayana, 2013a;Bhuvaneshwari, 2014;Pallavi and Pillai, 2015). ...
... POS tagger using CRFs has been developed. There are three noun categories (common noun, proper noun, location) in the pos tagset (Pallavi and Pillai, 2015) which helped to identify named entities. Often, NE is always represented as a proper noun. ...
... The pos tagger developed by the author achieved an accuracy of 92.8% (Pallavi and Pillai, 2015) and the new NP chunker with 95.32% accuracy was developed for the task were used to improve Kannada NER system. The system was trained using pos and chunk information in the beginning. ...
Article
Full-text available
Named Entities (NEs) that exist in the sentences are essential to build Natural Language Processing (NLP) applications for Information Extraction (IE) from large corpora. However, generating a large corpus is challenging for resource poor languages, such as Kannada. Further, there is no annotated corpus available online. The challenges faced in annotating NEs with pre-defined classes are: It is morphologically joined with other words and the spelling variations are more frequent for Kannada words. Sentence structure varies according to morphology, parts of speech (pos) and chunking of a language. These parameters differ from one language to another. To address these challenges, a novel application system is proposed to identify NEs in Kannada using a large corpus of 73,676 tokens. The Named Entity Recognition (NER) system consist of a robust pos tagger and Noun Phrase (NP) chunker developed for generic data. Five gazetteer lists were created from many orthographic patterns for each word. Context information such as previous two words, next two words, word morphology and gazetteer lists were add ed to feature lists. An unigram-bigram template was designed and incorporated into Conditional Random Fields (CRFs) to generate conditional feature functions. The proposed system resulted in 86.85% and 71.01% f-measure for gold test data and newspaper data respectively.
... Using the corpus, the systems produce accuracies of 84.54% for CRF and 79.9% for HMM. In another paper (Pallavi & Pillai, 2016) also POS tagging system was discussed for Kannada language using CRF. For this work, 36 tags from Technology Development for Indian Languages (TDIL) were used for annotating the corpus. ...
... The advantage of using CRF in POS Tagging is, it uses a property of an undirected graph-based model that can examine the words which are before the entity and also after the entity. To incorporate local features in a log-linear model, is one of the best characteristic of the CRF model (Pallavi & Pillai, 2016;Khan et al., 2019;Pandian & Geetha, 2009;Zhang et al., 2008). ...
Article
Full-text available
Khasi is a language that belongs to the Mon-Khmer language of the Austroasiatic group. Khasi language is spoken by the indigenous people of the state of Meghalaya in India. This paper presents a work on Part-of-speech (POS) tagging for the Khasi language by using the Conditional Random Field (CRF) method. The main significance of this work, is to experiment with the CRF model for PoS tagging in the Khasi language. This method produces a reliable agreement on the features of the language. POS tagging for Khasi is essential for creating lemmatizers which are used to lessen a word to its root structure and the POS corpus or dataset can be used in other NLP applications. In this research work, we have designed a tag set and POS tagging corpus. Khasi does not have any standard POS corpus. Therefore, we have to build a Khasi corpus that consists of around 71,000 tokens. After feeding the Khasi corpus to the CRF model for learning, the system yields a testing accuracy of 92.12% and an F1-score of 0.91. The result is compared with few other state-of-art techniques. It is observed that our approach produces promising results in comparison with other techniques. In future, we will increase the size of the Khasi POS corpus.
... Twitter data are predominantly extracted and monitored by public and private organizations for analysis of various trends in the industry and for opinion mining. For efficient NLP, corpus should be from the same domain of the NLP application [1]. NLP involves a set of computational linguistic tools for interaction between computer and natural languages. ...
... Conditional Random Fields is a probabilistic framework for segmenting and labelling sequence data. From the literature it was understood that CRF performs better than other models like Hidden Markov Model by providing conditional probability and Maximum Entropy Markov Models by observation and sequence of labels [1]. Finally, indexing was done, and the data was presented in the prescribed ESM-IL format. ...
... Pallavi and Pillai in 2016 [9] experimented with POS tagger using Kannada language by adopting n-gram CRF approach. The dataset consists of 80,000 words, in which 64,000 words considered as train dataset and 16,000 words as test dataset. ...
Chapter
Full-text available
Computational Linguistics is necessary for understanding the language which brings human being with an insight into thinking and intelligence. It is an inspiring research concept in Natural Language Processing (NLP) domain. Parts of Speech (POS) labeling is a very crucial phase in NLP, since based on this, most of the other tasks like Syntax parsing, Semantic parsing, Sentiment analysis, Sentence level classification etc. are carried out. This paper presents a POS tagger model developed on Kannada texts which is one of the South Indian Languages by employing deep neural network methodology. The deep learning model adopted in this work has a combination of word embedding and Recurrent Neural Network (RNN) along with Long-Short memory (LSTM) techniques. The total size of the dataset used during this implementation is 10,000 annotated Kannada sentences (190,000 Kannada words) comprising of five different domains like Agriculture, Sports, Literature, Tourism and Science and Technology. Most of the sentences from the available dataset are compound and ranged up to 10–11 words. The dataset is taken from Technology Development for Indian Languages (TDIL) website. It has been divided into 8000 Kannada sentences as a training dataset and 2,000 Kannada sentences as a test dataset. The BIS (Bureau of Indian Standards) tagset is adopted for POS tags, in which we considered 27 prime POS tags. An average accuracy of POS tagging on an unseen dataset obtained from the trained POS tagger model is 81%. The results are extended with plotted graphs and shown in this paper.
... They have built CRF model and recurrent RNN model separately and compared the accuracy results. Pallavi et.al [2], proposed a POS tagger for Kannada with 1000 words, trained the CRF model and obtained average accuracy of 55% using 10 fold cross validation technique. The Authors [3], have also implemented Deep learning technique, but on sentiment analysis to predict the sentiments at sentence level. ...
Article
Full-text available
Computational Linguistics is one of the interesting topics in the research field of Computer Science. This paper presents training for Part of Speech (POS) tagging on Kannada words using two techniques. First approach is supervised machine learning technique CRF++0.50 (Conditional Random Field). The second approach is a combination of word embedding and deep learning techniques. The total dataset used for this implementation is 1200 tagged Kannada sentences downloaded from Technology Development for Indian Languages (TDIL). We divided the dataset into 1100 sentences (13,600 words) as training data and 100 sentences (1053 words) as test data. The BIS (Bureau of Indian Standards) tagset is used in this work in which 27 major POS tags have been considered. An accuracy obtained through CRF++0.50 tool is 76% and that with deep learning technique is 71%. The precision, recall and f-score of each tag using both the techniques are considerable.
... It is also inflexible to the evolution of the language. K. P. Pallavi and Anitha S. Pillai [4] also use CRFs for POS tagging. They develop a tagger using an 80,000-word corpus created from Kannada Wikipedia. ...
Article
Full-text available
This paper proposes a system of part of speech tagging for the South Indian language Kannada using supervised machine learning. POS tagging is an important step in Natural Language Processing and has varied applications such as word sense disambiguation, natural language understanding etc. Based on extensive research into methods used for POS tagging, Conditional Random fields have been chosen as our algorithm. CRFs are used for sequence modeling in POS tagging, named entity recognition and as an alternative to Hidden Markov Models. Three very large corpora are used and their results are compared. The feature sets for all three corpora are also varied. The best method for the task is determined using these results.
Article
Full-text available
Compilation of tag set is an important task in all NLP. It is the initial stage in all NLP applications. We are focusing on improvements over the IL-POST (Indian language part of speech tag set) in this paper. Our tag set is fine grained and captures detail information. We have developed this tag set keeping higher NLP applications in mind. Fine grained tag set is useful for NLP applications like chunking, parsing, morphological analyzer and machine translation etc. We follow EAGLES (Expert Advisory Group on Language Engineering Standards) as guideline with modifications as required for our Kannada Language. The morphology of Kannada is complex as comparable to Turkish and Finnish. This tag set can be adopted for whole Dravidian language family. This is Hierarchical tagset and is largely based on computational needs. We have compiled a tag set of 170 tags. Compilation of tag set is an important task in all NLP and is quite challenging for Languages like Kannada. This paper will look at solving the open issues left unsolved in Microsoft's IL-POST tag set like clitics, auxiliaries. Modal auxiliaries etc. Tagging efficiency rate is more than 90% in Our tag set as compared existing ones.
Chapter
Full-text available
This paper presents a challenging task for POS Tagging using Artificial Neural Network for Odia language. Neural Network is used for Odia POS Tagging. A Single Neural Network based POS Tagger with fixed length of context chosen empirically is presented first. Then a multiple neuron tagger which consists of multiple single-neuron taggers with fixed but different lengths of contexts is presented. Multi-neuron tagger performs tagging by voting on the output of all single neuron tagger. The experiments carried out are discussed, Neural Network for efficient recognition where the errors were corrected through forward propagation and rectified neuron values were transmitted by feed-forward method in the neural network of multiple layers, i.e. the input layer, the output layer and the middle layer or hidden layers. Neural networks are one of the most efficient techniques for identified the correct data. A small labeled training set is provided; a HMM based approach does not yield very good result. So in this work, morphological analyzer is used to improve the performance of the tagger. This tagger has an accuracy of about 81% on the test data provided.
Conference Paper
Full-text available
Part of Speech tagging in Indian Languages is still an open problem. We still lack a clear approach in implementing a POS tagger for Indian Languages. In this paper we describe our efforts to build a Hidden Markov Model based Part of Speech Tagger. We have used IL POS tag set for the development of this tagger. We have achieved the accuracy of 92%.
Article
Full-text available
Parts of Speech Tagger (POS) is the task of assigning to each word of a text the proper POS tag in its context of appearance in sentences. The Chunking is the process of identifying and assigning different types of phrases in sentences. In this paper, a statistical approach with the Hidden Markov Model following the Viterbi algorithm is described. The corpus both tagged and untagged used for training and testing the system is in the Unicode UTF-8 format.
Article
Full-text available
We present a new part-of-speech tagger that demonstrates the following ideas: (i) explicit use of both preceding and following tag contexts via a dependency network representation, (ii) broad use of lexical features, including jointly conditioning on multiple consecutive words, (iii) effective use of priors in conditional loglinear models, and (iv) fine-grained modeling of unknown word features. Using these ideas together, the resulting tagger gives a 97.24% accuracy on the Penn Treebank WSJ, an error reduction of 4.4% on the best previous single automatically learned tagging result.
Article
Different languages contain complementary cues about entities, which can be used to improve Named Entity Recognition (NER) systems. We propose a method that formulates the problem of exploring such signals on unannotated bilingual text as a simple Integer Linear Program, which encourages entity tags to agree via bilingual constraints. Bilingual NER experiments on the large OntoNotes 4.0 Chinese-English corpus show that the proposed method can improve strong baselines for both Chinese and English. In particular, Chinese performance improves by over 5% absolute F1 score. We can then annotate a large amount of bilingual text (80k sentence pairs) using our method, and add it as uptraining data to the original monolingual NER training corpus. The Chinese model retrained on this new combined dataset outperforms the strong baseline by over 3% F1 score.
Article
Part-Of-Speech (POS) tagging is defined as the Natural Language Processing (NLP) task in which each word in a sentence is labeled with a tag indicating its appropriate part of speech. Of the entire supervised machine learning classification algorithms, second order Hidden Markov Model (HMM) and Conditional Random Fields (CRF) is chosen in this work for POS tagging of Kannada language. Training data includes 51,269 words and test data consists of around 2932 tokens. Both set being disjoint and taken from EMILLE corpus. Experiments show that the accuracy of the tools based on HMM and CRF is 79.9% and 84.58% respectively.
Article
Part of speech (POS) tagging is one of the basic preprocessing techniques for any text processing NLP application. It is a difficult task for morphologically rich and partially free word order languages. This paper describes a Part of Speech (POS) tagger of one such morphologically rich language, Tamil. The main issue of POS tagging is the ambiguity that arises because different POS tags can have the same inflections, and have to be disambiguated using the context. This paper presents a pattern based bootstrapping approach using only a small set of POS labeled suffix context patterns. The pattern consists of a stem and a sequence of suffixes, obtained by segmentation using a suffix list. This bootstrapping technique generates new patterns by iteratively masking suffixes with low probability of occurrences in the suffix context, and replacing them with other co-occurring suffixes. We have tested our system with a corpus containing 20,000 Tamil documents having 2,71,933 unique words. Our system achieves a precision of 87.74%.
Article
Despite significant recent work, purely unsu-pervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source of weak supervision that significantly improves accu-racy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing re-source, the Wiktionary. Across eight lan-guages for which we have labeled data to eval-uate results, we achieve accuracy that signifi-cantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy re-ported for several languages and show that our approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
Article
We consider the construction of part-of-speech taggers for resource-poor languages. Recently, manually constructed tag dictionaries from Wiktionary and dictionaries projected via bitext have been used as type constraints to overcome the scarcity of annotated data in this setting. In this paper, we show that additional token constraints can be projected from a resource-rich source language to a resource-poor target language via word-aligned bitext. We present several models to this end; in particular a partially observed conditional random field model, where coupled token and type constraints provide a partial signal for training. Averaged across eight previously studied Indo-European languages, our model achieves a 25% relative error reduction over the prior state of the art. We further present successful results on seven additional languages from different families, empirically demonstrating the applicability of coupled token and type constraints across a diverse set of languages.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Conference Paper
The proposed paper presents the development of a part-of-speech tagger for Kannada language that can be used for analyzing and annotating Kannada texts. POS tagging is considered as one of the basic tool and component necessary for many Natural Language Processing (NLP) applications like speech recognition, natural language parsing, information retrieval and information extraction of a given language. In order to alleviate problems for Kannada language, we proposed a new machine learning POS tagger approach. Identifying the ambiguities in Kannada lexical items is the challenging objective in the process of developing an efficient and accurate POS Tagger. We have developed our own tagset which consist of 30 tags and built a part-of-speech Tagger for Kannada Language using Support Vector Machine (SVM). A corpus of texts, extracted from Kannada news papers and books, is manually morphologically analyzed and tagged using our developed tagset. The performance of the system is evaluated and we found that the result obtained was more efficient and accurate compared with earlier methods for Kannada POS tagging.
Conference Paper
We show that categories induced by unsupervised word clustering can surpass the performance of gold part-of-speech tags in dependency grammar induction. Unlike classic clustering algorithms, our method allows a word to have different tags in different contexts. In an ablative analysis, we first demonstrate that this context-dependence is crucial to the superior performance of gold tags --- requiring a word to always have the same part-of-speech significantly degrades the performance of manual tags in grammar induction, eliminating the advantage that human annotation has over unsupervised tags. We then introduce a sequence modeling technique that combines the output of a word clustering algorithm with context-colored noise, to allow words to be tagged differently in different contexts. With these new induced tags as input, our state-of-the-art dependency grammar inducer achieves 59.1% directed accuracy on Section 23 (all sentences) of the Wall Street Journal (WSJ) corpus --- 0.7% higher than using gold tags.
Conference Paper
We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.
Article
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Designing POS Tagset for Kannada, Linguistic Data Consortium for Indian Languages (LDC-IL), Organized by Central Institute of Indian Languages
  • V F Patil
Rule Based POS Tagger for Marathi Text
  • P Bagul
  • Mishra
  • Archana
  • Mahajan
  • Prachi
  • Kulkarni
  • Medinee
  • Dhopavkar
  • Gauri
Developing a part of speech tagger for Manipuri
  • Kh R Singha
  • Ksh K B Singha
  • B S Purkayastha
Cross language POS taggers (and other tools) for Indian languages: an experiment with Kannada using Telugu resources
  • S Reddy
  • S Serge
POS Tagger for Kannada Sentence Translation
  • M V Reddy
  • M Hanumanthappa
Part-of-speech tagging for twitter: annotation, features, and experiments
  • K Gimpel
  • N Schneider
  • B Connor
  • D Das
  • D Mills
  • J Eisenstein
  • M Heilman
  • D Yogatama
  • J Flanigan
  • N A Smith