Article

Suffix stripping based NER in Assamese for location names

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Named Entity Recognition (NER) is the process of identifying and classifying proper nouns in text documents into pre-defined classes such as person, location and organization. It plays an important role in Natural Language Processing applications. Although NER in Indian languages is a difficult and challenging task and suffers from scarcity of resources, such work has started to appear recently. In highly inflectional languages such as Assamese, NER requires identification of the root forms of words that occur in texts. Our work reports a suffix stripping approach to identify those roots of words which are location named entities.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... "Suffix Stripping Based Named Entity Recognition in Assamese for Location Names" mentioned by Padmaja Sharma, Utpal Sharma and Jugal Kalita (2010) [6] used an Assamese Pratidin Corpus containing 300,000 wordforms. A location named entity was produced by generating the root word by removing suffixes. ...
... Approach Used F-measure (%) Suffix stripping based approach [6] 88 Table 6. F-measure achieved in Urdu for different statistical approach. ...
... "Suffix Stripping Based Named Entity Recognition in Assamese for Location Names" mentioned by Padmaja Sharma, Utpal Sharma and Jugal Kalita (2010) [6] used an Assamese Pratidin Corpus containing 300,000 wordforms. A location named entity was produced by generating the root word by removing suffixes. ...
... Approach Used F-measure (%) Suffix stripping based approach [6] 88 Table 6. F-measure achieved in Urdu for different statistical approach. ...
Article
Full-text available
Named Entity Recognition is always important when dealing with major Natural Language Processing tasks such as information extraction, question-answering, machine translation, document summarization etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular reference to Assamese. There are various rule-based and machine learning approaches available for Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for Named Entity Recognition and then we discuss about the related research in this field. Assamese like other Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of the issues faced in Assamese while doing Named Entity Recognition.
... They were analyzed the tagged corpus to enumerate some rules for automatic Named Entity tagging. Another work was developed by them [10], where location named entities were found by suffix stripping approach to identify the root of the word. They collect the corpus from Asomiya Pratidin nearly 300,000 words. ...
Conference Paper
Full-text available
Machine Translation (MT) is the process of automatically converting one natural language into another, preserving the exact meaning of the input text to the output text. It is one of the classical problems in the Natural Language Processing (NLP) domain and there is a wide application in our daily life. Though the research in MT in English and some other language is relatively in an advanced stage, but for most of the languages, it is far from the human-level performance in the translation task. From the computational point of view, for MT a lot of preprocessing and basic NLP tools and resources are needed. This study gives an overview of the available basic NLP resources in the context of Assamese-English machine translation.
... Another system that was reported to perform named entity recognition was a suffix stripping based system for finding locations. The system took advantage of the fact that in Assamese some location named entities often combines with common suffixes [2]. NER in Assamese was done using rule based approach and conditional random fields in [3] which was able to achieve an F-measure of 90-95%. ...
Chapter
Full-text available
Named Entity Recognition (NER) is crucial when it comes to taking care of information extraction, question-answering, document summarization and machine translation which are undoubtly the important Natural Language Processing (NLP) tasks. This work is a detailed analysis of our previously developed NER system with more emphasis on how individual features will contribute towards the recognition of person, location and organization named entities and how these features in different combinations affect the performance measure of the system. In addition to these, we have also evaluated the behaviour of the features with the increase in training and test corpus. Since this system is based on supervised learning, we need to have a large parts of speech tagged and named entity tagged Training Corpus as well as a parts of speech tagged Test Corpus. The maximum value of performance measure of the overall system is obtained when the training corpus is of size with 5000 words and the amount of named entities present in the test corpus is 50 and the values obtained are 95% in terms of precision, 84% in terms of recall and 89% in terms of F1-measure. This work will add a new dimension in the usage of features for recognition of ENAMEX tags in Assamese corpus.
... Assamese is a highly inflectional language [7]. Word sense disambiguation in Assamese is difficult due to its rich morphology. ...
Chapter
Full-text available
Word sense ambiguity comes about the use of lexemes associated with more than one sense. In this research work, an improvement has been proposed and evaluated for our previously developed Assamese Word-Sense Disambiguation (WSD) system where potential outcomes of using semantic features were evaluated up to a limited extent. As semantic relationship information has a good effect in most of the natural language processing (NLP) tasks, in this work, the system is developed based on supervised learning approach using Naïve Bayes classifier with syntactic as well as semantic features. The performance measure of the overall system has been improved up to 91.11% in terms of F1-measure as compared to 86% of the previously developed system by incorporating the Semantically Related Words (SRW) feature in our feature set.
... . As a part of preprocessing steps, it removed punctuation; digit and single character words. The stemmer performance was evaluated over different domains of 1,800 words. The technique showed improvement in the performance over rule based system. Technology Development for Indian Languages (TDIL) datasets were used for testing with 90.48% accuracy. [Padmaja Sharma et al, 2012] introduced suffix stripping based named entity recognizer in Assamese for location names. NER is an important task for natural language processing. Although in Assamese language, it was a challenging task as it suffered scarcity of resources. As Assamese is an inflectional language which makes the job more difficult. The work reported a ...
Article
Full-text available
The authors used yet another suffix stripper (YASS) to find out the base words or stems for one of the languages of northeast India called Mising Language. There are over 5, 00,000 speakers in Mising Language. The Roman scripts are used for Mising Language. Mising Agom Kébang is the highest body of the Mising people and is dedicated for the development of Mising literature. The particular suffix remover may be used without in depth knowledge about the language. The authors successfully used the YASS with a F-score of around 87% for finding the stem. In the field of information retrieval, the automatic removals of suffixes are very important. As the mising language does not have a known corpus, the authors created the corpus.
Article
The morphological variations of highly inflected languages that appear in a text impede the progress of computer processing and root word determination tasks while extracting an abstract. As a remedy to this difficulty, a lemmatization algorithm is developed, and its effectiveness is evaluated for Word Sense Disambiguation (WSD). Having observed its usefulness, lemmatizer is considered for developing Natural Language Processing tools for languages rich in morphological variations. Among various Indian highly inflected languages, Assamese, spoken by over 14 million people in the North-Eastern region of India, is also one of them. In this present work, after a detailed study on the possible transformations through which surface words are created from lemmas, we have designed an Assamese lemmatizer in such a manner that suitable reverse transformations can be employed on a surface word to derive the co-relative (similar) lemma back. And it has been observed that the lemmatizer is competent to deal with inflectional and derivational morphology in Assamese, and the same was evaluated on various Assamese articles extracted from the Assamese Corpus consisting of 50,000 surface words (excluding proper nouns), and the result that it yielded with 82% accuracy was quite encouraging and satisfying, as Assamese is a low-level language and no research work has been done in the Assamese language regarding the lemmatization of words. Considering the result obtained, the lemmatizer is then evaluated for Assamese WSD. For this purpose, 10 highly polysemous Assamese words are taken into account for sense disambiguation. We have also regarded varied WSD systems and observed that such systems enhance the effectiveness of all the WSD systems, which is statistically significant.
Article
To enhance the Assamese stemmer several approaches and solutions by researchers have been proposed. Such stemmers are important as the features are often applied for application-oriented projects, and especially, to develop information retrieval (IR) systems. Assamese stemming could be defined as a process that strips off a set of suffixes from words. But this process also has certain set back such as vocalization ambiguity, incorrect removal, single solution, etc. In this paper, we have proposed an Assamese stemmer that provides solutions to various drawbacks as proposed earlier and to make use of various features as mentioned above efficiently. We have tested using 20,000 words from 16 different articles, all possible suffixes in the Assamese language were manually collected taking the help of an Assamese linguistic expert. It has achieved quite better accuracy with 86.16%. Also, the accuracy of the system is compared with other existing approaches and our system outperforms all the others. Besides, we proposed an automatic approach for the evaluation and comparison of Assamese stemmers that takes into account metrics related to the accuracy of results.
Article
Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While Assamese, Bengali and Bishnupriya Manipuri are Indo- Aryan, Bodo is a Tibeto-Burman language. We design a rule-based approach to remove suffixes from words. To reduce over-stemming and under-stemming errors, we introduce a dictionary of frequent words. We observe that, for these languages a dominant amount of suffixes are single letters creating problems during suffix stripping. As a result, we introduce an HMM-based hybrid approach to classify the mis-matched last character. For each word, the stem is extracted by calculating the most probable path in four HMMstates. At each step we measure the stemming accuracy for each language. We obtain 94% accuracy for Assamese and Bengali and 87%, and 82% for Bishnupriya Manipuri and Bodo, respectively, using the hybrid approach. We compare our work with Morfessor [Creutz and Lagus 2005]. As of now, there is no reported work on stemming for Bishnupriya Manipuri and Bodo. Our results on Assamese and Bengali show significant improvement over prior published work [Sarkar and Bandyopadhyay 2008; Sharma et al. 2002, 2003].
Article
Full-text available
This paper presents a rule-based approach for finding out the stems from text in Ben- gali, a resource-poor language. It starts by introducing the concept of orthographic syllable, the basic orthographic unit of Bengali. Then it discusses the morphologi- cal structure of the tokens for different parts of speech, formalizes the inflection rule constructs and formulates a quantita- tive ranking measure for potential candi- date stems of a token. These concepts are applied in the design and implementation of an extensible architecture of a stemmer system for Bengali text. The accuracy of the system is calculated to be ~89% and above.
Article
Full-text available
We discuss problems that arise in mor- phological analysis of highly inflectional nat- ural languages. We focus on word stem- ming, particularly the problem of identify- ing root words automatically when access to a substantive computational lexicon is un- available.
Article
Full-text available
Words play a crucial role in aspects of natural language understanding such as syntactic and semantic processing. Usually, a natural language understanding system either already knows the words that appear in the text, or is able to automatically learn relevant information about a word upon encountering it. Usually, a capable system---human or machine, knows a subset of the entire vocabulary of a language and morphological rules to determine attributes of words not seen before. Developing a knowledge base of legal words and morphological rules is an important task in computational linguistics. In this paper, we describe initial experiments following an approach based on unsupervised learning of morphology from a text corpus, especially developed for this purpose. It is a method for conveniently creating a dictionary and a morphology rule base, and is, especially suitable for highly inflectional languages like Assamese. Assamese is a major Indian language of the Indic branch of the Indo-European family of languages. It is used by around 15 million people.
Article
Full-text available
Stemming, a well known IR module, is used to enhance the effectiveness of text Retrieval Systems. In FIRE 2008 ad-hoc monolingual task we applied a simple corpus based technique that is based on n-gram matching on three Indian languages. In our method we group a class of words which share a common prefix of given character length and replace each of them by their common prefix. We hope to discover whether this simple method works well in Indian language context and the initial results are encouraging.
Article
Full-text available
This article describes an approach to unsupervised learning ofmorphology from an unannotated corpus for a highly inflectionalIndo-European language called Assamese spoken by about 30 millionpeople. Although Assamese is one of Indias national languages, itutterly lacks computational linguistic resources. There exists noprior computational work on this language spoken widely innortheast India. The work presented is pioneering in this respect.In this article, we discuss salient issues in Assamese morphologywhere the presence of a large number of suffixal determiners,sandhi, samas, and the propensity to use suffix sequences makeapproximately 50% of the words used in written and spoken textinflected. We implement methods proposed by Gaussier and Goldsmithon acquisition of morphological knowledge, and obtain F-measureperformance below 60%. This motivates us to present a method moresuitable for handling suffix sequences, enabling us to increase theF-measure performance of morphology acquisition to almost 70%. Wedescribe how we build a morphological dictionary for Assamese fromthe text corpus. Using the morphological knowledge acquired and themorphological dictionary, we are able to process small chunks ofdata at a time as well as a large corpus. We achieve approximately85% precision and recall during the analysis of small chunks ofcoherent text.
Article
Full-text available
Unsupervised morphological analysis is the task of segmenting words into prefixes, suffixes and stems without prior knowledge of language-specific morphotactics and morpho-phonological rules. This paper introduces a simple, yet highly effective algorithm for unsupervised morphological learning for Bengali, an Indo–Aryan language that is highly inflectional in nature. When evaluated on a set of 4,110 human-segmented Bengali words, our algorithm achieves an F-score of 83%, substantially outperforming Linguistica, one of the most widely-used unsupervised morphological parsers, by about 23%.
Article
Full-text available
Suffix stripping is a pre-processing step required in a number of natural language processing applications. Stemmer is a tool used to perform this step. This paper presents and evaluates a rule-based and an unsupervised Marathi stemmer. The rule-based stemmer uses a set of manually extracted suffix stripping rules whereas the unsupervised approach learns suffixes automatically from a set of words extracted from raw Marathi text. The performance of both the stemmers has beencompared on a test dataset consisting of 1500 manually stemmedword.
Article
Full-text available
Stemming is the process of removing the affixes from inflected words, without doing complete morphological analysis. A stemming Algorithm is a procedure to reduce all words with the same stem to a common form [20]. It is useful in many areas of computational linguistics and information-retrieval work. This technique is used by the various search engines to find the best solution for a problem. The algorithm is a basic building block for the stemmer. Stemmer is basically used in information retrieval system to improve the performance .The paper present a stemmer for Punjabi, which uses a brute force algorithm. We also use a suffix stripping technique in our paper. Similar techniques can be used to make stemmer for other languages such as Hindi, Bengali and Marathi. The result of stemmer is good and it can be effective in information retrieval system. This stemmer also reduces the problem of over-stemming and under-stemming.
Article
The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.
Article
Morphology is the area of linguistics concerned with the internal structure of words. Information retrieval has generally not paid much attention to word structure, other than to account for some of the variability in word forms via the use of stemmers. We report on our experiments to determine the importance of morphology, and the effect that it has on performance. We found that grouping morphological variants makes a significant improvement in retrieval performance. Improvements are seen by grouping inflectional as well as derivational variants. We also found that performance was enhanced by recognizing lexical phrases. We describe the interaction between morphology and lexical ambiguity, and how resolving that ambiguity will lead to further improvements in performance.
Conference Paper
The effectiveness of stemming algorithms has usually been measured in terms of their effect on retrieval performance with test collections. This however does not provide any insights which might help in stemmer optimisation. This paper describes a method in which stemming performance is assessed against predefine concept groups in samples of words. This enables various indices of stemming performance and weight to be computed. Results are reported for three stemming algorithms. The validity and usefulness of the approach, and the problems of conceptual grouping, are discussed, and directions for further research are identified.
Article
As participants in the TIDES Surprise language exercise, researchers at the University of Massachusetts helped collect Hindi--English resources and developed a cross-language information retrieval system. Components included normalization, stop-word removal, transliteration, structured query translation, and language modeling using a probabilistic dictionary derived from a parallel corpus. Existing technology was successfully applied to Hindi. The biggest stumbling blocks were collection of parallel English and Hindi text and dealing with numerous proprietary encodings.
A Morphological Analyzer and a Stemmer for Nepali, 2007-01-01 Pan Localization, Working Papers
  • Krishna Bal
  • Prajol Bal
  • Shrestha
A light weight stemmer for Hindi
  • A Ramanathan
  • D Rao
Generating statistical Hindi stemmer from Parallel texts
  • A Chen
  • F C Grey
Suffix Removal and word Conflation ALLCbulletin
  • J Dawson