Conference PaperPDF Available

Supervised named entity recognition in Assamese language

Authors:
Supervised Named Entity Recognition in
Assamese language
Gitimoni Talukdar
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
talukdargitimoni@gmail.com
Pranjal Protim Borah
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
pranjalborah777@gmail.com
Arup Baruah
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
arup.baruah@gmail.com
Abstract— In each and every natural language nouns play a
very important role. A subcategory of noun is proper noun.
They represent the names of person, location, organization etc.
The task of recognizing the proper nouns in a text and
categorizing them into some classes such as person, location,
organization and other is called Named Entity Recognition.
This is a very essential step of many natural language
processing applications that makes the process of information
extraction easier. Named Entity Recognition (NER) in most of
the Indian languages has been performed using rule-based,
supervised and unsupervised approaches. In this work our
target language is Assamese, the language spoken by most of
the people in North-Eastern part of India and particularly in
Assam. In Assamese language, Named Entity Recognition has
been performed using the rule based and suffix stripping based
approaches. Supervised learning technique is more useful and
can be easily adapted to new domains compared to rule based
approaches. This paper reports the first work in Assamese
NER using a machine learning technique. In this paper
Assamese Named Entity Recognition is performed using Naïve
Bayes classifier. Since feature extraction plays the most
important role in getting better performance in any machine
learning technique, in this work our aim is to put forward a
description of a few important features related to Assamese
NER and performance measure of the system using these
features.
Keywords— Named Entity Recognition; Corpus; Naïve Bayes
Classifier; Morphology; Suffix stripping.
I. INTRODUCTION
Named entity recognition enables the classification of
text parts to a number of defined classes [1]. There is a
challenge underlying the process of detection of named
entities because of the infinite ways they may appear.
Named entities also belong to the open class of words
meaning that new named entities keep getting added to the
language with progress of time. Most of the current work on
named entity recognition use machine learning approaches.
Machine learning approaches are very popular as they are
less expensive in maintenance and can be easily transferred
to new languages and domains [2, 3 and 4]. On the contrary,
rule based approaches fails to cope up with the demands of
portability and robustness, and tremendous linguistic
expertise is required to find out the rules based on which the
system is expected to give the optimal performance. The
costs of maintenance for rule based systems are very high
[5]. One of the challenges in the task of Named Entity
Recognition is to focus on the problem of ambiguity which
often leads to bad performance of the system [6]. Evidences
found within the word and the context where the word
occurs can help to solve these problems [7, 8, and 9].
Assamese is a language of Indo-European origin used by
about 30 million people in North-Eastern part of India [1].
Research in the field of NER in Assamese language has
started only recently. The computational work in Assamese
language is very limited. The first NER system available in
Assamese was a rule based system, reported to find 500
person names and 250 location names [1].
The paper is further subdivided in the following
manner. In Section II, work related to Assamese NER is
described. Section III discusses the different features that, in
general, can be used in the task of NER. Section IV
emphasizes on the results that we have derived
experimentally in Assamese language using Naïve Bayes
classifier. Section V finally concludes our paper.
II. RELATED WORK IN ASSAMESE NER
Named Entity Recognition for Assamese language has so
far has been performed by rule based approach and suffix
stripping based approach as mentioned below.
The first work on Assamese NER is performed in [1]
where a rule based NER has been developed. The corpus
was made from articles taken from Assamese online
Pratidin newspaper and consisted of 50000 wordforms
187
978-1-4799-6629-5/14/$31.00 c
2014 IEEE
which were tagged manually. The paper also gives a
glimpse of different challenges that were faced while
developing the system. The challenges mentioned in the
paper were:
No concept of capitalization: In Assamese language
there is no concept of capitalization. This makes the
process of Named Entity Recognition task difficult.
Capitalization forms an important feature in English
in recognizing the proper nouns.
Nested Entities: Sometimes a problem arises in
detecting the named entity class when there is an
entity within an entity because individual words may
refer to different entity classes and the word formed
by combining the individual words may refer to some
other entity classes.
For example: In Pandu college (ăđȉ ïĘĊö), the
individual words are Pandu (ăđȉ) which is the name
of a place and college(ïĘĊö) referring to an
organization thus creating problem in detecting the
appropriate named entity class.
Ambiguity: In Assamese there are some words which
may be common noun as well as proper noun.
For example: Tora (þįđ) which generally means star
if used as common noun but when used as a proper
noun may be the name of a girl thus creating
ambiguity.
Derivational morphology: Derivational morphology
also sometimes creates a problem in Named Entity
Recognition.
For example: Assam (ačć) is a location named entity
but when it is combined with the suffix basi (ąđčē)it
becomes Assambasi(ačćąđčē)meaning people of
Assam which is in turn not referring to a named
entity.
This paper also reported some rules for Named Entity
tagging in Assamese as follows:
Suffix like ĻĊ(loi), į(r), ĺĠ (ye) etc indicates the
presence of a named entity.
Presence of action verbs like ĺðĒĊĘõ (khelise), ĻñĘõ
(goise)often implies the presence of named entities.
Words like ïđĘăđį-ôđĘăđį (Kapur-sapur),ĺïį-ĺïį (Ker-
ker) can never be named entities
If the words like ̄ĈĔk (Srijukta), ̄ćþē (Srimati)etc
are present in previous positions of a word then the
next word is a named entity.
If words like ąđi (bai), ĀđĀđ (dada) exist after a word
then it indicates the presence of person named entity
in the previous position.
This rule based system found 500 person names and 250
location names.
Another work was reported in [2] where location named
entities were found by suffix stripping based approach. This
paper reported that Assamese is a highly inflectional
language that is the same word may have different variants
morphologically. Location named entities often combines
with some suffixes as described in the paper as follows:
Complex suffix: Plain suffixes combines with some
consonants to create complex suffixes. For example:
ačćēĠđ(AssmIYA)=ačć(Assm) + æĠđ(IYA)
Plain suffix: Suffixes that combines with root words
to form its morphological variants are called plain
suffixes. For example:
ñĔöįđ˶(Gujrt)=ñĔöįđù(Gujrt)+æ(I)
The paper reported to use a corpus of about 300,000
words. Root words were obtained by stripping suffixes. The
inputs for the approach were Assamese words occurring in
the text. A key file was maintained having the list of all
possible suffixes that usually combines with location named
entities. Output of the approach was the stem word of the
location named entity which was obtained after the
stemming technique was applied to the input word.
The approach can be summarized in the following steps:
The input word is taken.
The suffixes for the input words are checked in the
key file.
If the suffix in the input word is present in the key
file then remove the suffix.
Exit.
The approach reported to give an F-measure of 88%.
III. FEATURES USED IN NAMED ENTITY
RECOGNITION
In Named Entity Recognition task the estimation of
features has a very significant role to play in the framework
of any machine learning technique. In Indian languages, a
lot of features have been found by considering the variety of
combinations of the available context words.
A. Features
Features that are used in NER task are given as follows:
188 2014 International Conference on Contemporary Computing and Informatics (IC3I)
1) First word: For this feature, checking is done to find
whether the word appears as the first word of the sentence
or not because the first word in most cases appears to be a
named entity occurring often in the position of the subject.
2) Context word feature: Context word feature
encompasses the next and previous words of a particular
word .This feature gives valuable information in
recognizing the named entities.
3) Part of speech information: The part of speech
information of the surrounding words and the current word
often helps to recognize the named entities.
4) Word suffix: A fixed length or variable length word
suffix of the current word and the words in the surrounding
can also be treated as a feature. Suffix information helps in
the sense that named entities often combines with common
suffixes.
5) Word prefix: Word prefixes can also be of fixed
length or variable length. Prefix information also helps in
the sense that named entities often combines with common
prefixes.
6) Infrequent word: This feature is based on the fact that
the words which occur rarely are named entities. Infrequent
words can be found by calculating the frequency of each and
every word from the training corpus. Words which occur
less than a chosen cut-off frequency are considered to be
infrequent words.
7) Digit features: Digit features helps in categorizing
miscellaneous named entities like numerical expressions,
percentage and time expressions.
8) Length of the word: This binary feature is based on
the observation that named entities are not usually short
words. In this feature checking is done whether the length of
a word is less than a particular value or not.
IV. EXPERIMENTAL RESULTS
Application of Naïve Bayes classifier for Named Entity
Recognition requires large amount of annotated corpus for
reasonable performance [10]. As annotated corpus in
Assamese is not available openly, we have developed a
corpus containing approximately 6,000 words. Corpus
creation is performed by collecting some random Assamese
text articles in the sports domain from the Assamese
newspaper “Janasadharan”. The tagset used for part of
speech annotation in our corpus mostly contains the
following tags:
NN-Noun
NNP-Proper Noun
PRP-Pronoun
CC-Conjunction
JJ-Adjective
RB-Adverb
VAUX-Verb Auxiliary
TABLE I. Number of Tags in Training and Test Corpus
Tags In Training Corpus In Test Corpus
NN 3480 527
NNP 483 50
PRP 141 79
CC 140 51
JJ 191 91
RB 139 53
VAUX 219 135
Out of these 6000 words, approximately 5000 words
used for training and another part is used for testing as
shown in TABLE I.. A separate tag list is used for NE tags in
the training corpus. The POS annotation has been done in
corpora with a linguistic expert.
TABLE II. Number of NE Tags in Training and Test Corpus
NE Tags In Training Corpus In Test Corpus
Person 193 28
Location 185 17
Organization 105 5
The NE tagset consists of the following ENAMEX tags:
Person name: Denotes the name of a person.
Location name: Denotes the name of a location or
place.
Organization name: Denotes the name of an
organization.
Our Test corpus contains 50 named entities as shown in
TABLE II.
Five different features are used in our NER system for
classification of named entities using the Naïve Bayes
classifier. They are as follows:
1) POS Feature: The POS information of the target and
surrounding words helps in identifying the named entities as
most of the surrounding words in Assamese of the target
word follows some common POS information that can be
learned from the training corpus.
2) First Word of Compound proper noun: In Assamese
most of the person names is preceded by the pre-nominal
word Shri (̄), Shrijut (̄ĈĔþ), Shrimati (̄ćþē) which
serves as the first word of the compound proper noun and if
such first word exists in the compound proper noun then it
2014 International Conference on Contemporary Computing and Informatics (IC3I) 189
can be said that a person named entity
i
feature has been found to be very effectiv
e
Naïve Bayes system for finding the per
s
example: ĆđįþēĠ ĒĘïù ĒïáąĀĒn ̄_NNPC ċôēĂ
_
N
N
ĒĘïù öñþį ăįđ äĒö ïĊïđþđþ_NNP äĂ eðĂ ăĖ
ÿ
ĒĀĘĠ @ Here Shri (̄) Sachin (ċôēĂ) Tendul
k
compound proper noun and the first word (
̄
that it is a person named entity. NNPC indi
c
p
roper noun and NNP indicates the last
w
compound proper noun ends.
3)
L
ast Word of Compound proper n
o
feature of compound proper noun is al
analyzing the training corpus it is found that
as college (ïĘĊö), club (Ǔđą), association (eõ
Ē
indicates the presence of an organization n
a
example: ûĂ_NNPC ą_NNPC ïĘĊö_NNP äĒö
Ēą
the last word of the compound proper no
u
Don (ûĂ) Bosco(ą)College(ïĘĊö) is an or
g
feature has been found to very helpful
organization named entities.
4) Previous Word: The previous word f
e
in finding the named entities.
N
amed entit
i
common previous words which helps greatl
y
named entities.
5) Next Word: Some of the words often
o
in the next positions of the named entities. T
an important role in identifying the named
Naïve Bayes system.
Fig. I. Our Naïve Bayes Named Entity Recogn
i
Our system as shown in Fig. I. has
different size of Training Corpus. As show
n
b
etter result is obtained when training corpu
s
TABLE III.
Performance measure of the NER System
training corpus.
Training Corpus Precision Rec
a
Containing 2500 Words 77% 70%
i
s present. This
e
in training our
s
on names. For
N
PC ĺþȉĔĊïđĘį_NNP
ÿï
ēĞđ ĺǘþ ĆĒį
k
ar (ĺþȉĔĊïđį)is a
̄
) of it indicates
c
ates compound
w
ord where the
o
un: Last word
s
o important as
last words such
Ē
ôĘĠôĂ) etc often
a
med entity. For
Ēą
öĠē ĻĎĘõ @ Here
u
n indicates that
g
anization. This
in enumerating
e
ature also helps
i
es follow some
y
in locating the
o
ccur frequently
hese words play
entities for our
i
tion System
been tested for
n
in TABLE III.
s
is enlarged.
for different size of
a
ll F1 Measure
73.33%
Containing 5000 Words 9
3
Fig. II. Comparison of Resul
t
V.
C
O
This is the first step to
w
using Naïve Bayes clas
s
reasonably good performan
c
nearly 88.40%. Some of the
n
b
y our system due to equal
Moreover Assamese is a hig
h
p
erformance can be obtaine
d
information along with huge
R
E
F
[1] Padmaja Sharma, Utpal Shar
m
towards Assamese Named En
t
Center Brisbane Australia,201
[2] Padmaja Sharma, Utpal Shar
m
Based NER in Assamese
f
Intelligence and Signal Proces
[3] David Nadeau and Satoshi
recognition and classification
pp. 3-26,2007.
[4] B. Sasidhar, P. M. Yohan, Dr
A Survey on Named Entity
R
p
articular reference to Tel
u
Computer Science Issues, Vol
.
[5] Asif Ekbal, Rajewanul Haque
,
Sivaji Bandyopadhyay,“La
n
Recognition in Indian Lang
u
Workshop on NER for So
u
Hyderabad,India,2008.
[6] Asif Ekbal and Sivaji Ban
d
Recognition using Support
V
IJNLP-08 Workshop on N
E
Languages Hyderabad, India,
2
[7] Thoudam Doren Singh, Kis
h
Sivaji Bandyopadhyay, “ N
a
Using Support Vector Mach
i
Language, Information and C
o
[8] Sujan Kum ar Saha, Sanja y
C
Sarkar and Pabitra Mitra , “
A
Recognition in Indian Lang
u
Workshop on NER for So
u
Hyderabad,India,2008.
[9] Wenhui Liao and Sriharsha
supervised Algorithm For N
a
of the NAACL HLT Work
s
Natural Language Processing,
3
% 84% 88.4%
t
s for two different size of Corpus
O
NCLUSION
w
ards supervised Assamese NER
s
ifie
r
. We have obtained a
c
e measure with F1-measure of
n
amed entities are not classified
contribution of all the features.
h
ly inflectional language; better
d
by handling the morphological
annotated corpora.
F
ERENCES
m
a and Jugal Kalita, “ The first Steps
t
ity Recognition”, Brisbane Convention
0.
m
a and Jugal Kalita, “Suffix Stripping
f
or Location Names”, Computational
sing (CISP),2012.
Sekine , “A survey of named entity
, Lingvisticae Investigationes, Vol. 30,
. A. Vinaya Babu, Dr. A. Govardhan, “
R
ecognition in Indian Languages with
u
gu”, IJCSI International Journal of
.
8, Issue 2, March 2011.
,
Amitava Das, Venkateswarlu Poka and
n
guage Independent Named Entity
u
ages”, Proceedings of the IJNLP-08
u
th and South East Asian Languages
d
yopadhyay “ Bengali Named Entity
V
ector Machine”, Proceedings of the
E
R for South and South East Asian
2
008.
h
orjit Nongmeikapam,Asif Ekbal and
a
med Entity Recognition for Manipuri
i
ne”, 23rd Pacific Asia Conference on
o
mputation, pp. 811–818,2009.
C
hatterji, Sandipan Dantapat, Sudeshna
A
Hybrid Approach for Named Entity
u
ages”, Proceedings of the IJNLP-08
u
th and South East Asian Languages
Veeramachaneni, “ A Simple Semi-
a
med Entity Recognition”, Proceedings
s
hop on Semi-supervised Learning for
pp. 58–65, June 2009.
190 2014 International Conference on Contemporary Computing and Informatics (IC3I)
[10] Gitimoni Talukdar, Pranjal Protim Borah, Arup Baruah,“A Survey of
Named Entity Recognition in Assamese and other Indian Languages”,
Proceedings of International Conference on Natural Language
Processing and Cognitive Computing, India, 2014.
2014 International Conference on Contemporary Computing and Informatics (IC3I) 191
... For natural languages, a large number of different NLP applications is being developed in India, as well as across the world. As Saiful Islam et al. (Devi and Purkayastha, 2018) (Agarwalla and Sarma, 2016) and Supervised named entity recognition in Assamese language (Talukdar et al., 2014). In this piece of writing, we are suggesting an encoderdecoder framework for writing Assamese image captions. ...
Conference Paper
Full-text available
Image captioning is a prominent Artificial Intelligence (AI) research area that deals with visual recognition and a linguistic description of the image. It is an interdis-ciplinary field concerning how computers can see and understand digital images & videos, and describe them in a language known to humans. Constructing a meaningful sentence needs both structural and semantic information of the language. This paper highlights the contribution of image caption generation for the Assamese language. The unavailability of an image caption generation system for the Assamese language is an open problem for AI-NLP researchers , and it's just an early stage of the research. To achieve our defined objective, we have used the encoder-decoder framework , which combines the Convolutional Neural Networks and the Recurrent Neural Networks. The experiment has been tested on Flickr30k and Coco Captions dataset, which have been originally present in the English language. We have translated these datasets into Assamese language using the state-of-the-art Machine Translation (MT) system for our designed work.
... They collect the corpus from Asomiya Pratidin nearly 300,000 words. The approach is simple, interestingly, it performs reasonably well and gives an F-measure of nearly 90 percent.Gitimoni Talukdar, Pranjal Protim Borah and Arup Baruah [11] developed an Assamese NER by using Naïve Bayes classifier. This is a machine learning approach. ...
Conference Paper
Full-text available
Machine Translation (MT) is the process of automatically converting one natural language into another, preserving the exact meaning of the input text to the output text. It is one of the classical problems in the Natural Language Processing (NLP) domain and there is a wide application in our daily life. Though the research in MT in English and some other language is relatively in an advanced stage, but for most of the languages, it is far from the human-level performance in the translation task. From the computational point of view, for MT a lot of preprocessing and basic NLP tools and resources are needed. This study gives an overview of the available basic NLP resources in the context of Assamese-English machine translation.
... F1-measure = (2  Precision  Recall)/(Precision + Recall) The Assamese NER system has been tested for Training Corpus of different sizes. The overall performance measure of our previous system [18] containing 50 NEs and current system containing 200 NEs are given in Table 1. ...
Chapter
Full-text available
Named Entity Recognition (NER) is crucial when it comes to taking care of information extraction, question-answering, document summarization and machine translation which are undoubtly the important Natural Language Processing (NLP) tasks. This work is a detailed analysis of our previously developed NER system with more emphasis on how individual features will contribute towards the recognition of person, location and organization named entities and how these features in different combinations affect the performance measure of the system. In addition to these, we have also evaluated the behaviour of the features with the increase in training and test corpus. Since this system is based on supervised learning, we need to have a large parts of speech tagged and named entity tagged Training Corpus as well as a parts of speech tagged Test Corpus. The maximum value of performance measure of the overall system is obtained when the training corpus is of size with 5000 words and the amount of named entities present in the test corpus is 50 and the values obtained are 95% in terms of precision, 84% in terms of recall and 89% in terms of F1-measure. This work will add a new dimension in the usage of features for recognition of ENAMEX tags in Assamese corpus.
Article
In Natural Language Processing (NLP) pipelines, Named Entity Recognition (NER) is one of the preliminary problems, which marks proper nouns and other named entities such as Location, Person, Organization, Disease etc. Such entities, without an NER module, adversely affect the performance of a machine translation system. NER helps in overcoming this problem by recognising and handling such entities separately, although it can be useful in Information Extraction systems also. Bhojpuri, Maithili and Magahi are low resource languages, usually known as Purvanchal languages. This paper focuses on the development of an NER benchmark dataset for Machine Translation systems developed to translate from these languages to Hindi by annotating parts of the available corpora with named entities. Bhojpuri, Maithili and Magahi corpora of sizes 228373, 157468 and 56190 tokens, respectively, were annotated using 22 entity labels. The annotation considers coarse-grained annotation labels followed by the tagset used in one of the Hindi NER datasets. We also report a Deep Learning baseline that uses an LSTM-CNNs-CRF model. The lower baseline F 1 -scores from the NER tool obtained by using Conditional Random Fields models are 70.56% for Bhojpuri, 73.19% for Maithili and 84.18% for Magahi. The Deep Learning-based technique (LSTM-CNNs-CRF) achieved 61.41% for Bhojpuri, 71.38% for Maithili and 86.39% for Magahi. As the results show, LSTM-CNNs-CRF fails to outperform the lower baseline in the case of Bhojpuri and Maithili, which have more data in terms of the number of tokens, but not in terms of the number of named entities. However, the cross-lingual model training of LSTM-CNNs-CRF for Bhojpuri and Maithili performed better than the CRF.
Article
Full-text available
Named Entity Recognition(NER), one of the most fundamental problems in natural language processing, seeks to identify the boundaries and types of entities with specific meanings in natural language text. As an important international language, Chinese has uniqueness in many aspects, and Chinese NER (CNER) is receiving increasing attention. In this paper, we give a comprehensive survey of recent advances in CNER. We first introduce some preliminary knowledge, including the common datasets, tag schemes, evaluation metrics and difficulties of CNER. Then, we separately describe recent advances in traditional research and deep learning research of CNER, in which the CNER with deep learning is our focus. We summarize related works in a basic three-layer architecture, including character representation, context encoder, and context encoder and tag decoder. Meanwhile, the attention mechanism and adversarial-transfer learning methods based on this architecture are introduced. Finally, we present the future research trends and challenges of CNER.
Chapter
A natural language or an everyday language is an accustomed form of communication used by the people to speak, express and write. Besides, these languages are called natural because they are evolved naturally among the communities. Natural Language Processing is a very vital field in connection with Artificial Intelligence, where research has exponentially taken place. This research aims to explore the techniques that have been used to process the Assamese language, the focus will be basically on Parsing, Part-of-Speech tagging, Word-Sense Disambiguation, Machine Translation, WordNet.
Article
Full-text available
Named Entity Recognition is always important when dealing with major Natural Language Processing tasks such as information extraction, question-answering, machine translation, document summarization etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular reference to Assamese. There are various rule-based and machine learning approaches available for Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for Named Entity Recognition and then we discuss about the related research in this field. Assamese like other Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of the issues faced in Assamese while doing Named Entity Recognition.
Article
Full-text available
In this paper we describe a hybrid system that applies Maximum Entropy model (Max- Ent), language specific rules and gazetteers to the task of Named Entity Recognition (NER) in Indian languages designed for the IJCNLP NERSSEAL shared task. Starting with Named Entity (NE) annotated corpora and a set of features we first build a base- line NER system. Then some language spe- cific rules are added to the system to recog- nize some specific NE classes. Also we have added some gazetteers and context patterns to the system to increase the performance. As identification of rules and context pat- terns requires language knowledge, we were able to prepare rules and identify context patterns for Hindi and Bengali only. For the other languages the system uses the MaxEnt model only. After preparing the one-level NER system, we have applied a set of rules to identify the nested entities. The system is able to recognize 12 classes of NEs with 65.13% f-value in Hindi, 65.96% f-value in Bengali and 44.65%, 18.74%, and 35.47% f-value in Oriya, Telugu and Urdu respec- tively.
Article
Full-text available
This paper reports about the development of a Named Entity Recognition (NER) sys- tem for South and South East Asian lan- guages, particularly for Bengali, Hindi, Te- lugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task1. We have
Article
Full-text available
Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity (NE) classes and is nowadays considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). An appropriate tag conversion routine has been developed in order to convert the data into the forms tagged with the four NE tags, namely Person name , Location name , Organization name and Miscellaneous name . The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the different NE classes. The system has been tested with the gold standard test sets of 35K, and 38K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the overall recall, precision, and f-score values of 85.11%, 81.74%, and 83.39%, respectively, for Bengali and 82.76%, 77.81%, and 80.21%, respectively, for Hindi. Statistical analysis, ANOVA is performed to show that the improvement in the performance with the use of language dependent features is statistically significant over the language independent features for Bengali and Hindi both.
Conference Paper
Full-text available
This paper reports about the development of a Manipuri NER system, a less computerized Indian language. Two different models, one using an active learning technique based on the context patterns generated from an unlabeled news corpus and the other based on the well known Support Vector Machine (SVM), have been developed. The active learning technique has been considered as the baseline system. The Manipuri news corpus has been manually annotated with the major NE tags, namely Person name, Location name, Organization name and Miscellaneous name to apply SVM. The SVM based system makes use of the different contextual information of the words along with the variety of orthographic word-level features which are helpful in predicting the NE classes. In addition, lexical context patterns generated using the active learning technique have been used as the features of SVM in order to improve performance. The system has been trained and tested with 28,629 and 4,763 wordforms, respectively. Experimental results show the effectiveness of the proposed approach with the overall average Recall, Precision and F-Score values of 93.91%, 95.32% and 94.59% respectively.
Article
Full-text available
Named Entities provides critical information for many NLP applications. Named Entity recognition and classification (NERC) in text is recognized as one of the important sub-tasks of Information Extraction (IE). The seven papers in this volume cover various interesting and informative aspects of NERC research. Nadeau & Sekine provide an extensive survey of past NERC technologies, which should be a very useful resource for new researchers in this field. Smith & Osborne describe a machine learning model which tries to solve the over-fitting problem. Mazur & Dale tackle a common problem of NE and conjunction; as conjunctions are often a part of NEs or appear close to NEs, this is an important practical problem. A further three papers describe analyses and implementations of NERC for different languages: Spanish (Galicia-Haro & Gelbukh), Bengali (Ekbal, Naskar & Bandyopadhyay), and Serbian (Vitas, Krstev & Maurel). Finally, Steinberger & Pouliquen report on a real WEB application where multilingual NERC technology is used to identify occurrences of people, locations and organizations in newspapers in different languages. The contributions to this volume were previously published in Lingvisticae Investigationes 30:1 (2007).
Article
Named Entity Recognition (NER) is the process of identifying and classifying proper nouns in text documents into pre-defined classes such as person, location and organization. It plays an important role in Natural Language Processing applications. Although NER in Indian languages is a difficult and challenging task and suffers from scarcity of resources, such work has started to appear recently. In highly inflectional languages such as Assamese, NER requires identification of the root forms of words that occur in texts. Our work reports a suffix stripping approach to identify those roots of words which are location named entities.
Article
We present a simple semi-supervised learning algorithm for named entity recognition (NER) using conditional random fields (CRFs). The algorithm is based on exploiting evidence that is independent from the features used for a classifier, which provides high-precision la- bels to unlabeled data. Such independent ev- idence is used to automatically extract high- accuracy and non-redundant data, leading to a much improved classifier at the next iteration. We show that our algorithm achieves an aver- age improvement of 12 in recall and 4 in pre- cision compared to the supervised algorithm. We also show that our algorithm achieves high accuracy when the training and test sets are from different domains.
The First Steps Towards Assamese Named Entity Recognition
  • Padmaja Sharma
  • Utpal Sharma
  • Jugal Kalita