Content uploaded by Pranjal Protim Borah
Author content
All content in this area was uploaded by Pranjal Protim Borah on Jan 23, 2018
Content may be subject to copyright.
Supervised Named Entity Recognition in
Assamese language
Gitimoni Talukdar
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
talukdargitimoni@gmail.com
Pranjal Protim Borah
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
pranjalborah777@gmail.com
Arup Baruah
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
arup.baruah@gmail.com
Abstract— In each and every natural language nouns play a
very important role. A subcategory of noun is proper noun.
They represent the names of person, location, organization etc.
The task of recognizing the proper nouns in a text and
categorizing them into some classes such as person, location,
organization and other is called Named Entity Recognition.
This is a very essential step of many natural language
processing applications that makes the process of information
extraction easier. Named Entity Recognition (NER) in most of
the Indian languages has been performed using rule-based,
supervised and unsupervised approaches. In this work our
target language is Assamese, the language spoken by most of
the people in North-Eastern part of India and particularly in
Assam. In Assamese language, Named Entity Recognition has
been performed using the rule based and suffix stripping based
approaches. Supervised learning technique is more useful and
can be easily adapted to new domains compared to rule based
approaches. This paper reports the first work in Assamese
NER using a machine learning technique. In this paper
Assamese Named Entity Recognition is performed using Naïve
Bayes classifier. Since feature extraction plays the most
important role in getting better performance in any machine
learning technique, in this work our aim is to put forward a
description of a few important features related to Assamese
NER and performance measure of the system using these
features.
Keywords— Named Entity Recognition; Corpus; Naïve Bayes
Classifier; Morphology; Suffix stripping.
I. INTRODUCTION
Named entity recognition enables the classification of
text parts to a number of defined classes [1]. There is a
challenge underlying the process of detection of named
entities because of the infinite ways they may appear.
Named entities also belong to the open class of words
meaning that new named entities keep getting added to the
language with progress of time. Most of the current work on
named entity recognition use machine learning approaches.
Machine learning approaches are very popular as they are
less expensive in maintenance and can be easily transferred
to new languages and domains [2, 3 and 4]. On the contrary,
rule based approaches fails to cope up with the demands of
portability and robustness, and tremendous linguistic
expertise is required to find out the rules based on which the
system is expected to give the optimal performance. The
costs of maintenance for rule based systems are very high
[5]. One of the challenges in the task of Named Entity
Recognition is to focus on the problem of ambiguity which
often leads to bad performance of the system [6]. Evidences
found within the word and the context where the word
occurs can help to solve these problems [7, 8, and 9].
Assamese is a language of Indo-European origin used by
about 30 million people in North-Eastern part of India [1].
Research in the field of NER in Assamese language has
started only recently. The computational work in Assamese
language is very limited. The first NER system available in
Assamese was a rule based system, reported to find 500
person names and 250 location names [1].
The paper is further subdivided in the following
manner. In Section II, work related to Assamese NER is
described. Section III discusses the different features that, in
general, can be used in the task of NER. Section IV
emphasizes on the results that we have derived
experimentally in Assamese language using Naïve Bayes
classifier. Section V finally concludes our paper.
II. RELATED WORK IN ASSAMESE NER
Named Entity Recognition for Assamese language has so
far has been performed by rule based approach and suffix
stripping based approach as mentioned below.
The first work on Assamese NER is performed in [1]
where a rule based NER has been developed. The corpus
was made from articles taken from Assamese online
Pratidin newspaper and consisted of 50000 wordforms
187
978-1-4799-6629-5/14/$31.00 c
2014 IEEE
which were tagged manually. The paper also gives a
glimpse of different challenges that were faced while
developing the system. The challenges mentioned in the
paper were:
• No concept of capitalization: In Assamese language
there is no concept of capitalization. This makes the
process of Named Entity Recognition task difficult.
Capitalization forms an important feature in English
in recognizing the proper nouns.
• Nested Entities: Sometimes a problem arises in
detecting the named entity class when there is an
entity within an entity because individual words may
refer to different entity classes and the word formed
by combining the individual words may refer to some
other entity classes.
For example: In Pandu college (ăđȉ ïĘĊö), the
individual words are Pandu (ăđȉ) which is the name
of a place and college(ïĘĊö) referring to an
organization thus creating problem in detecting the
appropriate named entity class.
• Ambiguity: In Assamese there are some words which
may be common noun as well as proper noun.
For example: Tora (þįđ) which generally means star
if used as common noun but when used as a proper
noun may be the name of a girl thus creating
ambiguity.
• Derivational morphology: Derivational morphology
also sometimes creates a problem in Named Entity
Recognition.
For example: Assam (ačć) is a location named entity
but when it is combined with the suffix basi (ąđčē)it
becomes Assambasi(ačćąđčē)meaning people of
Assam which is in turn not referring to a named
entity.
This paper also reported some rules for Named Entity
tagging in Assamese as follows:
• Suffix like ĻĊ(loi), į(r), ĺĠ (ye) etc indicates the
presence of a named entity.
• Presence of action verbs like ĺðĒĊĘõ (khelise), ĻñĘõ
(goise)often implies the presence of named entities.
• Words like ïđĘăđį-ôđĘăđį (Kapur-sapur),ĺïį-ĺïį (Ker-
ker) can never be named entities
• If the words like ̄ĈĔk (Srijukta), ̄ćþē (Srimati)etc
are present in previous positions of a word then the
next word is a named entity.
• If words like ąđi (bai), ĀđĀđ (dada) exist after a word
then it indicates the presence of person named entity
in the previous position.
This rule based system found 500 person names and 250
location names.
Another work was reported in [2] where location named
entities were found by suffix stripping based approach. This
paper reported that Assamese is a highly inflectional
language that is the same word may have different variants
morphologically. Location named entities often combines
with some suffixes as described in the paper as follows:
• Complex suffix: Plain suffixes combines with some
consonants to create complex suffixes. For example:
ačćēĠđ(AssmIYA)=ačć(Assm) + æĠđ(IYA)
• Plain suffix: Suffixes that combines with root words
to form its morphological variants are called plain
suffixes. For example:
ñĔöįđ˶(Gujrt)=ñĔöįđù(Gujrt)+æ(I)
The paper reported to use a corpus of about 300,000
words. Root words were obtained by stripping suffixes. The
inputs for the approach were Assamese words occurring in
the text. A key file was maintained having the list of all
possible suffixes that usually combines with location named
entities. Output of the approach was the stem word of the
location named entity which was obtained after the
stemming technique was applied to the input word.
The approach can be summarized in the following steps:
• The input word is taken.
• The suffixes for the input words are checked in the
key file.
• If the suffix in the input word is present in the key
file then remove the suffix.
• Exit.
The approach reported to give an F-measure of 88%.
III. FEATURES USED IN NAMED ENTITY
RECOGNITION
In Named Entity Recognition task the estimation of
features has a very significant role to play in the framework
of any machine learning technique. In Indian languages, a
lot of features have been found by considering the variety of
combinations of the available context words.
A. Features
Features that are used in NER task are given as follows:
188 2014 International Conference on Contemporary Computing and Informatics (IC3I)
1) First word: For this feature, checking is done to find
whether the word appears as the first word of the sentence
or not because the first word in most cases appears to be a
named entity occurring often in the position of the subject.
2) Context word feature: Context word feature
encompasses the next and previous words of a particular
word .This feature gives valuable information in
recognizing the named entities.
3) Part of speech information: The part of speech
information of the surrounding words and the current word
often helps to recognize the named entities.
4) Word suffix: A fixed length or variable length word
suffix of the current word and the words in the surrounding
can also be treated as a feature. Suffix information helps in
the sense that named entities often combines with common
suffixes.
5) Word prefix: Word prefixes can also be of fixed
length or variable length. Prefix information also helps in
the sense that named entities often combines with common
prefixes.
6) Infrequent word: This feature is based on the fact that
the words which occur rarely are named entities. Infrequent
words can be found by calculating the frequency of each and
every word from the training corpus. Words which occur
less than a chosen cut-off frequency are considered to be
infrequent words.
7) Digit features: Digit features helps in categorizing
miscellaneous named entities like numerical expressions,
percentage and time expressions.
8) Length of the word: This binary feature is based on
the observation that named entities are not usually short
words. In this feature checking is done whether the length of
a word is less than a particular value or not.
IV. EXPERIMENTAL RESULTS
Application of Naïve Bayes classifier for Named Entity
Recognition requires large amount of annotated corpus for
reasonable performance [10]. As annotated corpus in
Assamese is not available openly, we have developed a
corpus containing approximately 6,000 words. Corpus
creation is performed by collecting some random Assamese
text articles in the sports domain from the Assamese
newspaper “Janasadharan”. The tagset used for part of
speech annotation in our corpus mostly contains the
following tags:
• NN-Noun
• NNP-Proper Noun
• PRP-Pronoun
• CC-Conjunction
• JJ-Adjective
• RB-Adverb
• VAUX-Verb Auxiliary
TABLE I. Number of Tags in Training and Test Corpus
Tags In Training Corpus In Test Corpus
NN 3480 527
NNP 483 50
PRP 141 79
CC 140 51
JJ 191 91
RB 139 53
VAUX 219 135
Out of these 6000 words, approximately 5000 words
used for training and another part is used for testing as
shown in TABLE I.. A separate tag list is used for NE tags in
the training corpus. The POS annotation has been done in
corpora with a linguistic expert.
TABLE II. Number of NE Tags in Training and Test Corpus
NE Tags In Training Corpus In Test Corpus
Person 193 28
Location 185 17
Organization 105 5
The NE tagset consists of the following ENAMEX tags:
• Person name: Denotes the name of a person.
• Location name: Denotes the name of a location or
place.
• Organization name: Denotes the name of an
organization.
Our Test corpus contains 50 named entities as shown in
TABLE II.
Five different features are used in our NER system for
classification of named entities using the Naïve Bayes
classifier. They are as follows:
1) POS Feature: The POS information of the target and
surrounding words helps in identifying the named entities as
most of the surrounding words in Assamese of the target
word follows some common POS information that can be
learned from the training corpus.
2) First Word of Compound proper noun: In Assamese
most of the person names is preceded by the pre-nominal
word Shri (̄), Shrijut (̄ĈĔþ), Shrimati (̄ćþē) which
serves as the first word of the compound proper noun and if
such first word exists in the compound proper noun then it
2014 International Conference on Contemporary Computing and Informatics (IC3I) 189
can be said that a person named entity
i
feature has been found to be very effectiv
e
Naïve Bayes system for finding the per
s
example: ĆđįþēĠ ĒĘïù ĒïáąĀĒn ̄_NNPC ċôēĂ
_
N
N
ĒĘïù öñþį ăįđ äĒö ïĊïđþđþ_NNP äĂ eðĂ ăĖ
ÿ
ĒĀĘĠ @ Here Shri (̄) Sachin (ċôēĂ) Tendul
k
compound proper noun and the first word (
̄
that it is a person named entity. NNPC indi
c
p
roper noun and NNP indicates the last
w
compound proper noun ends.
3)
L
ast Word of Compound proper n
o
feature of compound proper noun is al
s
analyzing the training corpus it is found that
as college (ïĘĊö), club (Ǔđą), association (eõ
Ē
indicates the presence of an organization n
a
example: ûĂ_NNPC ą_NNPC ïĘĊö_NNP äĒö
Ēą
the last word of the compound proper no
u
Don (ûĂ) Bosco(ą)College(ïĘĊö) is an or
g
feature has been found to very helpful
organization named entities.
4) Previous Word: The previous word f
e
in finding the named entities.
N
amed entit
i
common previous words which helps greatl
y
named entities.
5) Next Word: Some of the words often
o
in the next positions of the named entities. T
an important role in identifying the named
Naïve Bayes system.
Fig. I. Our Naïve Bayes Named Entity Recogn
i
Our system as shown in Fig. I. has
different size of Training Corpus. As show
n
b
etter result is obtained when training corpu
s
TABLE III.
Performance measure of the NER System
training corpus.
Training Corpus Precision Rec
a
Containing 2500 Words 77% 70%
i
s present. This
e
in training our
s
on names. For
N
PC ĺþȉĔĊïđĘį_NNP
ÿï
ēĞđ ĺǘþ ĆĒį
k
ar (ĺþȉĔĊïđį)is a
̄
) of it indicates
c
ates compound
w
ord where the
o
un: Last word
s
o important as
last words such
Ē
ôĘĠôĂ) etc often
a
med entity. For
Ēą
öĠē ĻĎĘõ @ Here
u
n indicates that
g
anization. This
in enumerating
e
ature also helps
i
es follow some
y
in locating the
o
ccur frequently
hese words play
entities for our
i
tion System
been tested for
n
in TABLE III.
s
is enlarged.
for different size of
a
ll F1 Measure
73.33%
Containing 5000 Words 9
3
Fig. II. Comparison of Resul
t
V.
C
O
This is the first step to
w
using Naïve Bayes clas
s
reasonably good performan
c
nearly 88.40%. Some of the
n
b
y our system due to equal
Moreover Assamese is a hig
h
p
erformance can be obtaine
d
information along with huge
R
E
F
[1] Padmaja Sharma, Utpal Shar
m
towards Assamese Named En
t
Center Brisbane Australia,201
[2] Padmaja Sharma, Utpal Shar
m
Based NER in Assamese
f
Intelligence and Signal Proces
[3] David Nadeau and Satoshi
recognition and classification
”
pp. 3-26,2007.
[4] B. Sasidhar, P. M. Yohan, Dr
A Survey on Named Entity
R
p
articular reference to Tel
u
Computer Science Issues, Vol
.
[5] Asif Ekbal, Rajewanul Haque
,
Sivaji Bandyopadhyay,“La
n
Recognition in Indian Lang
u
Workshop on NER for So
u
Hyderabad,India,2008.
[6] Asif Ekbal and Sivaji Ban
d
Recognition using Support
V
IJNLP-08 Workshop on N
E
Languages Hyderabad, India,
2
[7] Thoudam Doren Singh, Kis
h
Sivaji Bandyopadhyay, “ N
a
Using Support Vector Mach
i
Language, Information and C
o
[8] Sujan Kum ar Saha, Sanja y
C
Sarkar and Pabitra Mitra , “
A
Recognition in Indian Lang
u
Workshop on NER for So
u
Hyderabad,India,2008.
[9] Wenhui Liao and Sriharsha
supervised Algorithm For N
a
of the NAACL HLT Work
s
Natural Language Processing,
3
% 84% 88.4%
t
s for two different size of Corpus
O
NCLUSION
w
ards supervised Assamese NER
s
ifie
r
. We have obtained a
c
e measure with F1-measure of
n
amed entities are not classified
contribution of all the features.
h
ly inflectional language; better
d
by handling the morphological
annotated corpora.
F
ERENCES
m
a and Jugal Kalita, “ The first Steps
t
ity Recognition”, Brisbane Convention
0.
m
a and Jugal Kalita, “Suffix Stripping
f
or Location Names”, Computational
sing (CISP),2012.
Sekine , “A survey of named entity
”
, Lingvisticae Investigationes, Vol. 30,
. A. Vinaya Babu, Dr. A. Govardhan, “
R
ecognition in Indian Languages with
u
gu”, IJCSI International Journal of
.
8, Issue 2, March 2011.
,
Amitava Das, Venkateswarlu Poka and
n
guage Independent Named Entity
u
ages”, Proceedings of the IJNLP-08
u
th and South East Asian Languages
d
yopadhyay “ Bengali Named Entity
V
ector Machine”, Proceedings of the
E
R for South and South East Asian
2
008.
h
orjit Nongmeikapam,Asif Ekbal and
a
med Entity Recognition for Manipuri
i
ne”, 23rd Pacific Asia Conference on
o
mputation, pp. 811–818,2009.
C
hatterji, Sandipan Dantapat, Sudeshna
A
Hybrid Approach for Named Entity
u
ages”, Proceedings of the IJNLP-08
u
th and South East Asian Languages
Veeramachaneni, “ A Simple Semi-
a
med Entity Recognition”, Proceedings
s
hop on Semi-supervised Learning for
pp. 58–65, June 2009.
190 2014 International Conference on Contemporary Computing and Informatics (IC3I)
[10] Gitimoni Talukdar, Pranjal Protim Borah, Arup Baruah,“A Survey of
Named Entity Recognition in Assamese and other Indian Languages”,
Proceedings of International Conference on Natural Language
Processing and Cognitive Computing, India, 2014.
2014 International Conference on Contemporary Computing and Informatics (IC3I) 191