ArticlePDF Available

A Survey of Named Entity Recognition in Assamese and other Indian Languages

Authors:

Abstract

Named Entity Recognition is always important when dealing with major Natural Language Processing tasks such as information extraction, question-answering, machine translation, document summarization etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular reference to Assamese. There are various rule-based and machine learning approaches available for Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for Named Entity Recognition and then we discuss about the related research in this field. Assamese like other Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of the issues faced in Assamese while doing Named Entity Recognition.
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
10.5121/ijnlc.2014.3310 105
A
S
URVEY OF
N
AMED
E
NTITY
R
ECOGNITION IN
A
SSAMESE AND OTHER
I
NDIAN
L
ANGUAGES
Gitimoni Talukdar
1
, Pranjal Protim Borah
2
, Arup Baruah
3
1,2,3
Department of Computer Science and Engineering, Assam Don Bosco University,
Guwahati, India.
Abstract
Named Entity Recognition is always important when dealing with major Natural Language Processing
tasks such as information extraction, question-answering, machine translation, document summarization
etc so in this paper we put forward a survey of Named Entities in Indian Languages with particular
reference to Assamese. There are various rule-based and machine learning approaches available for
Named Entity Recognition. At the very first of the paper we give an idea of the available approaches for
Named Entity Recognition and then we discuss about the related research in this field. Assamese like other
Indian languages is agglutinative and suffers from lack of appropriate resources as Named Entity
Recognition requires large data sets, gazetteer list, dictionary etc and some useful feature like
capitalization as found in English cannot be found in Assamese. Apart from this we also describe some of
the issues faced in Assamese while doing Named Entity Recognition.
Keywords
Named entity recognition; Named entity; Annotated corpora; Gazetteer list; Heuristics.
1. I
NTRODUCTION
Named Entity is a text element indicating the name of a person, organization and location. It was
felt in the Message Understanding Conferences (MUC) of 1990s that if certain classes of
information are previously extracted from a document then the information extraction task
becomes really easy. In the later stage of the conference Named Entity Recognition systems were
asked to classify the names, time, date and numerical information .Thus Named Entity
Recognition is a task of two stages-first to identify the proper nouns and then to classify the
proper names into categories such as person name, organization names (e.g., government
organizations, companies), location names (e.g., countries and cities) and miscellaneous names
(e.g., percentage, date, number, time, monetary expressions).
In the MUC conference some conventions were followed for the named entities which include
[8]:
NUMEX for numerical entities
TIMEX for temporal entities
ENAMEX for names
Assamese is an Indo European language in North Eastern India spoken by about 30 million
people and a national language of India. It is the official language of the state of Assam .The first
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
106
work on Named Entity Recognition for Assamese was rule based Named Entity Recognition
described in [5].
The paper has been further divided into six sections. In section 2 the approaches available for
named entity recognition has been highlighted with elaboration of the features that are commonly
used in named entity recognition in section 3. In section 4 a survey of the research work done in
Named Entity Recognition of Indian languages has been put forwarded followed by discussion of
challenges faced in Assamese Named Entity Recognition in section 5. Section 6 emphasizes on
performance metrics that are used to evaluate an Named Entity Recognition system and section 7
concludes our paper.
2. D
IFFERENT
APPROACHES
Named Entity Recognition approaches can broadly be classified into three classes namely rule-
based or handcrafted, machine learning or automated and hybrid model. A small overview of the
approaches is outlined below:
2.1. Rule-based Named Entity Recognition
Studies made in Named Entity Recognition initially were mainly based on handcrafted rules. This
can be seen from the fact that five systems were based on handcrafted rules out of eight in the
MUC-7 competition and sixteen presented in CONLL-2003 [8]. Human made rules forms the
main background of rule based Named Entity Recognition. Rule based approach can be further
classified as:
2.1.1. Linguistic Approach
In this approach expertise is required in the grammar and language to ascertain the heuristics for
classifying the named entities. Performance of this approach is based on coming up with good
rules and good heuristics.
2.1.2. List Look up Approach
For this a list called the gazetteer list is needed. This list is to be updated from time to time and
Named Entity Recognition works only for the gazetteer list.
2.2.
Automated or Machine learning approach
Machine learning approaches are advantageous over rule based approaches as they can be easily
trained and can be used in various domains. All these approaches are statistical in nature. Some
machine learning approaches are Conditional Random Field (CRF), Hidden Markov Model
(HMM), Decision Trees (DT), Maximum Entropy Markov Model (MEMM) and Support Vector
Machine (SVM).
2.2.1. Conditional Random Field (CRF)
CRFs are used for segmenting and labelling data in sequential mode and is a discriminative
probabilistic model. It requires huge random and non-independent features. In CRF there is a state
sequence S and observed sequence O where observed sequence is the sequence of words in a text
document or sentence and state sequence is the respective label sequence. If the observed
sequence O = (o1,o2,......ok) is provided then the conditional probability of the state sequence S =
(s1,s2,.....sK) is given as
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
107
ps|o1
Z exp ΥppSk1,Sk,o,k

Where Z represents the normalization factor.
ZexpΥppSk1,Sk,o,k

Where pSk1,Sk,o,k is a feature function and Υp is learned during training phase.
2.2.2. Hidden Markov Model (HMM)
This model is based on maximizing the joint probability of the sequence of words and tags. If the
word sequence is Word= (word
1
featureset
1
). . . . . . . . . . (word
n
featureset
n
) where word
i
indicates
a word and featureset
i
is the associated single token feature set of word
i
then the main aim is to
calculate the tag sequence T = t1,t2,.....tn for which the conditional probability of tag sequence
given the word sequence is maximized.
2.2.3. Decision Trees (DT)
Decision Tree is a tree structure used to make decisions at the nodes and obtain some result at the
leaf nodes .A path in the tree represents a sequence of decisions leading to the classification at the
leaf node. Decision trees are attractive because the rules can be easily understood from the tree .It
is a popular tool for prediction and classification.
2.2.4. Maximum Entropy Markov Model (MEMM)
MEMM is based on the principle that the model which considers all known facts is the one that
maximizes entropy. It can handle long term dependency and words having multiple features
which was a drawback for HMM. It suffers from Label-Bias problem. It is biased towards state
with lower outgoing transition because the probability transition coming out from any given state
must equal to one.
2.2.5. Support Vector Machine (SVM)
In a classification task using SVM the task usually involves training and testing data which
consist of some data instances. The goal is to predict the class of the data instances. It is one of
the famous binary classifier giving best results for fewer data sets and can be applied to multi-
class problems by extending the algorithm. The SVM classifier used in the training set for making
the classifier model and classify the testing data based on this model with the use of features.
2.3. Hybrid model approach
Machine learning approaches and rule based approaches are combined for more accuracy to
Named Entity Recognition. Various hybrid approaches can be CRF and rule based approach,
HMM and rule based approach, MEMM and rule based approach and SVM and rule based
approach.
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
108
3. F
EATURES IN
NER
Features commonly used for Named Entity Recognition are:
3.1. Surrounding words
Various combinations of previous to next words of a word which are the surrounding words can
be treated as a feature.
3.2. Context word feature
Specific list for class can be created for the words occurring quite often at previous and next
positions of a word belonging to a particular class. This feature is set to 1 if the surrounding
words are found in the class context list.
3.3. Digit features
Binary valued digit features can be helpful in detecting the miscellaneous named entities such as:
ContDigitPeriod: Word contains digits and periods.
ContDigitComma: Word contains of digits and commas.
ContDigitSlash: Word contains digits and slash.
ContDigitHyphen: Word contains digits and hyphen.
ContDigitPercentage: Word contains digits and percentage.
ContDigitSpecial: Word contains digits and special symbols.
3.4. Infrequent word
Infrequent or rare word can be found by calculating the frequencies of the words in the corpus of
the training phase and by selecting a cut off frequency to allow only those words as rare that
occurs less than the cut off frequency. This feature is important as it is found that infrequent
words are most probably named entities.
3.5. Word suffix
The word suffix feature can be defined in two ways. The first way is to use fixed length suffix
information of the surrounding words and the other way is to use suffixes of variable length and
matching with the predefined lists of valid suffixes of named entities.
3.6. Word prefix
The word prefix feature can be defined in two ways. The first way is to use fixed length prefix
information of the surrounding words and the other way is to use prefixes of variable length and
matching with the predefined lists of valid prefixes of named entities.
3.7. Part of speech information
The part of speech information of the surrounding words can be used for recognizing named
entities.
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
109
3.8. Numerical word
This feature can be set to 1 if a token signifies a number.
3.9. Word length
This feature can be used to check if the length of a word is less than three or not because it is
found that very short words are not named entities.
4. O
BSERVATIONS AND
D
ISCUSSIONS
In this section we provide a survey of the research done in Indian languages. A research work was
mentioned by Praneeth M Shishtla, 2008 , “A Character n-gram Based Approach for Improved
Recall in Indian Language Named Entity Recognition” [2] used Telugu corpus containing 4709
named entities from 45,714 tokens with Hindi corpus containing 3140 named entities from 45,380
tokens and English corpus containing 4287 named entities from 45,870 tokens. They used CRF
technique with Character based n-gram technique on two languages Hindi and Telugu. During
training and testing they used morphological analyzers, POS taggers and chunkers along with
feature set having nine features. It was found that Gram n=4 gave a F-measure 45.18% for 35k
words, 42.36% for 30k words, 36.26% for 20k words and 40.96% for 10k words for Hindi. Gram
n=3 gave F-measure of 48.93% for 35k words, 44.48% for 30k words, 35.38% for 20k words and
24.2% for 10k words for Telugu and Gram n=2 gave F-measure up to 68.46% for 35k words,
67.49% for 30k words, 65.59% for 20k words and 52.92% for 10k words for English.
Another work that was reported was “Bengali Named Entity Recognition using Support Vector
Machine” mentioned by Asif Ekbal and Sivaji Bandyopadhyay (2008) [3] used a training corpus
of 1,30,000 tokens with sixteen named entity tags on Bengali with support vector machine
approach provided F-measure of 91.8% on test set of 1,50,000 words.
“A hybrid Approach for Named Entity Recognition in Indian Languages” mentioned by Sujan
Kumar Saha et al.(2008) [4] allowed twelve classes of named entity recognition for Bengali,
Hindi, Telugu , Urdu and Oriya. A combination of maximum entropy model, gazetteers list and
some language dependent rules were used. The system reported F-measure of 65.96% for
Bengali, 65.13% for Hindi, 18.75% for Telugu, 35.47% for Urdu and 44.65% for Oriya.
“Domain Focused Named Entity Recognition for Tamil Using Conditional Random Fields”
mentioned by Vijaya Krishna R and Sobha L (2008) [7] used CRF approach for Tamil focussed
on tourism domain. A corpus of about 94,000 words in tourism domain was used and 106 tag sets
along with five feature template. A part of the corpus was used for training and another part was
used for testing. The system reported a F-measure of 80.44% respectively.
“Language Independent Named Entity Recognition in Indian languages” mentioned by Asif
Ekbal et al., 2008 [1] used CRF approach for named entity recognition. The system applied
language dependent features to Hindi and Bengali only and language independent features on
Bengali, Hindi, Telugu, Urdu and Oriya along with contextual information of words. The system
was trained with 502,974 tokens for Hindi, 93,173 tokens for Oriya, 64,026 tokens for Telugu,
35,447 tokens for Urdu and 122,467 tokens for Bengali. The system was tested with 38,708
tokens for Hindi, 24,640 tokens for Oriya, 6,356 tokens for Telugu, 3,782 tokens for Urdu and
30,505 tokens for Bengali. An F-measure of 53.46% was obtained for Bengali.
Padmaja Sharma, Utpal Sharma and Jugal Kalita 2010, “The first Steps towards Assamese
Named Entity Recognition” [5] developed a handcrafted rule based Named Entity Recognition for
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
110
Assamese. A corpus of about 50000 words was manually tagged from Assamese online Protidin
articles. The system found 500 person names and 250 location names. They analyzed the tagged
corpus to enumerate some rules for automatic Named Entity tagging.
“Suffix Stripping Based Named Entity Recognition in Assamese for Location Names” mentioned
by Padmaja Sharma, Utpal Sharma and Jugal Kalita (2010) [6] used an Assamese Pratidin Corpus
containing 300,000 wordforms. A location named entity was produced by generating the root
word by removing suffixes. In their approach the Assamese word occurring in the text is the input
and a list of suffixes is used which commonly combines with location named entities. The
stemmer removes the suffixes from the input word by searching for the suffixes in the key suffix
file and if a match is found the suffix along with the trailing characters is removed to produce the
output which is the location named entity. The approach was simple and obtained an F-measure
of 88%.
Table 1. F-measure achieved in Hindi for different statistical approach.
Approach Used F-measure (%)
MEMM [4] 65.13
Character based n-gram technique [2] 45.18
Language dependent features [1] 33.12
Table 2. F-measure achieved in Bengali for different statistical approach.
Approach Used F-measure (%)
MEMM
[
4]
65.9
6
Language dependent features [1] 59.39
SVM [3] 91.8
Table 3. F-measure achieved in Telegu for different statistical approach.
F
-
measure (%)
MEMM [4] 18.75
CRF
[
13
]
92
Language independent features [1] 47.49
Character based n-gram technique [2] 48.93
Table 4. F-measure achieved in Oriya for different statistical approach.
Approach Used F-measure (%)
MEMM
[4]
44.65
Language independent features [1] 28.71
Table 5. F-measure achieved in Assamese for Suffix stripping based approach.
Approach Used F-measure (%)
Suffix stripping based approach
[6]
88
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
111
Table 6. F-measure achieved in Urdu for different statistical approach.
F
-
measure (%)
MEMM [4] 35.47
Language independent features
[1]
35.52
5. K
EY ISSUES IN
A
SSAMESE
N
AMED
E
NTITY
R
ECOGNITION
5.1. Ambiguity in Assamese
Ambiguity occurs between common noun and proper noun as most of the names are taken from
dictionary. For example,
я
(Jon) indicates moon which is a common noun but may also indicate
the name of a person which is a proper noun.
5.2. Agglutinative nature
Assamese language suffers from agglutination and complex words are created by adding
additional features to change the meaning of the word. For example,
a
(Assam) is the name of a
place which is a location named entity but
a
(AssamIYA) is produced by adding suffix
◌
(IYA) to
a
(Assam) which signifies people residing in Assam which is not a location named
entity [5].
5.3. Lack of capitalization
In Assamese there is no capitalization that can help to recognize the proper nouns as found in
English.
5.4. Nested Entities
When two or more proper nouns are present then it becomes difficult to assign the proper named
entity class. For example, In
 
(Gauhati bishabidyaly)

(Gauhati) is a location
named entity and

(bishabidyalay) refers to organization thus creating problem in
assigning the proper class.
5.5. Spelling Variation
Changes in the spelling of proper names are another problem in Assamese Named Entity
Recognition. For example, In
 n
(Shree Shreesanth) there is a confusion whether
(Shree) in
n
(Shreesanth) is a Pre-nominal word or person named entity.
6.
P
ERFORMANCE
M
ETRICS
The level of accuracy of a system to recognize Named Entity Recognition can be done by the
following metrics:
Precision (P): Precision is the ratio of relevant results returned to total number of results returned.
It can be represented as: P = W/Y
Recall (R): Recall is the ratio of relevant results returned to all relevant results. It can be
represented as: R= W/T
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.3, June 2014
112
F1-measure: 2*(P*R) / (P+R)
Where,
W= Number of relevant Named Entities returned.
Y= Total Named Entities returned.
T= Total Named Entities present.
7.
C
ONCLUSION
Indian languages suffer greatly from lack of available annotated corpora, agglutinative nature,
different writing methodologies, difficult morphology and no concept of capitalization and as
such research of Named Entity Recognition in Indian languages are not much as compared to the
European languages. We found that rule based approaches with gazetteer list along with some
language independent rules combined together with statistical approach may give satisfactory
results for Named Entity Recognition in Indian languages because of insufficient data available
for training. Our conclusion is that in a situation where sufficient training data is not available a
hybrid model where combination of rule based, statistical and language independent rules are
used will be a better approach to perform Named Entity Recognition in Indian languages.
R
EFERENCES
Authors :
Gitimoni Talukdar.
B.E. (CSE), Research Scholar.
Pranjal Protim Borah.
B.E. (CSE), Research Scholar.
Arup Baruah.
B.E (CSE), M.Tech (IT)
Assistant Professor, Department of CSE,
Assam Don Bosco University, Guwahati, India.
... Progress in the field of Natural Language Understanding (NLU) has been made over the past decade, and the proposed systems use various NEs methodologies and techniques that fall into three broad categories: rule-based techniques, Machine Learning (ML), and hybrid approaches [4]. The advantage of ML techniques is that the system can be trained and expanded easily to different language domains [5]. ...
Article
Full-text available
Named Entity Recognition (NER), a popular method that is used for recognizing entities that are present in a text document. It is a method for processing natural language that can automatically read whole articles, pull out the most important parts, and put them into predefined categories. In this article, an Attention-BiLSTM_DenseNet Model for NER English has been presented. The model works in three phases; datat pre-processing, features extraction and NER Phase. During the pre-processing URLs, Special Symbols, Usernames, stop words are removed and similarly tokenization and normalization are performed on the datasets. In the next phase, the necessary features; domain weights, event weights and textual similarity are extracted and then during the training phase of Attention-BiLSTM-DenseNet, word embeddings were concatenated to obtain context, and then optimal weight parameter coefficients were used to train the model parameters. It has been found that the proposed method has better precision, recall, accuracy, and F-Measure when compared to established methods like LSTM-CRF and BiLSTM-CRF, as demonstrated by statistical measurements.
... mining [13] and others. Unfortunately, the NER systems for South Asian languages are still at the developing phase [14]. In this regard International Journal of Computing and Natural Language Processing (IJCNLP), 2008 workshop [15] played a key role in developing the NER system for South Asian Languages specially for Hindi, Oriya, Bangali, Telugu, and Urdu [16]. ...
Conference Paper
In this paper, we present the challenges and research opportunities in the field of Sindhi Named Entity Recognition (SNER). Sindhi has great influence on the large population specially in the Sindh province of Pakistan, some states of India, and other countries. But unfortunately, the Named Entity Recognition (NER) task has never been investigated due to certain challenges and its complex morphological features. Therefore the focus of this paper is to discuss difficulties and future research opportunities in the field of NER in the Sindhi language. The study reveals the importance of Sindhi, present methods in NER with applications, challenges to the development of SNER system and directions for future research.
... Various surveys are done for Named Entity Recognition since NER is a key task for dealing with many NLP tasks such as question answering system, information extraction etc. So Talukdar et al. 17 and Hiremath et al. 15 made a survey on various approaches used for NER for English and Indian languages. hey concluded that a hybrid approach combining rule based and statistical technique employing language independent rules should be used to have better performance for NER in Indian languages. ...
Article
Full-text available
In this paper, we present a new approach for Named Entity Recognition (NER) in Tamil language using Random Kitchen Sink algorithm. Named Entity recognition is the process of identification of Named Entities (NEs) from the text. It involves the identifying and classifying predefined categories such as person, location, organization etc. A lot of work has been done in the field of Named Entity Recognition for English language and Indian languages using various machine learning approaches. In this work, we implement the NER system for Tamil using Random Kitchen Sink algorithm which is a statistical and supervised approach. The NER system is also implemented using Support Vector Machine (SVM) and Conditional Random Field (CRF). The overall performance of the NER system was evaluated as 86.61% for RKS, 81.62% for SVM and 87.21% for CRF. Additional results have been taken in SVM and CRF by increasing the corpus size and the performance are evaluated as 86.06% and 87.20% respectively.
Article
In Natural Language Processing (NLP) pipelines, Named Entity Recognition (NER) is one of the preliminary problems, which marks proper nouns and other named entities such as Location, Person, Organization, Disease etc. Such entities, without an NER module, adversely affect the performance of a machine translation system. NER helps in overcoming this problem by recognising and handling such entities separately, although it can be useful in Information Extraction systems also. Bhojpuri, Maithili and Magahi are low resource languages, usually known as Purvanchal languages. This paper focuses on the development of an NER benchmark dataset for Machine Translation systems developed to translate from these languages to Hindi by annotating parts of the available corpora with named entities. Bhojpuri, Maithili and Magahi corpora of sizes 228373, 157468 and 56190 tokens, respectively, were annotated using 22 entity labels. The annotation considers coarse-grained annotation labels followed by the tagset used in one of the Hindi NER datasets. We also report a Deep Learning baseline that uses an LSTM-CNNs-CRF model. The lower baseline F 1 -scores from the NER tool obtained by using Conditional Random Fields models are 70.56% for Bhojpuri, 73.19% for Maithili and 84.18% for Magahi. The Deep Learning-based technique (LSTM-CNNs-CRF) achieved 61.41% for Bhojpuri, 71.38% for Maithili and 86.39% for Magahi. As the results show, LSTM-CNNs-CRF fails to outperform the lower baseline in the case of Bhojpuri and Maithili, which have more data in terms of the number of tokens, but not in terms of the number of named entities. However, the cross-lingual model training of LSTM-CNNs-CRF for Bhojpuri and Maithili performed better than the CRF.
Article
Full-text available
In this paper we describe a hybrid system that applies Maximum Entropy model (Max- Ent), language specific rules and gazetteers to the task of Named Entity Recognition (NER) in Indian languages designed for the IJCNLP NERSSEAL shared task. Starting with Named Entity (NE) annotated corpora and a set of features we first build a base- line NER system. Then some language spe- cific rules are added to the system to recog- nize some specific NE classes. Also we have added some gazetteers and context patterns to the system to increase the performance. As identification of rules and context pat- terns requires language knowledge, we were able to prepare rules and identify context patterns for Hindi and Bengali only. For the other languages the system uses the MaxEnt model only. After preparing the one-level NER system, we have applied a set of rules to identify the nested entities. The system is able to recognize 12 classes of NEs with 65.13% f-value in Hindi, 65.96% f-value in Bengali and 44.65%, 18.74%, and 35.47% f-value in Oriya, Telugu and Urdu respec- tively.
Article
Full-text available
This paper reports about the development of a Named Entity Recognition (NER) sys- tem for South and South East Asian lan- guages, particularly for Bengali, Hindi, Te- lugu, Oriya and Urdu as part of the IJCNLP-08 NER Shared Task1. We have
Article
Full-text available
In this paper, we present a domain focused Tamil Named Entity Recognizer for tourism domain. This method takes care of morphological inflections of named entities (NE). It handles nested tagging of named entities with a hierarchical tagset containing 106 tags. The tagset is designed with focus to tourism domain. We have experimented building Conditional Random Field (CRF) models by training the noun phrases of the training data and it gives encouraging results.
Article
Full-text available
Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity (NE) classes and is nowadays considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL). An appropriate tag conversion routine has been developed in order to convert the data into the forms tagged with the four NE tags, namely Person name , Location name , Organization name and Miscellaneous name . The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the different NE classes. The system has been tested with the gold standard test sets of 35K, and 38K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the overall recall, precision, and f-score values of 85.11%, 81.74%, and 83.39%, respectively, for Bengali and 82.76%, 77.81%, and 80.21%, respectively, for Hindi. Statistical analysis, ANOVA is performed to show that the improvement in the performance with the use of language dependent features is statistically significant over the language independent features for Bengali and Hindi both.
Conference Paper
Full-text available
This paper reports about the development of a Manipuri NER system, a less computerized Indian language. Two different models, one using an active learning technique based on the context patterns generated from an unlabeled news corpus and the other based on the well known Support Vector Machine (SVM), have been developed. The active learning technique has been considered as the baseline system. The Manipuri news corpus has been manually annotated with the major NE tags, namely Person name, Location name, Organization name and Miscellaneous name to apply SVM. The SVM based system makes use of the different contextual information of the words along with the variety of orthographic word-level features which are helpful in predicting the NE classes. In addition, lexical context patterns generated using the active learning technique have been used as the features of SVM in order to improve performance. The system has been trained and tested with 28,629 and 4,763 wordforms, respectively. Experimental results show the effectiveness of the proposed approach with the overall average Recall, Precision and F-Score values of 93.91%, 95.32% and 94.59% respectively.
Article
Full-text available
The suitability of the algorithms for recognition and classification of entities (NERC) is evaluated through competitions such as MUC, CONLL or ACE. In general, these competitions are limited to the recognition of predefined entity types in certain languages. In addition, the evaluation of free applications and commercial systems that do not attend the competitions has been lightly studied. Shallowly studied have also been the causes of erroneous results. In this study a set of NERC tools are assessed. The assessment of the tools has consisted of: 1) the elaboration of a test corpus with typical and marginal types of entities; 2) the elaboration of a brief technical specification for the tools evaluated; 3) the assessment of the quality of the tools for the developed corpus by means of precision-recall ratios; 4) the analysis of the most frequent errors. The sufficiency of the technical characteristics of the tools and their evaluation ratios, presents an objective perspective of the quality and the effectiveness of the recognition and classification techniques of each tool. Thus, the study complements the information provided by other competitions and aids the choice or the design of more suitable NER tools for a specific project.
Article
Named Entity Recognition (NER) is the process of identifying and classifying proper nouns in text documents into pre-defined classes such as person, location and organization. It plays an important role in Natural Language Processing applications. Although NER in Indian languages is a difficult and challenging task and suffers from scarcity of resources, such work has started to appear recently. In highly inflectional languages such as Assamese, NER requires identification of the root forms of words that occur in texts. Our work reports a suffix stripping approach to identify those roots of words which are location named entities.
Article
Named Entity Recognition (NER) is the task of identifying and classifying all proper nouns in a document as person names, or- ganization names, location names, date & time expressions and miscellaneous. Previ- ous work (Cucerzan and Yarowsky, 1999) was done using the complete words as fea- tures which suffers from a low recall prob- lem. Character n-gram based approach (Klein et al., 2003) using generative mod- els, was experimented on English language and it proved to be useful over the word based models. Applying the same technique on Indian Languages, we experimented with Conditional Random Fields (CRFs), a dis- criminative model, and evaluated our sys- tem on two Indian Languages Telugu and Hindi. The character n-gram based models showed considerable improvement over the word based models. This paper describes the features used and experiments to increase the recall of Named Entity Recognition Sys- tems which is also language independent.
Article
This paper is about Named Entity Recogni- tion (NER) for Telugu. Not much work has been done in NER for Indian languages in general and Telugu in particular. Adequate annotated corpora are not yet available in Telugu. We recognize that named entities are usually nouns. In this paper we there- fore start with our experiments in building a CRF (Conditional Random Fields) based Noun Tagger. Trained on a manually tagged data of 13,425 words and tested on a test data set of 6,223 words, this Noun Tagger has given an F-Measure of about 92%. We then develop a rule based NER system for Telugu. Our focus is mainly on identify- ing person, place and organization names. A manually checked Named Entity tagged corpus of 72,157 words has been developed using this rule based tagger through boot- strapping. We have then developed a CRF based NER system for Telugu and tested it on several data sets from the Eenaadu and Andhra Prabha newspaper corpora de- veloped by us here. Good performance has been obtained using the majority tag con- cept. We have obtained overall F-measures between 80% and 97% in various experi- ments.
Article
We present a simple semi-supervised learning algorithm for named entity recognition (NER) using conditional random fields (CRFs). The algorithm is based on exploiting evidence that is independent from the features used for a classifier, which provides high-precision la- bels to unlabeled data. Such independent ev- idence is used to automatically extract high- accuracy and non-redundant data, leading to a much improved classifier at the next iteration. We show that our algorithm achieves an aver- age improvement of 12 in recall and 4 in pre- cision compared to the supervised algorithm. We also show that our algorithm achieves high accuracy when the training and test sets are from different domains.