Content uploaded by Uma Sawant
Author content
All content in this area was uploaded by Uma Sawant on Apr 16, 2016
Content may be subject to copyright.
Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy
Approach
Aniket Dalal
CSE department
IIT Bombay
Mumbai
Kumar Nagaraj
CSE department
IIT Bombay
Mumbai
(aniketd,kumar,uma,sandy)
Uma Sawant
CSE department
IIT Bombay
Mumbai
@cse.iitb.ac.in
Sandeep Shelke
CSE department
IIT Bombay
Mumbai
Abstract
We present a statistical approach to
Part-of-Speech(POS) tagger and chunker
for Hindi language. Our system em-
ploys Maximum Entropy Markov Model
(MEMM), trains from annotated Hindi
corpus and assigns tags(POS tags and
chunk labels) to previously unseen text.
This model uses multiple features simulta-
neously to predict the tag for a word. The
feature set is broadly classified as context-
based features, word features, dictionary
features and corpus-based features. Apart
from contextual features, which are inde-
pendent of languages, we discuss the use
of specialized features that capture lexi-
cal and morphological properties of Hindi
language. We experimented our approach
over corpus of NLPAI-ML 2006 contest
consisting of around 35000 words anno-
tated with 29 different POS tags and 6
chunk tags. The best accuracies reported
by our method on development data is
89.346% for POS tagging and 87.399%
for chunk labelling on per word basis.
When the models trained using the above
data was applied to the final evaluation
data of the contest, the F1-measures for
POS tagging was reported to be 82.22%,
while for chunking it was 82.40%.
1 Introduction
Part-of-Speech(POS) tagging is the process of as-
signing a part-of-speech like noun, verb, pronoun
or other lexical class marker to each word in a sen-
tence. POS tagging is a necessary precursor to
other natural language processing tasks like natu-
ral language parsing, information retrieval and in-
formation extraction.
A word can occur with different lexical class
tags in different contexts. The main challenge in
POS tagging involves resolving this ambiguity in
possible POS tags for a word. Several approaches
have been proposed and successfully implemented
for English POS tagging. These systems can be
grouped as rule based, statistical and hybrid.
POS tagging can be modelled as a sequence la-
belling task. Given an input sequence of words
Wn=w1w2...wn, the task is to construct a label
sequence Ln=l1l2...ln, where label libelongs to
the set of POS tags. The generated label sequence
Lnhas the highest probability of occurring for the
word sequence Wnamong all possible label se-
quences, that is
ˆ
Ln=argmax {P r (Ln|Wn)}
Statistical POS tagging methods take this ap-
proach. In particular, Maximum Entropy Markov
Model(MEMM) builds a model which captures
known information and applies this model to ob-
tain the best label sequence (Ratnaparkhi, 1996;
Ratnaparkhi, 1997).
After POS tags are identified, the next step
is chunking, which involves dividing sentences
into nonoverlapping nonrecursive phrases. In
general, full parsing is expensive, and is not
very robust. On the other hand, chunking can
be much faster, more robust, yet may be suffi-
cient for many applications (information extrac-
tion, question answering). It can also serve as
a possible first step for full parsing. In our sys-
tem, there are six different kinds of chunk la-
bels, namely, noun phrase(NP), verb phrase(VG),
adjective phrase(JJP), Adverb phrase(RBP), con-
junct phrase(CP) and others(BLK). The task of
identifying chunks and their labels is modeled in
the same way as that of identifying POS tags.
In this paper, we present a statistical POS tag-
ger and chunker for Hindi language. We have
built separate models for the same which satisfy
the maximum entropy principle and can be used to
tag unseen text. Our system is tailored for NLPAI-
ML contest 2006.
This paper is organized as follows. Section 2
gives an overview of maximum entropy models.
Feature functions used in Hindi POS tagging and
chunking are presented in section 3. Section 4 pro-
vides experimental details and results.
2 Maximum Entropy Markov Model
Maximum entropy (ME) principle states that the
least biased model which considers all known in-
formation is the one which maximizes entropy.
The ME technique builds a model which assumes
nothing other than the imposed constraints. To
build such a model, we define feature functions. A
feature function is a boolean function which cap-
tures some aspect of the language which is rele-
vant to the sequence labelling task. An example
feature function for POS tagging is
fj(l|c) = (1if current word is alphanumeric,
0otherwise
Here, lis one of the possible labels and cis the
context1. The relationship between feature func-
tions and labels as evidenced in the training cor-
pus is expressed as constraints. The probability
distribution satisfying these constraints and which
makes no other assumptions has maximum en-
tropy, is unique and can be expressed as (Berger
et al., 1996)
P r(l|c) = 1
z(c)exp
k
X
j=1
λjfj(l, c)
where z(c)is a normalizing constant. The prob-
lem of estimating λjparameters is solved by using
Generalized Iterative Scaling(Darroch and Rat-
cliff, 1972) algorithm. This learnt model is used
for tagging unseen text. In our system, during tag-
ging, Beam Search algorithm is applied to find the
most promising label sequence.
1Context is a set of words surrounding the current word
and/or labels of previous words.
3 Feature Functions
3.1 POS tagging features
For the task of Hindi POS tagging, the main fea-
ture functions used in our system are listed below:
Context-based features:
From our empirical analysis, we found that a
context window of size four gives the best perfor-
mance. For a word, the context consists of :
•POS tag of previous word.
•Combination of POS tags of previous two
words.
•Current word.
•Next word.
Word features:
Word features capture lexical and morphological
properties of the word being tagged. They are:
•Suffixes : If the word suffix is same as a given
suffix.
•Digits : Does the word have any digits, or is
the word completely numeric.
•Special characters : Are there any special
characters like ‘-’ in the word.
•Root of current word, or the next word (e. g.
‘KaRa’)
•English word: To handle English words that
occasionally appear in Hindi text.
Dictionary feature:
This feature utilizes information present in a stan-
dard Hindi dictionary. We define a feature func-
tion for each POS tag. For a POS tag l, if the word
being tagged can occur with label laccording to
dictionary, then the corresponding feature is true.
Corpus-based features: These features rely on
information extracted from training corpus. They
are:
•Has the word occurred as proper noun in
training.
•All possible tags of the current word, as seen
in training.
•Has the word occurred with only a single tag
in training corpus.
•All possible tags of the next word, as seen in
training.
3.2 Chunking features
The main feature functions used in Hindi chunk-
ing are listed below.
Context-based features:
For chunking, the most suitable context window
was empirically found to consist of words, POS
tags and chunk labels of current word and two
words on either side of it. On the lines of (Singh
et al., 2005), we found that for words having spe-
cific POS tags (JJ, NN, VFM, PREP, SYM, QF,
NEG and RP) adding current word, word and its
POS tag combination as features reduces the per-
formance of chunker. We call such a POS tag as
nonessential-word tag. For a word, the context-
based features consists of :
•Current word and word, POS tag combina-
tion, if POS tag of current word is not in the
list of nonessential-word tags.
•POS tags of all words in context, individually.
•Combinations of POS tags of next two words,
previous two words and current word, previ-
ous word, separately.
•Chunk label of previous two words, indepen-
dently.
Current POS tag based features:
For each tag, list of possible chunk labels for that
tag are identified. These chunk labels are used as
features. Another feature based on POS tag of cur-
rent word utilizes what we call as tag class. POS
tags are classified into different groups based on
the most likely chunk label for that POS tag, as
seen in training corpus. For example, all POS
tags which are most likely to occur in noun phrase
are grouped under one class. The class of current
word’s POS tag is used as a feature.
4 Experiments
Our system is built for the NLPAI-ML task of POS
tagging Indian Languages. The tagset of the con-
test specifies 29 POS tags and 6 chunk labels. The
development corpus for the task was provided by
0.85
0.86
0.87
0.88
0.89
0.9
55 60 65 70 75 80 85 90 95
Accuracy
Training Data Size( % )
Accuracy v/s Training Data Size
Figure 1: POS tagging accuracy with varying
training - test data split
0.82
0.84
0.86
0.88
0.9
0.92
0 1 2 3 4 5 6 7 8 9
Accuracy
Run
Accuracy across runs
chunking accuracy
POS tagging accuracy
Figure 2: Accuracy across runs
contest organizers. We have conducted experi-
ments for different split of training and test data.
As can be seen in figure 1, POS tagging accu-
racy increases with increase in proportion of train-
ing data till it reaches 75%, after which there is
a reduction in accuracy due to overfitting of the
trained model to training corpus. Beyond a split
of 85-15, increasing training corpus proportion in-
creases the accuracy as the test corpus size be-
comes very small. This prompted us to use a 75-
25 split for training and test data in our experi-
ments. The results were averaged out across dif-
ferent runs, each time randomly picking training
and test data. Figure 2 shows results using 75-25
split of training and test data across 10 different
runs. Our chunker heavily depends on POS tags
and hence, in most cases its accuracy closely tails
the POS tagging accuracy. The best POS tagging
accuracy of the system in these runs was found
to be 89.34% and the least accuracy was 87.04%.
The average accuracy over 10 runs was 88.4%.
For chunking, the best accuracy of chunk labels
on per word basis in these runs was 87.29% and
the least accuracy was 83.25%, with average be-
ing 86.45%.
Tag Precision Recall Instances
CC 0.9159091 0.9482353 425
NN 0.8416244 0.9221357 1798
PREP 0.95674485 0.9804658 1331
QFN 0.897541 0.93191487 235
JJ 0.77710843 0.73295456 352
VFM 0.9081081 0.8993576 934
PRP 0.9819277 0.9702381 840
QF 0.74285716 0.7819549 133
NLOC 0.90384614 0.8468468 111
JVB 0.68 0.6296296 108
VAUX 0.94126505 0.9272997 674
SYM 0.9758389 0.9972565 729
QW 0.9285714 0.8125 48
INTF 0.64285713 0.5869565 46
NNC 0.6839623 0.6415929 226
RP 0.91056913 0.8924303 251
NVB 0.64880955 0.5369458 203
RB 0.8695652 0.7619048 105
VNN 0.91907513 0.9137931 174
VJJ 0.5555556 0.20833333 24
VRB 0.8333333 0.41666666 24
NEG 0.9894737 0.9791667 96
NNPC 0.88 0.6984127 126
NNP 0.7904762 0.53205127 156
RBVB 0 0.0 1
UH 0 0.0 3
VV 0 0.0 1
Table 1: Statistics for individual POS tags in a run
with 75-25 split.
Number of words tagged 9154
Number of words wrongly tagged 975
Correctness Accuracy 89.3489
Table 2: Overall statistics for a run with 75-25
split.
Detailed statistics for a run of the POS tagger
with 75-25 split is provided in table 2. From the
table, we can observe that our system has good
performance in case of verb forms which appear
more frequently(VAUX, VFM, VNN), postposi-
tions and pronouns. However, for proper nouns
the performance is not satisfactory because con-
siderable number of proper nouns are tagged as
nouns. This is because, in most cases the ambigu-
ity between the two can be resolved only at seman-
tic level. Also, we find that compound tags(NNC,
NNPC) are incorrectly tagged as corresponding
non-compound tags(NN, NNP).
5 Conclusion
We have presented a part-of-speech tagger and
chunker for Hindi which uses maximum entropy
framework. We also discussed language depen-
dent as well as language independent features suit-
able for Hindi POS tagging and chunking. We
have shown that such a system has good perfor-
mance with an average accuracy of 88.4% for POS
tagging and 86.45% for chunking, with best ac-
curacies being 89.35% and 87.39% for POS tag-
ging and chunking, respectively. We believe that
further error analysis and more language specific
features would improve the system performance,
particularly in case of chunking.
6 Acknowledgment
We would like to thank Dr. Pushpak Bhat-
tacharyya for his guidance. We would also like to
thank Manish Shrivastava for many helpful sug-
gestions and comments.
References
Adam L. Berger, Stephen Della Pietra, and Vincent
J. Della Pietra. 1996. A maximum entropy ap-
proach to natural language processing. Computa-
tional Linguistics, 22(1):39–71.
J.N. Darroch and D. Ratcliff. 1972. Generalized itera-
tive scaling for log-linear models. Annals of Mathe-
matical Statistics, 43(5):1470–1480.
Adwait Ratnaparkhi. 1996. A maximum entropy
model for part-of-speech tagging. In Eric Brill and
Kenneth Church, editors, Proceedings of the Con-
ference on Empirical Methods in Natural Language
Processing, pages 133–142. Association for Com-
putational Linguistics, Somerset, New Jersey.
Adwait Ratnaparkhi. 1997. A simple introduction to
maximum entropy models for natural language pro-
cessing. Technical Report 97-08, Institute for Re-
search in Cognitive Science, University of Pennsyl-
vania, May.
Akshay Singh, Sushma Bendre, and Rajeev Sangal.
2005. Hmm based chunker for hindi. In Proceed-
ings of IJCNLP-05. Jeju Island, Republic of Korea,
October.