ArticlePDF Available

Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy Approach

Authors:

Abstract and Figures

We present a statistical approach to Part-of-Speech(POS) tagger and chunker for Hindi language. Our system em- ploys Maximum Entropy Markov Model (MEMM), trains from annotated Hindi corpus and assigns tags(POS tags and chunk labels) to previously unseen text. This model uses multiple features simulta- neously to predict the tag for a word. The feature set is broadly classified as context- based features, word features, dictionary features and corpus-based features. Apart from contextual features, which are inde- pendent of languages, we discuss the use of specialized features that capture lexi- cal and morphological properties of Hindi language. We experimented our approach over corpus of NLPAI-ML 2006 contest consisting of around 35000 words anno- tated with 29 different POS tags and 6 chunk tags. The best accuracies reported by our method on development data is 89.346% for POS tagging and 87.399% for chunk labelling on per word basis. When the models trained using the above data was applied to the final evaluation data of the contest, the F1-measures for POS tagging was reported to be 82.22%, while for chunking it was 82.40%.
Content may be subject to copyright.
Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy
Approach
Aniket Dalal
CSE department
IIT Bombay
Mumbai
Kumar Nagaraj
CSE department
IIT Bombay
Mumbai
(aniketd,kumar,uma,sandy)
Uma Sawant
CSE department
IIT Bombay
Mumbai
@cse.iitb.ac.in
Sandeep Shelke
CSE department
IIT Bombay
Mumbai
Abstract
We present a statistical approach to
Part-of-Speech(POS) tagger and chunker
for Hindi language. Our system em-
ploys Maximum Entropy Markov Model
(MEMM), trains from annotated Hindi
corpus and assigns tags(POS tags and
chunk labels) to previously unseen text.
This model uses multiple features simulta-
neously to predict the tag for a word. The
feature set is broadly classified as context-
based features, word features, dictionary
features and corpus-based features. Apart
from contextual features, which are inde-
pendent of languages, we discuss the use
of specialized features that capture lexi-
cal and morphological properties of Hindi
language. We experimented our approach
over corpus of NLPAI-ML 2006 contest
consisting of around 35000 words anno-
tated with 29 different POS tags and 6
chunk tags. The best accuracies reported
by our method on development data is
89.346% for POS tagging and 87.399%
for chunk labelling on per word basis.
When the models trained using the above
data was applied to the final evaluation
data of the contest, the F1-measures for
POS tagging was reported to be 82.22%,
while for chunking it was 82.40%.
1 Introduction
Part-of-Speech(POS) tagging is the process of as-
signing a part-of-speech like noun, verb, pronoun
or other lexical class marker to each word in a sen-
tence. POS tagging is a necessary precursor to
other natural language processing tasks like natu-
ral language parsing, information retrieval and in-
formation extraction.
A word can occur with different lexical class
tags in different contexts. The main challenge in
POS tagging involves resolving this ambiguity in
possible POS tags for a word. Several approaches
have been proposed and successfully implemented
for English POS tagging. These systems can be
grouped as rule based, statistical and hybrid.
POS tagging can be modelled as a sequence la-
belling task. Given an input sequence of words
Wn=w1w2...wn, the task is to construct a label
sequence Ln=l1l2...ln, where label libelongs to
the set of POS tags. The generated label sequence
Lnhas the highest probability of occurring for the
word sequence Wnamong all possible label se-
quences, that is
ˆ
Ln=argmax {P r (Ln|Wn)}
Statistical POS tagging methods take this ap-
proach. In particular, Maximum Entropy Markov
Model(MEMM) builds a model which captures
known information and applies this model to ob-
tain the best label sequence (Ratnaparkhi, 1996;
Ratnaparkhi, 1997).
After POS tags are identified, the next step
is chunking, which involves dividing sentences
into nonoverlapping nonrecursive phrases. In
general, full parsing is expensive, and is not
very robust. On the other hand, chunking can
be much faster, more robust, yet may be suffi-
cient for many applications (information extrac-
tion, question answering). It can also serve as
a possible first step for full parsing. In our sys-
tem, there are six different kinds of chunk la-
bels, namely, noun phrase(NP), verb phrase(VG),
adjective phrase(JJP), Adverb phrase(RBP), con-
junct phrase(CP) and others(BLK). The task of
identifying chunks and their labels is modeled in
the same way as that of identifying POS tags.
In this paper, we present a statistical POS tag-
ger and chunker for Hindi language. We have
built separate models for the same which satisfy
the maximum entropy principle and can be used to
tag unseen text. Our system is tailored for NLPAI-
ML contest 2006.
This paper is organized as follows. Section 2
gives an overview of maximum entropy models.
Feature functions used in Hindi POS tagging and
chunking are presented in section 3. Section 4 pro-
vides experimental details and results.
2 Maximum Entropy Markov Model
Maximum entropy (ME) principle states that the
least biased model which considers all known in-
formation is the one which maximizes entropy.
The ME technique builds a model which assumes
nothing other than the imposed constraints. To
build such a model, we define feature functions. A
feature function is a boolean function which cap-
tures some aspect of the language which is rele-
vant to the sequence labelling task. An example
feature function for POS tagging is
fj(l|c) = (1if current word is alphanumeric,
0otherwise
Here, lis one of the possible labels and cis the
context1. The relationship between feature func-
tions and labels as evidenced in the training cor-
pus is expressed as constraints. The probability
distribution satisfying these constraints and which
makes no other assumptions has maximum en-
tropy, is unique and can be expressed as (Berger
et al., 1996)
P r(l|c) = 1
z(c)exp
k
X
j=1
λjfj(l, c)
where z(c)is a normalizing constant. The prob-
lem of estimating λjparameters is solved by using
Generalized Iterative Scaling(Darroch and Rat-
cliff, 1972) algorithm. This learnt model is used
for tagging unseen text. In our system, during tag-
ging, Beam Search algorithm is applied to find the
most promising label sequence.
1Context is a set of words surrounding the current word
and/or labels of previous words.
3 Feature Functions
3.1 POS tagging features
For the task of Hindi POS tagging, the main fea-
ture functions used in our system are listed below:
Context-based features:
From our empirical analysis, we found that a
context window of size four gives the best perfor-
mance. For a word, the context consists of :
POS tag of previous word.
Combination of POS tags of previous two
words.
Current word.
Next word.
Word features:
Word features capture lexical and morphological
properties of the word being tagged. They are:
Suffixes : If the word suffix is same as a given
suffix.
Digits : Does the word have any digits, or is
the word completely numeric.
Special characters : Are there any special
characters like ‘-’ in the word.
Root of current word, or the next word (e. g.
‘KaRa’)
English word: To handle English words that
occasionally appear in Hindi text.
Dictionary feature:
This feature utilizes information present in a stan-
dard Hindi dictionary. We define a feature func-
tion for each POS tag. For a POS tag l, if the word
being tagged can occur with label laccording to
dictionary, then the corresponding feature is true.
Corpus-based features: These features rely on
information extracted from training corpus. They
are:
Has the word occurred as proper noun in
training.
All possible tags of the current word, as seen
in training.
Has the word occurred with only a single tag
in training corpus.
All possible tags of the next word, as seen in
training.
3.2 Chunking features
The main feature functions used in Hindi chunk-
ing are listed below.
Context-based features:
For chunking, the most suitable context window
was empirically found to consist of words, POS
tags and chunk labels of current word and two
words on either side of it. On the lines of (Singh
et al., 2005), we found that for words having spe-
cific POS tags (JJ, NN, VFM, PREP, SYM, QF,
NEG and RP) adding current word, word and its
POS tag combination as features reduces the per-
formance of chunker. We call such a POS tag as
nonessential-word tag. For a word, the context-
based features consists of :
Current word and word, POS tag combina-
tion, if POS tag of current word is not in the
list of nonessential-word tags.
POS tags of all words in context, individually.
Combinations of POS tags of next two words,
previous two words and current word, previ-
ous word, separately.
Chunk label of previous two words, indepen-
dently.
Current POS tag based features:
For each tag, list of possible chunk labels for that
tag are identified. These chunk labels are used as
features. Another feature based on POS tag of cur-
rent word utilizes what we call as tag class. POS
tags are classified into different groups based on
the most likely chunk label for that POS tag, as
seen in training corpus. For example, all POS
tags which are most likely to occur in noun phrase
are grouped under one class. The class of current
word’s POS tag is used as a feature.
4 Experiments
Our system is built for the NLPAI-ML task of POS
tagging Indian Languages. The tagset of the con-
test specifies 29 POS tags and 6 chunk labels. The
development corpus for the task was provided by
0.85
0.86
0.87
0.88
0.89
0.9
55 60 65 70 75 80 85 90 95
Accuracy
Training Data Size( % )
Accuracy v/s Training Data Size
Figure 1: POS tagging accuracy with varying
training - test data split
0.82
0.84
0.86
0.88
0.9
0.92
0 1 2 3 4 5 6 7 8 9
Accuracy
Run
Accuracy across runs
chunking accuracy
POS tagging accuracy
Figure 2: Accuracy across runs
contest organizers. We have conducted experi-
ments for different split of training and test data.
As can be seen in figure 1, POS tagging accu-
racy increases with increase in proportion of train-
ing data till it reaches 75%, after which there is
a reduction in accuracy due to overfitting of the
trained model to training corpus. Beyond a split
of 85-15, increasing training corpus proportion in-
creases the accuracy as the test corpus size be-
comes very small. This prompted us to use a 75-
25 split for training and test data in our experi-
ments. The results were averaged out across dif-
ferent runs, each time randomly picking training
and test data. Figure 2 shows results using 75-25
split of training and test data across 10 different
runs. Our chunker heavily depends on POS tags
and hence, in most cases its accuracy closely tails
the POS tagging accuracy. The best POS tagging
accuracy of the system in these runs was found
to be 89.34% and the least accuracy was 87.04%.
The average accuracy over 10 runs was 88.4%.
For chunking, the best accuracy of chunk labels
on per word basis in these runs was 87.29% and
the least accuracy was 83.25%, with average be-
ing 86.45%.
Tag Precision Recall Instances
CC 0.9159091 0.9482353 425
NN 0.8416244 0.9221357 1798
PREP 0.95674485 0.9804658 1331
QFN 0.897541 0.93191487 235
JJ 0.77710843 0.73295456 352
VFM 0.9081081 0.8993576 934
PRP 0.9819277 0.9702381 840
QF 0.74285716 0.7819549 133
NLOC 0.90384614 0.8468468 111
JVB 0.68 0.6296296 108
VAUX 0.94126505 0.9272997 674
SYM 0.9758389 0.9972565 729
QW 0.9285714 0.8125 48
INTF 0.64285713 0.5869565 46
NNC 0.6839623 0.6415929 226
RP 0.91056913 0.8924303 251
NVB 0.64880955 0.5369458 203
RB 0.8695652 0.7619048 105
VNN 0.91907513 0.9137931 174
VJJ 0.5555556 0.20833333 24
VRB 0.8333333 0.41666666 24
NEG 0.9894737 0.9791667 96
NNPC 0.88 0.6984127 126
NNP 0.7904762 0.53205127 156
RBVB 0 0.0 1
UH 0 0.0 3
VV 0 0.0 1
Table 1: Statistics for individual POS tags in a run
with 75-25 split.
Number of words tagged 9154
Number of words wrongly tagged 975
Correctness Accuracy 89.3489
Table 2: Overall statistics for a run with 75-25
split.
Detailed statistics for a run of the POS tagger
with 75-25 split is provided in table 2. From the
table, we can observe that our system has good
performance in case of verb forms which appear
more frequently(VAUX, VFM, VNN), postposi-
tions and pronouns. However, for proper nouns
the performance is not satisfactory because con-
siderable number of proper nouns are tagged as
nouns. This is because, in most cases the ambigu-
ity between the two can be resolved only at seman-
tic level. Also, we find that compound tags(NNC,
NNPC) are incorrectly tagged as corresponding
non-compound tags(NN, NNP).
5 Conclusion
We have presented a part-of-speech tagger and
chunker for Hindi which uses maximum entropy
framework. We also discussed language depen-
dent as well as language independent features suit-
able for Hindi POS tagging and chunking. We
have shown that such a system has good perfor-
mance with an average accuracy of 88.4% for POS
tagging and 86.45% for chunking, with best ac-
curacies being 89.35% and 87.39% for POS tag-
ging and chunking, respectively. We believe that
further error analysis and more language specific
features would improve the system performance,
particularly in case of chunking.
6 Acknowledgment
We would like to thank Dr. Pushpak Bhat-
tacharyya for his guidance. We would also like to
thank Manish Shrivastava for many helpful sug-
gestions and comments.
References
Adam L. Berger, Stephen Della Pietra, and Vincent
J. Della Pietra. 1996. A maximum entropy ap-
proach to natural language processing. Computa-
tional Linguistics, 22(1):39–71.
J.N. Darroch and D. Ratcliff. 1972. Generalized itera-
tive scaling for log-linear models. Annals of Mathe-
matical Statistics, 43(5):1470–1480.
Adwait Ratnaparkhi. 1996. A maximum entropy
model for part-of-speech tagging. In Eric Brill and
Kenneth Church, editors, Proceedings of the Con-
ference on Empirical Methods in Natural Language
Processing, pages 133–142. Association for Com-
putational Linguistics, Somerset, New Jersey.
Adwait Ratnaparkhi. 1997. A simple introduction to
maximum entropy models for natural language pro-
cessing. Technical Report 97-08, Institute for Re-
search in Cognitive Science, University of Pennsyl-
vania, May.
Akshay Singh, Sushma Bendre, and Rajeev Sangal.
2005. Hmm based chunker for hindi. In Proceed-
ings of IJCNLP-05. Jeju Island, Republic of Korea,
October.
... For humans, this may be easy to identify, but for machines, it is hard to resolve such morphological complexity and language ambiguity without knowing the context. In the past, several researchers have proposed POS taggers for Hindi language [9,[11][12][13][14][15][16][17][18]. They have used different classification models such as Maximum Entropy (ME), HMMs, DT, Rule-based, NN, RNNs, SVM, etc. [19] to improve the accuracy of tag identification. ...
... This section focuses on various techniques proposed by the previous researchers [9,11,[14][15][16][17][18][23][24][25][26][27][28][29][30][31][32][33][34][35][36] for identifying the POS information in Indian languages. They exploit various classification techniques such as SVM, CRF, HMM, DT, ME, NN, RNN, rulebased, etc. ...
Article
Full-text available
In the domain of natural language processing, part-of-speech (POS) tagging is the most important task. It plays a vital role in applications like sentiment analysis, text summarization, opinion mining, etc. POS tagging is a process of assigning POS information (noun, pronoun, verb, etc.) to the given word. This information is considered in the context of their relationship with the surrounding words. Hindi is very popular language in countries like India, Nepal, United States, Mauritius, etc. Majority of Indians are accustomed to Hindi for reading and writing. They also use Hindi for writing on social media such as Twitter, Facebook, WhatsApp, etc. POS tagging is the most important phase to analyze these Hindi text from social media. The text scripted in Hindi is ambiguous in nature and rich in morphology. It makes identification of POS information challenging. In this article, a heuristic based approach is proposed for identifying POS information. The proposed method deployed a context-based bigram model that create a bigram sequence based on the relationship with the adjacent words. Subsequently, it selects the most likelihood POS information for a word based on both the forward and reverse bigram sequences. The experimental result of the proposed heuristic approach is compared with existing state-of-the-art techniques like hidden Markov model, decision tree, conditional random fields, support vector machine, neural network, and recurrent neural networks. Finally, it is observe that the proposed heuristic approach for POS tagging in Hindi outperforms the existing techniques and attains an accuracy of 94.3%.
... This section outlines some of the probable POS tagging works based on various techniques. Existing techniques for POS tagging, such as rule-based [19] and statistical approaches [16], place emphasis on repetition and usage of every word in training sentences. If a word appears more than once in a sentence, it will be assigned to several lexical groups. ...
... The work of rule-based methods is similar to that of statistical methods, with the exception that the former use language dependent rule sets rather than statistical variance. The limitations of existing methods [19,16,45] for POS tagging motivated the study, where the representation of each word is kept fixed irrespective of the context in which it exists or not suitable for an ambiguous corpus. ...
Article
Full-text available
Understanding the context of any phrase or extracting relationships requires part of speech tagging (POS). This article proposes an RNN-based POS tagger and compares its performance with some of the existing POS tagging methods. We present novel LSTM-based RNN architecture for POS tagging. The study attempts to determine the usefulness of machine learning and deep learning techniques for tagging part-of-speech of words for the low-resource Hindi language, which is an Indo-Aryan language spoken mostly in India. During the experiments, different deep learning architecture (ANN and RNN) and machine learning methods (HMM, SVM, DT) have been used. A multi-representational treebank and an open-source dataset have been used for the performance analysis of the proposed framework. The experimental results in terms of macro-measured variables have shown better results compared to some state-of-the-art methods.
... [ 8]: POS tagging can be considered as a classification problem. In classification problems data are separated into a number of classes on the basis of features (attributes) of data. ...
Article
Full-text available
In a natural language text, a word can occur with different lexical tags in different contexts. First step in all natural language applications is to get the accurate syntactic category of each word in a sentence based on its role in the text. This is also called part of speech tagging (POST). There are different approaches for assigning a part of speech (POS) tag to each word of a natural language sentence. Broadly the classification of approaches can be done as Rule based, Statistical based and Transformation based. Subsequent researches have added various modifications to these basic approaches based on Maximum Entropy, Decision tree, Conditional Random Field, Support Vector Machine, Neural Network etc.. Each approach has its advantages and disadvantages. In this paper, we elaborate the functionality of different approaches and try to present a comparison of latest approaches. The comparison shows that due to availability of large amount of data through internet, statistical techniques with some contextual and morphological knowledge of language are more powerful than the regular grammar based technique. Introduction:
... Shrivastava et al. [24] developed Hindi POS tagger using Longest Suffix Matching Approach of HMM and got 93.12% accuracy. Dalal et al. [25] stated that Maximum Entropy Markov Model (MEMM) is used for POS tagging and chunking. This model is having various features such as corpus based feature, word based feature, dictionary based feature and context based features. ...
... Further the words "Seeta" has the B-tag and "ne" has I-tag, similarly each first word in the phrase has B-tag and each second or internal word has the I-tag. The first attempt for Hindi chunker was made in 2005 by Singh et al [111], who got accuracy of 91.70%, using HMM which used 2 Lakh words annotated by POS and chunk labels, provided by IIIT-H.However they used only 2 boundary tags, 5 chunk labels and the Penn tagset for POS tagging.Next year researchers Dalal et al [34] claimed the chunker efficiency depends on POS tagging accuracy and designed max entropy model to get 82.4% accuracy on data given under a contest on ML, with 35000 words having 26 POS tags and 6 chunk tags.In a contest held in 2007 [94,19], when same dataset was applied for TBL, CRF, Max Entropy, Decision Forest, Rule based and HMM +CRF approach, the best was HMM + CRF with 80.97% accuracy. Chunker for Hindi-English code mixed data was attempted in 2016 by Sharma et al [103].They used CRF to get accuracy of 78%.Eight bilingual speakers annotated 858 sentences using Universal POS tagset and self designed chunk tagset.The designed tool is released for public use 15 .Another tool based on CRF with TBL is released by IIIT-H 16 which has accuracy of 87% but it is only for Hindi. ...
Preprint
Full-text available
Natural Language processing (NLP) represents the task of automatic handling of natural human language by machines.There is large spectrum of possible applications of NLP which help in automating tasks like translating text from one language to other, retrieving and summarizing data from very huge repositories, spam email filtering, identifying fake news in digital media, find sentiment and feedback of people, find political opinions and views of people on various government policies, provide effective medical assistance based on past history records of patient etc. Hindi is the official language of India with nearly 691 million users in India and 366 million in rest of world. At present, a number of government and private sector projects and researchers in India and abroad, are working towards developing NLP applications and resources for Indian languages. This survey gives a report of the resources and applications for Hindi language NLP.
... In Hindi, Singh et al., [84] presented POS tagger with detailed morphosyntactic analysis, skillful handling of suffixes and decision tree based learning algorithm. Dalal et al., [18] used maximum entropy markov model, which is statistical based and considers multiple features simultaneously such as context based features, word features, dictionary features and corpus-based features to predict the tag for a word. Avinesh et al., [68] used conditional random field and transformation based learning statistical methods for POS tagging of Hindi, Telugu and Bengali. ...
Article
Full-text available
In recent information explosion, contents in internet are multilingual and majority will be in the form of natural languages. Processing of these natural languages for various language processing tasks is challenging. The Indian regional languages are considered to be low resourced when compared to other languages. In this survey, the various approaches and techniques contributed by the researchers for Indian regional language processing are reviewed. The tasks like machine translation, Named Entity Recognition, Sentiment Analysis and Parts-Of-Speech tagging are reviewed with respect to Rule, Statistical and Neural based approaches. The challenges which motivate to solve language processing problems are presented. The sources of dataset for the Indian regional languages are described. The future scope and essential requirements to enhance the processing of Indian regional languages for various language processing tasks are discussed.ϖ
Article
Indian literature, traditionally praised as the ancestral language of the nation, relies on Sanskrit. Sanskrit has shaped all Indian languages, with various linguistic lineages originating from it. Sanskrit syntax-free nature makes it intriguing and difficult. Morphological analysis, a major language processing tool for Indian languages, investigates this complexity. The flexible word order of Sanskrit and its prospects for future research indicate that rearranging words does not alter a sentence meaning. This paper examines the complicated task of creating a Sanskrit-specific spellchecker. Morphological analysis, which divides words into pieces of speech, underpins this spellchecker complex algorithm. The spellchecker architecture is based on this morphological rule-based algorithm. The work emphasizes the importance of morphological analysis in Indian language research, especially spell-checking Sanskrit manuscripts. The technique provides a complete language picture by addressing syntactic and morphological issues. The research work describes the complex morphological principles that improve spellchecker accuracy. This combination of academic language analysis and practical technology shows how academic research and technical innovation may preserve and understand Sanskrit and its language.
Chapter
Part of speech (POS) tagging is most necessary concept in the natural language processing categorized each word in the corpus. This paper focus on the development of Marathi part of speech tagger using the N-gram models. For this we have designed rules for the development of the POS tagger. These rules are framed on the basis of the Marathi grammar. The corpus is of 635 Marathi sentences written in considering variations, for better evaluation of the POS tagger. Total 5715 words has been tagged from in which 1918 words unigram tagger and 106 words are tagged by the bigram tagger. The overall accuracy of POS tagger is 79.34%.
Article
Full-text available
This paper presents an HMM-based chunk tagger for Hindi. Various tagging schemes for marking chunk boundaries are discussed along with their results. Contextual information is incorporated into the chunk tags in the form of part-of-speech (POS) information. This in-formation is also added to the tokens themselves to achieve better precision. Error analysis is carried out to reduce the number of common errors. It is found that for certain classes of words, using the POS information is more ef-fective than using a combination of word and POS tag as the token. Fi-nally, chunk labels are also marked on the chunks.
Article
Full-text available
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we describe a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.
Article
Say that a probability distribution {pi;iI}\{p_i; i \in I\} over a finite set I is in "product form" if (1) pi=πiμs=1dμsbsip_i = \pi_i\mu \prod^d_{s=1} \mu_s^{b_si} where πi\pi_i and {bsi}\{b_{si}\} are given constants and where μ\mu and {μs}\{\mu_s\} are determined from the equations (2) iIbsipi=ks,s=1,2,,d\sum_{i \in I} b_{si} p_i = k_s, s = 1, 2, \cdots, d; (3) iIpi=1\sum_{i \in I} p_i = 1. Probability distributions in product form arise from minimizing the discriminatory information iIpilogpi/πi\sum_{i \in I} p_i \log p_i/\pi_i subject to (2) and (3) or from maximizing entropy or maximizing likelihood. The theory of the iterative scaling method of determining (1) subject to (2) and (3) has, until now, been limited to the case when bsi=0,1b_{si} = 0, 1. In this paper the method is generalized to allow the bsib_{si} to be any real numbers. This expands considerably the list of probability distributions in product form which it is possible to estimate by maximum likelihood.
Article
Many problems in natural language processing can be viewed as linguistic classification problems, in which linguistic contexts are used to predict linguistic classes. Maximum entropy models offer a clean way to combine diverse pieces of contextual evidence in order to estimate the probability of a certain linguistic class occurring with a certain linguistic context. This report demonstrates the use of a particular maximum entropy model on an example problem, and then proves some relevant mathematical facts about the model in a simple and accessible manner. This report also describes an existing procedure called Generalized Iterative Scaling, which estimates the parameters of this particular model. The goal of this report is to provide enough detail to re-implement the maximum entropy models described in [Ratnaparkhi,1996, Reynar and Ratnaparkhi,1997, Ratnaparkhi, 1997] and also to provide a simple explanation of the maximum entropy formalism.
Article
This paper presents a statistical model which trains from a corpus annotated with Part-OfSpeech tags and assigns them to previously unseen text with state-of-the-art accuracy(96.6%). The model can be classified as a Maximum Entropy model and simultaneously uses many contextual "features" to predict the POS tag. Furthermore, this paper demonstrates the use of specialized features to model difficult tagging decisions, discusses the corpus consistency problems discovered during the implementation of these features, and proposes a training strategy that mitigates these problems.