Content uploaded by Pranjal Protim Borah
Author content
All content in this area was uploaded by Pranjal Protim Borah on Jan 23, 2018
Content may be subject to copyright.
Assamese Word Sense Disambiguation using
Supervised Learning
Pranjal Protim Borah
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
pranjalborah777@gmail.com
Gitimoni Talukdar
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
talukdargitimoni@gmail.com
Arup Baruah
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
arup.baruah@gmail.com
Abstract— Word sense disambiguation (WSD) can be
defined as a task that focuses on estimating the right
sense of a word in its context. It is important as a pre-
processing step in information extraction, machine
translation, question answering and many other natural
language processing tasks. Ambiguity in Word Sense
arises when a particular word has more than one
possible sense. Finding the correct sense requires
thorough knowledge regarding words. This information
of words is often derived from the sources such as words
appearing in the context of the target word, part of
speech information of the words in the neighbour,
syntactical relations and local collocations. Our main
aim in this paper is to develop an automatic system for
WSD in Assamese using a Naive Bayes classifier. This is
the first work to the best of our knowledge on developing
an automatic WSD system for Assamese language.
Assamese, the main language of most of the people in
North-Eastern part of India is a morphologically very
rich language. In Assamese WSD is a challenging task
because a word can behave differently when combined
with a suffix or a sequence of suffixes to have an entirely
different sense. WSD often makes use of lexical
resources such as WordNet, lexicon, annotated or
unannotated corpora etc for its process of
disambiguation.
Keywords— Lexicon; Wordnet; Local collocations;
Polysemic word; Unigram cooccurence.
I. INTRODUCTION
Word sense disambiguation is often depicted as a
problem whereby using the machine learning approaches, a
disambiguator is to be generated from a corpus. The corpus
is manually sense tagged. A number of features interact with
each other to form the model representing the learning
algorithm which in turn is used as the classifier to perform
the task of disambiguation. Ambiguity in the context of
word sense disambiguation refers to having two or more
senses of the same word instance. WSD task hence looks
forward in choosing the correct sense of the word from a
number of already defined possibilities. Words can have
more than one possible sense, for example the Assamese
word ïđĒĊ (kali) may have four different senses which are
used frequently in our day to day discussions. These four
senses are:
• Instrument Sense
ăĕöđþ ïđĒĊ ąĘöđİđ ĎĠ @
pujat kali bojuwa hoi
In the above sentence the word ïđĒĊ (kali) is used to
represent an instrument. This sentence means that
“kali is played in the puja”. Here “kali” means an
instrument and “puja” means a festival.
• Way of measurement Sense
ĺþào ćđDž ïđĒĊį ĒĎõđă įđĘð @
teu mati kalir hisap rakhe
In the above sentence the word ïđĒĊ (kali) is used to
represent a way of measurement. This sentence means
that “He keeps measurements of land areas”. Here
“kali” means area.
• Ink Sense
äðį ĒĊĒðąęĊ ïđĒĊ ąƟİĎđį ïįđ ĎĠ @
Akhor likhiboloi kali byabahar kora hoi
In the above sentence the word ïđĒĊ (kali) is used to
represent ink. This sentence means that “Ink is used
to write letters”. Here “kali” means ink.
• Time Sense
aĎđ ïđĒĊ äĒć ĄĔĒįąęĊ Ĉđć @
Oha kali ami phuribo jam
946
978-1-4799-6629-5/14/$31.00 c
2014 IEEE
In the above sentence the word ïđĒĊ (kali) and the
word aĎđ (Oha) together used to represent tomorrow.
This sentence means that “We will travel tomorrow”.
Here “Oha” and “kali” together means tomorrow.
From the above examples we can notice that the
Assamese word ïđĒĊ (kali) creates ambiguity by representing
different senses at different contexts.
The primary reasons that makes WSD a challenging
task is that some definitions of word senses based on
dictionary are also ambiguous and the difficulty in handling
the morphology. The manual tagging of word sense can be
done by some trained expert in linguistics but there remains
a problem in the inter-agreement of the word sense as
annotation of words may be different with respect to senses
by different annotators [1]. Another problem in WSD is that
since a lot of common sense or world knowledge is
involved, sometimes there arises a problem to work with
dictionaries. In supervised learning technique from a
training corpus, a program should automatically induce
world knowledge and contextual features from a training
corpus. This helps in training the classification model.
The paper is further divided into the following sections.
Section II gives an overview of related work in WSD
followed by section III that summarizes various knowledge
sources used in WSD. Section IV illustrates our approach to
WSD using Naive Bayes classifier showing experimental
results and section V finally concludes our paper.
II. RELATED WORK
In this section we discuss about some related work in
Word sense disambiguation.
A research work was mentioned by Anagha Kulkarni,
Michael Heilman, Maxine Eskenazi and Jamie Callan, 2006,
“Word Sense Disambiguation for Vocabulary Learning”, [2]
used supervised and unsupervised approach to perform word
sense disambiguation in vocabulary learning. Learning of
word meaning pairs was done instead of words. The system
was developed to improve vocabulary in English. It was
found that supervised approaches were more accurate than
unsupervised approaches.
Rigau et al., 1997, is another work where it is
mentioned that there is an 8 % increase in the precision
value when combination of disambiguation methods are
used. The study included some methods involving most
frequent sense.
Another work was mentioned by Peter D. Turney,
“Word Sense Disambiguation by Web Mining ”, developed
the word sense supervised approach for National Research
Council (NRC). The supervised approach used both
syntactic and semantic features. Brill’s rule-based part-of-
speech tagger and Weka machine learning software were
used. Word co-occurrence possibilities were considered for
inducing the semantic features.
A research work mentioned by Manish Sinha, Mahesh
Kumar Reddy .R, Pushpak Bhattacharyya, Prabhakar
Pandey Laxmi Kashyap ,“ Hindi Word Sense
Disambiguation” [8] reported the first attempt for
developing an automatic WSD system for an Indian
language. Hindi wordnet was used to disambiguate the
Hindi words. Nouns were only considered for the system
and the accuracy measure ranged from 40% to 70%.
In a work mentioned in [3], “Word Sense
Disambiguation using statistical model of Roget’s
categories trained on large corpora” by David Yarowsky,
1992, word associations were used for finding the senses of
words. Supervised learning approach and thesaurus was
designated as solutions to the problem of WSD.
A research work was mentioned in “Knowledge based
approaches to Nepali Word Sense Disambiguation” 2014 by
Arindam Roy, Sunita Sarkar, Bipul Syam Purkayastha [4]
where overlap based, conceptual distance and semantic
graph based approaches were used to perform WSD in
Nepali. The accuracy for noun and adjectives using overlap
based approach were 54% and 42% approximately. The
combination of conceptual distance and semantic graph
based approach gave better result than overlap based
approach evident from the experimental results.
One more work was mentioned in “A Decision Tree
Based Word Sense Disambiguation System in Manipuri
Language” by Richard Laishram Singh, Krishnendu Ghosh,
Kishorjit Nongmeikapam and Sivaji Bandyopadhyay , 2014,
[5] suggested positional and contextual features for
developing word sense disambiguation system for Manipuri
language.
III. KNOWLEDGE SOURCES USED IN WSD
A. Lexical knowledge
Lexical knowledge forms the main background of
unsupervised WSD approaches. They are as follows:
1) Part of speech (POS) information: It often helps to
disambiguate senses in a partial or full manner when the
target word’s POS information is given [6].
2) Sense frequency: It is the number of times the sense
of a particular word has been used. Sense frequency is often
used in WSD algorithms where the algorithm selects the
most frequent sense for the target word.
3) Selectional Restrictions: It often reduces the number
of possible word senses by applying semantic constraints on
the word sense as certain word senses can only occur with
particular subjects and objects [6].
2014 International Conference on Contemporary Computing and Informatics (IC3I) 947
4) Sense glosses: It constitute examples and
explanations of word sense. The target word’s context and
the gloss can have some words in common which can be
taken into consideration for assigning the word sense to the
target word.
5) Subject code: It refers to some classes for every
meaning of the target word. Subject code can be determined
by indicative words and the sense is selected depending on
the association between subject code and word sense.
Indicative words can be derived from the training
corpus.[3][6].
6) Concept trees: It represents the relationships of target
word in the form of semantic networks [7]. Meronym,
hyponym, synonym, hypernym and holonym are the most
familiar relationships.
B. World knowledge
World knowledge can be acquired automatically from
the corpus during the training phase. Some contextual
features that can be used for machine learning technique are
given below:
1) Domain-specific Knowledge: It also puts semantic
restrictions like the selectional restrictions on the usage of
word senses. The training corpus provides the domain
specific knowledge [9].
2) Indicative words: It often indicate the senses of target
words. They usually occur as surrounding words to the
target word.
3) Parallel corpora: It uses two languages such as
primary language and the secondary language. Some verbs
and nouns can be aligned and if they refer to some common
concept then this information can be used to find senses of
some words in the primary language.
4) Syntactic features: It implies structure of sentences
and constituents of sentences. Syntactic feature can be
implemented as a Boolean feature which is set to 1 if a
syntactic object exist or can be implemented as a feature
which indicates if a word occurs at the position of
prepositional complement, direct object, position of subject
and indirect object [9][1].
IV. EXPERIMENTAL RESULTS
Natural Language Processing in Assamese is difficult
because of limited computational resources like annotated
corpora, openly available machine readable dictionary and
Wordnet. For the experiment at first a POS and sense tagged
Assamese corpus with knowledge source is created
containing 25 highly polysemic words using the articles
collected from Assamese text book of class X. The Training
Corpus contains approximately 750 words and Test Corpus
contains approximately 1300 words where the set of 25
multi-semantic words appeared 73 times and 135 times
respectively bearing all possible senses randomly. The
statistical details of the training and testing corpus are
shown below in TABLE I. and TABLE II.
TABLE I. Numbers of different tags present in the Training Corpus.
Tags Number of entries
NN 385
PRP 40
JJ 101
VB 95
RB 15
VAUX 88
CC 5
PREP 15
QF 10
TABLE II. Numbers of different tags present in the Test Corpus.
Tags Number of entries
NN 686
PRP 97
JJ 56
VB 235
RB 32
VAUX 89
CC 23
PREP 51
QF 39
Sense Tag List is created for the polysemic words
present in the corpus. This will give the information about
appropriate sense for the polysemic words appearing in the
training corpus. POS and sense tagging is done under
supervision of a linguistic expert.
The task of WSD using Naive Bayesian classifier with
richer features can obtain high accuracies [10]. So keeping
this belief in mind, we have used Naïve Bayes Classifier in
our project. Naïve Bayes approach has been used in most
classification work and were first used for WSD task by
William A. Gale, Kenneth W. Church and David Yarowsky
in their work named “A Method for Disambiguating Word
Senses in a Large Corpus” in the year 1992 [11]. Naïve
Bayes classifiers work on the assumption that all the
features representing a problem are class conditionally
independent. In the problem of word sense disambiguation,
the feature vector is represented by F=(f1, f2, . . . , fn) and k
different possible senses of the ambiguous word is
represented by (S1, S2, . . ., Sk). Now classifying the right
sense of the target ambiguous word (w) is the task of finding
the sense Si that maximizes the conditional probability
P(w=Si|F).
Word sense disambiguation requires knowledge sources
like machine readable dictionary and possibly wordnet. To
overcome the lack of these resources we have developed a
Lexicon containing the words used in the corpus where
word senses are included along with corresponding
948 2014 International Conference on Contemporary Computing and Informatics (IC3I)
synonyms with the help of Assamese-English Dictionary
[12]. In addition to these most frequently appearing words
and collocation are also available with respect to particular
sense in our Lexicon.
The features used in our system are:
1) Unigram Co Occurrences (UCO): Co-occurrences
are pairs of words that tend to occur in the same context, not
necessarily in any order and with a variable number of
intermediary words [13]. Here we have considered an
window of two that is two previous words and the two next
words if available.
2) POS of Target Word (POST): Some of the words in
different parts of speech have separate possible senses.
3) POS of Next Word Feature (POSN): Assamese
language uses some auxiliary verbs to represent action
related to a noun. Usually these auxiliary verbs appear next
to the noun for which the action is represented. POS of the
next word of the target word also helps in WSD.
4) Local Collocation (LC): Words in an ordered
sequence, which occurs together more often, representing a
particular sense every time for a polysemic word contained
in that sequence, helps in word sense disambiguation [14].
A number of experiments have been carried out taking
different combinations of the features available in our
system. As shown in TAB LE III. our system gives an F1-
measure of 55.6% when the unigram co-occurrence feature
is used alone. Combination of POS of the target word
feature and POS of the next word feature gives an F1-
mesure of 33.33%. On the other hand when unigram co-
occurrence, POS of target word and POS of next word
features are combined together then an F1-measure of
62.9% is obtained. Unigram co-occurrence and local
collocation features together give an improved F1-measure
of 73.3% .The best result is obtained when combination of
all features are used.
TABLE III. Performance Measure of the system for different combination
of features.
Features Precision Recall F1 Measure
UCO 62.5% 50% 55.6%
POST+ POSN 37.5% 30% 33.3%
UCO+POST+POSN 66.7% 60% 62.9%
UCO+LC 77.8% 70% 73.3%
All Together 86% 86% 86%
The system has performed best giving an F1-mesure of
86%, when all the four features are combined together. It
can also be analyzed from our results that local collocation
and unigram feature somewhat gives better result of 73.3%
as compared to unigram feature used alone and combination
of POS of target word feature and POS of next word feature
and combination of unigram feature, POS of target word
feature and POS of next word feature.
V. CONCLUSION
This is the first step towards Assamese Word Sense
Disambiguation. Due to limited availability of open
computational resources in Assamese language, few feature
extraction processes become difficult to implement. The
major limitation of our system is the lack of standard
computational resources in Assamese. For better
performance of our system the relations amongst synsets in
the WordNets could be used along with excessive care to
handle morphology. As WSD is an important component of
most the NLP research works, this project will have a
fruitful effect on Assamese Natural Language Processing.
VI. ACKNOWLEDGMENTS
We would like to thank Ms. Deepa Hazarika for her
valuable technical advice and tremendous support in the
process of POS tagging.
REFERENCES
[1] Fellbaum, C. & Palmer, M., “Manual and Automatic Semantic
Annotation with WordNet”, Proceedings of NAACL 2001
Workshop,2001.
[2] Anagha Kulkarni, Michael Heilman, Maxine Eskenazi and Jamie
Callan, 2006, “Word Sense Disambiguation for Vocabulary
Learning”, Proceedings of the Ninth International Conference on
Spoken Language Processing , 2006.
[3] David Yarowsky, “Word Sense Disambiguation using statistical
model of Roget’s categories trained on large corpora,” In Proceedings
of the 14th International Conference on Computational Linguistics
(COLING-92), pages 454-460, Nantes, France, 1992.
[4] Arindam Roy, Sunita Sarkar, Bipul Syam Purkayastha , “Knowledge
based approaches to Nepali Word Sense Disambiguation”,
Proceedings of International Conference of Natural Language
Processing and Cognitive Computing,India,2014.
[5] Richard Laishram Singh, Krishnendu Ghosh, Kishorjit
Nongmeikapam and Sivaji Bandyopadhyay, “A Decision Tree Based
Word Sense Disambiguation System in Manipuri Language”,
Proceedings of International Conference of Natural Language
Processing and Cognitive Computing,2014.
[6] Stevenson, M. & Wilks, Y., “The Interaction of Knowledge Sources
in Word Sense Disambiguation”,Computational Linguistics, Vol. 27,
No 3, pp.321 – 349,2001.
[7] Fellbaum, C., “WordNet: An electronic Lexical Database,
Cambridge: MIT Press”,1998.
[8] Manish Sinha, Mahesh Kumar Reddy, Prabhakar Pande, Laxmi
Kashyap and Pushpak Bhattacharyya, “Hindi Word Sense
Disambiguation” , International Symposium on Machine Translation,
Natural Language Processing and Translation Support Systems,
Delhi, India, November, 2004.
[9] Hastings, P. et al., “ Inferring the meaning of verbs from context”,
Proceedings of the Twentieth Annual Conference of the Cognitive
Science Society (CogSci-98), Wisconsin, Madison,1998.
[10] Cuong Anh Le and Akira Shimazu, “High WSD accuracy using
Naive Bayesian classifier with rich features”, PACLIC 18, 2004.
2014 International Conference on Contemporary Computing and Informatics (IC3I) 949
[11] Gale W., Church K., and Yarowsky D., “A Method for
Disambiguation Word Sense in a Large Corpus”. Computers and
Humanities, vol. 26, pp. 415-439, 1992.
[12] Hemchandra Baruah. Hem Kosha. Hemkosh Prakashan, Guwahati,
sixth edition, 1985.
[13] F. A. Smadja, “Lexical co-occurrence: The missing link”, Literary
and Linguistic Computing (4) (3): 163-168, 1989.
[14] D. Yarowsky, “One sense per collocation”, In Proceedings of the
workshop on Human Language Technology, pp. 266-271, March
1993.
950 2014 International Conference on Contemporary Computing and Informatics (IC3I)