Conference PaperPDF Available

Assamese Word Sense Disambiguation using Supervised Learning

Authors:

Abstract

Word sense disambiguation (WSD) can be defined as a task that focuses on estimating the right sense of a word in its context. It is important as a pre-processing step in information extraction, machine translation, question answering and many other natural language processing tasks. Ambiguity in Word Sense arises when a particular word has more than one possible sense. Finding the correct sense requires thorough knowledge regarding words. This information of words is often derived from the sources such as words appearing in the context of the target word, part of speech information of the words in the neighbour, syntactical relations and local collocations. Our main aim in this paper is to develop an automatic system for WSD in Assamese using a Naive Bayes classifier. This is the first work to the best of our knowledge on developing an automatic WSD system for Assamese language. Assamese, the main language of most of the people in North-Eastern part of India is a morphologically very rich language. In Assamese WSD is a challenging task because a word can behave differently when combined with a suffix or a sequence of suffixes to have an entirely different sense. WSD often makes use of lexical resources such as WordNet, lexicon, annotated or unannotated corpora etc for its process of disambiguation.
Assamese Word Sense Disambiguation using
Supervised Learning
Pranjal Protim Borah
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
pranjalborah777@gmail.com
Gitimoni Talukdar
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
talukdargitimoni@gmail.com
Arup Baruah
Department of Computer Science &
Engineering and IT,
Assam Don Bosco University,
Guwahati, India.
arup.baruah@gmail.com
Abstract— Word sense disambiguation (WSD) can be
defined as a task that focuses on estimating the right
sense of a word in its context. It is important as a pre-
processing step in information extraction, machine
translation, question answering and many other natural
language processing tasks. Ambiguity in Word Sense
arises when a particular word has more than one
possible sense. Finding the correct sense requires
thorough knowledge regarding words. This information
of words is often derived from the sources such as words
appearing in the context of the target word, part of
speech information of the words in the neighbour,
syntactical relations and local collocations. Our main
aim in this paper is to develop an automatic system for
WSD in Assamese using a Naive Bayes classifier. This is
the first work to the best of our knowledge on developing
an automatic WSD system for Assamese language.
Assamese, the main language of most of the people in
North-Eastern part of India is a morphologically very
rich language. In Assamese WSD is a challenging task
because a word can behave differently when combined
with a suffix or a sequence of suffixes to have an entirely
different sense. WSD often makes use of lexical
resources such as WordNet, lexicon, annotated or
unannotated corpora etc for its process of
disambiguation.
Keywords— Lexicon; Wordnet; Local collocations;
Polysemic word; Unigram cooccurence.
I. INTRODUCTION
Word sense disambiguation is often depicted as a
problem whereby using the machine learning approaches, a
disambiguator is to be generated from a corpus. The corpus
is manually sense tagged. A number of features interact with
each other to form the model representing the learning
algorithm which in turn is used as the classifier to perform
the task of disambiguation. Ambiguity in the context of
word sense disambiguation refers to having two or more
senses of the same word instance. WSD task hence looks
forward in choosing the correct sense of the word from a
number of already defined possibilities. Words can have
more than one possible sense, for example the Assamese
word ïđĒĊ (kali) may have four different senses which are
used frequently in our day to day discussions. These four
senses are:
Instrument Sense
ăĕöđþ ïđĒĊ ąĘöđİđ ĎĠ @
pujat kali bojuwa hoi
In the above sentence the word ïđĒĊ (kali) is used to
represent an instrument. This sentence means that
kali is played in the puja”. Here “kali” means an
instrument and “puja” means a festival.
Way of measurement Sense
ĺþào ćđDž ïđĒĊį ĒĎõđă įđĘð @
teu mati kalir hisap rakhe
In the above sentence the word ïđĒĊ (kali) is used to
represent a way of measurement. This sentence means
that “He keeps measurements of land areas”. Here
kali” means area.
Ink Sense
äðį ĒĊĒðąęĊ ïđĒĊ ąƟİĎđį ïįđ ĎĠ @
Akhor likhiboloi kali byabahar kora hoi
In the above sentence the word ïđĒĊ (kali) is used to
represent ink. This sentence means that “Ink is used
to write letters”. Here “kali” means ink.
Time Sense
aĎđ ïđĒĊ äĒć ĄĔĒįąęĊ Ĉđć @
Oha kali ami phuribo jam
946
978-1-4799-6629-5/14/$31.00 c
2014 IEEE
In the above sentence the word ïđĒĊ (kali) and the
word aĎđ (Oha) together used to represent tomorrow.
This sentence means that “We will travel tomorrow”.
Here “Oha” and “kali” together means tomorrow.
From the above examples we can notice that the
Assamese word ïđĒĊ (kali) creates ambiguity by representing
different senses at different contexts.
The primary reasons that makes WSD a challenging
task is that some definitions of word senses based on
dictionary are also ambiguous and the difficulty in handling
the morphology. The manual tagging of word sense can be
done by some trained expert in linguistics but there remains
a problem in the inter-agreement of the word sense as
annotation of words may be different with respect to senses
by different annotators [1]. Another problem in WSD is that
since a lot of common sense or world knowledge is
involved, sometimes there arises a problem to work with
dictionaries. In supervised learning technique from a
training corpus, a program should automatically induce
world knowledge and contextual features from a training
corpus. This helps in training the classification model.
The paper is further divided into the following sections.
Section II gives an overview of related work in WSD
followed by section III that summarizes various knowledge
sources used in WSD. Section IV illustrates our approach to
WSD using Naive Bayes classifier showing experimental
results and section V finally concludes our paper.
II. RELATED WORK
In this section we discuss about some related work in
Word sense disambiguation.
A research work was mentioned by Anagha Kulkarni,
Michael Heilman, Maxine Eskenazi and Jamie Callan, 2006,
“Word Sense Disambiguation for Vocabulary Learning”, [2]
used supervised and unsupervised approach to perform word
sense disambiguation in vocabulary learning. Learning of
word meaning pairs was done instead of words. The system
was developed to improve vocabulary in English. It was
found that supervised approaches were more accurate than
unsupervised approaches.
Rigau et al., 1997, is another work where it is
mentioned that there is an 8 % increase in the precision
value when combination of disambiguation methods are
used. The study included some methods involving most
frequent sense.
Another work was mentioned by Peter D. Turney,
“Word Sense Disambiguation by Web Mining ”, developed
the word sense supervised approach for National Research
Council (NRC). The supervised approach used both
syntactic and semantic features. Brill’s rule-based part-of-
speech tagger and Weka machine learning software were
used. Word co-occurrence possibilities were considered for
inducing the semantic features.
A research work mentioned by Manish Sinha, Mahesh
Kumar Reddy .R, Pushpak Bhattacharyya, Prabhakar
Pandey Laxmi Kashyap ,“ Hindi Word Sense
Disambiguation” [8] reported the first attempt for
developing an automatic WSD system for an Indian
language. Hindi wordnet was used to disambiguate the
Hindi words. Nouns were only considered for the system
and the accuracy measure ranged from 40% to 70%.
In a work mentioned in [3], “Word Sense
Disambiguation using statistical model of Roget’s
categories trained on large corpora” by David Yarowsky,
1992, word associations were used for finding the senses of
words. Supervised learning approach and thesaurus was
designated as solutions to the problem of WSD.
A research work was mentioned in “Knowledge based
approaches to Nepali Word Sense Disambiguation” 2014 by
Arindam Roy, Sunita Sarkar, Bipul Syam Purkayastha [4]
where overlap based, conceptual distance and semantic
graph based approaches were used to perform WSD in
Nepali. The accuracy for noun and adjectives using overlap
based approach were 54% and 42% approximately. The
combination of conceptual distance and semantic graph
based approach gave better result than overlap based
approach evident from the experimental results.
One more work was mentioned in “A Decision Tree
Based Word Sense Disambiguation System in Manipuri
Language” by Richard Laishram Singh, Krishnendu Ghosh,
Kishorjit Nongmeikapam and Sivaji Bandyopadhyay , 2014,
[5] suggested positional and contextual features for
developing word sense disambiguation system for Manipuri
language.
III. KNOWLEDGE SOURCES USED IN WSD
A. Lexical knowledge
Lexical knowledge forms the main background of
unsupervised WSD approaches. They are as follows:
1) Part of speech (POS) information: It often helps to
disambiguate senses in a partial or full manner when the
target word’s POS information is given [6].
2) Sense frequency: It is the number of times the sense
of a particular word has been used. Sense frequency is often
used in WSD algorithms where the algorithm selects the
most frequent sense for the target word.
3) Selectional Restrictions: It often reduces the number
of possible word senses by applying semantic constraints on
the word sense as certain word senses can only occur with
particular subjects and objects [6].
2014 International Conference on Contemporary Computing and Informatics (IC3I) 947
4) Sense glosses: It constitute examples and
explanations of word sense. The target word’s context and
the gloss can have some words in common which can be
taken into consideration for assigning the word sense to the
target word.
5) Subject code: It refers to some classes for every
meaning of the target word. Subject code can be determined
by indicative words and the sense is selected depending on
the association between subject code and word sense.
Indicative words can be derived from the training
corpus.[3][6].
6) Concept trees: It represents the relationships of target
word in the form of semantic networks [7]. Meronym,
hyponym, synonym, hypernym and holonym are the most
familiar relationships.
B. World knowledge
World knowledge can be acquired automatically from
the corpus during the training phase. Some contextual
features that can be used for machine learning technique are
given below:
1) Domain-specific Knowledge: It also puts semantic
restrictions like the selectional restrictions on the usage of
word senses. The training corpus provides the domain
specific knowledge [9].
2) Indicative words: It often indicate the senses of target
words. They usually occur as surrounding words to the
target word.
3) Parallel corpora: It uses two languages such as
primary language and the secondary language. Some verbs
and nouns can be aligned and if they refer to some common
concept then this information can be used to find senses of
some words in the primary language.
4) Syntactic features: It implies structure of sentences
and constituents of sentences. Syntactic feature can be
implemented as a Boolean feature which is set to 1 if a
syntactic object exist or can be implemented as a feature
which indicates if a word occurs at the position of
prepositional complement, direct object, position of subject
and indirect object [9][1].
IV. EXPERIMENTAL RESULTS
Natural Language Processing in Assamese is difficult
because of limited computational resources like annotated
corpora, openly available machine readable dictionary and
Wordnet. For the experiment at first a POS and sense tagged
Assamese corpus with knowledge source is created
containing 25 highly polysemic words using the articles
collected from Assamese text book of class X. The Training
Corpus contains approximately 750 words and Test Corpus
contains approximately 1300 words where the set of 25
multi-semantic words appeared 73 times and 135 times
respectively bearing all possible senses randomly. The
statistical details of the training and testing corpus are
shown below in TABLE I. and TABLE II.
TABLE I. Numbers of different tags present in the Training Corpus.
Tags Number of entries
NN 385
PRP 40
JJ 101
VB 95
RB 15
VAUX 88
CC 5
PREP 15
QF 10
TABLE II. Numbers of different tags present in the Test Corpus.
Tags Number of entries
NN 686
PRP 97
JJ 56
VB 235
RB 32
VAUX 89
CC 23
PREP 51
QF 39
Sense Tag List is created for the polysemic words
present in the corpus. This will give the information about
appropriate sense for the polysemic words appearing in the
training corpus. POS and sense tagging is done under
supervision of a linguistic expert.
The task of WSD using Naive Bayesian classifier with
richer features can obtain high accuracies [10]. So keeping
this belief in mind, we have used Naïve Bayes Classifier in
our project. Naïve Bayes approach has been used in most
classification work and were first used for WSD task by
William A. Gale, Kenneth W. Church and David Yarowsky
in their work named “A Method for Disambiguating Word
Senses in a Large Corpus” in the year 1992 [11]. Naïve
Bayes classifiers work on the assumption that all the
features representing a problem are class conditionally
independent. In the problem of word sense disambiguation,
the feature vector is represented by F=(f1, f2, . . . , fn) and k
different possible senses of the ambiguous word is
represented by (S1, S2, . . ., Sk). Now classifying the right
sense of the target ambiguous word (w) is the task of finding
the sense Si that maximizes the conditional probability
P(w=Si|F).
Word sense disambiguation requires knowledge sources
like machine readable dictionary and possibly wordnet. To
overcome the lack of these resources we have developed a
Lexicon containing the words used in the corpus where
word senses are included along with corresponding
948 2014 International Conference on Contemporary Computing and Informatics (IC3I)
synonyms with the help of Assamese-English Dictionary
[12]. In addition to these most frequently appearing words
and collocation are also available with respect to particular
sense in our Lexicon.
The features used in our system are:
1) Unigram Co Occurrences (UCO): Co-occurrences
are pairs of words that tend to occur in the same context, not
necessarily in any order and with a variable number of
intermediary words [13]. Here we have considered an
window of two that is two previous words and the two next
words if available.
2) POS of Target Word (POST): Some of the words in
different parts of speech have separate possible senses.
3) POS of Next Word Feature (POSN): Assamese
language uses some auxiliary verbs to represent action
related to a noun. Usually these auxiliary verbs appear next
to the noun for which the action is represented. POS of the
next word of the target word also helps in WSD.
4) Local Collocation (LC): Words in an ordered
sequence, which occurs together more often, representing a
particular sense every time for a polysemic word contained
in that sequence, helps in word sense disambiguation [14].
A number of experiments have been carried out taking
different combinations of the features available in our
system. As shown in TAB LE III. our system gives an F1-
measure of 55.6% when the unigram co-occurrence feature
is used alone. Combination of POS of the target word
feature and POS of the next word feature gives an F1-
mesure of 33.33%. On the other hand when unigram co-
occurrence, POS of target word and POS of next word
features are combined together then an F1-measure of
62.9% is obtained. Unigram co-occurrence and local
collocation features together give an improved F1-measure
of 73.3% .The best result is obtained when combination of
all features are used.
TABLE III. Performance Measure of the system for different combination
of features.
Features Precision Recall F1 Measure
UCO 62.5% 50% 55.6%
POST+ POSN 37.5% 30% 33.3%
UCO+POST+POSN 66.7% 60% 62.9%
UCO+LC 77.8% 70% 73.3%
All Together 86% 86% 86%
The system has performed best giving an F1-mesure of
86%, when all the four features are combined together. It
can also be analyzed from our results that local collocation
and unigram feature somewhat gives better result of 73.3%
as compared to unigram feature used alone and combination
of POS of target word feature and POS of next word feature
and combination of unigram feature, POS of target word
feature and POS of next word feature.
V. CONCLUSION
This is the first step towards Assamese Word Sense
Disambiguation. Due to limited availability of open
computational resources in Assamese language, few feature
extraction processes become difficult to implement. The
major limitation of our system is the lack of standard
computational resources in Assamese. For better
performance of our system the relations amongst synsets in
the WordNets could be used along with excessive care to
handle morphology. As WSD is an important component of
most the NLP research works, this project will have a
fruitful effect on Assamese Natural Language Processing.
VI. ACKNOWLEDGMENTS
We would like to thank Ms. Deepa Hazarika for her
valuable technical advice and tremendous support in the
process of POS tagging.
REFERENCES
[1] Fellbaum, C. & Palmer, M., “Manual and Automatic Semantic
Annotation with WordNet”, Proceedings of NAACL 2001
Workshop,2001.
[2] Anagha Kulkarni, Michael Heilman, Maxine Eskenazi and Jamie
Callan, 2006, “Word Sense Disambiguation for Vocabulary
Learning”, Proceedings of the Ninth International Conference on
Spoken Language Processing , 2006.
[3] David Yarowsky, “Word Sense Disambiguation using statistical
model of Roget’s categories trained on large corpora,” In Proceedings
of the 14th International Conference on Computational Linguistics
(COLING-92), pages 454-460, Nantes, France, 1992.
[4] Arindam Roy, Sunita Sarkar, Bipul Syam Purkayastha , “Knowledge
based approaches to Nepali Word Sense Disambiguation”,
Proceedings of International Conference of Natural Language
Processing and Cognitive Computing,India,2014.
[5] Richard Laishram Singh, Krishnendu Ghosh, Kishorjit
Nongmeikapam and Sivaji Bandyopadhyay, “A Decision Tree Based
Word Sense Disambiguation System in Manipuri Language”,
Proceedings of International Conference of Natural Language
Processing and Cognitive Computing,2014.
[6] Stevenson, M. & Wilks, Y., “The Interaction of Knowledge Sources
in Word Sense Disambiguation”,Computational Linguistics, Vol. 27,
No 3, pp.321 – 349,2001.
[7] Fellbaum, C., “WordNet: An electronic Lexical Database,
Cambridge: MIT Press”,1998.
[8] Manish Sinha, Mahesh Kumar Reddy, Prabhakar Pande, Laxmi
Kashyap and Pushpak Bhattacharyya, “Hindi Word Sense
Disambiguation , International Symposium on Machine Translation,
Natural Language Processing and Translation Support Systems,
Delhi, India, November, 2004.
[9] Hastings, P. et al., “ Inferring the meaning of verbs from context”,
Proceedings of the Twentieth Annual Conference of the Cognitive
Science Society (CogSci-98), Wisconsin, Madison,1998.
[10] Cuong Anh Le and Akira Shimazu, “High WSD accuracy using
Naive Bayesian classifier with rich features”, PACLIC 18, 2004.
2014 International Conference on Contemporary Computing and Informatics (IC3I) 949
[11] Gale W., Church K., and Yarowsky D., “A Method for
Disambiguation Word Sense in a Large Corpus”. Computers and
Humanities, vol. 26, pp. 415-439, 1992.
[12] Hemchandra Baruah. Hem Kosha. Hemkosh Prakashan, Guwahati,
sixth edition, 1985.
[13] F. A. Smadja, “Lexical co-occurrence: The missing link”, Literary
and Linguistic Computing (4) (3): 163-168, 1989.
[14] D. Yarowsky, “One sense per collocation”, In Proceedings of the
workshop on Human Language Technology, pp. 266-271, March
1993.
950 2014 International Conference on Contemporary Computing and Informatics (IC3I)
... The problem of WSD has been viewed as a classification problem in machine learning. The authors Borah et al. (2014) proposed Assamese as an automatic WSD system using a naive Bayes classifier. [4] They used unigram co-occurrences (UCO), POS of target word (POST), POS of next word feature (POSN), and local collocation (LC) features of their system. ...
... The authors Borah et al. (2014) proposed Assamese as an automatic WSD system using a naive Bayes classifier. [4] They used unigram co-occurrences (UCO), POS of target word (POST), POS of next word feature (POSN), and local collocation (LC) features of their system. The proposed system has performed best giving when combined all features by giving an F1 measure of 86%. ...
... WSD for Assamese language reported by Borah et al. [4] achieved 86% F1measure by incorporating four features (Unigram Co-occurrences, POS of Target Word, POS of Next Word and Local Collocation) with Naïve Bayes classifier. Another Naïve Bayes Classifier-based WSD task reported by Sarmah and Sarma [5] obtained a result of accuracy 71%, which achieved 7% improvement in accuracy by adopting iterative learning mechanism. ...
... The sense-tagged test corpus is the expected output of the system. In our previous work, 86% of F1-measure was obtained using the features Unigram Co-occurrence (UCO), Parts of Speech of Target word (POST), Parts of Speech of Next word (POSN) and Local Collocation (LC) [4]. A new feature Semantically Related Words (SRW) has been employed in addition to the above-mentioned features in this research work. ...
Chapter
Full-text available
Word sense ambiguity comes about the use of lexemes associated with more than one sense. In this research work, an improvement has been proposed and evaluated for our previously developed Assamese Word-Sense Disambiguation (WSD) system where potential outcomes of using semantic features were evaluated up to a limited extent. As semantic relationship information has a good effect in most of the natural language processing (NLP) tasks, in this work, the system is developed based on supervised learning approach using Naïve Bayes classifier with syntactic as well as semantic features. The performance measure of the overall system has been improved up to 91.11% in terms of F1-measure as compared to 86% of the previously developed system by incorporating the Semantically Related Words (SRW) feature in our feature set.
... NB classifier for Hindi language WSD task by Singh et al. [20] used eleven features such as local context, collocations, nouns, vibhaktis and unordered list of words on 60 Hindi polysemous noun words. Borah et al. [21] described the use of NB classifier for disambiguation of Assamese language by using unigram co-occurrences, local collocation, part of speech of target word and part of speech of next word as features. Parameswarappa and Narayana [22] exploited the compound words clue and syntactic features in a local context, for Kannad language words disambiguation using NB classifier. ...
Article
Full-text available
Word Sense Disambiguation (WSD) is the process of identifying the correct sense of the word in the context. The most leading scheme used by WSD is machine learning approach, where a human expert provides examples of correctly disambiguated words, and a machine learning algorithm is used to induce a model from these examples. In this paper, Naive Bayes supervised classifier has been used to disambiguate words of Punjabi language. The feature extraction process plays a vital role in building the supervised machine learning models. For the proposed Punjabi WSD system, Bag of Words (BoW) and collocation models are used separately to extract relevant features. BoW model has used all words around target word while collocation model has used two words before and two words after the target word as features. Both the models have used a common training data set to build the model. It has been observed that the selection of smoothing parameter for Naive Bayes has a significant impact on its performance. This proposed work has been tested on 150 most ambiguous noun words selected form Punjabi WordNet having 6 or more senses. During the process of building the model, fine senses of ambiguous words have been merged to produce coarse sense on the basis of manual analysis of lexical relations of WordNet. The accuracy of the proposed system has been calculated independently for BoW and collocation model. The proposed WSD system achieves an accuracy of 89% for BoW model and 81% for collocation model. It has been concluded that BoW model performs better than the collocation model for WSD task for Punjabi language.
... There are around 13 million users of this language around the globe (Das et al. 2002b). In literature 12 studies related to the Assamese language are found (Das et al. 2002a;Choudhury et al. 2019;Saharia et al. 2009Sarma et al. 2010;Saharia 2011;Hussain et al. 2011;Saharia and Konwar 2012;Sarmah et al. 2013;Borah et al. 2014;Kashyap et al. 2015;Sarma 2017). Out of which 6 are related to spell-checking. ...
Article
Full-text available
Performance of any word processor, search engine, social media relies heavily on the spell-checkers, grammar checkers etc. Spell-checkers are the language tools which break down the text to check the spelling errors. It cautions the user if there is any unintentional misspelling occurred in the text. In the area of spell-checking, we still lack an exhaustive study that covers aspects like strengths, limitations, handled errors, performance along with the evaluation parameters. In literature, spell-checkers for different languages are available and each one possesses similar characteristics however, have a different design. This study follows the guidelines of systematic literature review and applies it to the field of spell-checking. The steps of the systematic review are employed on 130 selected articles published in leading journals, premier conferences and workshops in the field of spell-checking of different inflectional languages. These steps include framing of the research questions, selection of research articles, inclusion/exclusion criteria and the extraction of the relevant information from the selected research articles. The literature about spell-checking is divided into key sub-areas according to the languages. Each sub-area is then described based on the technique being used. In this study, various articles are analyzed on certain criteria to reach the conclusion. This article suggests how the techniques from the other domains like morphology, part-of-speech, chunking, stemming, hash-table etc. can be used in development of spell-checkers. It also highlights the major challenges faced by researchers along with the future area of research in the field of spell-checking.
Article
Full-text available
To translate a particular word or collection of words from one dialect to another, a system called the dialect translation module is used in this work. The suggested system's results can help with understanding and analysing the dialectal differences between the dialects. This will also benefit individuals keen on acquiring communication skills in one or more of the dialects. When translating, consider both mainstream Assamese (standard Assamese) and the Kamrupia (Palasbari) dialect of Assamese. This study delves into the analysis of text, word construction, morphology, grammar, and ambiguity. Towards the conclusion, a method for developing a deliberate translation system is suggested.
Chapter
Word-sense disambiguation (WSD) has been a persistent issue since its introduction to the community of natural language processing (NLP). It has a wide range of applications in different areas like information retrieval (IR), sentiment analysis, knowledge graph construction, machine translation, lexicography, text mining, information extraction, and so on. Analysis of the performance of deep learning algorithms with different word embeddings is required to be done since various deep learning models are deployed for the task of disambiguation of word sense. In this paper, comparison of several deep learning models like CNN, LSTM, bidirectional LSTM, and CNN + LSTM is done with trainable as well as pretrained GloVe embeddings with common preprocessing methods. Performance evaluation of temporal convolutional network (TCN) model is done along with the comparison of the same with the formerly mentioned models. This paper shows that using GloVe embeddings may not result in better accuracy in the case of word-sense disambiguation, i.e., trainable embeddings perform better. It also includes a framework for evaluating deep learning models for WSD and analysis of the usage of embeddings for the same.
Chapter
A natural language or an everyday language is an accustomed form of communication used by the people to speak, express and write. Besides, these languages are called natural because they are evolved naturally among the communities. Natural Language Processing is a very vital field in connection with Artificial Intelligence, where research has exponentially taken place. This research aims to explore the techniques that have been used to process the Assamese language, the focus will be basically on Parsing, Part-of-Speech tagging, Word-Sense Disambiguation, Machine Translation, WordNet.
Article
Full-text available
This paper manifests a primary attempt on building a word sense disambiguation system in Manipuri language. The paper discusses related attempts made in the Manipuri language followed by the proposed plan. A database, consisting of 650 sentences, is collected in Manipuri language in the course of the study. Conventional positional and context based features are suggested to capture the sense of the words, which have ambiguous and multiple senses. The proposed work is expected to predict the senses of the polysemous words with high accuracy with the help of the suitable knowledge acquisition techniques. The system produces an accuracy of 71.75 %.
Article
Full-text available
Word sense disambiguation (WSD) is a computational linguistics task likely to benefit from the tradition of combining different knowledge sources in articial intelligence research. An important step in the exploration of this hypothesis is to determine which linguistic knowledge sources are most useful and whether their combination leads to improved results. We present a sense tagger which uses several knowledge sources. Tested accuracy exceeds 94% on our evaluation corpus. Our system attempts to disambiguate all content words in running text rather than limiting itself to treating a restricted vocabulary of words. It is argued that this approach is more likely to assist the creation of practical systems.
Article
Full-text available
Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Both quantitive and qualitative methods have been tried, but much of this work has been stymied by difficulties in acquiring appropriate lexical resources. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 92% accuracy in discriminating between two very distinct senses of a noun. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. The proposed method is probably most appropriate for those aspects of sense disambiguation that are closest to the information retrieval task. In particular, the proposed method was designed to disambiguate senses that are usually associated with different topics.
Article
Word Sense Disambiguation (WSD) is defined as the task of finding the correct sense of a word in a specific context. This is crucial for applications like Machine Translation and Information Extraction. While the work on automatic WSD for English is voluminous, to our knowledge, this is the first attempt for an Indian language at automatic WSD. We make use of the Wordnet for Hindi developed at IIT Bombay, which is a highly important lexical knowledge base for Hindi. The main idea is to compare the context of the word in a sentence with the contexts constructed from the Wordnet and chooses the winner. The output of the system is a particular synset number designating the sense of the word. The mentioned Wordnet contexts are built from the semantic relations and glosses, using the Application Programming Interface created around the lexical data. The evaluation has been done on the Hindi corpora provided by the Central Institute of Indian Languages and the results are encouraging. Currently the system disambiguates nouns. Work is on for other parts of speech too.
Article
Previous work [Gale, Church and Yarowsky, 1992] showed that with high probability a polysemous word has one sense per discourse. In this paper we show that for certain definitions of collocation, a polysemous word exhibits essentially only one sense per collocation. We test this empirical hypothesis for several definitions of sense and collocation, and discover that it holds with 90-99% accuracy for binary ambiguities. We utilize this property in a disambiguation algorithm that achieves precision of 92% using combined models of very local context.