ChapterPDF Available

WSD for Assamese Language

Authors:

Abstract and Figures

Word sense ambiguity comes about the use of lexemes associated with more than one sense. In this research work, an improvement has been proposed and evaluated for our previously developed Assamese Word-Sense Disambiguation (WSD) system where potential outcomes of using semantic features were evaluated up to a limited extent. As semantic relationship information has a good effect in most of the natural language processing (NLP) tasks, in this work, the system is developed based on supervised learning approach using Naïve Bayes classifier with syntactic as well as semantic features. The performance measure of the overall system has been improved up to 91.11% in terms of F1-measure as compared to 86% of the previously developed system by incorporating the Semantically Related Words (SRW) feature in our feature set.
Content may be subject to copyright.
WSD for Assamese Language
Pranjal Protim Borah, Gitimoni Talukdar and Arup Baruah
Abstract Word sense ambiguity comes about the use of lexemes associated with
more than one sense. In this research work, an improvement has been proposed
and evaluated for our previously developed Assamese Word-Sense Disambiguation
(WSD) system where potential outcomes of using semantic features were evaluated
up to a limited extent. As semantic relationship information has a good effect in most
of the natural language processing (NLP) tasks, in this work, the system is developed
based on supervised learning approach using Naïve Bayes classifier with syntactic as
well as semantic features. The performance measure of the overall system has been
improved up to 91.11% in terms of F1-measure as compared to 86% of the previously
developed system by incorporating the Semantically Related Words (SRW) feature
in our feature set.
Keywords Word sense ambiguity ·Naïve Bayes classifier ·Semantic feature
Corpus ·Prior probability
1 Introduction
The existence of multiple senses for a single lexeme is one of the common charac-
teristics of natural languages. This particular criterion of words creates ambiguity
P. P. Borah (B)
Department of Design, Indian Institute of Technology Guwahati, Guwahati, India
e-mail: pranjalborah777@gmail.com
G. Talukdar
Department of Computer Science and Engineering, Royal Group of Institutions,
Guwahati, India
e-mail: talukdargitimoni@gmail.com
A. Baruah
Department of Computer Science and Engineering, Assam Don Bosco University,
Guwahati, India
e-mail: arup.baruah@gmail.com
© Springer Nature Singapore Pte Ltd. 2019
J. Kalita et al. (eds.), Recent Developments in Machine Learning and Data Analytics,
Advances in Intelligent Systems and Computing 740,
https://doi.org/10.1007/978-981-13-1280-9_11
119
120 P. P. Borah et al.
by making the user to deduce more than one meaning for a distinct word. Most of
the time it is easy for a human to detect the appropriate sense but coming to auto-
mated language processing systems, this disambiguation becomes a critical task. In
computational linguistics, we can consider WSD to be a phenomenon whereby main
work is to focus on determining the correct word sense provided the word is creating
an ambiguity in the context.
Ambiguity is the quality of having more than one permissible interpretation. In
computational linguistics, a sentence is said to be ambiguous if it can be understood
in two or more possible ways [1]. Word sense ambiguity occurs when a word or
a lexeme is associated with more than one meaning or sense. It is a long-standing
problem in NLP, which has been discussed in reference to machine translation [2].
For example, in English language:
Sentence A1 He was mad about stars at the age of nine.
Sentence A2 About 20,000 years ago, the last ice age ended.
In the above sentences, the word ‘age’ is ambiguous. In Sentence A1, the word
‘age’ means ‘how long something has existed’ and in Sentence A2 the word ‘age’
refers to ‘an era of history having some distinctive feature’.
For example, in Assamese language:
Sentence A3 খনত মা বাহ পািত আে
(meeting khanat mAnuhe bAh pAti Ase)
People have assembled together in the meeting.
Sentence A4 েযাৱা কািলৰ জােক চৰাই বাহ েবাৰ ভািঙ েপলাে
(juwA kAli dhumuhA jAke sorAi bAh bur vAngi pelAle)
Last night storm has broken the birds’ nests.
In the above sentences, the word ‘bAh’ (বা
) is ambiguous. In Sentence A3, the
word ‘bAh’ (বা
) means ‘to gather together’ or ‘a drove’ and in Sentence A4 the
word ‘bAh’ (বা
) refers to ‘a structure in which animals lay eggs or give birth to
their young ones’ or ‘a nest’.
WSD is one of the challenging tasks in the area of computational linguistics. In the
later part of 1940s, WSD came to be recognized as a very important computational
task, specially when the days of machine translation began to start [2]. In 1949,
Weaver introduced WSD when he presented his popular memorandum based on
machine translation [3]. Machine translation is the area in which the first attempt
to perform WSD was carried out [3]. According to Weaver the contexts as well as
statistical semantic studies play crucial parts in WSD [3].
WSD for Assamese language reported by Borah et al. [4] achieved 86% F1-
measure by incorporating four features (Unigram Co-occurrences, POS of Target
Word, POS of Next Wordand Local Collocation) with Naïve Bayes classifier. Another
Naïve Bayes Classifier-based WSD task reported by Sarmah and Sarma [5] obtained
a result of accuracy 71%, which achieved 7% improvement in accuracy by adopting
iterative learning mechanism. The work by Sarmah and Sarma [6] has used Decision
Tree model for Assamese WSD task and reported an average F-measure of 0.611
for 10 Assamese ambiguous words.
WSD for Assamese Language 121
2 Methodology
Assamese is a highly inflectional language [7]. Word sense disambiguation in
Assamese is difficult due to its rich morphology. A subset of homonymous and
polysemous words of Assamese language has been selected for this research work.
In this project, the WSD system is designed based on supervised learning approach,
which demands a large set of resources such as annotated corpora and lexical database
in Assamese language.
The inputs to the Assamese WSD system are the training data (training corpus)
and test data (test corpus). The system uses external knowledge source the Lexicon
for feature extraction of both training and testing phase. The sense-tagged test corpus
is the expected output of the system. In our previous work, 86% of F1-measure was
obtained using the features Unigram Co-occurrence (UCO), Parts of Speech of Target
word (POST), Parts of Speech of Next word (POSN) and Local Collocation (LC) [4].
A new feature Semantically Related Words (SRW) has been employed in addition
to the above-mentioned features in this research work.
2.1 Classification Process
Naïve Bayes machine learning process, when collaborated with richer set of features,
can help to obtain high accuracies in classification process [8]. Naïve Bayes approach
is a famous and a common classification approach, and Gale, Church and Yarowsky
were the first researchers to use Naïve Bayes technique for WSD task in their work
named A Method for Disambiguating Word Senses in a Large Corpus’ in the year
1992 [9].
Naïve Bayes classifier works on the assumption that all the features, which are
used to classify the test case are class conditionally independent. If the feature vector
is defined by F(f1,f2,…,fn) and kmultiple senses for the ambiguous word is
defined by S(S1,S2,…,Sk), then in order to classify the true sense of the target
ambiguous word (w), we have to find the sense Sithat maximizes the conditional
probability represented as P(wSi|F).
2.2 Features
In supervised word sense disambiguation where corpus plays an important role, the
classification system becomes solely reliable on the information obtained from the
training corpus. Syntactic features are extracted from POS information incorporated
in training corpus, whereas semantic features can also be used based on the avail-
ability of lexical database. The five different features used in this Assamese WSD
system are described below.
122 P. P. Borah et al.
Unigram Co-occurrences (UCO). Co-occurrences are word pairs, which mostly
have the tendency to occur in the same context (not necessarily in a particular order)
with a variable number of intermediate words [10]. Assuming a window of size m
(that is mnumber of previous words and mnumber of next words are available with
respect to the target ambiguous word where mranges from 1 to 5), a list of unigrams
(most frequently appearing words) is maintained in the Lexicon for every sense of
the ambiguous words.
POSofTargetWord(POST ). It may happen that all or some of the senses of a multi-
semantic word appear in different parts of speech. The contribution of this feature
depends on whether or not all or some of the possible senses of a multi-semantic word
have different POS. This POS information is maintained in the Lexicon for every
sense of each ambiguous word. The computation is done according to likelihood
estimation.
POS of Next Word (POSN). The structure of Assamese language is such that
it makes use of auxiliary verbs often to indicate the action with regard to a noun.
Frequently, these auxiliary verbs appear in next positions to that noun in particular.
This is why the next word’s part of speech (POSN) information has a contribution in
WSD. This information is obtained from the training corpus by likelihood estimation.
Local Collocation (LC). Collocation represents a group of words most often occur-
ring together in a particular sequence, which notifies a distinct sense for the multi-
semantic word appearing in a particular set of occurrences [11]. Collocation has a
great contribution in WSD if unique collocations are possible for all or some of the
possible senses of the multi-semantic words. A list of collocations is maintained in
the Lexicon for every senses of each ambiguous word.
Semantically Related Words (SWR). Information about semantic relations has a
quite good effect in most of the language processing tasks. Generally, for the task of
WSD, this information is collected from the WordNet [12]. In our work, the required
semantic information is provided by the Lexicon as Assamese WordNet is not readily
available. This Lexicon contains synonyms of ambiguous words corresponding to
every particular sense, and for each synonym, there is a separate entry in the Lexicon
containing the information such as sense-id, POS, synonyms, unigrams and colloca-
tions. The semantic closeness of the target ambiguous word with its synonyms in a
particular sense represents the closeness between the ambiguous word and that par-
ticular sense. Relation between the synonyms and the context is measured in terms
of Unigram Co-occurrences. If the synonyms are also multi-semantic words, then
the same process is repeated.
3 Results and Discussion
The process of WSD starts with the phase of discovering ambiguous words in the
test corpus, and finally, performs the classification of these ambiguous words. Once
the WSD task is completed, we need to focus on the performance measure of the
system. The accuracy of the system can be measured in terms of the following:
WSD for Assamese Language 123
Fig. 1 F1-measure
considering prior probability
with respect to change in
window size (m)
Window size
F1-measure
Precision (P). Precision is the ratio of relevant results returned to total number of
results returned.
Recall (R). Recall is the ratio of relevant results returned to the number of results
possible.
F1-measure. F1-measure is calculated as the harmonic mean of precision and recall.
F1-measure 2(Precision Recall)/(Precision + Recall)
The disambiguation process for the test corpus has been carried out several times
by considering the features in different combinations along with varying window
sizes (range 1–5) for the feature UCO. Moreover, for each combination of features,
we have two set of results, respectively, by considering and avoiding the prior prob-
abilities as shown in Table 1.
For the rows 2, 3, 5, 10, 12, 14 and 23 in Table 1, there is no change in F1-measures
with respect to the change in window size (m) as these respective combinations of
features neither have UCO nor SRW feature, which relies on window size (m).
AsshowninFig.1, the line diagram of F1-measure (considering prior probability)
with respect to the change in window size (m) clearly indicates that the maximum
value of F1-measure obtained for the features UCO and SRW are, respectively, at
m4 and m2.
As shown in Fig. 2, the line diagram of F1-measure (without considering
prior probability) with respect to the change in window size (m) indicates that
the maximum value of F1-measure obtained for the features UCO and SRW are,
respectively, at m3, 4 and m2. However, the combination of UCO and SRW
gives same accuracy values for m= 3 and m=4(Row8inTable1).
The bar diagram in Fig. 3representing the F1-measures for individual features
UCO, POST, POSN, SRW and LC when m= 3, indicates that contribution of UCO is
highest as compared to other features. It is also found that 76% of times F1-measure
without prior probability is greater than F1-measure with prior probability when
window size m3.
Figure 4represents F1-measures without considering prior probabilities with
window size m=3 for a selected set of feature combinations. The combination of all
the five features together gives an accuracy measure of 91.1%.
124 P. P. Borah et al.
Tabl e 1 F1-measure for different combination of features and window sizes of UCO
S. No. Combination of features F1-measure with prior probability F1-measure without prior probability
m1m2m3m4m5m1m2m3m4m5
1 UCO 74 75.5 77 79.2 77.7 79.2 82.2 82.9 82.9 81.4
2POST 67.4 67.4 67.4 67.4 67.4 62.9 62.9 62.9 62.9 62.9
3POSN 59.2 59.2 59.2 59.2 59.2 58.5 58.5 58.5 58.5 58.5
4SRW 62.2 63.7 62.9 62.9 60 57 59.2 57.7 57.7 56.2
5LC 66.6 66.6 66.6 66.6 66.6 68.8 68.8 68.8 68.8 68.8
6UCO+POST 80.7 82.9 83.7 84.4 82.9 82.9 85.9 88.1 89.6 88.1
7UCO+POSN 70.3 73.3 76 77 80 74.8 76.2 80 81.4 80
8UCO+SRW 77 79.2 80.7 80.7 77.7 79.2 82.2 82.2 82.2 80
9 UCO+LC 82.2 84.4 85.9 86.6 85.9 82.9 85.9 88.1 88.8 88.1
10 POST+POSN 66.6 66.6 66.6 66.6 66.6 68.1 68.1 68.1 68.1 68.1
11 POST+SRW 72.5 74.8 74.8 74 71.8 71.8 74 73.3 73.3 71.8
12 POST+LC 73.3 73.3 73.3 73.3 73.3 74.8 74.8 74.8 74.8 74.8
13 POSN+SRW 60.7 62.9 62.9 62.9 62.2 61.4 62.9 62.9 62.9 62.2
14 POSN+LC 67.4 67.4 67.4 67.4 67.4 70.3 70.3 70.3 70.3 70.3
15 SRW+LC 74 75.5 75.5 74.8 73.3 75.5 77.7 76.2 75.5 74.8
16 UCO+POST+POSN 75.5 79.2 80.7 81.4 80.7 80 80.7 83.7 83.7 82.2
17 UCO+POST+SRW 82.9 85.1 85.9 85.1 82.9 82.9 85.9 87.4 88.8 86.6
18 UCO+POST+LC 86.6 88.1 88.8 89.6 88.8 85.1 88.8 90.3 91.1 90.3
19 UCO+POSN+SRW 74 77 87.5 87.5 76.2 77 78.5 80.7 80 78.5
20 UCO+POSN+LC 76.2 79.2 82.2 84.4 84.4 82.2 84.4 88.1 88.1 87.4
(continued)
WSD for Assamese Language 125
Tabl e 1 (continued)
S. No. Combination of features F1-measure with prior probability F1-measure without prior probability
m1m2m3m4m5m1m2m3m4m5
21 UCO+SRW+LC 82.9 85.9 87.4 86.6 85.9 82.9 85.9 86.6 86.6 85.9
22 POST+POSN+SRW 68.8 71.1 71.1 71.1 71.1 70.3 69.6 71.8 71.8 71.1
23 POST+POSN+LC 73.3 73.3 73.3 73.3 73.3 75.5 75.5 75.5 75.5 75.5
24 POSN+SRW+LC 69.6 71.8 72.5 73.3 73.3 74 75.5 75.5 75.5 74.8
25 UCO+POST+POSN+SRW 77.7 81.4 82.9 82.9 81.4 80.7 81.4 83.7 83.7 82.2
26 UCO+POST+POSN+LC 80.7 83.7 85.9 86.6 86.6 85.1 85.9 88.8 90.3 89.6
27 POST+POSN+SRW+LC 74.8 77 77 77 77 77 78.5 79.2 80 79.2
28 POSN+SWR+LC+UCO 80.7 83.7 85.9 85.9 85.1 83.7 85.9 88.1 87.4 86.6
29 UCO+POST+LC+SRW 86.6 88.1 88.8 88.8 88.1 85.1 88.8 89.6 90.3 89.6
30 ALL 82.9 85.9 88.1 88.1 87.4 88.1 88.8 91.1 91.1 90.3
126 P. P. Borah et al.
Window size
F1-measure
Fig. 2 F1-measure without prior probability with respect to change in window size (m)
(a) F1 -measure with prior probability (b) F1 -measure without prior probability
Fig. 3 Features with prior probability (a) and without prior probability (b)
F1-measure
Fig. 4 Change in F1-measure with respect to addition of features
WSD for Assamese Language 127
4 Conclusion
In this paper, we reported advancement in our Assamese WSD task by addition of
a semantic feature to the Naïve Bayes classification process. A number of experi-
ments have been carried out considering different combinations of features to dis-
ambiguate 135 multi-semantic words present in the test corpus (size 1300 words).
Half of the experiments were carried out considering the use of prior probability and
other half without considering the use of prior probability. For all these experiments,
the obtained accuracy values ranges from 56.2 to 91.11%. Highest F1-measure of
91.11% (without considering the use of prior probability) was obtained using all the
five features UCO, POST, POSN, SRW and LC with window size m3. However,
considering the use of prior probability for the same set of features and window size,
the F1-measure was obtained to be 88.1%. The performance measure of the overall
system has been improved up to 91.11% in terms of F1-measure as compared to
86% of the previously developed system by incorporating the Semantically Related
Words (SRW) feature in our existing feature set. The basic drawback of this system
is the size of the training corpus and test corpus due to which the current system is
suitable for a selected set of nouns, adjectives, verbs, pronouns and quantifiers with
less effects of morphology. A large set of words of different parts of speech can be
included for the task of word sense disambiguation with the use of WordNet and the
morphology being excessively handled. The contribution of the features and effect of
prior probability discussed in this paper will be helpful for future works in Assamese
WSD.
References
1. Jurafsky, D.: Speech & Language Processing. Pearson Education, India (2000)
2. Kaplan, A.: An experimental study of ambiguity and context. Mech. Transl. 2(2), 39–46 (1955)
3. Weaver, W.: Translation. Mach. Transl. Lang. 14, 15–23 (1955)
4. Borah, P.P., Talukdar, G., Baruah, A.: Assamese word sense disambiguation using supervised
learning. In: 2014 International Conference on Contemporary Computing and Informatics
(IC3I). IEEE (2014)
5. Sarmah, J., Sarma, S.K.: Word sense disambiguation for Assamese. In: 2016 IEEE 6th Inter-
national Conference on Advanced Computing (IACC). IEEE (2016)
6. Sarmah, J., Sarma, S.K.: Decision tree based supervised word sense disambiguation for
Assamese. Int. J. Comput. Appl. 141(1) (2016)
7. Sharma, P., Sharma, U., Kalita, J.: Suffix stripping based NER in Assamese for location names.
In: 2012 2nd National Conference on Computational Intelligence and Signal Processing (CISP).
IEEE (2012)
8. Le, C.A., Shimazu, A.: High WSD accuracy using Naïve Bayesian classifier with rich features.
In: Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation
(2004)
9. Gale, W.A., Church, K.W., Yarowsky, D.: A method for disambiguating word senses in a large
corpus. Comput. Humanit. 26(5-6), 415–439 (1992)
10. Smadja, F.A.: Lexical co-occurrence: The missing link. Literary Linguist. Comput. 4(3),
163–168 (1989)
128 P. P. Borah et al.
11. Yarowsky, D.: One sense per collocation. Pennsylvania University Philadelphia, Department
of Computer and Information Science (1993)
12. Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet:: similarity: measuring the relatedness
of concepts. Demonstration papers at HLT-NAACL 2004. Association for Computational Lin-
guistics (2004)
... Here in experiment 3, we had compared with the latest approaches that falls under dictionary-based approach. Observation regarding the performance of Lesk Algorithm approach is made and found that it excels to the approach proposed by Borah et al [20] by 2.45% and also algorithm with word cooccurrence enhances by 1.53% respectively. It is also to be mentioned that word cooccurrence algorithm was considered for 135 words by Borah et al [20] whereas we had done it for 200 words. ...
... Observation regarding the performance of Lesk Algorithm approach is made and found that it excels to the approach proposed by Borah et al [20] by 2.45% and also algorithm with word cooccurrence enhances by 1.53% respectively. It is also to be mentioned that word cooccurrence algorithm was considered for 135 words by Borah et al [20] whereas we had done it for 200 words. ...
... In [8], the Naïve Bayes classifier for Assamese WSD is put forth. The role of syntactic and semantic features is explored for carrying out the disambiguation process. ...
Article
Full-text available
Many applications of Natural Language Processing (NLP) like machine translation, document clustering, and information retrieval make use of Word Sense Disambiguation (WSD). WSD automatically predicts the sense of an ambiguous word that exactly fits it as per the given situation. While it may seem very easy for humans to interpret the meaning of natural language, machines require the processing of huge amounts of data for similar tasks. In this paper, we propose an automatic WSD system for the Kashmiri language based on the Naive Bayes classifier. This work is the first attempt towards developing a WSD system for the Kashmiri language to the best of our knowledge. Bag-of-Words (BoW) and Part-of-Speech (PoS) based features are used in this study for developing the WSD system. Experiments are carried out on a manually crafted sense-tagged dataset for 60 ambiguous Kashmiri words. These 60 words are selected based on the frequency in the raw corpus collected. Senses for annotation purposes of these ambiguous words are extracted from Kashmiri WordNet. The performance of the proposed system is measured using accuracy, precision, recall and F-1 measure metrics. The proposed WSD model reported the best performance (accuracy = 89.92, precision = 0.84, recall = 0.89, F-1 measure = 0.86) when both PoS and BoW features were used at the same time.
Article
Full-text available
Natural Languages are inherently ambiguous. Word Sense Disambiguation is one such problem where one word has multiple meaning depending upon the context in which it appears in the text. It can be considered as an intermediate step for many NLP applications like Machine Translation, Summarization, Query Processing, etc. Developing a Word Sense Disambiguation (WSD) system for the Gujarati language can have significant applications in the education sector. There are different approaches available for WSD which includes Supervised, Unsupervised and knowledge-based approaches. The work done is English language is extensive for this problem but for other regional languages more research needs to be done. This paper presents various works, mentioning their proposed method, datasets used, limitations and performance for some Indian languages like Hindi, Malayalam, Bengali, etc. The paper also enlists some general observations about existing approaches for word sense disambiguation. In the last section, the paper proposes a method to resolve WSD problem for Gujarati Language using Genetic Algorithm.
Article
Full-text available
Natural languages are written and spoken languages, and NLP (Natural Language Processing) is the ability of a computer program to recognize both written and spoken languages. Word Sense Disambiguation (WSD) is identified as a challenging area of research in Artificial Intelligence (AI), and Machine Translation (MT). WSD is the procedure for selecting the exact meaning of a word that has more than one meaning. This is an essential application for all-natural language processing applications. There are various knowledge-based, supervised, and unsupervised approaches to WSD process. The Naïve Bayes classifier as an example of approach supervised and unsupervised approaches is the most important method. In this paper, we emphasize on the use of the Naïve Bayes approach for text classification in WSD techniques. Bayes' hypothesis is a probabilistic model and a reliable approach for text classification. Bayes' hypothesis acknowledges that the occurrence of some other features is not dependent on the presence of a particular element in a class. This calculation can be used to solve multi-class prediction problems. This classifier performs better compared to the other methods of different approaches. This paper gives an itemized investigation of Naïve Bayes algorithms, which depicts its ideas, covering up Naïve Bayes', text characterization, traditional innocent Bayes', and machine learning. We have used the collocation method of feature extraction for the WSD of English sentences. Using this model we have disambiguate ambiguous English words by predicting part-of-speech inclusive of "noun," "verb," "adverb," and "adjective." This disambiguation module is an enhancement in machine translation. The system reported the performance measure of seventy-eight (78%) percent of the scale on F1-measure.
Conference Paper
The process of assigning the correct meaning to a word with respect to its context is known as Word-Sense Disambiguation (WSD). It is a hard problem in natural language processing. In this paper, we discuss various approaches used to tackle this problem. Research works conduced in recent years to solve this problem are also discussed with special focus on resource poor languages. Technological trends followed in the recent years are analyzed that may help in the identification of future path to search better WSD solutions.
Conference Paper
Full-text available
Word sense disambiguation (WSD) can be defined as a task that focuses on estimating the right sense of a word in its context. It is important as a pre-processing step in information extraction, machine translation, question answering and many other natural language processing tasks. Ambiguity in Word Sense arises when a particular word has more than one possible sense. Finding the correct sense requires thorough knowledge regarding words. This information of words is often derived from the sources such as words appearing in the context of the target word, part of speech information of the words in the neighbour, syntactical relations and local collocations. Our main aim in this paper is to develop an automatic system for WSD in Assamese using a Naive Bayes classifier. This is the first work to the best of our knowledge on developing an automatic WSD system for Assamese language. Assamese, the main language of most of the people in North-Eastern part of India is a morphologically very rich language. In Assamese WSD is a challenging task because a word can behave differently when combined with a suffix or a sequence of suffixes to have an entirely different sense. WSD often makes use of lexical resources such as WordNet, lexicon, annotated or unannotated corpora etc for its process of disambiguation.
Conference Paper
Full-text available
WordNet::Similarity is a freely available software package that makes it possible to measure the semantic similarity and relatedness between a pair of concepts (or synsets). It provides six measures of similarity, and three measures of relatedness, all of which are based on the lexical database WordNet. These measures are implemented as Perl modules which take as input two concepts, and return a numeric value that represents the degree to which they are similar or related.
Article
Full-text available
Word sense disambiguation has been recognized as a major problem in natural language processing research for over forty years. Both quantitive and qualitative methods have been tried, but much of this work has been stymied by difficulties in acquiring appropriate lexical resources. The availability of this testing and training material has enabled us to develop quantitative disambiguation methods that achieve 92% accuracy in discriminating between two very distinct senses of a noun. In the training phase, we collect a number of instances of each sense of the polysemous noun. Then in the testing phase, we are given a new instance of the noun, and are asked to assign the instance to one of the senses. We attempt to answer this question by comparing the context of the unknown instance with contexts of known instances using a Bayesian argument that has been applied successfully in related tasks such as author identification and information retrieval. The proposed method is probably most appropriate for those aspects of sense disambiguation that are closest to the information retrieval task. In particular, the proposed method was designed to disambiguate senses that are usually associated with different topics.
Article
Full-text available
Many wording choices in English sentences cannot be accounted for on semantic or syntactic grounds; they are idiosyncratic. They can be expressed in terms of co-occurrence relations among lexical items, and need to be specifically included in dictionaries. For language generation, this type of lexical knowledge is crucial as it would enhance the process of lexical selection while simplifying input structures Co-occurrence knowledge is currently not available in Compiled form, which is the main reason why it has generally been ignored in the past. We describe in this paper a co-occurrence compiler, EXTRACT that identifies co-occurrence lexical relations in large on-line corpora of English texts EXTRACT can be used as a lexicographic tool for compiling machine readable dictionaries
Article
Full-text available
WordNet::Similarity is a freely available software package that makes it possible to measure the semantic similarity and relatedness between a pair of concepts (or synsets). It provides six measures of similarity, and three measures of relatedness, all of which are based on the lexical database WordNet. These measures are implemented as Perl modules which take as input two concepts, and return a numeric value that represents the degree to which they are similar or related.
Conference Paper
Word Sense Disambiguation (WSD) is the process ofidentifying the proper sense of an ambiguous word depending onthe particular context. It is to find the accurate sense si among theset of senses {s1, s2, , sn}. This task was motivated by itsinterpretation in various Natural Language Processing (NLP) applications like IR, MT, QA, TC, SP etc. In this paper, machinelearning technique - Naive Bayes Classifier was used forautomatic disambiguation task. Training data was prepared withsense annotated features. For preparing sense annotated data wetook help of the sense inventory. Currently, about 160 ambiguouswords are present in the sense inventory derived from 18K and25K words from Assamese Corpus and WordNet. The system isimplemented in two phases. In the first phase, a total of 2.7Ksense annotated training data and 800 test data were taken and aresult of 71% accuracy was found. Analyzing the result depictsthat accuracy improves as the training data size graduallyincreases and by the learned model generated in the previousiteration. In second phase we manually validate the outcomes offirst-phase and we add those clean sense tagged data to previoustraining data set. Than we train our system with our incresingtraining data (3.5K) which enhance the result accuracy. Aniterative learning is adopted by the system and more accuracy of7% is achieved. This paper aims to implement Assamese WSDsystem by NB classifier using lexical features and enhancement ofthe baseline method turns out in improving the classifieraccuracy to 78%.
Article
Sense Disambiguation (WSD) aims to disambiguate the words which have multiple sense in a context automatically. Sense denotes the meaning of a word and the words which have various meanings in a context are referred as ambiguous words. WSD is vital in many important Natural Language Processing tasks like MT, IR, TC, SP etc. This research paper attempts to propose a supervised Machine Learning approach- Decision Tree for Word Sense Disambiguation task in Assamese language. A Decision Tree is decision model flow- chart like tree structure where each internal node denotes a test, each branch represents result of a test and each leaf holds a sense label. J48 a Java implementation of C4.5 decision tree algorithm is taken for experimentation in our case. A few polysemous words with different real occurrences in Assamese text with manual sense annotation was collected as the training and test dataset. DT algorithm produces average F-measure of .611 when 10-fold crossvalidation evaluation was performed on 10 Assamese ambiguous words.
Article
Named Entity Recognition (NER) is the process of identifying and classifying proper nouns in text documents into pre-defined classes such as person, location and organization. It plays an important role in Natural Language Processing applications. Although NER in Indian languages is a difficult and challenging task and suffers from scarcity of resources, such work has started to appear recently. In highly inflectional languages such as Assamese, NER requires identification of the root forms of words that occur in texts. Our work reports a suffix stripping approach to identify those roots of words which are location named entities.
Article
Previous work [Gale, Church and Yarowsky, 1992] showed that with high probability a polysemous word has one sense per discourse. In this paper we show that for certain definitions of collocation, a polysemous word exhibits essentially only one sense per collocation. We test this empirical hypothesis for several definitions of sense and collocation, and discover that it holds with 90-99% accuracy for binary ambiguities. We utilize this property in a disambiguation algorithm that achieves precision of 92% using combined models of very local context.