Content uploaded by Pranjal Protim Borah
Author content
All content in this area was uploaded by Pranjal Protim Borah on Apr 02, 2019
Content may be subject to copyright.
WSD for Assamese Language
Pranjal Protim Borah, Gitimoni Talukdar and Arup Baruah
Abstract Word sense ambiguity comes about the use of lexemes associated with
more than one sense. In this research work, an improvement has been proposed
and evaluated for our previously developed Assamese Word-Sense Disambiguation
(WSD) system where potential outcomes of using semantic features were evaluated
up to a limited extent. As semantic relationship information has a good effect in most
of the natural language processing (NLP) tasks, in this work, the system is developed
based on supervised learning approach using Naïve Bayes classifier with syntactic as
well as semantic features. The performance measure of the overall system has been
improved up to 91.11% in terms of F1-measure as compared to 86% of the previously
developed system by incorporating the Semantically Related Words (SRW) feature
in our feature set.
Keywords Word sense ambiguity ·Naïve Bayes classifier ·Semantic feature
Corpus ·Prior probability
1 Introduction
The existence of multiple senses for a single lexeme is one of the common charac-
teristics of natural languages. This particular criterion of words creates ambiguity
P. P. Borah (B)
Department of Design, Indian Institute of Technology Guwahati, Guwahati, India
e-mail: pranjalborah777@gmail.com
G. Talukdar
Department of Computer Science and Engineering, Royal Group of Institutions,
Guwahati, India
e-mail: talukdargitimoni@gmail.com
A. Baruah
Department of Computer Science and Engineering, Assam Don Bosco University,
Guwahati, India
e-mail: arup.baruah@gmail.com
© Springer Nature Singapore Pte Ltd. 2019
J. Kalita et al. (eds.), Recent Developments in Machine Learning and Data Analytics,
Advances in Intelligent Systems and Computing 740,
https://doi.org/10.1007/978-981-13-1280-9_11
119
120 P. P. Borah et al.
by making the user to deduce more than one meaning for a distinct word. Most of
the time it is easy for a human to detect the appropriate sense but coming to auto-
mated language processing systems, this disambiguation becomes a critical task. In
computational linguistics, we can consider WSD to be a phenomenon whereby main
work is to focus on determining the correct word sense provided the word is creating
an ambiguity in the context.
Ambiguity is the quality of having more than one permissible interpretation. In
computational linguistics, a sentence is said to be ambiguous if it can be understood
in two or more possible ways [1]. Word sense ambiguity occurs when a word or
a lexeme is associated with more than one meaning or sense. It is a long-standing
problem in NLP, which has been discussed in reference to machine translation [2].
For example, in English language:
Sentence A1 He was mad about stars at the age of nine.
Sentence A2 About 20,000 years ago, the last ice age ended.
In the above sentences, the word ‘age’ is ambiguous. In Sentence A1, the word
‘age’ means ‘how long something has existed’ and in Sentence A2 the word ‘age’
refers to ‘an era of history having some distinctive feature’.
For example, in Assamese language:
Sentence A3 খনত মা হেুন বাহ পািত আে
ছ
(meeting khanat mAnuhe bAh pAti Ase)
People have assembled together in the meeting.
Sentence A4 েযাৱা কািলৰ াহুমুধ জােক চৰাই বাহ েবাৰ ভািঙ েপলাে
ল
(juwA kAli dhumuhA jAke sorAi bAh bur vAngi pelAle)
Last night storm has broken the birds’ nests.
In the above sentences, the word ‘bAh’ (বা
হ
) is ambiguous. In Sentence A3, the
word ‘bAh’ (বা
হ
) means ‘to gather together’ or ‘a drove’ and in Sentence A4 the
word ‘bAh’ (বা
হ
) refers to ‘a structure in which animals lay eggs or give birth to
their young ones’ or ‘a nest’.
WSD is one of the challenging tasks in the area of computational linguistics. In the
later part of 1940s, WSD came to be recognized as a very important computational
task, specially when the days of machine translation began to start [2]. In 1949,
Weaver introduced WSD when he presented his popular memorandum based on
machine translation [3]. Machine translation is the area in which the first attempt
to perform WSD was carried out [3]. According to Weaver the contexts as well as
statistical semantic studies play crucial parts in WSD [3].
WSD for Assamese language reported by Borah et al. [4] achieved 86% F1-
measure by incorporating four features (Unigram Co-occurrences, POS of Target
Word, POS of Next Wordand Local Collocation) with Naïve Bayes classifier. Another
Naïve Bayes Classifier-based WSD task reported by Sarmah and Sarma [5] obtained
a result of accuracy 71%, which achieved 7% improvement in accuracy by adopting
iterative learning mechanism. The work by Sarmah and Sarma [6] has used Decision
Tree model for Assamese WSD task and reported an average F-measure of 0.611
for 10 Assamese ambiguous words.
WSD for Assamese Language 121
2 Methodology
Assamese is a highly inflectional language [7]. Word sense disambiguation in
Assamese is difficult due to its rich morphology. A subset of homonymous and
polysemous words of Assamese language has been selected for this research work.
In this project, the WSD system is designed based on supervised learning approach,
which demands a large set of resources such as annotated corpora and lexical database
in Assamese language.
The inputs to the Assamese WSD system are the training data (training corpus)
and test data (test corpus). The system uses external knowledge source the Lexicon
for feature extraction of both training and testing phase. The sense-tagged test corpus
is the expected output of the system. In our previous work, 86% of F1-measure was
obtained using the features Unigram Co-occurrence (UCO), Parts of Speech of Target
word (POST), Parts of Speech of Next word (POSN) and Local Collocation (LC) [4].
A new feature Semantically Related Words (SRW) has been employed in addition
to the above-mentioned features in this research work.
2.1 Classification Process
Naïve Bayes machine learning process, when collaborated with richer set of features,
can help to obtain high accuracies in classification process [8]. Naïve Bayes approach
is a famous and a common classification approach, and Gale, Church and Yarowsky
were the first researchers to use Naïve Bayes technique for WSD task in their work
named ‘A Method for Disambiguating Word Senses in a Large Corpus’ in the year
1992 [9].
Naïve Bayes classifier works on the assumption that all the features, which are
used to classify the test case are class conditionally independent. If the feature vector
is defined by F(f1,f2,…,fn) and kmultiple senses for the ambiguous word is
defined by S(S1,S2,…,Sk), then in order to classify the true sense of the target
ambiguous word (w), we have to find the sense Sithat maximizes the conditional
probability represented as P(wSi|F).
2.2 Features
In supervised word sense disambiguation where corpus plays an important role, the
classification system becomes solely reliable on the information obtained from the
training corpus. Syntactic features are extracted from POS information incorporated
in training corpus, whereas semantic features can also be used based on the avail-
ability of lexical database. The five different features used in this Assamese WSD
system are described below.
122 P. P. Borah et al.
Unigram Co-occurrences (UCO). Co-occurrences are word pairs, which mostly
have the tendency to occur in the same context (not necessarily in a particular order)
with a variable number of intermediate words [10]. Assuming a window of size m
(that is mnumber of previous words and mnumber of next words are available with
respect to the target ambiguous word where mranges from 1 to 5), a list of unigrams
(most frequently appearing words) is maintained in the Lexicon for every sense of
the ambiguous words.
POSofTargetWord(POST ). It may happen that all or some of the senses of a multi-
semantic word appear in different parts of speech. The contribution of this feature
depends on whether or not all or some of the possible senses of a multi-semantic word
have different POS. This POS information is maintained in the Lexicon for every
sense of each ambiguous word. The computation is done according to likelihood
estimation.
POS of Next Word (POSN). The structure of Assamese language is such that
it makes use of auxiliary verbs often to indicate the action with regard to a noun.
Frequently, these auxiliary verbs appear in next positions to that noun in particular.
This is why the next word’s part of speech (POSN) information has a contribution in
WSD. This information is obtained from the training corpus by likelihood estimation.
Local Collocation (LC). Collocation represents a group of words most often occur-
ring together in a particular sequence, which notifies a distinct sense for the multi-
semantic word appearing in a particular set of occurrences [11]. Collocation has a
great contribution in WSD if unique collocations are possible for all or some of the
possible senses of the multi-semantic words. A list of collocations is maintained in
the Lexicon for every senses of each ambiguous word.
Semantically Related Words (SWR). Information about semantic relations has a
quite good effect in most of the language processing tasks. Generally, for the task of
WSD, this information is collected from the WordNet [12]. In our work, the required
semantic information is provided by the Lexicon as Assamese WordNet is not readily
available. This Lexicon contains synonyms of ambiguous words corresponding to
every particular sense, and for each synonym, there is a separate entry in the Lexicon
containing the information such as sense-id, POS, synonyms, unigrams and colloca-
tions. The semantic closeness of the target ambiguous word with its synonyms in a
particular sense represents the closeness between the ambiguous word and that par-
ticular sense. Relation between the synonyms and the context is measured in terms
of Unigram Co-occurrences. If the synonyms are also multi-semantic words, then
the same process is repeated.
3 Results and Discussion
The process of WSD starts with the phase of discovering ambiguous words in the
test corpus, and finally, performs the classification of these ambiguous words. Once
the WSD task is completed, we need to focus on the performance measure of the
system. The accuracy of the system can be measured in terms of the following:
WSD for Assamese Language 123
Fig. 1 F1-measure
considering prior probability
with respect to change in
window size (m)
Window size →
F1-measure →
Precision (P). Precision is the ratio of relevant results returned to total number of
results returned.
Recall (R). Recall is the ratio of relevant results returned to the number of results
possible.
F1-measure. F1-measure is calculated as the harmonic mean of precision and recall.
F1-measure 2∗(Precision ∗Recall)/(Precision + Recall)
The disambiguation process for the test corpus has been carried out several times
by considering the features in different combinations along with varying window
sizes (range 1–5) for the feature UCO. Moreover, for each combination of features,
we have two set of results, respectively, by considering and avoiding the prior prob-
abilities as shown in Table 1.
For the rows 2, 3, 5, 10, 12, 14 and 23 in Table 1, there is no change in F1-measures
with respect to the change in window size (m) as these respective combinations of
features neither have UCO nor SRW feature, which relies on window size (m).
AsshowninFig.1, the line diagram of F1-measure (considering prior probability)
with respect to the change in window size (m) clearly indicates that the maximum
value of F1-measure obtained for the features UCO and SRW are, respectively, at
m4 and m2.
As shown in Fig. 2, the line diagram of F1-measure (without considering
prior probability) with respect to the change in window size (m) indicates that
the maximum value of F1-measure obtained for the features UCO and SRW are,
respectively, at m3, 4 and m2. However, the combination of UCO and SRW
gives same accuracy values for m= 3 and m=4(Row8inTable1).
The bar diagram in Fig. 3representing the F1-measures for individual features
UCO, POST, POSN, SRW and LC when m= 3, indicates that contribution of UCO is
highest as compared to other features. It is also found that 76% of times F1-measure
without prior probability is greater than F1-measure with prior probability when
window size m3.
Figure 4represents F1-measures without considering prior probabilities with
window size m=3 for a selected set of feature combinations. The combination of all
the five features together gives an accuracy measure of 91.1%.
124 P. P. Borah et al.
Tabl e 1 F1-measure for different combination of features and window sizes of UCO
S. No. Combination of features F1-measure with prior probability F1-measure without prior probability
m1m2m3m4m5m1m2m3m4m5
1 UCO 74 75.5 77 79.2 77.7 79.2 82.2 82.9 82.9 81.4
2POST 67.4 67.4 67.4 67.4 67.4 62.9 62.9 62.9 62.9 62.9
3POSN 59.2 59.2 59.2 59.2 59.2 58.5 58.5 58.5 58.5 58.5
4SRW 62.2 63.7 62.9 62.9 60 57 59.2 57.7 57.7 56.2
5LC 66.6 66.6 66.6 66.6 66.6 68.8 68.8 68.8 68.8 68.8
6UCO+POST 80.7 82.9 83.7 84.4 82.9 82.9 85.9 88.1 89.6 88.1
7UCO+POSN 70.3 73.3 76 77 80 74.8 76.2 80 81.4 80
8UCO+SRW 77 79.2 80.7 80.7 77.7 79.2 82.2 82.2 82.2 80
9 UCO+LC 82.2 84.4 85.9 86.6 85.9 82.9 85.9 88.1 88.8 88.1
10 POST+POSN 66.6 66.6 66.6 66.6 66.6 68.1 68.1 68.1 68.1 68.1
11 POST+SRW 72.5 74.8 74.8 74 71.8 71.8 74 73.3 73.3 71.8
12 POST+LC 73.3 73.3 73.3 73.3 73.3 74.8 74.8 74.8 74.8 74.8
13 POSN+SRW 60.7 62.9 62.9 62.9 62.2 61.4 62.9 62.9 62.9 62.2
14 POSN+LC 67.4 67.4 67.4 67.4 67.4 70.3 70.3 70.3 70.3 70.3
15 SRW+LC 74 75.5 75.5 74.8 73.3 75.5 77.7 76.2 75.5 74.8
16 UCO+POST+POSN 75.5 79.2 80.7 81.4 80.7 80 80.7 83.7 83.7 82.2
17 UCO+POST+SRW 82.9 85.1 85.9 85.1 82.9 82.9 85.9 87.4 88.8 86.6
18 UCO+POST+LC 86.6 88.1 88.8 89.6 88.8 85.1 88.8 90.3 91.1 90.3
19 UCO+POSN+SRW 74 77 87.5 87.5 76.2 77 78.5 80.7 80 78.5
20 UCO+POSN+LC 76.2 79.2 82.2 84.4 84.4 82.2 84.4 88.1 88.1 87.4
(continued)
WSD for Assamese Language 125
Tabl e 1 (continued)
S. No. Combination of features F1-measure with prior probability F1-measure without prior probability
m1m2m3m4m5m1m2m3m4m5
21 UCO+SRW+LC 82.9 85.9 87.4 86.6 85.9 82.9 85.9 86.6 86.6 85.9
22 POST+POSN+SRW 68.8 71.1 71.1 71.1 71.1 70.3 69.6 71.8 71.8 71.1
23 POST+POSN+LC 73.3 73.3 73.3 73.3 73.3 75.5 75.5 75.5 75.5 75.5
24 POSN+SRW+LC 69.6 71.8 72.5 73.3 73.3 74 75.5 75.5 75.5 74.8
25 UCO+POST+POSN+SRW 77.7 81.4 82.9 82.9 81.4 80.7 81.4 83.7 83.7 82.2
26 UCO+POST+POSN+LC 80.7 83.7 85.9 86.6 86.6 85.1 85.9 88.8 90.3 89.6
27 POST+POSN+SRW+LC 74.8 77 77 77 77 77 78.5 79.2 80 79.2
28 POSN+SWR+LC+UCO 80.7 83.7 85.9 85.9 85.1 83.7 85.9 88.1 87.4 86.6
29 UCO+POST+LC+SRW 86.6 88.1 88.8 88.8 88.1 85.1 88.8 89.6 90.3 89.6
30 ALL 82.9 85.9 88.1 88.1 87.4 88.1 88.8 91.1 91.1 90.3
126 P. P. Borah et al.
Window size →
F1-measure →
Fig. 2 F1-measure without prior probability with respect to change in window size (m)
(a) F1 -measure with prior probability →(b) F1 -measure without prior probability →
Fig. 3 Features with prior probability (a) and without prior probability (b)
F1-measure →
Fig. 4 Change in F1-measure with respect to addition of features
WSD for Assamese Language 127
4 Conclusion
In this paper, we reported advancement in our Assamese WSD task by addition of
a semantic feature to the Naïve Bayes classification process. A number of experi-
ments have been carried out considering different combinations of features to dis-
ambiguate 135 multi-semantic words present in the test corpus (size 1300 words).
Half of the experiments were carried out considering the use of prior probability and
other half without considering the use of prior probability. For all these experiments,
the obtained accuracy values ranges from 56.2 to 91.11%. Highest F1-measure of
91.11% (without considering the use of prior probability) was obtained using all the
five features UCO, POST, POSN, SRW and LC with window size m3. However,
considering the use of prior probability for the same set of features and window size,
the F1-measure was obtained to be 88.1%. The performance measure of the overall
system has been improved up to 91.11% in terms of F1-measure as compared to
86% of the previously developed system by incorporating the Semantically Related
Words (SRW) feature in our existing feature set. The basic drawback of this system
is the size of the training corpus and test corpus due to which the current system is
suitable for a selected set of nouns, adjectives, verbs, pronouns and quantifiers with
less effects of morphology. A large set of words of different parts of speech can be
included for the task of word sense disambiguation with the use of WordNet and the
morphology being excessively handled. The contribution of the features and effect of
prior probability discussed in this paper will be helpful for future works in Assamese
WSD.
References
1. Jurafsky, D.: Speech & Language Processing. Pearson Education, India (2000)
2. Kaplan, A.: An experimental study of ambiguity and context. Mech. Transl. 2(2), 39–46 (1955)
3. Weaver, W.: Translation. Mach. Transl. Lang. 14, 15–23 (1955)
4. Borah, P.P., Talukdar, G., Baruah, A.: Assamese word sense disambiguation using supervised
learning. In: 2014 International Conference on Contemporary Computing and Informatics
(IC3I). IEEE (2014)
5. Sarmah, J., Sarma, S.K.: Word sense disambiguation for Assamese. In: 2016 IEEE 6th Inter-
national Conference on Advanced Computing (IACC). IEEE (2016)
6. Sarmah, J., Sarma, S.K.: Decision tree based supervised word sense disambiguation for
Assamese. Int. J. Comput. Appl. 141(1) (2016)
7. Sharma, P., Sharma, U., Kalita, J.: Suffix stripping based NER in Assamese for location names.
In: 2012 2nd National Conference on Computational Intelligence and Signal Processing (CISP).
IEEE (2012)
8. Le, C.A., Shimazu, A.: High WSD accuracy using Naïve Bayesian classifier with rich features.
In: Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation
(2004)
9. Gale, W.A., Church, K.W., Yarowsky, D.: A method for disambiguating word senses in a large
corpus. Comput. Humanit. 26(5-6), 415–439 (1992)
10. Smadja, F.A.: Lexical co-occurrence: The missing link. Literary Linguist. Comput. 4(3),
163–168 (1989)
128 P. P. Borah et al.
11. Yarowsky, D.: One sense per collocation. Pennsylvania University Philadelphia, Department
of Computer and Information Science (1993)
12. Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet:: similarity: measuring the relatedness
of concepts. Demonstration papers at HLT-NAACL 2004. Association for Computational Lin-
guistics (2004)