Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 353–356,
Los Angeles, California, June 2010. c ?2010 Association for Computational Linguistics
The Effect of Ambiguity on the Automated Acquisition of WSD Examples
Mark Stevenson and Yikun Guo
Department of Computer Science,
University of Sheffield,
Regent Court, 211 Portobello,
Sheffield, S1 4DP
firstname.lastname@example.org and email@example.com
Several methods for automatically gen-
erating labeled examples that can be
used as training data for WSD systems
have been proposed, including a semi-
supervised approach based on relevance
feedback (Stevenson et al., 2008a). This
approach was shown to generate examples
that improved the performance of a WSD
system for a set of ambiguous terms from
the biomedical domain. However, we find
on other data sets. The levels of ambigu-
ity in these data sets are analysed and we
suggest this is the reason for this negative
Several studies, for example (Mihalcea et al.,
2004; Pradhan et al., 2007), have shown that su-
pervised approaches to Word Sense Disambigua-
tion (WSD) outperform unsupervised ones. But
these rely on labeled training data which is diffi-
cult to create and not always available (e.g. (Wee-
ber et al., 2001)). Various techniques for creating
labeled training data automatically have been sug-
gested in the literature. Stevenson et al. (2008a)
describe a semi-supervised approach that used rel-
evance feedback (Rocchio, 1971) to analyse ex-
isting labeled examples and use the information
produced to generate further ones. The approach
was tested on the biomedical domain and the addi-
tional examples found to improve performance of
a WSD system. However, biomedical documents
represent a restricted domain. In this paper the
same approach is tested against two data sets that
are not limited to a single domain.
2Application to a Range of Data Sets
In this paper the relevance feedback approach de-
ing three data sets: the NLM-WSD corpus (Wee-
ber et al., 2001) which Stevenson et al. (2008a)
used for their experiments, the Senseval-3 lexical
sample task (Mihalcea et al., 2004) and the coarse-
grained version of the SemEval English lexical
sample task (Pradhan et al., 2007).
2.1 Generating Examples
To generate examples for a particular sense of an
ambiguous term all of the examples where the
term is used in that sense are considered to be
any other sense of the term is used are considered
to be “irrelevant documents”.
back (Rocchio, 1971) is used to generate a set of
query terms designed to identify relevant docu-
ments, and therefore instances of the sense. The
top five query terms are used to retrieve docu-
ments and these are used as labeled examples of
the sense. Further details of this process are de-
scribed by Stevenson et al. (2008a).
This process requires a collection of documents
that can be queried to generate the additional
examples.For the NLM-WSD data set we
used PubMed, a database of biomedical journal
abstracts queried using the Entrez retrieval sys-
sites/gquery). The British National Corpus
(BNC) was used for Senseval-3 and SemEval.1
Lucene (http://lucene.apache.org) was
used to index the BNC and retrieve examples.
1We also experimented with the English WaCky corpus
(Baroni et al., 2009) which contains nearly 2 billion words
automatically retrieved from the web. However, results were
not as good as when the BNC was used.
We use a WSD system that has been shown to
perform well when evaluated against ambiguities
found in both general text and the biomedical do-
main (Stevenson et al., 2008b). Medical Subject
Headings (MeSH), a controlled vocabulary used
and used as additional features for the NLM-WSD
data set since they have been shown to improve
performance. The features are combined using
the Vector Space Model, a simple memory-based
Experiments were carried out comparing perfor-
mance when the WSD system was trained using
either the examples in the original data set (orig-
inal), the examples generated from these using
the relevance feedback approach (additional) or a
combination of these (combined). The Senseval-
3 and SemEval corpora are split into training and
test portions so the training portion is used as the
original data set and the WSD system evaluated
against the held-back data. As there is no such
recognised standard split for the NLM-WSD cor-
pus, 10-fold cross-validation was used. For each
fold the training portion is used as the original data
set and automatically generated examples created
by examining just that part of the data. Evaluation
is carried out against the fold’s test data and the
average result across the 10 folds reported.
Table 1 shows the results of this experiment.2
Examples generated using the relevance feedback
approach only improve results for one data set, the
NLM-WSD corpus. In this case there is a sig-
nificant improvement (Mann-Whitney, p < 0.01)
when the original and automatically generated ex-
amples are combined. There is no such improve-
ment for the other two data sets: WSD results us-
ing the additional data are noticeably worse than
when the original data is used alone and, although
performance improves when these examples are
combined with the original data, results are still
lower than using the original data. When exam-
ples are combined there is a drop in performance
of 1.2% and 2.9% for SemEval and Senseval-3 re-
2Results reported here for the NLM-WSD corpus are
slightly different from those reported by (Stevenson et al.,
2008a). We used an additional feature (MeSH headings),
which improved the baseline performance, and more query
terms which improved the quality of the additional examples
for all three data sets.
Table 1: Results of relevance feedback approach
applied to three data sets
These results indicate that the relevance feed-
back approach described by Stevenson et al.
(2008a) is not able to generate useful examples for
the Senseval-3 and SemEval data sets, although it
can for the NLM-WSD data set. We hypothesise
that these corpora contain different levels of ambi-
guity which effect suitability of the approach.
3 Analysis of Ambiguities
The three data sets are compared using measures
designed to determine the level of ambiguity they
contain. Section 3.1 reports results using various
widely used measures based on the distribution of
senses. Section 3.2 introduces a measure based
on the semantic similarity between the possible
senses of ambiguous terms.
Three measures for characterising the difficulty of
WSD data sets based on their sense distribution
were used. The first is the widely applied most
frequent sense (MFS) baseline (McCarthy et al.,
2004), i.e. the proportion of examples for an am-
biguous term that are labeled with the commonest
sense. The second is number of senses per am-
biguous term. The final measure, the entropy of
thesense distribution, has beenshown tobea good
indication of disambiguation difficulty (Kilgarriff
and Rosenzweig, 2000). For two of these mea-
sures (number of senses and entropy) a higher fig-
ure indicates greater ambiguity while for the MFS
measure a lower figure indicates a more difficult
Table 2 shows the results of computing these
measures averaged across all terms in the cor-
pus. For two measures (number of senses and en-
tropy) the NLM-WSD corpus is least ambiguous,
Senseval-3 the most ambiguous with SemEval be-
tween them. The MFS scores are very similar for
two data sets (NLM-WSD and SemEval), both of
which are much higher than for Senseval-3.
These measures suggest that the NLM-WSD
corpus is less ambiguous than the other two and
also that the Senseval-3 corpus is the most am-
biguous of the three.
Table 2: Properties of Data Sets using sense distri-
We also developed a measure that takes into ac-
count the similarity in meaning between the possi-
ble senses for an ambiguous term. This measure is
similar to the one used by Passoneau et al. (2009)
to analyse levels of inter-annotator agreement in
word sense annotation. Our measure is shown in
equation 1 where Senses is the set of possible
senses for an ambiguous term, |Senses| = n and
pairs). The similarity between a pair of senses,
sim(x,y), can be computed using any lexical sim-
ilarity measure, see Pedersen et al. (2004). Essen-
tially this measure computes the mean of the sim-
ilarities between each pair of senses for the term.
ing two of its members (i.e the set of unordered
sim measure =
One problem with comparing the data sets used
here is that they use a range of sense invento-
ries. Although lexical similarity measures have
been applied to WordNet (Pedersen et al., 2004)
and UMLS (Pedersen et al., 2007), it is not clear
that the scores they produce can be meaningfully
compared. To avoid this problem we mapped the
sense inventories onto a single resource: WordNet
The mapping was most straightforward for
Senseval-3 which uses WordNet 1.7.1 and could
be automatically mapped onto WordNet 3.0 senses
using publicly available mappings (Daud´ e et al.,
2000). The SemEval data contains a mapping
from the OntoNotes senses to groups of WordNet
2.1 senses. The first sense from this group was
mapped to WordNet 3.0 using the same mappings.
Mapping the NLM-WSD corpus was more
problematic and had to be carried out manually by
comparing sense definitions in UMLS and Word-
Net 3.0. We had expected this process to be diffi-
cult but found clear mappings for the majority of
senses. There were even found cases in which the
sense definitions were identical in both resources.
(The most likely reason for this is that some of
the resources that are included in the UMLS were
used to compile WordNet.) Another, more serious,
problem is related to the annotation scheme used
in the NLM-WSD corpus. If none of the possi-
ble senses in UMLS were judged to be appropri-
ate the annotators could label the sense as “None”.
We did not map these senses since it would require
examining each instance to determine the most ap-
propriate sense or senses in WordNet and we ex-
pected this to be error prone. In addition, there is
term labeled with “None” refer to the same mean-
ing. All of the “None” senses were removed from
the NLM-WSD data set and any terms where there
were more than ten instances marked as “None”
were also rejected from the similarity analysis.
This allowed us to compute the similarity score
for just 20 examples (40% of the total) although
we felt that this was a large enough sample to pro-
vide insight into the data set.
The WordNet::Similarity package (Ped-
ersen et al., 2004) was used to compute similar-
ity scores. Results are reported for three of the
measures in this package. (Other measures pro-
duced similar results.) The simple path measure
computes the similarity between a pair of nodes in
WordNet as the reciprocal of the number of edges
in the shortest path between them, the LCh mea-
sure (Leacock et al., 1998) also uses information
about the length of the shortest path between a pair
of nodesand combines this with information about
the maximum depth in WordNet and the JCn mea-
sure (Jaing and Conrath, 1997) makes use of in-
formation theory to assign probabilities to each of
the nodes in the WordNet hierarchy and computes
similarity based on these scores.
Table 3 shows the values of equation 1 for
the three similarity measures with scores averaged
across terms. These results indicate that for all
measures the Senseval-3 data set contains the most
ambiguity and NLM-WSD the least. This analysis
is consistent with the one carried out using mea-
sures based on sense distributions (Section 3.1)
Measure Download full-text
Table 3: Semantic similarity for each data set us-
ing a variety of measures
and suggest that the senses in the NLM-WSD data
set are more clearly distinguished than the other
This paper has explored a semi-supervised ap-
proach to the generation of labeled training data
for WSD that is based on relevance feedback
(Stevenson et al., 2008a). It was tested on three
data sets but was only found to generate examples
that were accurate enough to improve WSD per-
formance for one of these. The data set in which
a performance improvement was observed repre-
sented a limited domain (biomedicine) while the
designed to quantify the level of ambiguity were
applied to these data sets including ones based on
the distribution of senses and another designed to
quantify similarities between senses. These mea-
sures provided evidence that the corpus for which
the relevance feedback approach was successful
contained less ambiguity than the other two and
this suggests that the relevance feedback approach
is most appropriate when the level of ambiguity is
The experiments described in this paper high-
light the importance of the level of ambiguity on
the relevance feedback approach’s ability to gen-
erate useful labeled examples. Since it is semi-
supervised the ambiguity level can be checked us-
ing the measures used in this paper (Section 3)
and the performance of any automatically gener-
ated examples can be compared with the manu-
ally labeled ones (see Section 2.3) before deciding
whether or not they should be applied.
collection of very large linguistically processed
The wacky wide web: a
Language Resources and
J. Daud´ e, L. Padr´ o, and G. Rigau. 2000. Mapping
wordnets using structural information. In Proceed-
ings of ACL ’00, pages 504–511, Hong Kong.
J. Jaing and D. Conrath.
ity based on corpus statistics and lexical taxonomy.
In Proceedings of International Conference on Re-
search in Computational Linguistics, Taiwan.
1997. Semantic similar-
A. Kilgarriff and J. Rosenzweig. 2000. Framework
and results for English SENSEVAL. Computers and
the Humanities, 34(1-2):15–48.
C. Leacock, M. Chodorow, and G. Miller.
Using corpus statistics and WordNet relations for
sense identification.Computational Linguistics,
D. McCarthy, R. Koeling, J. Weeds, and J. Carroll.
text. In Proceedings of ACL’04, pages 279–286,
R. Mihalcea, T. Chklovski, and A. Kilgarriff. 2004.
The Senseval-3 English lexical sample task.
Proceedings of Senseval-3, pages 25–28, Barcelona,
R. Passoneau, A. Salleb-Aouissi, and N. Ide. 2009.
Making sense of word sense variation. In Proceed-
ings of SEW-2009, pages 2–9, Boulder, Colorado.
T. Pedersen, S. Patwardhan, and Michelizzi.
Wordnet::similarity - measuring the relatedness of
concepts. In Proceedings of AAAI-04, pages 1024–
1025, San Jose, CA.
T. Pedersen, S. Pakhomov, S. Patwardhan, and
C. Chute. 2007. Measures of semantic similarity
and relateness in the biomedical domain. Journal of
Biomedical Informatics, 40(3):288–299.
S. Pradhan, E. Loper, D. Dligach, and M. Palmer.
2007.SemEval-2007 Task-17: English Lexical
Sample, SRL and All Words.
SemEval-2007, pages 87–92, Prague, Czech Repub-
In Proceedings of
J. Rocchio. 1971. Relevance feedback in Informa-
tion Retrieval.In G. Salton, editor, The SMART
Retrieval System – Experiments in Automatic Doc-
ument Processing. Prentice Hall, Englewood Cliffs,
M. Stevenson, Y. Guo, and R. Gaizauskas.
Acquiring Sense Tagged Examples using Relevance
Feedback. In Proceedings of the Coling 2008, pages
809–816, Manchester, UK, August.
M. Stevenson, Y. Guo, R. Gaizauskas, and D. Martinez.
2008b. Disambiguation of biomedical text using di-
verse sources of information. BMC Bioinformatics,
M. Weeber, J. Mork, and A. Aronson. 2001. Devel-
oping a Test Collection for Biomedical Word Sense
Disambiguation. In Proceedings of AMIA Sympo-
sium, pages 746–50, Washington, DC.