Conference PaperPDF Available

Framing Word Sense Disambiguation as a Multi-Label Problem for Model-Agnostic Knowledge Integration

Authors:

Abstract

Recent studies treat Word Sense Disambiguation (WSD) as a single-label classification problem in which one is asked to choose only the best-fitting sense for a target word, given its context. However, gold data labelled by expert annotators suggest that maximizing the probability of a single sense may not be the most suitable training objective for WSD, especially if the sense inventory of choice is fine-grained. In this paper, we approach WSD as a multi-label classification problem in which multiple senses can be assigned to each target word. Not only does our simple method bear a closer resemblance to how human annotators disambiguate text, but it can also be extended seamlessly to exploit structured knowledge from semantic networks to achieve state-of-the-art results in English all-words WSD.
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 3269–3275
April 19 - 23, 2021. ©2021 Association for Computational Linguistics
3269
Framing Word Sense Disambiguation as a Multi-Label Problem
for Model-Agnostic Knowledge Integration
Simone Conia Roberto Navigli
Sapienza NLP Group
Department of Computer Science
Sapienza University of Rome
{conia,navigli}@di.uniroma1.it
Abstract
Recent studies treat Word Sense Disambigua-
tion (WSD) as a single-label classification
problem in which one is asked to choose only
the best-fitting sense for a target word, given
its context. However, gold data labelled by ex-
pert annotators suggest that maximizing the
probability of a single sense may not be the
most suitable training objective for WSD, es-
pecially if the sense inventory of choice is fine-
grained. In this paper, we approach WSD as
a multi-label classification problem in which
multiple senses can be assigned to each target
word. Not only does our simple method bear
a closer resemblance to how human annota-
tors disambiguate text, but it can also be ex-
tended seamlessly to exploit structured knowl-
edge from semantic networks to achieve state-
of-the-art results in English all-words WSD.
1 Introduction
Word Sense Disambiguation (WSD) is traditionally
framed as the task of associating a word in con-
text with its correct meaning from a finite set of
possible choices (Navigli,2009). Following this
definition, recently proposed neural models were
trained to maximize the probability of the most
appropriate meaning while minimizing the proba-
bility of the other possible choices (Huang et al.,
2019;Vial et al.,2019;Blevins and Zettlemoyer,
2020;Bevilacqua and Navigli,2020). Although
this training objective proved to be extremely ef-
fective and even led to Bevilacqua and Navigli
(2020) reaching the estimated upper bound of inter-
annotator agreement for WSD performance on the
unified evaluation framework of Raganato et al.
(2017b), adhering to it underplays a fundamental
aspect of how human annotators disambiguate text.
Indeed, past studies have observed that it is not
uncommon for a word to have multiple appropri-
ate meanings in a given context, meanings that
can be used interchangeably under some circum-
stances because their boundaries are not clear cut
(Tuggy,1993;Kilgarriff,1997;Hanks,2000;Erk
and McCarthy,2009). This is especially evident
if the underlying sense inventory is fine-grained,
as the complexity, and therefore performance, of
WSD is tightly coupled to sense granularity (Lac-
erra et al.,2020). The difficulty an annotator faces
in choosing the most appropriate meaning from a
fine-grained sense inventory becomes clear from an
analysis of gold standard datasets: a non-negligible
5% of the target words are annotated with two or
more sense labels in several gold standard datasets,
including Senseval-2 (Edmonds and Cotton,2001),
Senseval-3 (Snyder and Palmer,2004), SemEval-
2007 (Pradhan et al.,2007), SemEval-2013 (Nav-
igli et al.,2013), and SemEval-2015 (Moro and
Navigli,2015). Therefore, we follow Erk and
McCarthy (2009), Jurgens (2012), and Erk et al.
(2013), and argue that forcing a system to treat
WSD as a single-label classification problem and
learn that only one sense is correct for a word in a
given context does not reflect how human beings
disambiguate text.
In contrast to recent work, we approach WSD as
a soft multi-label classification problem in which
multiple senses can be assigned to each target word.
We show that not only does this simple method
bring significant improvements at low or no addi-
tional cost in terms of training and inference times
and number of trainable parameters, but it can also
be seamlessly extended to integrate senses from
relational knowledge in structured form, e.g., sim-
ilarity, hypernymy and hyponymy relations from
semantic networks such as WordNet (Miller,1995)
and BabelNet (Navigli and Ponzetto,2012). While
structured knowledge has been naturally utilized
by graph-based algorithms for WSD (Agirre and
Soroa,2009;Moro et al.,2014;Scozzafava et al.,
2020), the incorporation of such information into
3270
neural approaches has recently been garnering sig-
nificant attention. However, currently available
models can only take advantage of this knowl-
edge with purposely-built layers (Bevilacqua and
Navigli,2020) that require additional complexi-
ties and/or trainable parameters. To the best of
our knowledge, the work presented in this paper
is the first to integrate structured knowledge into
a neural architecture at negligible cost in terms of
training time and number of parameters, while at
the same time attaining state-of-the-art results in
English all-words WSD.
2 Method
Single-label vs multi-label.
WSD is the task of
selecting the best-fitting sense
s
among the possi-
ble senses
Sw
of a target word
w
in a given context
c=hw1, w2, . . . , wni
, where
Sw
is a subset of a
predefined sense inventory
S
. Abstracting away
from the intricacies of any particular supervised
model for WSD, the output of a WSD system pro-
vides a probability
yi
for each sense
siSw
. Re-
cently proposed machine learning models – Kumar
et al.,2019;Barba et al.,2020;Blevins and Zettle-
moyer,2020;Bevilacqua and Navigli,2020,inter
alia – are trained to maximize the probability of
the single most appropriate sense
ˆs
by minimizing
the cross-entropy loss LCE:
LCE(w, ˆs) = log(yˆs)(1)
We observe that this loss function is only suitable
for single-label classification problems. In the case
of WSD, this is equivalent to assuming that there
is just a single appropriate sense
ˆsSw
for the
target word
w
in the given context
c
, that is,
ˆs
is clearly dissimilar from any other sense in
Sw
.
Indeed, minimizing the cross-entropy loss in order
to maximize the probability of two or more senses
generates conflicting training signals; at the same
time, choosing to ignore one of the correct senses
results in a loss of valuable information.
Since there is a not insignificant number of in-
stances where multiple similar senses of the target
word
w
fit the given context
c
(see Section 1), we
frame WSD as a multi-label classification problem
in which a machine learning model is trained to
predict whether a sense
sSw
is appropriate for a
word
w
in a given context
c
, independently of the
other senses in
Sw
. This is simply equivalent to
minimizing the binary cross-entropy loss
LBCE
on
the probabilities of the candidate senses Sw:
LBCE(w, ˆ
Sw) = X
ˆsˆ
Sw
log(yˆs)(2)
X
sSw\ˆ
Sw
log(1 ys)
where
ˆ
SwSw
is the set of appropriate senses
for the target word
w
in the given context
c
. We
note that this simple yet fundamental change in
paradigm does not come with an increased compu-
tational complexity as
|Sw|
is usually small. More-
over, it is independent of the underlying model
used to calculate the output probabilities and, there-
fore, it does not increase the number of trainable
parameters.
Knowledge integration.
If our model benefits
from learning to assign multiple similar senses to a
target word in a given context, then it makes sense
that the very same model may also benefit from
learning what related senses can be assigned to
that word. For example, in the sentence “the quick
brown fox jumps over the lazy dog”, our model
may formulate a better representation of fox if it is
also trained to learn that any fox is a canine (hy-
pernymy relation) or that the fox species includes
arctic foxes, red foxes, and kit foxes (hyponymy
relations). In this way, not only would the model
learn that canines, foxes and arctic foxes are closely
related, but it would also learn that canines and
arctic foxes may have the ability to jump, and this
could act as a data augmentation strategy especially
for those senses that do not appear in the training
set.
There is a growing interest in injecting relational
information from knowledge bases into neural net-
works but, so far, recent attempts have required
purposely-designed strategies or layers. Among
others, Kumar et al. (2019) aid their model with a
gloss encoder that uses the WordNet graph struc-
ture; Vial et al. (2019) adopt a preprocessing strat-
egy aimed at clustering related senses to decrease
the number of output classes; Bevilacqua and Nav-
igli (2020) introduce a logit aggregation layer that
takes into account the neighboring meanings in the
WordNet graph.
In contrast, our multi-labeling approach to WSD
can be seamlessly extended to integrate relational
knowledge from semantic networks such as Word-
Net without any increase in architectural complex-
ity, training time, and number of trainable param-
3271
eters. We simply relax the definition of the set of
possible senses
Sw
for a word
w
to include all the
senses related to a sense in
Sw
. More formally, let
G= (S, R)
be a semantic network where
S
is a
sense inventory and
R
is the set of semantic con-
nections between any two senses. Then we define
S+
w
to also include every sense
sj
that is connected
to any sense
siSw
by an edge
(si, sj)R
, that
is,
S+
w=Sw∪ {sj: (si, sj)R, siSw}
. The
loss function is updated accordingly to maximize
not only the probability of the correct senses, but
also the probability of their related senses:
LBCE(w, ˆ
S+
w) = X
ˆsˆ
S+
w
log(yˆs)(3)
X
sS+
w\ˆ
S+
w
log(1 ys)
where
ˆ
S+
w=ˆ
Sw∪ {sj: (ˆsi, sj)R, ˆsiˆ
Sw}
.
We note that the increase of the number of possi-
ble choices (
|S+
w| ≥ |Sw|
) and correct meanings
(
|ˆ
S+
w|≥|ˆ
Sw|
) does not hinder the learning process
since each probability is computed independently
of the others. Finally, we stress that our approach
to structured knowledge integration is completely
model-agnostic, as it is independent of the architec-
ture of the underlying supervised model.
Model description.
In order to assess the bene-
fits of our multi-labeling approach and avoid im-
provements that may not be related to the overall
objective of this paper, we conduct our experiments
with a simple WSD model. Similarly to Bevilacqua
and Navigli (2020), this model is simply composed
of BERT (large-cased, frozen), a non-linear layer,
and a linear classifier. Thus, given a word
w
in
context we build a contextualized representation
ewRdBERT
of the word
w
as the average of the
corresponding hidden states of the last four lay-
ers of BERT, apply a non-linear transformation to
obtain
hwRdh
with
dh= 512
, and finally a lin-
ear projection to
owR|S|
to compute the sense
scores. More formally:
ew= BatchNorm 1
4
4
X
i=1
bi
w
hw= Swish(Whew+bh)
ow=Wohw+bo
where
bi
w
is the hidden state of the
i
-th layer of
BERT from the topmost one,
BatchNorm(·)
is the
batch normalization operation, and
Swish(x) =
x·sigmoid(x)
is the Swish activation function (Ra-
machandran et al.,2017).
3 Experiments and Results
Experimental setup.
We train our models in dif-
ferent configurations to assess the individual con-
tribution of several factors. First, we compare our
baseline model trained with a single-label objec-
tive (Equation 1) to the same model trained with
a multi-label objective (Equation 2). Then, we
gradually include structured knowledge in the form
of WordNet relations using Equation 3, starting
from similarity relations (similar-to, also-see, verb-
group, and derivationally-related-form), and incre-
mentally including generalization and specifica-
tion relations (hypernymy, hyponymy, instance-
hypernymy, instance-hyponymy). In order to keep
a level playing field with single-label systems, we
choose only the meaning with highest probability
for our multi-label models.
Datasets.
We evaluate the models on the Unified
Evaluation Framework for English all-words WSD
proposed by Raganato et al. (2017b). This evalua-
tion includes five gold standard datasets, namely,
Senseval-2, Senseval-3, SemEval-2007, SemEval-
2013, and SemEval-2015. Following standard prac-
tice we use the smallest gold standard as our devel-
opment set, SemEval-2007, and the remaining ones
as test sets. We distinguish between two settings:
closed and open. In the former setting, we include
systems that only use SemCor (Miller et al.,1994)
as the training corpus, while in the latter we also
include those systems that use WordNet glosses
and examples and/or Wikipedia.
Hyperparameters.
We use the pretrained ver-
sion of BERT-large-cased (Devlin et al.,2019)
available on HuggingFace’s Transformers library
(Wolf et al.,2020) to build our contextualized em-
beddings (Section 2). BERT is left frozen, that
is, its parameters are not updated during training.
Each model is trained for 25 epochs using Adam
(Kingma and Ba,2015) with a learning rate of
104
.
We avoid hyperparameter tuning and opt for values
that are close to the ones reported in the literature
so as to have a fairer comparison.
Comparison systems.
In order to have a com-
prehensive comparison with the current state of the
art in WSD, we include the work of:
3272
Concatenation of ALL datasets
SE2 SE3 SE07 SE13 SE15 Nouns Verbs Adj Adv ALL
SemCor only
Raganato et al. (2017a) 72.0 69.1 64.8 66.9 71.5 71.5 57.5 75.0 83.8 69.9
BERTLarge 76.3 73.2 66.2 71.7 74.1 73.5
Hadiwinoto et al. (2019) 75.5 73.6 68.1 71.1 76.2 73.7
Peters et al. (2019) – – – – – – – 75.1
Vial et al. (2019) – – – – – – – 75.6
Vial et al. (2019) - Ensemble 77.5 77.4 69.5 76.0 78.3 79.6 65.9 79.5 85.5 76.7
This work 78.4 77.8 72.2 76.7 78.2 80.1 67.0 80.5 86.2 77.6
SemCor + definitions / examples
Loureiro and Jorge (2019) 76.3 75.6 68.1 75.1 77.0 78.0 64.0 80.7 84.5 75.4
Scarlini et al. (2020a) – 78.7 – 80.4
Conia and Navigli (2020) 77.1 76.4 70.3 76.2 77.2 78.7 65.6 81.1 84.7 76.4
Bevilacqua et al. (2020) 78.0 75.4 71.9 77.0 77.6 79.9 64.8 79.2 86.4 76.7
Huang et al. (2019) 77.7 75.2 72.5 76.1 80.4 77.0
Scarlini et al. (2020b) 78.0 77.1 71.0 77.3 83.2 80.6 68.3 80.5 83.5 77.9
Blevins and Zettlemoyer (2020) 79.4 77.4 74.5 79.7 81.7 81.4 68.5 83.0 87.9 79.0
Bevilacqua and Navigli (2020)80.8 79.0 75.2 80.7 81.8 82.9 69.4 82.9 87.6 80.1
This work 80.4 77.8 76.2 81.8 83.3 82.9 70.3 83.4 85.5 80.2
Table 1: WSD results in F1scores on Senseval-2 (SE2), Senseval-3 (SE3), SemEval-2007 (SE07), SemEval-2013
(SE13), SemEval-2015 (SE15), and the concatenation of all the datasets (ALL). Top: closed setting (only SemCor
allowed as the training corpus without definitions and/or examples). Bottom: open setting (WordNet glosses and
examples are also used for training).
WSD Sim See Rel Vrb Hpe Hpo HpeIHpoISE07 ALL
SL 69.0 74.7
ML 69.2 75.7
ML 4 4 4 4 70.6 76.6
ML 4 4 4 4 4 71.0 77.0
ML 4 4 4 4 4 72.5 77.4
ML 4 4 4 4 4 4 72.2 77.6
ML 4 4 4 4 4 4 4 4 72.2 77.6
Table 2: WSD results in F1scores on SemEval-2007
(SE07) and the concatenation of all the datasets (ALL).
SL/ML: single-label/multi-label. Sim: similar-to. See:
also-see. Rel: derivationally-related-forms. Vrb: verb-
groups. Hpe: hypernymy. Hpo: hyponymy. HpeI:
instance-hypernyms. HpoI: instance-hyponyms.
Raganato et al. (2017a) which was one of the
first to propose a neural sequence model for
WSD based on a stack of BiLSTM layers;
BERT
large
, a simple 1-neareast-neighbor ap-
proach based on the last hidden state of the
BERT-large-cased model (Loureiro and Jorge,
2019);
Hadiwinoto et al. (2019) which was among
the first to exploit pretrained contextualized
models for WSD;
Peters et al. (2019) which incorporated WSD
knowledge directly into the training process
of BERT;
Huang et al. (2019) which tasked the model
to learn which gloss is the most appropriate
for a word in context;
Bevilacqua et al. (2020) which tackled WSD
as a gloss generation problem;
Loureiro and Jorge (2019) and Conia and Nav-
igli (2020) which created and enhanced sense
embeddings with relational knowledge from
WordNet and BabelNet;
Scarlini et al. (2020a) which proposed nomi-
nal sense embeddings built by exploiting Ba-
belNet to automatically retrieve sense-specific
context;
Scarlini et al. (2020b) which extended the
above approach to non-nominal senses and
multiple languages;
alongside the aforementioned work of Vial
et al. (2019), Blevins and Zettlemoyer (2020),
and Bevilacqua and Navigli (2020).
The systems are divided into two groups in Ta-
ble 1: in the upper part we compare our approach
against those systems that do not take advantage of
information coming from WordNet glosses and/or
examples, while in the lower part we also include
those systems that make use of such knowledge.
3273
Results.
The first two rows of Table 2show the
results of switching from a single-label to a multi-
label approach for WSD: this single change already
brings a significant improvement in performance
(+1.0% in F
1
score, significant with
p <
0.1,
χ2
test). Not only that, increasing the number and
variety of WordNet relations further increases the
performance of the model, with hyponyms being
particularly beneficial (+0.8% in F
1
score). Un-
fortunately, including instance hypernyms and in-
stance hyponyms does not bring further improve-
ments; this may be due to the relatively low number
of instances that can take advantage of such rela-
tions in SemCor.
Nonetheless, the results obtained set a new state
of the art among single and ensemble systems
trained only on SemCor without the use of addi-
tional training data or resources external to Word-
Net such as Wikipedia, surpassing the previous
state-of-the-art non-ensemble system of Vial et al.
(2019) by 2.0% in F
1
score (significant with
p <
0.05,
χ2
test), as shown in Table 1. When fur-
ther trained on the WordNet glosses and examples,
our model attains state-of-the-art results (+1.2%
and +0.1% in F
1
score compared to the systems
of Blevins and Zettlemoyer (2020) and Bevilac-
qua and Navigli (2020), respectively) despite being
simpler than most of the techniques it is compared
against.
4 Conclusion
WSD is a key task in Natural Language Under-
standing with several open challenges and with
the granularity of sense inventories being undoubt-
edly the most pressing issue (Navigli,2018). We
departed from recent work on WSD and investi-
gated the effect of tackling the task as a multi-label
classification problem. Not only is our approach
simple and model-agnostic, but it can also be seam-
lessly extended to integrate relational knowledge
in structured form from semantic networks such
as WordNet, and at no extra cost in terms of archi-
tectural complexity, training times, and number of
parameters.
Our experiments show that our method, thanks to
its more comprehensive notion of loss over equally
valid and structurally-related senses, achieves state-
of-the-art results in English all-words WSD, es-
pecially when there is a lower amount of anno-
tated text available. These results open the path
to further research in this direction, from explor-
ing more complex models and richer knowledge
bases to exploiting multiple labels in innovative
disambiguation settings which can overcome the
fine granularity of sense inventories. Not only that,
our knowledge integration approach could poten-
tially be applied to address the knowledge acquisi-
tion bottleneck in multilingual WSD (Pasini,2020;
Pasini et al.,2021). Finally, with the rise of ever
more complex general and specialized pretrained
models, we believe that our simple model-agnostic
approach can be another step towards knowledge-
based (self-)supervision.
We release our software and model check-
points at
https://github.com/SapienzaNLP/
multilabel-wsd.
Acknowledgments
The authors gratefully acknowledge
the support of the ERC Consolida-
tor Grant MOUSSE No. 726487 un-
der the European Union’s Horizon
2020 research and innovation pro-
gramme.
This work was supported in part by the MIUR
under grant “Dipartimenti di eccellenza 2018-
2022” of the Department of Computer Science of
Sapienza University.
References
Eneko Agirre and Aitor Soroa. 2009. Personalizing
pagerank for word sense disambiguation. In EACL
2009, 12th Conference of the European Chapter of
the Association for Computational Linguistics, Pro-
ceedings of the Conference, Athens, Greece, March
30 - April 3, 2009.
Edoardo Barba, Luigi Procopio, Niccol`
o Campolungo,
Tommaso Pasini, and Roberto Navigli. 2020. Mu-
LaN: Multilingual label propagation for Word Sense
Disambiguation. In Proceedings of the Twenty-
Ninth International Joint Conference on Artificial In-
telligence, IJCAI 2020.
Michele Bevilacqua, Marco Maru, and Roberto Nav-
igli. 2020. Generationary or: “How we went beyond
word sense inventories and learned to gloss”. In Pro-
ceedings of the 2020 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP), On-
line.
Michele Bevilacqua and Roberto Navigli. 2020. Break-
ing through the 80% glass ceiling: Raising the state
of the art in Word Sense Disambiguation by incor-
porating knowledge graph information. In Proceed-
ings of the 58th Annual Meeting of the Association
3274
for Computational Linguistics, ACL 2020, Online,
July 5-10, 2020.
Terra Blevins and Luke Zettlemoyer. 2020. Moving
down the long tail of Word Sense Disambiguation
with gloss informed bi-encoders. In Proceedings of
the 58th Annual Meeting of the Association for Com-
putational Linguistics, ACL 2020, Online, July 5-10,
2020.
Simone Conia and Roberto Navigli. 2020. Conception:
Multilingually-enhanced, human-readable concept
vector representations. In Proceedings of the 28th
International Conference on Computational Linguis-
tics, Barcelona, Spain (Online).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional Transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), Min-
neapolis, Minnesota.
Philip Edmonds and Scott Cotton. 2001. SENSEVAL-
2: Overview. In Proceedings of Second In-
ternational Workshop on Evaluating Word Sense
Disambiguation Systems, SENSEVAL@ACL 2001,
Toulouse, France, July 5-6, 2001.
Katrin Erk and Diana McCarthy. 2009. Graded word
sense assignment. In Proceedings of the 2009 Con-
ference on Empirical Methods in Natural Language
Processing, EMNLP 2009, 6-7 August 2009, Singa-
pore, A meeting of SIGDAT, a Special Interest Group
of the ACL.
Katrin Erk, Diana McCarthy, and Nicholas Gaylord.
2013. Measuring word meaning in context.Com-
putational Linguistics, 39(3):511–554.
Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung
Gan. 2019. Improved Word Sense Disambigua-
tion using pre-trained contextualized word represen-
tations. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Nat-
ural Language Processing, EMNLP-IJCNLP 2019,
Hong Kong, China, November 3-7, 2019.
Patrick Hanks. 2000. Do word meanings exist? Com-
put. Humanit., 34(1-2):205–215.
Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing
Huang. 2019. GlossBERT: BERT for Word Sense
Disambiguation with gloss knowledge. In Proceed-
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing, EMNLP-IJCNLP 2019, Hong Kong, China,
November 3-7, 2019.
David Jurgens. 2012. An evaluation of graded sense
disambiguation using word sense induction. In Pro-
ceedings of the First Joint Conference on Lexical
and Computational Semantics, *SEM 2012, June 7-
8, 2012, Montr´
eal, Canada.
Adam Kilgarriff. 1997. I don’t believe in word senses.
Computers and the Humanities, 31(2):91–113.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In 3rd Inter-
national Conference on Learning Representations,
ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings.
Sawan Kumar, Sharmistha Jat, Karan Saxena, and
Partha P. Talukdar. 2019. Zero-shot Word Sense
Disambiguation using sense definition embeddings.
In Proceedings of the 57th Conference of the As-
sociation for Computational Linguistics, ACL 2019,
Florence, Italy, July 28- August 2, 2019, Volume 1:
Long Papers.
Caterina Lacerra, Michele Bevilacqua, Tommaso
Pasini, and Roberto Navigli. 2020. CSI: A coarse
sense inventory for 85% word sense disambiguation.
In The Thirty-Fourth AAAI Conference on Artificial
Intelligence, AAAI 2020, The Thirty-Second Inno-
vative Applications of Artificial Intelligence Confer-
ence, IAAI 2020, The Tenth AAAI Symposium on Ed-
ucational Advances in Artificial Intelligence, EAAI
2020, New York, NY, USA, February 7-12, 2020.
Daniel Loureiro and Al´
ıpio Jorge. 2019. Language
modelling makes sense: Propagating representations
through wordnet for full-coverage Word Sense Dis-
ambiguation. In Proceedings of the 57th Confer-
ence of the Association for Computational Linguis-
tics, ACL 2019, Florence, Italy, July 28- August 2,
2019, Volume 1: Long Papers.
George A. Miller. 1995. WordNet: A lexical database
for english.Commun. ACM, 38(11):39–41.
George A. Miller, Martin Chodorow, Shari Landes,
Claudia Leacock, and Robert G. Thomas. 1994. Us-
ing a semantic concordance for sense identification.
In Human Language Technology: Proceedings of a
Workshop held at Plainsboro, New Jersey, March 8-
11, 1994.
Andrea Moro and Roberto Navigli. 2015. SemEval-
2015 task 13: Multilingual all-words Sense Dis-
ambiguation and entity linking. In Proceedings of
the 9th International Workshop on Semantic Eval-
uation, SemEval@NAACL-HLT 2015, Denver, Col-
orado, USA, June 4-5, 2015.
Andrea Moro, Alessandro Raganato, and Roberto Nav-
igli. 2014. Entity Linking meets Word Sense Disam-
biguation: A Unified Approach.Transactions of the
Association for Computational Linguistics (TACL),
2.
Roberto Navigli. 2009. Word Sense Disambiguation:
A survey.ACM Comput. Surv., 41(2):10:1–10:69.
Roberto Navigli. 2018. Natural language understand-
ing: Instructions for (present and future) use. In IJ-
CAI, pages 5697–5702.
3275
Roberto Navigli, David Jurgens, and Daniele Van-
nella. 2013. SemEval-2013 task 12: Multilingual
Word Sense Disambiguation. In Proceedings of
the 7th International Workshop on Semantic Evalu-
ation, SemEval@NAACL-HLT 2013, Atlanta, Geor-
gia, USA, June 14-15, 2013.
Roberto Navigli and Simone Paolo Ponzetto. 2012. Ba-
belNet: The automatic construction, evaluation and
application of a wide-coverage multilingual seman-
tic network.Artificial Intelligence, 193:217–250.
Tommaso Pasini. 2020. The knowledge acquisition
bottleneck problem in multilingual Word Sense Dis-
ambiguation. In Proceedings of the Twenty-Ninth
International Joint Conference on Artificial Intelli-
gence, IJCAI-20, pages 4936–4942.
Tommaso Pasini, Alessandro Raganato, and Roberto
Navigli. 2021. XL-WSD: An extra-large and cross-
lingual evaluation framework for word sense disam-
biguation. In Proc. of AAAI.
Matthew E. Peters, Mark Neumann, Robert L. Logan
IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and
Noah A. Smith. 2019. Knowledge enhanced con-
textual word representations. In Proceedings of the
2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, Hong Kong, China, Novem-
ber 3-7, 2019.
Sameer Pradhan, Edward Loper, Dmitriy Dligach, and
Martha Palmer. 2007. SemEval-2007 task-17: En-
glish lexical sample, SRL and all words. In Pro-
ceedings of the 4th International Workshop on Se-
mantic Evaluations, SemEval@ACL 2007, Prague,
Czech Republic, June 23-24, 2007.
Alessandro Raganato, Claudio Delli Bovi, and Roberto
Navigli. 2017a. Neural sequence learning models
for Word Sense Disambiguation. In Proceedings of
the 2017 Conference on Empirical Methods in Nat-
ural Language Processing, EMNLP 2017, Copen-
hagen, Denmark, September 9-11, 2017.
Alessandro Raganato, Jos ´
e Camacho-Collados, and
Roberto Navigli. 2017b. Word Sense Disambigua-
tion: A unified evaluation framework and empirical
comparison. In Proceedings of the 15th Conference
of the European Chapter of the Association for Com-
putational Linguistics, EACL 2017, Valencia, Spain,
April 3-7, 2017, Volume 1: Long Papers.
Prajit Ramachandran, Barret Zoph, and Quoc V. Le.
2017. Searching for activation functions.arXiv,
abs/1710.05941.
Bianca Scarlini, Tommaso Pasini, and Roberto Nav-
igli. 2020a. SensEmBERT: Context-enhanced sense
embeddings for multilingual Word Sense Disam-
biguation. In The Thirty-Fourth AAAI Conference
on Artificial Intelligence, AAAI 2020, The Thirty-
Second Innovative Applications of Artificial Intelli-
gence Conference, IAAI 2020, The Tenth AAAI Sym-
posium on Educational Advances in Artificial Intel-
ligence, EAAI 2020, New York, NY, USA, February
7-12, 2020.
Bianca Scarlini, Tommaso Pasini, and Roberto Nav-
igli. 2020b. With more contexts comes better per-
formance: Contextualized sense embeddings for all-
round Word Sense Disambiguation. In Proceedings
of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), Online.
Federico Scozzafava, Marco Maru, Fabrizio Brignone,
Giovanni Torrisi, and Roberto Navigli. 2020. Per-
sonalized PageRank with syntagmatic information
for multilingual Word Sense Disambiguation. In
Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics (ACL 2020),
demos, Online.
Benjamin Snyder and Martha Palmer. 2004. The en-
glish all-words task. In Proceedings of the Third In-
ternational Workshop on the Evaluation of Systems
for the Semantic Analysis of Text, SENSEVAL@ACL
2004, Barcelona, Spain, July 25-26, 2004.
David Tuggy. 1993. Ambiguity, polysemy, and vague-
ness.Cognitive Linguistics, 4:273–290.
Lo¨
ıc Vial, Benjamin Lecouteux, and Didier Schwab.
2019. Sense vocabulary compression through the
semantic knowledge of WordNet for neural Word
Sense Disambiguation.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, R´
emi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2020.
Huggingface’s transformers: State-of-the-art natural
language processing.
... However, although effective and straightforward, this formulation suffers from a number of pitfalls, most notably i) senses are only defined via their training set occurrences, with their actual linguistic meaning not explicitly embedded within the neural model, and ii) these architectures either behave poorly on rare and unseen senses, or cannot handle them at all. In order to address these issues, recent literature has proposed more sophisticated forms of supervision where definitions of senses, i.e. glosses (Kumar et al., 2019;Blevins and Zettlemoyer, 2020), and relational knowledge coming from the sense inventory (Bevilacqua and Navigli, 2020;Conia and Navigli, 2021) are integrated within the neural models. ...
... While already effective, these architectures displayed a number of shortcomings, especially with regard to modeling rare and unseen senses. To cope with these, many works started to complement the training data by exploiting different forms of lexical knowledge stored in WordNet, such as sense definitions (Kumar et al., 2019;Blevins and Zettlemoyer, 2020) and semantic relations (Bevilacqua and Navigli, 2020;Conia and Navigli, 2021), or with silver data produced via novel generative formulations (Barba et al., 2021b). Sense definitions, in particular, have been shown to significantly improve models' scalability to senses that are underrepresented in the training corpus, and their usage has been thoroughly investigated. ...
... However, although effective and straightforward, this formulation suffers from a number of pitfalls, most notably i) senses are only defined via their training set occurrences, with their actual linguistic meaning not explicitly embedded within the neural model, and ii) these architectures either behave poorly on rare and unseen senses, or cannot handle them at all. In order to address these issues, recent literature has proposed more sophisticated forms of supervision where definitions of senses, i.e. glosses (Kumar et al., 2019;Blevins and Zettlemoyer, 2020), and relational knowledge coming from the sense inventory (Bevilacqua and Navigli, 2020;Conia and Navigli, 2021) are integrated within the neural models. ...
... While already effective, these architectures displayed a number of shortcomings, especially with regard to modeling rare and unseen senses. To cope with these, many works started to complement the training data by exploiting different forms of lexical knowledge stored in WordNet, such as sense definitions (Kumar et al., 2019;Blevins and Zettlemoyer, 2020) and semantic relations (Bevilacqua and Navigli, 2020;Conia and Navigli, 2021), or with silver data produced via novel generative formulations (Barba et al., 2021b). Sense definitions, in particular, have been shown to significantly improve models' scalability to senses that are underrepresented in the training corpus, and their usage has been thoroughly investigated. ...
Conference Paper
Full-text available
Supervised systems have nowadays become the standard recipe for Word Sense Disambiguation (WSD), with Transformer-based language models as their primary ingredient. However, while these systems have certainly attained unprecedented performances, virtually all of them operate under the constraining assumption that, given a context, each word can be disambiguated individually with no account of the other sense choices. To address this limitation and drop this assumption, we propose CONtinuous SEnse Comprehension (CONSEC), a novel approach to WSD: leveraging a recent re-framing of this task as a text extraction problem, we adapt it to our formulation and introduce a feedback loop strategy that allows the disambiguation of a target word to be conditioned not only on its context but also on the explicit senses assigned to nearby words. We evaluate CONSEC and examine how its components lead it to surpass all its competitors and set a new state of the art on English WSD. We also explore how CONSEC fares in the cross-lingual setting , focusing on 8 languages with various degrees of resource availability, and report significant improvements over prior systems. We release our code at https://github.com/ SapienzaNLP/consec.
... Discerning the meaning of a word in context is often considered a fundamental step in enabling machine understanding of text (Navigli, 2018): indeed, a word can convey different meanings depending on the context it appears in (Camacho-Collados and Pilehvar, 2018). Over the years, there has been steady progress in this area, so much so that recent approaches (Bevilacqua and Navigli, 2020;Conia and Navigli, 2021;Barba et al., 2021a;Barba et al., 2021b) have achieved results that have come close to or even surpassed the estimated inter-annotator agreement on gold standard benchmarks for English WSD , even though recent studies have shown that there is still much work to be done (Maru et al., 2022), especially in multilingual and cross-lingual WSD (Pasini, 2020;Pasini et al., 2021). ...
Conference Paper
Full-text available
In this paper, we present the Universal Semantic Annotator (USeA), which offers the first unified API for high-quality automatic annotations of texts in 100 languages through state-of-the-art systems for Word Sense Disambiguation, Semantic Role Labeling and Semantic Parsing. Together, such annotations can be used to provide users with rich and diverse semantic information, help second-language learners, and allow researchers to integrate explicit semantic knowledge into downstream tasks and real-world applications.
... Furthermore, modern lexicography and WSD are inextricably tied to corpora, i.e. large collections of written text in machine-readable form. Indeed, while lexicographers analyse corpora to identify and record relevant linguistic phenomena for the purpose of creating and updating dictionaries, WSD exploits corpora in multiple ways, such as learning effective unsupervised dense representations (Devlin et al., 2019;Conneau et al., 2020), or producing training and test data to be used in supervised approaches (Vial et al., 2019;Huang et al., 2019;Bevilacqua & Navigli, 2020;Blevins & Zettlemoyer, 2020;Conia & Navigli, 2021) by annotating them in a manual, semi-automatic or fully-automatic fashion. ...
Conference Paper
Full-text available
Over the course of the last few years, lexicography has witnessed the burgeoning of increasingly reliable automatic approaches supporting the creation of lexicographic resources such as dictionaries, lexical knowledge bases and annotated datasets. In fact, recent achievements in the field of Natural Language Processing and particularly in Word Sense Disambiguation have widely demonstrated their effectiveness not only for the creation of lexicographic resources, but also for enabling a deeper analysis of lexical-semantic data both within and across languages. Nevertheless, we argue that the potential derived from the connections between the two fields is far from exhausted. In this work, we address a serious limitation affecting both lexicography and Word Sense Disambiguation, i.e. the lack of high-quality sense-annotated data and describe our efforts aimed at constructing a novel entirely manually annotated parallel dataset in 10 European languages. For the purposes of the present paper, we concentrate on the annotation of morpho-syntactic features. Finally, unlike many of the currently available sense-annotated datasets, we will annotate semantically by using senses derived from high-quality lexicographic repositories.
... Supervised methods exist ranging from purely supervised [10,11] to knowledge-based [12], to hybrid supervised and KB approaches [13,14,15,16,17,18]. In general, the supervised WSD approach concerns purely data-driven models [10], supervised models exploiting glosses (human-readable way of clarifying sense distinctions) [13], supervised models exploiting relations in a knowledge graph such as WordNet hyponymy and hypernymy relations [19], and supervised approaches using other sources of knowledge like Wikipedia and Web search [18]. ...
Article
Language is the main means of communication used by human. In various situations, the same word can mean differently based on the usage of the word in a particular sentence which is challenging for a computer to understand as level of human. Word Sense Disambiguation (WSD), which aims to identify correct sense of a given ambiguity word, is a long-standing problem in natural language processing (NLP). As the major aim of WSD is to accurately understand the sense of a word in particular context, can be used for the correct labeling of words in natural language applications. In this paper, I propose a normalized statistical algorithm that performs the task of WSD for Afaan Oromo language despite morphological analysis The propose algorithm has the power to discriminate ambiguous word’s sense without windows size consideration, without predefined rule and without utilize annotated dataset for training which minimize a challenge of under resource languages. The proposed system tested on 249 sentences with precision, recall, and F-measure. The overall effectiveness of the system is 80.76% in F-measure, which implies that the proposed system is promising on Afaan Oromo that is one of under resource languages spoken in East Africa. The algorithm can be extended for semantic text similarity without modification or with a bit modification. Furthermore, the forwarded direction can improve the performance of the proposed algorithm.
Conference Paper
Full-text available
With state-of-the-art systems having finally attained estimated human performance, Word Sense Disambiguation (WSD) has now joined the array of Natural Language Processing tasks that have seemingly been solved, thanks to the vast amounts of knowledge encoded into Transformer-based pre-trained language models. And yet, if we look below the surface of raw figures, it is easy to realize that current approaches still make trivial mistakes that a human would never make. In this work, we provide evidence showing why the F1 score metric should not simply be taken at face value and present an exhaustive analysis of the errors that seven of the most representative state-of-the-art systems for English all-words WSD make on traditional evaluation benchmarks. In addition, we produce and release a collection of test sets featuring (a) an amended version of the standard evaluation benchmark that fixes its lexical and semantic inaccuracies, (b) 42D, a challenge set devised to assess the resilience of systems with respect to least frequent word senses and senses not seen at training time, and (c) hardEN, a challenge set made up solely of instances which none of the investigated state-of-the-art systems can solve. We make all of the test sets and model predictions available to the research community at https://github.com/SapienzaNLP/wsd-hard-benchmark.
Conference Paper
Full-text available
With the advent of contextualized embeddings, attention towards neural ranking approaches for Information Retrieval increased considerably. However, two aspects have remained largely neglected: i) queries usually consist of few keywords only, which increases ambiguity and makes their contextualization harder, and ii) performing neural ranking on non-English documents is still cumbersome due to shortage of labeled datasets. In this paper we present SIR (Sense-enhanced Information Retrieval) to mitigate both problems by leveraging word sense information. At the core of our approach lies a novel multilingual query expansion mechanism based on Word Sense Disambiguation that provides sense definitions as additional semantic information for the query. Importantly, we use senses as a bridge across languages, thus allowing our model to perform considerably better than its supervised and unsupervised alternatives across French, German, Italian and Spanish languages on several CLEF benchmarks, while being trained on English Robust04 data only. We release SIR at https://github.com/SapienzaNLP/sir.
Conference Paper
Full-text available
Over the past few years, Word Sense Disambiguation (WSD) has received renewed interest: recently proposed systems have shown the remarkable effectiveness of deep learning techniques in this task, especially when aided by modern pretrained language models. Unfortunately, such systems are still not available as ready-to-use end-to-end packages, making it difficult for researchers to take advantage of their performance. The only alternative for a user interested in applying WSD to downstream tasks is to use currently available end-to-end WSD systems, which, however, still rely on graph-based heuristics or non-neural machine learning algorithms. In this paper, we fill this gap and propose AMuSE-WSD, the first end-to-end system to offer high-quality sense information in 40 languages through a state-of-the-art neural model for WSD. We hope that AMuSE-WSD will provide a stepping stone for the integration of meaning into real-world applications and encourage further studies in lexical semantics. AMuSE-WSD is available online at http://nlp.uniroma1.it/amuse-wsd.
Conference Paper
Full-text available
Mainstream computational lexical semantics embraces the assumption that word senses can be represented as discrete items of a predefined inventory. In this paper we show this needs not be the case, and propose a unified model that is able to produce contextually appropriate definitions. In our model, Generationary, we employ a novel span-based encoding scheme which we use to fine-tune an English pre-trained Encoder-Decoder system to generate glosses. We show that, even though we drop the need of choosing from a predefined sense inventory, our model can be employed effectively: not only does Generationary outperform previous approaches in the generative task of Definition Modeling in many settings, but it also matches or surpasses the state of the art in discriminative tasks such as Word Sense Disambiguation and Word-in-Context. Finally, we show that Generationary benefits from training on data from multiple inventories, with strong gains on various zero-shot benchmarks, including a novel dataset of definitions for free adjective-noun phrases. The software and reproduction materials are available at http://generationary.org.
Conference Paper
Full-text available
Contextualized word embeddings have been employed effectively across several tasks in Natural Language Processing, as they have proved to carry useful semantic information. However, it is still hard to link them to structured sources of knowledge. In this paper we present ARES (context-AwaRe Embeddings of Senses), a semi-supervised approach to producing sense embeddings for the lexical meanings within a lexical knowledge base that lie in a space that is comparable to that of contextualized word vectors. ARES representations enable a simple 1-Nearest-Neighbour algorithm to outperform state-of-the-art models, not only in the English Word Sense Disambiguation task, but also in the multilingual one, whilst training on sense-annotated data in English only. We further assess the quality of our embeddings in the Word-in-Context task, where, when used as an external source of knowledge, they consistently improve the performance of a neural model, leading it to compete with other more complex architectures. ARES em-beddings for all WordNet concepts and the automatically-extracted contexts used for creating the sense representations are freely available at http://sensembert.org/ares.
Conference Paper
Full-text available
Word Sense Disambiguation (WSD) is the task of identifying the meaning of a word in a given context. It lies at the base of Natural Language Processing as it provides semantic information for words. In the last decade, great strides have been made in this field and much effort has been devoted to mitigate the knowledge acquisition bottleneck problem, i.e., the problem of semantically annotating texts at a large scale and in different languages. This issue is ubiquitous in WSD as it hinders the creation of both multilingual knowledge bases and manually-curated training sets. In this work, we first introduce the reader to the task of WSD through a short historical digression and then take the stock of the advancements to alleviate the knowledge acquisition bottleneck problem. In that, we survey the literature on manual, semi-automatic and automatic approaches to create English and multilingual corpora tagged with sense annotations and present a clear overview over supervised models for WSD. Finally, we provide our view over the future directions that we foresee for the field.
Conference Paper
Full-text available
Exploiting syntagmatic information is an encouraging research focus to be pursued in an effort to close the gap between knowledge-based and supervised Word Sense Disambiguation (WSD) performance. We follow this direction in our next-generation knowledge-based WSD system, SyntagRank, which we make available via a Web interface and a RESTful API. SyntagRank leverages the disambiguated pairs of co-occurring words included in SyntagNet, a lexical-semantic combination resource, to perform state-of-the-art knowledge-based WSD in a multilingual setting. Our service provides both a user-friendly interface, available at http://syntagnet.org/, and a RESTful endpoint to query the system programmatically (accessible at http://api.syntagnet.org/).
Conference Paper
Full-text available
Contextual representations of words derived by neural language models have proven to effectively encode the subtle distinctions that might occur between different meanings of the same word. However, these representations are not tied to a semantic network, hence they leave the word meanings implicit and thereby neglect the information that can be derived from the knowledge base itself. In this paper, we propose SENSEMBERT, a knowledge-based approach that brings together the expressive power of language modelling and the vast amount of knowledge contained in a semantic network to produce high-quality latent semantic representations of word meanings in multiple languages. Our vectors lie in a space comparable with that of contextualized word embeddings, thus allowing a word occurrence to be easily linked to its meaning by applying a simple nearest neighbour approach. We show that, whilst not relying on manual semantic annotations , SENSEMBERT is able to either achieve or surpass state-of-the-art results attained by most of the supervised neu-ral approaches on the English Word Sense Disambiguation task. When scaling to other languages, our representations prove to be equally effective as their English counterpart and outperform the existing state of the art on all the Word Sense Disambiguation multilingual datasets. The embeddings are released in five different languages at http://sensembert.org.
Conference Paper
The knowledge acquisition bottleneck strongly affects the creation of multilingual sense-annotated data, hence limiting the power of supervised systems when applied to multilingual Word Sense Disambiguation. In this paper, we propose a semi-supervised approach based upon a novel label propagation scheme, which, by jointly leveraging contextualized word embeddings and the multilingual information enclosed in a knowledge base, projects sense labels from a high-resource language, i.e., English, to lower-resourced ones. Backed by several experiments, we provide empirical evidence that our automatically created datasets are of a higher quality than those generated by other competitors and lead a supervised model to achieve state-of-the-art performances in all multilingual Word Sense Disambiguation tasks. We make our datasets available for research purposes at https://github.com/SapienzaNLP/mulan.