PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We present two supervised (pre-)training methods to incorporate gloss definitions from lexical resources into neural language models (LMs). The training improves our models' performance for Word Sense Disambiguation (WSD) but also benefits general language understanding tasks while adding almost no parameters. We evaluate our techniques with seven different neural LMs and find that XLNet is more suitable for WSD than BERT. Our best-performing methods exceeds state-of-the-art WSD techniques on the SemCor 3.0 dataset by 0.5% F1 and increase BERT's performance on the GLUE benchmark by 1.1% on average.
Content may be subject to copyright.
Incorporating Word Sense Disambiguation in Neural Language Models
Jan Philip Wahle1, Terry Ruas1, Norman Meuschke1, Bela Gipp1
1University of Wuppertal, Rainer-Gruenter-Str. 21, D-42119, Wuppertal, Germany
1last@uni-wuppertal.de
Abstract
We present two supervised (pre-)training meth-
ods to incorporate gloss definitions from lex-
ical resources into neural language models
(LMs). The training improves our models’
performance for Word Sense Disambiguation
(WSD) but also benefits general language un-
derstanding tasks while adding almost no pa-
rameters. We evaluate our techniques with
seven different neural LMs and find that XL-
Net is more suitable for WSD than BERT. Our
best-performing methods exceed state-of-the-
art WSD techniques on the SemCor 3.0 dataset
by 0.5% F1 and increase BERT’s performance
on the GLUE benchmark by 1.1% on average.
1 Introduction
WSD seeks to determine the meaning of words
in a context and is a fundamental challenge
in natural language processing (NLP) (Weaver,
1955;Navigli,2009). Knowledge-based methods
for WSD (Camacho-Collados et al.,2015) lever-
age lexical knowledge databases (LKB), such as
WordNet (Miller,1995;Fellbaum,1998). Super-
vised WSD techniques (Pasini and Navigli,2020)
rely on annotated data, while unsupervised tech-
niques (Chaplot and Salakhutdinov,2018) explore,
e.g., larger contexts or topic modeling.
Recently, supervised WSD methods (Huang
et al.,2019;Bevilacqua and Navigli,2020) increas-
ingly rely on word representations from BERT (De-
vlin et al.,2019), although advances in bidirec-
tional transformers seem more promising for the
task (Yang et al.,2019;Clark et al.,2020). We
define an end-to-end WSD approach applicable to
any LM and evaluate seven of these novel models
regarding their suitability for WSD.
Aside from WSD, pre-trained word representa-
tions have become crucial for LMs and almost any
other NLP task (Mikolov et al.,2013a;Radford
et al.,2018) LMs are trained on large unlabeled
corpora and often ignore relevant information on
word senses available in LKB (e.g., gloss1).
We propose two supervised methods that inte-
grate WordNet knowledge in LMs during the pre-
training phase and show the improved semantic
representations benefit WSD and other tasks, such
as text-similarity. The repository for all experi-
ments is publicly available2.
2 Related Work
The same way word2vec (Mikolov et al.,2013b)
inspired many models in NLP (Bojanowski et al.,
2017;Ruas et al.,2020), BERT (Devlin et al.,
2019) echoed in the literature with recent mod-
els as well (Yang et al.,2019;Clark et al.,
2020). These novel BERT-based models achieve
higher performance in several NLP tasks but are—
with few exceptions—neglected in the WSD do-
main (Loureiro et al.,2020).
Using the Transformer (Vaswani et al.,2017)
architecture, BERT (Devlin et al.,2019) has two
pre-training tasks to capture general language as-
pects, i.e., Masked Language Model (MLM) and
Next Sentence Prediction (NSP). BERT is a strong
baseline but recent studies show the model has
not reached its full capacity; its training scheme
still offers opportunities for improvement. Al-
BERT (Lan et al.,2019) and DistilBERT (Sanh
et al.,2019) make BERT more efficient through pa-
rameter adjustments and distilled knowledge, while
RoBERTa (Liu et al.,2019) increases BERT’s train-
ing volume. XLNet (Yang et al.,2019) focuses
on improving the training objective, while ELEC-
TRA (Clark et al.,2020) and BART (Lewis et al.,
2020) propose a discriminative denoising method
to distinguish real from plausible artificially gen-
erated input tokens. Methods related to ours still
use BERT’s semantic representations to perform
1Brief definition of a synonym set (synset) (Miller,1995)
2https://tinyurl.com/y66h2fhp
arXiv:2106.07967v1 [cs.CL] 15 Jun 2021
WSD (Huang et al.,2019;Du et al.,2019;Peters
et al.,2019;Levine et al.,2019).
Directly related to our work, GlossBERT (Sent-
CLS-WS) (Huang et al.,2019) uses WordNet’s
glosses to fine-tune BERT for the WSD task. Gloss-
BERT classifies a marked word in a sentence into
one of its possible definitions. Du et al. (2019)
fine-tune BERT in the WSD task using word-sense
definitions from WordNet combining an encoder
and classifier. KnowBERT (KBERT) (Peters et al.,
2019) incorporates LKB into BERT using knowl-
edge attention and a recontextualization mecha-
nism. KBERT-W+W, the best-performing model
of Peters et al. (2019), surpasses BERT
BAS E
at
the cost of
400M parameters and 32% more
training time. Our methods are computationally
cheaper alternatives as they do not require adjust-
ing the embeddings from the LKB or using word-
piece attention. Recent contributions in WSD, e.g.,
LMMS (Loureiro and Jorge,2019), BEM (Blevins
and Zettlemoyer,2020), GLU (Hadiwinoto et al.,
2019), and EWISER (Bevilacqua and Navigli,
2020), enhance BERT’s semantic representation
via context or external knowledge, but do not ex-
plore generalizing the model to other NLP tasks.
3 Methods
We introduce a method to perform WSD with ar-
bitrary LMs and explore architectural changes to
increase our model’s performance (Section 3.1).
We hypothesize that WSD can complement MLM
as polysemous words occur frequently in natural
language. In Section 3.2, we present a variation of
our method that improves polysemy resolution and
keeps general language understanding.
3.1 Language Model Gloss Classification
With Language Model Gloss Classification
(LMGC), we propose a model-independent end-to-
end WSD approach to classify ambiguous words
from sentences into one of WordNet’s glosses. This
approach enables applying different LMs for WSD.
LMGC performs classification using the final
representations of its underlying transformer. The
classification approach is closely related to (Huang
et al.,2019). Each input sequence starts with an ag-
gregate token (e.g., the “
[CLS]
” token in BERT),
i.e., an annotated sentence containing the ambigu-
ous word, followed by a candidate gloss for that
word from a lexical resource, such as WordNet.
Sentence and gloss are concatenated with a sep-
arator token and pre-processed using the model’s
tokenizer. We modify the input sequence according
to Huang et al. (2019) with two supervision signals:
(1) highlighting the ambiguous tokens with two
special tokens and (2) adding the polysemous word
before the gloss.
Considering Du et al. (2019) and Huang et al.
(2019) findings, we apply a linear layer to the ag-
gregate representation of the sequence to perform
classification rather than using token embeddings.
In contrast to sequential binary classification, we
suggest modifying the prediction to a parallel multi-
classification construct, similar to Kågebäck and
Salomonsson (2016). Therefore, we stack the
k
candidate sentence-gloss pairs and classify them in
one forward pass using softmax.
To reduce training time to
1
3
compared to the
approach of (Huang et al.,2019), we reduce the
sequence length of all models from 512 to 160
3
as the computational cost of transformers grows
quadratic with the sequence length.
3.2 LMGC with Masked Language Modeling
LMGC focuses on improving the performance in
WSD rather than leveraging the model’s obtained
knowledge from lexical resources for language un-
derstanding. We assume the transfer learning be-
tween LMs and WSD increases the likelihood of
understanding polysemous words in related tasks.
Thus, we employ LMGC into MLM as an addi-
tional supervised training objective (LMGC-M) to
incorporate lexical knowledge into our pre-training.
LMGC-M performs a forward pass using an-
notated examples from our corpus in which we
masked words with a certain probability. More-
over, LMGC-M uses LMGC as a second objective,
similar to NSP in BERT. To prevent underfitting,
due to task difficulty, we only mask words in the
context of the polysemous word. Before inference,
we fine-tune LMGC without masks.
4 Experiments
We evaluate our methods using the SemCor
(3.0) (Miller et al.,1993;Raganato et al.,2017)
and GLUE (Wang et al.,2019b) benchmarks. Sem-
Cor is a popular all-words WSD benchmark for
English (Huang et al.,2019;Peters et al.,2019)
and one of the largest manually annotated datasets
(approx. 226k word sense annotations from Word-
3
99.8% of the dataset can be represented with 160 tokens;
we truncate the remaining sequences to this limit.
Net, cf. Table 1). GLUE (Wang et al.,2019b) is
a collection of nine language understanding tasks
widely used to validate the generalization of LMs
for different linguistic phenomena (Devlin et al.,
2019;Lan et al.,2019). All GLUE tasks are single-
sentence or sentence-pair classifications, except
STS-B, which is a regression task.
Dataset POS Tags Class dist.
Noun Verb Adj. Adv. Total Pos. Neg.
SemCor 87k 88.3k 31.7k 18.9k 226k 226.5k 1.79m
SE2 1k 517 445 254 2.3k 2.4k 14.2k
SE3 900 588 350 12 1.8k 1.8k 15.3k
SE7 159 296 0 0 455 459 4.5k
SE13 1.6k 0 0 0 1.6k 1.6k 9.7k
SE15 531 251 160 80 1k 1.2k 6.5k
Table 1: SemCor training corpus details: general statis-
tics (left) and class distribution for LMGC (right).
4.1 Setup
We initialized all models using the base con-
figuration of its underlying transformer (e.g.,
XLNet
BAS E
, L=12, H=768, A=12). Both our
methods have
2H+ 2
more parameters than
their baseline (e.g., LMGC (BERT) has
110M
parameters). For each polysemous word, we re-
trieved all possible gloss definitions from WordNet
to create sentence-gloss inputs. We increased the
hidden dropout probability to
0.2
as we observed
overfitting for most models. Further, we treated the
class imbalance of positive and negative examples
(Table 1) with focal loss (Lin et al.,2017) (
γ= 2
,
α= 0.25
). Following Devlin et al. (2019), we used
a batch size of 32 sequences, the AdamW optimizer
(
α
= 2e-5), trained three epochs, and chose the best
model according to validation loss. We applied the
same hyperparameter configuration for all models
used in both the SemCor and GLUE benchmarks.
The training was performed on 1 NVIDIA Tesla
V100 GPU for 3 hours per epoch.
For all GLUE tasks, except STS-B, we trans-
formed the aggregate embedding into a classifi-
cation vector by applying a new weight matrix
WRK×H
; where
K
is the number of la-
bels. For STS-B, we applied a new weight ma-
trix
VR1×H
transforming the aggregate into a
single value.
4.2 Results & Discussion
Table 2reports the results of applying LMGC to
different transformer models. Our rationale for
choosing the models was two-fold. First, we ex-
System SE7 SE2 SE3 SE13 SE15 All
BERT (2019) 71.9 77.8 74.6 76.5 79.7 76.6
RoBERTa (2019) 69.2 77.5 73.8 77.2 79.7 76.3
DistilBERT (2019) 66.2 74.9 70.7 74.6 77.1 73.5
AlBERT (2019) 71.4 75.9 73.9 76.8 78.7 75.7
BART (2020) 67.2 77.6 73.1 77.5 79.7 76.1
XLNet (2019)72.5 78.5 75.6 79.1 80.1 77.2
ELECTRA (2020) 62.0 71.5 67.0 73.9 76.0 70.9
Table 2: SemCor test results of LMGC for base trans-
former models. Bold font indicates the best results.
System SE7 SE2 SE3 SE13 SE15 All
GAS (2018b) - 72.2 70.5 67.2 72.6 70.6
CAN (2018a) - 72.2 70.2 69.1 72.2 70.9
HCAN (2018a) - 72.8 70.3 68.5 72.8 71.1
LMMSBERT (2019) 68.1 76.3 75.6 75.1 77.0 75.4
GLU (2019) 68.1 75.5 73.6 71.1 76.2 74.1
GlossBERT (2019) 72.5 77.7 75.2 76.1 80.4 77.0
BERTWSD (2019) - 76.4 74.9 76.3 78.3 76.3
KBERT-W+W (2019) - - - - - 75.1
LMGC (BERT) 71.9 77.8 74.6 76.5 79.7 76.6
LMGC-M (BERT) 72.9 78.2 75.5 76.3 79.5 77.0
LMGC (XLNet) 72.5 78.5 75.6 79.1 80.1 77.2
LMGC-M (XLNet) 73.0 79.1 75.9 79.0 80.3 77.5
Table 3: SemCor test results compared to state-of-the-
art techniques. Bold font indicates the best results.
plore models closely related to or based on BERT.
Either the chosen models improve BERT through
additional training time and data (RoBERTa), or
compress the architecture with minimal perfor-
mance loss (DistilBERT, AlBERT). Second, we
chose models that significantly change the train-
ing objective (XLNet), or employ a discriminative
learning approach (ELECTRA, BART). In Table 3,
we compare our techniques to other contributions
in WSD. We report all results of SemCor according
to Raganato et al. (2017).
RoBERTa shows inferior F1 than BERT al-
though it uses more data and training time. Dis-
tilBERT and AlBERT perform worse than BERT,
which we expected given that they use significantly
fewer parameters. However, AlBERT achieves
a reasonable performance with only
10% of
BERT’s parameters. The results of ELECTRA and
BART show the models’ denoising approach is
not suitable for our WSD setup. Besides, BART
achieves a similar performance as BERT but uses
26% more parameters. XLNet constantly performs
better than BERT on all evaluation sets while us-
ing insignificantly more parameters. Therefore we
selected it for our models’ variation. Considering
large models, preliminary experiments
2
showed a
difference of 0.08% in F1 between BERT
BAS E
and
System
Classification Semantic Similarity Natural Language Inference Average
CoLA SST-2 MRPC STS-B QQP MNLI QNLI RTE -
(mc) (acc) (F1) (sc) (acc) m/mm(acc) (acc) (acc) -
BERTBASE 52.1 93.5 88.9 85.8 89.3 84.6/83.4 90.5 66.4 81.4
GlossBERT 32.8 90.4 75.2 90.4 68.5 81.3/80 83.6 47.3 70.7
LMGC (BERT) 31.1 89.2 81.9 89.2 87.4 81.4/80.3 85.4 60.2 74.5
LMGC-M (BERT) 55.0 94.2 87.1 88.1 90.8 85.3/84.2 90.1 69.7 82.5
Table 4: GLUE test results. As in BERT, we exclude the problematic WNLI set. We report F1-score for MRPC,
Spearman correlations (sc) for STS-B, Matthews correlations (mc) for CoLA, and accuracy (acc) for the other
tasks (with matched/mismatched accuracy for MNLI). Bold font indicates the best results.
BERT
LARGE
for the SemCor datasets, which is in
line with Blevins and Zettlemoyer (2020). Thus,
we consider the base configuration as sufficient for
our experiments.
Table 3shows an overall improvement when
comparing LMGC to the other approaches.
LMGC (BERT) generally outperforms the baseline
BERT
W SD
approach and KBERT-W+W, which
has four times the number of parameters. We
can outperform GlossBERT in all test sets by us-
ing an optimal transformer (XLNet) and adjust-
ments in the training procedure. We exclude
EWISER (Bevilacqua and Navigli,2020) which ex-
plores additional knowledge other than gloss defini-
tions (e.g, knowledge graphs). We leave for future
work the investigation of BEM (Blevins and Zettle-
moyer,2020), a recently published bi-encoder with
separate encoders for context and gloss that are
learned simultaneously.
LMGC-M often outperforms LMGC, which we
assume is due to LMGC-M’s similarity to discrimi-
nated fine-tuning (Howard and Ruder,2018). We
combine LMGC and MLM in one pass, achiev-
ing higher accuracy in WSD and improving gen-
eralization. To show that WSD training allows
language models to achieve better generalization,
we fine-tune the weights of our approaches in the
GLUE (Wang et al.,2019b) datasets. The results
in Tables 3and 4show LMGC-M outperforms the
state-of-the-art in the WSD task and successfully
transfers the acquired knowledge to general lan-
guage understanding datasets. We exclude XLNet
from the comparison to show that the additional
performance is attributable mainly to our method;
not to the improvement of XLNet over BERT. The
number of polysemous words in the GLUE bench-
mark is generally high, supporting the training de-
sign of our method. We provide more details about
polysemy in GLUE in our repository2.
We evaluated our proposed methods against the
best performing model in WSD (Table 3) on the
GLUE datasets (Table 4). Comparing LMGC-M
with the official BERT
BAS E
model, we achieved
an average increase in performance of 1.1%. In this
work, we did not compare LMGC-M to other WSD
methods performing worse than GlossBERT in the
WSD task (Table 3) due to their computational re-
quirements (i.e., KBERT-W+W is 32% slower).
Unsurprisingly, LMGC and GlossBERT performed
well in WSD but cannot maintain performance on
other GLUE tasks. LMGC-M either outperformed
the underlying baseline (BERT) on most tasks or
it obtained comparable results. Therefore, incor-
porating MLM to our WSD architecture leverages
LMGC’s semantic representation and improves its
natural language understanding capabilities.
5 Conclusions and Future Work
We proposed two methods (LMGC, LMGC-M) that
allow for (pre-)training WSD models, which is es-
sential for many NLP tasks (e.g., text-similarity).
Our techniques perform WSD by combining neural
language models with lexical resources from Word-
Net. We exceeded state-of-the-art WSD methods
(+0.5%) and improved the performance over BERT
in general language understanding tasks (+1.1%).
Our future work will include testing generaliza-
tion on the WiC (Pilehvar and Camacho-Collados,
2019), and SuperGLUE (Wang et al.,2019a)
datasets. Furthermore, we want to test discim-
inative fine-tuning against our parallel approach
(Howard and Ruder,2018) and perform an ablation
study to investigate which components of our meth-
ods are most beneficial. We also leave for future
work to incorporate knowledge from other sources
(e.g., Wikidata, Wikipedia).
References
Michele Bevilacqua and Roberto Navigli. 2020. Break-
ing Through the 80% Glass Ceiling: Raising the
State of the Art in Word Sense Disambiguation by In-
corporating Knowledge Graph Information. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 2854–
2864, Online. Association for Computational Lin-
guistics.
Terra Blevins and Luke Zettlemoyer. 2020. Moving
Down the Long Tail of Word Sense Disambiguation
with Gloss Informed Bi-encoders. In Proceedings
of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 1006–1017, On-
line. Association for Computational Linguistics.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching Word Vectors with
Subword Information.Transactions of the Associa-
tion for Computational Linguistics, 5:135–146.
José Camacho-Collados, Mohammad Taher Pilehvar,
and Roberto Navigli. 2015. NASARI: A Novel Ap-
proach to a Semantically-Aware Representation of
Items. In Proceedings of the 2015 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, pages 567–577, Denver, Colorado. Association
for Computational Linguistics.
Devendra Singh Chaplot and Ruslan Salakhutdinov.
2018. Knowledge-based Word Sense Disambigua-
tion using Topic Models.arXiv:1801.01900 [cs].
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
Christopher D. Manning. 2020. ELECTRA: Pre-
training Text Encoders as Discriminators Rather
Than Generators.arXiv:2003.10555 [cs].
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
derstanding.arXiv:1810.04805 [cs].
Jiaju Du, Fanchao Qi, and Maosong Sun. 2019.
Using BERT for Word Sense Disambiguation.
arXiv:1909.08358 [cs].
Christiane Fellbaum, editor. 1998. WordNet: An Elec-
tronic Lexical Database. Language, Speech, and
Communication. MIT Press, Cambridge, Mass.
Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung
Gan. 2019. Improved Word Sense Disambiguation
Using Pre-Trained Contextualized Word Represen-
tations. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
5296–5305, Hong Kong, China. Association for
Computational Linguistics.
Jeremy Howard and Sebastian Ruder. 2018. Universal
Language Model Fine-tuning for Text Classification.
arXiv:1801.06146 [cs, stat].
Luyao Huang, Chi Sun, Xipeng Qiu, and Xuanjing
Huang. 2019. GlossBERT: BERT for Word Sense
Disambiguation with Gloss Knowledge. In Proceed-
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 3507–3512, Hong
Kong, China. Association for Computational Lin-
guistics.
Mikael Kågebäck and Hans Salomonsson. 2016. Word
Sense Disambiguation using a Bidirectional LSTM.
Proceedings of the 5th Workshop on Cognitive As-
pects of the Lexicon (CogALex - V):51–56.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Sori-
cut. 2019. ALBERT: A Lite BERT for Self-
supervised Learning of Language Representations.
arXiv:1909.11942 [cs].
Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos,
Or Sharir, Shai Shalev-Shwartz, Amnon Shashua,
and Yoav Shoham. 2019. SenseBERT: Driving
Some Sense into BERT.arXiv:1908.05646 [cs].
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
2020. BART: Denoising sequence-to-sequence pre-
training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, pages 7871–7880, Online. Association
for Computational Linguistics.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming
He, and Piotr Dollar. 2017. Focal Loss for Dense
Object Detection. In 2017 IEEE International Con-
ference on Computer Vision (ICCV), pages 2999–
3007, Venice. IEEE.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
RoBERTa: A Robustly Optimized BERT Pretrain-
ing Approach.arXiv:1907.11692 [cs].
Daniel Loureiro and Alípio Jorge. 2019. Language
Modelling Makes Sense: Propagating Represen-
tations through WordNet for Full-Coverage Word
Sense Disambiguation. In Proceedings of the 57th
Annual Meeting of the Association for Computa-
tional Linguistics, pages 5682–5691, Florence, Italy.
Association for Computational Linguistics.
Daniel Loureiro, Kiamehr Rezaee, Mohammad Taher
Pilehvar, and Jose Camacho-Collados. 2020. Lan-
guage Models and Word Sense Disambiguation: An
Overview and Analysis.ArXiv200811608 Cs.
Fuli Luo, Tianyu Liu, Zexue He, Qiaolin Xia, Zhifang
Sui, and Baobao Chang. 2018a. Leveraging Gloss
Knowledge in Neural Word Sense Disambiguation
by Hierarchical Co-Attention. In Proceedings of
the 2018 Conference on Empirical Methods in Nat-
ural Language Processing, pages 1402–1411. Asso-
ciation for Computational Linguistics.
Fuli Luo, Tianyu Liu, Qiaolin Xia, Baobao Chang,
and Zhifang Sui. 2018b. Incorporating Glosses into
Neural Word Sense Disambiguation. In Proceed-
ings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Pa-
pers), pages 2473–2482. Association for Computa-
tional Linguistics.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013a. Efficient Estimation of Word Repre-
sentations in Vector Space.arXiv:1301.3781 [cs].
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
rado, and Jeffrey Dean. 2013b. Distributed Repre-
sentations of Words and Phrases and their Composi-
tionality.arXiv:1310.4546 [cs, stat].
George A. Miller. 1995. WordNet: A lexical
database for English.Communications of the ACM,
38(11):39–41.
George A. Miller, Claudia Leacock, Randee Tengi, and
Ross T. Bunker. 1993. A semantic concordance. In
Proceedings of the Workshop on Human Language
Technology - HLT ’93, page 303, Princeton, New Jer-
sey. Association for Computational Linguistics.
Roberto Navigli. 2009. Word sense disambiguation: A
survey.ACM Computing Surveys, 41(2):1–69.
Tommaso Pasini and Roberto Navigli. 2020. Train-
O-Matic: Supervised Word Sense Disambiguation
with no (manual) effort.Artificial Intelligence,
279:103215.
Matthew E. Peters, Mark Neumann, Robert Logan,
Roy Schwartz, Vidur Joshi, Sameer Singh, and
Noah A. Smith. 2019. Knowledge Enhanced Con-
textual Word Representations. In Proceedings of
the 2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Process-
ing (EMNLP-IJCNLP), pages 43–54, Hong Kong,
China. Association for Computational Linguistics.
Mohammad Taher Pilehvar and Jose Camacho-
Collados. 2019. WiC: The Word-in-Context Dataset
for Evaluating Context-Sensitive Meaning Represen-
tations. In Proceedings of the 2019 Conference of
the North, pages 1267–1273. Association for Com-
putational Linguistics.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2018. Language
models are unsupervised multitask learners.
Alessandro Raganato, Jose Camacho-Collados, and
Roberto Navigli. 2017. Word sense disambiguation:
A unified evaluation framework and empirical com-
parison. In Proceedings of the 15th Conference of
the European Chapter of the Association for Compu-
tational Linguistics: Volume 1, Long Papers, pages
99–110, Valencia, Spain. Association for Computa-
tional Linguistics.
Terry Ruas, Charles Henrique Porto Ferreira, William
Grosky, Fabrício Olivetti de França, and Déb-
ora Maria Rossi de Medeiros. 2020. Enhanced
word embeddings using multi-semantic representa-
tion through lexical chains.Information Sciences,
532:16–32.
Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. 2019. DistilBERT, a distilled ver-
sion of BERT: Smaller, faster, cheaper and lighter.
arXiv:1910.01108 [cs].
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 5998–6008. Curran Asso-
ciates, Inc. Https://arxiv.org/abs/1706.03762.
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel Bowman. 2019a. Superglue: A
stickier benchmark for general-purpose language un-
derstanding systems. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-
nett, editors, Advances in Neural Information Pro-
cessing Systems 32, pages 3266–3280. Curran Asso-
ciates, Inc.
Alex Wang, Amanpreet Singh, Julian Michael, Fe-
lix Hill, Omer Levy, and Samuel R. Bowman.
2019b. GLUE: A Multi-Task Benchmark and Anal-
ysis Platform for Natural Language Understanding.
arXiv:1804.07461 [cs].
Warren Weaver. 1955. Translation. In William N.
Locke and Donald A. Boothe, editors, Machine
translation of languages : fourteen essays, pages 15–
23. MIT Press, Cambridge, MA.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
XLNet: Generalized Autoregressive Pretraining for
Language Understanding.arXiv:1906.08237 [cs].
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Neural architectures are the current state of the art in Word Sense Disambiguation (WSD). However, they make limited use of the vast amount of relational information encoded in Lexical Knowledge Bases (LKB). We present Enhanced WSD Integrating Synset Embed-dings and Relations (EWISER), a neural supervised architecture that is able to tap into this wealth of knowledge by embedding information from the LKB graph within the neural architecture, and to exploit pretrained synset embeddings, enabling the network to predict synsets that are not in the training set. As a result , we set a new state of the art on almost all the evaluation settings considered, also breaking through, for the first time, the 80% ceiling on the concatenation of all the standard all-words English WSD evaluation benchmarks. On multilingual all-words WSD, we report state-of-the-art results by training on nothing but English.
Article
Full-text available
Word Sense Disambiguation (WSD) is the task of associating the correct meaning with a word in a given context. WSD provides explicit semantic information that is beneficial to several downstream applications, such as question answering, semantic parsing and hypernym extraction. Unfortunately, WSD suffers from the well-known knowledge acquisition bottleneck problem: it is very expensive, in terms of both time and money, to acquire semantic annotations for a large number of sentences. To address this blocking issue we present Train-O-Matic, a knowledge-based and language-independent approach that is able to provide millions of training instances annotated automatically with word meanings. The approach is fully automatic, i.e., no human intervention is required, and the only type of human knowledge used is a task-independent WordNet-like resource. Moreover, as the sense distribution in the training set is pivotal to boosting the performance of WSD systems, we also present two unsupervised and language-independent methods that automatically induce a sense distribution when given a simple corpus of sentences. We show that, when the learned distributions are taken into account for generating the training sets, the performance of supervised methods is further enhanced. Experiments have proven that Train-O-Matic on its own, and also coupled with word sense distribution learning methods, lead a supervised system to achieve state-of-the-art performance consistently across gold standard datasets and languages. Importantly, we show how our sense distribution learning techniques aid Train-O-Matic to scale well over domains, without any extra human effort. To encourage future research, we release all the training sets in 5 different languages and the sense distributions for each domain of SemEval-13 and SemEval-15 at http://trainomatic.org.
Article
link: https://authors.elsevier.com/a/1b2zO4ZQE1Yyd The relationship between words in a sentence often tells us more about the underlying semantic content of a document than its actual words, individually. In this work, we propose two novel algorithms, called Flexible Lexical Chain II and Fixed Lexical Chain II. These algorithms combine the semantic relations derived from lexical chains, prior knowledge from lexical databases, and the robustness of the distributional hypothesis in word embeddings as building blocks forming a single system. In short, our approach has three main contributions: (i) a set of techniques that fully integrate word embeddings and lexical chains; (ii) a more robust semantic representation that considers the latent relation between words in a document; and (iii) lightweight word embeddings models that can be extended to any natural language task. We intend to assess the knowledge of pre-trained models to evaluate their robustness in document classification task. The proposed techniques are tested against seven word embeddings algorithms using five different machine learning classifiers over six scenarios in the document classification task. Our results show the integration between lexical chains and word embeddings representations sustain state-of-the-art results, even against more complex systems.