Conference PaperPDF Available

ExtEnD: Extractive Entity Disambiguation


Abstract and Figures

Local models for Entity Disambiguation (ED) have today become extremely powerful, in most part thanks to the advent of large pre-trained language models. However, despite their significant performance achievements, most of these approaches frame ED through classification formulations that have intrinsic limitations, both computationally and from a modeling perspective. In contrast with this trend, here we propose EXTEND, a novel local formulation for ED where we frame this task as a text extraction problem, and present two Transformer-based architectures that implement it. Based on experiments in and out of domain, and training over two different data regimes, we find our approach surpasses all its competitors in terms of both data efficiency and raw performance. EXTEND outperforms its alternatives by as few as 6 F 1 points on the more constrained of the two data regimes and, when moving to the other higher-resourced regime, sets a new state of the art on 4 out of 6 benchmarks under consideration, with average improvements of 0.7 F 1 points overall and 1.1 F 1 points out of domain. In addition, to gain better insights from our results, we also perform a fine-grained evaluation of our performances on different classes of label frequency, along with an ablation study of our architectural choices and an error analysis. We release our code and models for research purposes at https://
Content may be subject to copyright.
ExtEnD: Extractive Entity Disambiguation
Edoardo Barba and Luigi Procopio and Roberto Navigli
Sapienza NLP Group, Sapienza University of Rome
{edoardo.barba, luigi.procopio, roberto.navigli}
Local models for Entity Disambiguation (
have today become extremely powerful, in
most part thanks to the advent of large pre-
trained language models. However, despite
their significant performance achievements,
most of these approaches frame
classification formulations that have intrinsic
limitations, both computationally and from a
modeling perspective. In contrast with this
trend, here we propose
, a novel lo-
cal formulation for
where we frame this
task as a text extraction problem, and present
two Transformer-based architectures that im-
plement it. Based on experiments in and out
of domain, and training over two different data
regimes, we find our approach surpasses all its
competitors in terms of both data efficiency and
raw performance.
outperforms its al-
ternatives by as few as
points on the more
constrained of the two data regimes and, when
moving to the other higher-resourced regime,
sets a new state of the art on
out of
marks under consideration, with average im-
provements of
points overall and
points out of domain. In addition, to gain better
insights from our results, we also perform a
fine-grained evaluation of our performances on
different classes of label frequency, along with
an ablation study of our architectural choices
and an error analysis. We release our code and
models for research purposes at
1 Introduction
Being able to associate entity mentions in a given
text with the correct entity they refer to is a crucial
task in Natural Language Processing (
). For-
mally referred to as Entity Disambiguation (
this task entails, given a mention moccurring in a
, identifying the correct entity
out of a set
of candidates e1, . . . , en, coming from a reference
knowledge base (KB). First introduced by Bunescu
and Pa¸sca (2006),
aims to identify the actors in-
volved in human language and, as such, has shown
potential in downstream applications like Question
Answering (Yin et al.,2016), Information Extrac-
tion (Ji and Grishman,2011;Guo et al.,2013), Text
Generation (Puduppully et al.,2019) and Semantic
Parsing (Bevilacqua et al.,2021;Procopio et al.,
Since the advent of Deep Learning within the
community, this task has mostly been framed
as a multi-label classification problem (Shahbazi
et al.,2019;Broscheit,2019), especially lever-
aging the bi-encoder paradigm (Humeau et al.,
2020;Wu et al.,2020). However, although sim-
ple and yet powerful enough to push scores past
%inKB Micro
on standard benchmarks, this
formulation suffers from a number of downsides.
First, the actual disambiguation is only modeled
through a dot product between independent men-
tion and entity vectors, which may not capture com-
plex mention-entity interactions. Second, from a
computational perspective, entities are represented
through high-dimensional vectors that are cached
in a pre-computed index. Thus, classifying against
a large KB has a significant memory cost that, in
fact, scales linearly with respect to the number of
entities. Besides this, adding a new entity also re-
quires modifying the index itself. To address these
issues, De Cao et al. (2021b) have recently pro-
posed an auto-regressive formulation where, given
mentions in their context, models are trained to
generate, token-by-token, the correct entity identi-
While this approach has addressed the afore-
mentioned issues effectively, it requires an auto-
regressive decoding process, which has speed im-
plications, and, what is more, does not let the
model see its possible output choices, something
that has shown significant potential in other se-
mantic tasks (Barba et al.,2021a). In this work,
i.e. a textual description of the entity; in De Cao et al.
(2021b), they use the titles of Wikipedia articles, since their
reference KB is Wikipedia.
we focus on these shortcomings and, inspired by
this latter research trend, propose Extractive Entity
Disambiguation (
), the first entity disam-
biguator that frames
as a text extraction task.
Given as input a context
in which a mention
occurs, along with a text representation for each of
the possible candidates
e1, . . . , en
, a model has to
extract the span associated with the text representa-
tion of the entity that best suits
. We implement
this formulation through
architectures: i) a Trans-
former system (Vaswani et al.,2017;Devlin et al.,
2019) that features an almost identical modeling
power to that of previous works, and ii) a variant
that relaxes the computational requirements of our
approach when using common Transformer-based
architectures. Evaluating our two systems over
standard benchmarks, we find our formulation to
be particularly suited to
. In particular, when
restricting training resources to the AIDA-CoNLL
dataset (Hoffart et al.,2011) only,
pears to be significantly more data-efficient than
its alternatives, surpassing them by more than
inKB Micro
points on average across in-domain
and out-of-domain datasets. Furthermore, when
pre-training on external
data as in De Cao et al.
(2021b), our system sets a new state of the art on
out of
benchmarks under consideration, with
average improvements of
overall and
moving out of domain. Finally, we also perform a
thorough investigation of our system performances,
providing insights and pinpointing the reasons be-
hind our improvements via a fine-grained evalua-
tion on different label-frequency classes.
Our contributions are therefore as follows:
We propose a new framing of
as a text
extraction task;
We put forward two architectures that imple-
ment our formulation, whose average score
across different benchmarks surpasses all pre-
vious works in both data regimes we consider;
We perform a thorough analysis of our sys-
tems’ performances, evaluating their behavior
over different label-frequency classes.
We release our code and models for re-
search purposes at
2 Related Work
Entity Disambiguation (
) is the task of identi-
fying, given a mention in context, the most suit-
able entity among a set of candidates stored in a
knowledge base (KB). Generally the last step in an
Entity Linking system (Broscheit,2019), coming
immediately after mention detection and candidate
generation, this task has been the object of a vast
and diverse literature, with approaches typically
clustered into two groups, depending on how they
model co-occurring mentions in the same docu-
ment. On the one hand, global models strive to
enforce a global coherence across the disambigua-
tions within the same document, leveraging differ-
ent techniques and heuristics to approximate this
(Hoffart et al.,2011;Moro et al.,2014;
Yamada et al.,2016;Ganea and Hofmann,2017;
Le and Titov,2018;Yang et al.,2018).
On the other hand, local models disambiguate
each mention independently of the others, con-
ditioning the entity choice only on the mention
and its context. Thanks to the advent of large pre-
trained language models, this group has recently
witnessed a significant improvement in perfor-
mances, which are nowadays on par with, or even
above, those achieved by state-of-the-art global
systems (Shahbazi et al.,2019). These approaches
usually frame
as a multi-label classification
problem (Broscheit,2019) and a diverse set of
formulations have been proposed. Among these,
the bi-encoder paradigm (Bromley et al.,1994;
Humeau et al.,2020) has been particularly suc-
cessful (Gillick et al.,2019;Tedeschi et al.,2021;
Botha et al.,2020): here, two encoders are trained
to learn vector representations in a shared space for
mentions in context and entities, respectively. Clas-
sification of a given mention is then performed by
retrieving the entity whose representation is closest
according to some metric (e.g. cosine similarity).
Although remarkably powerful, these formula-
tions present a number of disadvantages, such as
their large memory footprint (each entity in the
KB needs to be represented by a high-dimensional
vector) and the fact that the actual disambiguation
process is only expressed via a dot product of inde-
pendently computed vectors, potentially neglecting
mention-entity interactions. While a number of
works (Logeswaran et al.,2019;Wu et al.,2020)
attempt to address the latter issue via multi-stage
approaches where a cross-encoder is stacked after
an initial bi-encoder
or other retrieval functions,
Approximation is necessary as the exact computation of
coherence objectives is NP-hard (Le and Titov,2018).
This bi-encoder, rather than performing the actual classi-
fication, is tasked to generate a filtered set of candidates.
After a long fight Superman saved Metropolis [SEP] Metropolis (1927 film) Metropolis-Hasting algorithm Metropolis (comics)
Input Sentence Candidate Set
span extraction
Figure 1: Illustration of
on the example sentence After a long fight Superman saved Metropolis. The
model takes as input a sentence with the target mention to disambiguate, Metropolis, explicitly marked (for better
visualization, we resort here to highlighting with a different color rather than surrounding it with special tokens)
along with the text representation of each candidate. As in our experiments, the knowledge base here is Wikipedia
and the candidate text representations are Wikipedia page titles. Then, the model performs the disambiguation by
indicating the start and end token of the span containing the predicted entity representation.
an interesting alternative direction that tackles both
problems was recently presented by De Cao et al.
(2021b): the authors frame
as a generation
problem and, leveraging an auto-regressive formu-
lation, train a sequence-to-sequence model to gener-
ate the correct entity identifier for a given mention
and its context.
Nevertheless, while this approach can model
more complex interactions, some of these can only
occur indirectly inside the backtracking of their
beam search. Furthermore, the disambiguation in-
volves an auto-regressive decoding that, although
mitigated by later efforts (De Cao et al.,2021a),
has intrinsic speed limitations. In contrast, here we
propose an extractive formulation, where a model
receives as input the mention, its context and the
text representation of each candidate, and has to
extract the span corresponding to the representation
of the entity that best matches the (mention, con-
text) pair under consideration. Note that this differs
from the aforementioned cross-encoder formula-
tions (Logeswaran et al.,2019;Wu et al.,2020)
where, instead, each entity was encoded together
with the (mention, context) pair, but independently
from all the other entities. With our schema, com-
plex mention-entity and entity-entity interactions
can be explicitly modeled by the neural system, as
all the information is provided in input.
Glancing over other related tasks in the area of
semantics, arguably closest to our work is ESC
(Barba et al.,2021a), where the authors propose
a new framing of Word Sense Disambiguation
) as an extractive sense comprehension task.
Yet, differently from their work, we propose here a
new framing for
, i.e. focus on entity descrip-
tions rather than word sense definitions, present a
baseline system that implements it and devise an
additional architecture that deals with the computa-
tional challenges that arise from such implementa-
3 Model
We now introduce
, our proposed ap-
proach for
. We first present the formulation
we adopt (Section 3.1) and, then, describe the two
architectures that implement it (Section 3.2).
3.1 Formulation
Inspired by recent trends in other semantic tasks
(Barba et al.,2021a), we formulate Entity Disam-
biguation as a text extraction problem: given a
and a context
, a model has to learn
to extract the text span of
that best answers
Formally, let
be a mention occurring in a context
and denote by
Cnd(m) = {cnd1, . . . , cndn}
the set of
text representations associated with
each candidate of
. Then, we formulate
follows: we treat the tuple
(m, cm)
and the con-
catenation of
cnd1, . . . , cndn
as the query
the context
, respectively, and train a model to
extract the text span from
associated with the
; the overall process is
illustrated in Figure 1. This formulation helps to
better model the input provided, with the possible
candidates of
included in the contextualization
process, while also disposing of large output vo-
cabularies as in De Cao et al. (2021b) and, yet, not
resorting to auto-regressive decoding strategies.
3.2 Architectures
To implement our formulation, we consider two
Transformer-based architectures. For both of these,
the input is composed of the concatenation of the
and the context
, subword-tokenized
and separated by a [SEP] special symbol. Since
is a tuple in our formulation, whereas Transformer
models only support text sequences as input, we
into a string
by taking only
surrounding the text span where
occurs with
the special tokens <t> and </t>. Additionally, to
better separate entity candidate representations and
ease their full span identification, we add a trailing
special symbol </ec> to each of them; henceforth,
we denote this resulting modified context by ˆxc.
As our first architecture, we use two independent
classification heads on top of BART (Lewis et al.,
2020) computing, respectively, for each word
, whether
is the start or end of the correct
entity representation
. We train the model
with a cross-entropy criterion over the start and end
. At inference time, we select the entity
candidate representation
joint probability over the 2heads is highest.
However, framing
as we propose here im-
plies that the length of the input to the model scales
linearly with the number of output choices
. Tak-
ing into account that the attention mechanism of
Transformer architectures has quadratic complexity
and that several pre-trained models actually support
inputs only up to a fixed maximum length,
might pose significant computational limitations de-
pending on the dataset and knowledge base under
consideration. To cope with these technical chal-
lenges, we consider a second system, similar to the
previous one but for two main differences. First, we
change the underlying Transformer model, replac-
ing BART with a pre-trained Longformer model
(Beltagy et al.,2020), a Transformer architecture
with an attention mechanism that is linear with re-
spect to the input length and that can handle longer
sequences. This linear complexity is achieved by
essentially applying a sliding attention window
over each token but for a few pre-selected ones (e.g.
[CLS]), which instead feature a symmetric global
attention: they attend upon and are attended by all
the other tokens in the input sequence. This global
mechanism is intended to be task-specific and en-
For instance, the implementation of BART available in
HuggingFace Transformers (Wolf et al.,2020) supports inputs
only up to 1024 subwords.
ables the model to learn representations potentially
close to those standard fully-attentive Transform-
ers would learn, while still maintaining the overall
attention complexity linear with respect to the in-
put size. Therefore, as our second modification,
we adapt this global pattern to our setting, activat-
ing it on the [CLS] special token and on the first
token of each
; this allows to bet-
ter mimic the original quadratic mechanism where
different entity candidate representations can also
attend upon each other. Furthermore, differently
from Beltagy et al. (2020), we disable the global
attention mechanism on the tokens in the query
. In Section 5, we report and discuss the impact
of these modifications. We illustrate the proposed
architecture in Figure 2.
4 Entity Disambiguation Evaluation
We now assess the effectiveness of
Entity Disambiguation. We first introduce the ex-
perimental setup we consider (Section 4.1). Then,
we present the results achieved by
in terms of raw performances (Section 4.2) and via
a breakdown of its behavior on different classes of
label frequency (Section 4.3). For ease of readabil-
ity, we focus here only on the Longformer-based
architecture, which we consider as our main model.
We defer the comparison with the BART-based sys-
tem to Section 5.
4.1 Experimental Setup
Data To evaluate
on Entity Disam-
biguation, we reproduce the same setting used by
De Cao et al. (2021b). Specifically, we adopt their
same candidate sets, which were originally pro-
posed by Le and Titov (2018),
use Wikipedia
titles (e.g. Metropolis (comics)) as the text rep-
resentation for entities and perform training, along
with in-domain evaluation, on the AIDA-CoNLL
dataset (Hoffart et al.,2011,AIDA); similarly, we
use their cleaned version of MSNBC,AQUAINT,
WIKI (WIKI) (Guo and Barbosa,2018;Evgeniy
et al.,2013) for out-of-domain evaluation.
While we use this AIDA-only training scenario,
which we refer to as AIDA, to test the data effi-
ciency of
, most
systems actually
make use of additional data and information origi-
nating from Wikipedia at training time. We denote
These candidate sets were generated relying upon count
statistics from Wikipedia, a large Web corpus and the YAGO
Figure 2: Longformer-based architecture for
. The input context and the candidate textual representations
are fed to the model in the same sequence separated by a [SEP] special token. The mention is surrounded by two
special tokens <t> and </t> and, for the sake of readability, we omit the trailing special tokens </ec>. We highlight
in red the tokens with global attention. Best seen in colors.
this additional training scenario where Wikipedia is
part of the training resources as Wikipedia+AIDA.
Specifically, as our system is a supervised neural
classifier, we follow De Cao et al. (2021b) and
utilize BLINK data (Wu et al.,2020) for
training in this setting. A brief description of each
dataset follows:
AIDA: one of the largest manually annotated
corpora for Entity Linking and Disambigua-
tion. It contains
articles from the Reuters
Corpus with
labeled mentions. The
training set contains
instances, while
the validation and test sets feature
4485 samples, respectively.
MSNBC: a small news corpus with
cles from MSNBC on
different topics. It
contains 656 annotated instances.
AQUAINT: a news corpus composed of
documents with news coming from the
Xhinua News Service, the New York Times
and the Associated Press. It contains
notated instances.
ACE2004: a manually annotated subset of the
ACE co-reference data set (Doddington et al.,
2004). It contains 257 annotated instances.
CWEB: a dataset automatically extracted
from the ClueWeb corpus
by Guo and Bar-
bosa (2018) containing English Websites, con-
sisting of 11,154 annotated instances.
WIKI: an automatically extracted corpus
comprised of Wikipedia pages released by
Evgeniy et al. (2013), with
BLINK: a dataset made up of 9 million (doc-
ument, entity, mention) triples automatically
extracted from Wikipedia.
For each of these resources,
we use the prepro-
cessed datasets, along with the mention candidate
sets, made available by De Cao et al. (2021b) in
the authors’ official repository.8
Evaluation Following common practice in
literature, results over the evaluation datasets are
expressed in terms of inKB Micro
. Furthermore,
7Which are all freely available for research purposes.
In-domain Out-of-domain Avgs
Wikipedia + AIDA
Ganea and Hofmann (2017) 92.2 93.7 88.5 88.5 77.9 77.5 86.4 85.2
Guo and Barbosa (2018) 89.0 92.0 87.0 88.0 77.0 84.5 86.2 85.7
Yang et al. (2018)95.9 92.6 89.9 88.5 81.8 79.2 88.0 86.4
Shahbazi et al. (2019) 93.5 92.3 90.1 88.7 78.4 79.8 87.1 85.9
Yang et al. (2019) 93.7 93.8 88.2 90.1 75.6 78.8 86.7 85.3
Le and Titov (2019) 89.6 92.2 90.7 88.1 78.2 81.7 86.8 86.2
Fang et al. (2019) 94.3 92.8 87.5 91.2 78.5 82.8 87.9 86.6
De Cao et al. (2021b) 93.3 94.3 89.9 90.1 77.3 87.4 88.8 87.8
EXT ENDLarge + BLINK 92.6 94.7 91.6 91.8 77.7 88.8 89.5 88.9
De Cao et al. (2021b) 88.6 88.1 77.1 82.3 71.9 71.7 79.5 78.2
Tedeschi et al. (2021)92.5 89.2 69.5 91.3 68.5 64.0 79.2 76.5
EXT ENDBase 87.9 92.6 84.5 89.8 74.8 74.9 84.1 83.3
EXT ENDLarge 90.0 94.5 87.9 88.9 76.6 76.7 85.8 84.9
Table 1: Results (inKB Micro
) on the in-domain and out-of-domain settings when training on the AIDA training
split only (bottom) and when using additional resources coming from Wikipedia (top). We mark in bold the best
scores and underline the second best.
to better highlight the performance on the out-of-
domain datasets, we report both the average score
over those and AIDA (
) and over those alone
), that is, when the result on AIDA is
excluded from the average.
Comparison Systems In order to contextualize
performances within the current land-
scape of Entity Disambiguation, we evaluate our
approach against recent state-of-the-art systems in
the literature. Specifically, we consider:
Global Models: Ganea and Hofmann (2017);
Guo and Barbosa (2018); Yang et al. (2018,
2019); Le and Titov (2019); Fang et al. (2019);
Local Models: Shahbazi et al. (2019) and
Tedeschi et al. (2021);
The auto-regressive approach proposed by
De Cao et al. (2021b).
Setup As previously mentioned, we
use the Longformer model (Beltagy et al.,2020)
as our reference architecture and retrieve the
pre-trained weights, for both its base and large
variants, from the HuggingFace Transformers li-
brary (Wolf et al.,2020); we refer to these vari-
ants as
parameters) and
parameters). Following
standard practice, we use the last encoder output
for the representation of each token and a simple
linear layer on top of it to compute the start and
end tokens probability distributions. We use a 64-
token attention window and fine-tune the whole
architecture using the Rectified Adam (Liu et al.,
2020) optimizer with
learning rate for at most
steps. We use
steps of gradient accumu-
lation and batches made of a maximum of
tokens. We evaluate the model on the validation
dataset every
steps, enforcing a patience of
evaluation rounds. We train every model for a
single run on a GeForce RTX 3090 graphic card
gigabytes of VRAM. Due to computational
constraints, we do not perform any hyperparameter
tuning, except for the attention window where we
try [
], and select the other hyperparam-
eters following previous literature. We implement
our work in PyTorch (Paszke et al.,2019), using
classy9as the underlying framework.
4.2 Results
We report in Table 1(top) the inKB Micro
and its comparison systems attain on the
evaluation datasets in the Wikipedia+AIDA setting.
Arguably the most interesting finding we re-
port is the improvement
achieves over
its comparison systems.
that is,
pre-trained on BLINK
then fine-tuned on AIDA, sets a new state of the
art on
out of
datasets, with the only excep-
tions being in-domain AIDA and CWEB, where
We note that, due to computational and hardware con-
straints, we were unable to match the training configuration
of De Cao et al. (2021b) and our pre-training performed a
significantly smaller number of updates. The scores reported
here are therefore likely to be higher.
we fall short compared to the global model of Yang
et al. (2018). On the
performances up by
points, and this improve-
ment becomes even more marked when consider-
). These results suggest that our
approach is indeed well-suited for
and, further-
more, is particularly effective when scaling out of
Additionally, we also evaluate
on the
AIDA-only training setting and compare against
De Cao et al. (2021b) and Tedeschi et al. (2021),
the only systems available in this setting. As shown
in Table 1(bottom),
behaves better, with
scores. In particular,
which features only
parameters, fares better
(by almost
points) than De Cao et al. (2021b),
whose model parameters amount to
Moreover, the
results, which are also
higher, further confirm our previous hypothesis
as regards the benefits of our approach in out-of-
domain scalability. Paired together, these results
highlight the higher data efficiency that our formu-
lation achieves, in comparison to its alternatives.
4.3 Fine-grained Results
Inspired by standard practices in the evaluation
of Word Sense Disambiguation systems (Blevins
and Zettlemoyer,2020;Barba et al.,2021a), we
perform a fine-grained analysis where we break
down the performances of our model into different
classes of label frequency. To this end, we partition
both the AIDA test set and the concatenation of
all the out-of-domain datasets in three different
subsets: i) MFC, containing all the instances in
the test set where the target mention is associated
with its most frequent candidate in the training
corpus (i.e. the AIDA training split).; ii) LFC,
containing all the instances in the test set annotated
with a least frequent candidate of the target mention
that appeared at least once in the training corpus;
iii) Unseen, containing all the instances in the test
set whose mention was never seen in the training
We then evaluate all the systems of the AIDA
setting, except for De Cao et al. (2021b) for which
the original model is unavailable, on these six test
sets. To put the results in perspective, we introduce
a simple baseline (PEM-MFC) that consists in al-
ways predicting the most frequent candidate for
each mention, taking mention-candidate frequen-
In-domain Out-of-domain
PEM-MFC 79.2 12.6 74.0 82.2 37.1 66.1
Tedeschi et al. (2021)95.8 60.9 89.0 91.1 43.0 61.7
EXTENDB ase 94.2 53.2 87.1 94.0 43.9 75.0
EXTENDLar ge 94.8 62.4 89.1 94.3 48.1 77.0
Table 2: Results (inKB Micro
) when training on
the AIDA training split only, on the MFC, LFC and
UNS (Unseen) partitions for both in-domain and out-of-
domain settings. We mark in Bold the best scores.
cies from Le and Titov (2018).
As we can see from Table 2, PEM-MFC is a
rather strong baseline, confirming the skewness
of the distribution with which each mention is
annotated with one of its possible candidates to-
wards the most frequent ones. Indeed, the gap
between the performances of all the models on
the MFC split and the LFC split is rather large,
with a difference of almost
points in the out-
of-domain setting. While future works should in-
vestigate the performances on these splits more
in depth, here we can see that
outperform their com-
petitors in the LFC and Unseen splits, in both the
in-domain and out-of-domain settings. This high-
lights the strong generalization capabilities of our
proposed approach, which is able to better handle
rare or unseen instances at the cost of only 1 point
in F1score on the MFC of the in-domain setting.
5 Model Ablation
While the above-mentioned experiments showed
our approach to be rather effective, we only focused
on the Longformer-based architecture, to which
we resorted owing to the computational challenges
we mentioned in Section 3.2. We now investigate
this model choice, evaluating first how the BART-
based system fares. Then, we ablate the attention
pattern we propose for the Longformer and, finally,
discuss the trade-off between our two proposed
BART Strictly speaking, the results we reported
in the previous Section are not exactly conclusive as
to whether or not our formulation is beneficial. In-
deed, while it is true that we use a new formulation,
we also rely upon a Transformer model that none
of our comparison systems considered. Therefore,
to better pinpoint the origin of the improvements,
we train our BART-based architecture in the AIDA
setting; we refer to this model as BART. Note that
the underlying Transformer is identical to that of
De Cao et al. (2021b), except for the final classifi-
cation heads.
As shown in Table 3, BART with
our extractive formulation attains significantly bet-
ter performances. This finding suggests that the
overall improvement does indeed originate from
our extractive formulation. Furthermore, as the two
systems are entirely identical except for the fram-
ing adopted, this finding further underlines the data
efficiency of our approach.
Longformer Ablations We now compare our
chosen global attention strategy with two standard
alternatives. First, we consider the schema origi-
nally proposed by Beltagy et al. (2020) for question-
answering tasks, where all the tokens in the input
query (i.e. the text containing the mention) have a
global attention (Longformer
). Then, we com-
pare against an
variant where the only
token with global attention enabled is the start of
sequence token (i.e. [CLS]). Table 3shows how
the three systems behave, reporting both their in-
domain and out-of-domain scores, along with the
average percentage of tokens in the input sequence
with global attention enabled (GA%). From these
results, we can see that i) our approach fares the
best and that ii) Longformer
achieves perfor-
mances almost in the same ballpark, making it a
viable option for more computationally limited sce-
BART and Longformer Finally, we compare
our two architectures. As we can see from Table 3,
BART performs better in the in-domain dataset,
whereas the Longformer outperforms it in the out-
of-domain setting. Nevertheless, neither of these
differences is very significant and, thus, this result
confirms our initial hypothesis that using our sec-
ond architecture is a valid approximation of the
standard quadratic attention strategy for the extrac-
tive Entity Disambiguation task.
6 Error Analysis
To further investigate the generalization capabili-
ties of
, we performed a black-box testing
(Ribeiro et al.,2020) of our system leveraging the
available test sets. Apart from the problem of label
frequencies (e.g. unseen entities), we discovered
two additional main classes of errors, namely i) in-
The model of De Cao et al. (2021b) has a single head on
the whole output vocabulary, whereas we have two (start and
Model In-domain Out-of-domain GA%
De Cao et al. (2021b) 88.6 78.2 100.0
EXT END 90.0 84.9 21.1
Longformerquery 89.2 84.1 43.3
LongformerCLS 88.8 84.3 0.8
BART 90.4 84.5 100.0
Table 3: Results (inKB Micro
) of the ablation study
for the in-domain and out-of-domain settings along with
the percentage of global tokens (GA%). We mark in
Bold the best scores.
sufficient context, and ii) titles alone might not be
Insufficient Context Since the average number
of candidates for each mention is roughly
, the
probability of having multiple valid candidates
given the input context is far from negligible. For
instance, let us consider the following example:
“In the last game
scored two goals de-
spite coming from a bad injury.. In this sentence,
the mention Ronaldo can refer both to Cristiano
Ronaldo, the Portuguese player, and to Ronaldo
de Lima, the Brazilian player. While this particu-
lar problem holds for several instances in the test
sets, the performance drop is, in fact, mitigated
by the labels skewness towards the most frequent
candidates. Indeed, the model appears always to
predict the most frequent candidate for this kind
of instance, therefore being right in the majority of
Titles might not be enough For both comparabil-
ity and performance purposes, the text representa-
tion we use for a given entity in this work is simply
its Wikipedia title. While article titles in Wikipedia
are rather informative, in several circumstances
they do not contain enough information to make
them sufficiently distinguishable from other candi-
dates. For example, several pages describing “Per-
sons” are entitled just with their respective names
and surnames. This kind of identifier is especially
ineffective if the mentions taken into consideration
were not present in the training dataset, or were rare
or unseen during the underlying Transformer pre-
training. To this end, we strongly believe that future
research might benefit from focusing on enriching
entities’ identifiers by adding a small description
of the articles (summary) or at least some keyword
representing the domain the entity belongs to.
7 Conclusion
In this work we presented
, a novel local
formulation for
that frames this task as a text
extraction problem: given as input a string con-
taining a marked mention in context and the text
representation of each entity in its candidate set, a
model has to extract the span corresponding to the
text representation of the correct entity. Together
with this formulation, we also presented two Trans-
former models that implement it and, by evaluating
them across several experiments, we found our ap-
proach to be particularly suited to
. First, it
is extremely data efficient, surpassing its alterna-
tives by more than
points when considering an
AIDA-only training setting. Second, pre-training
on BLINK data enables the model to set a new
state of the art on
out of
benchmarks under con-
sideration and yield average improvements of
points overall and
points when focusing
only on out-of-domain evaluation datasets.
As future work, we plan to relax the require-
ments towards the candidate set and explore adapt-
ing this local formulation to a global one, so as to
enforce coherence across predictions. For instance,
we believe integrating the feedback loop strategy
we proposed in Barba et al. (2021b) would be an
interesting direction to pursue.
The authors gratefully acknowledge
the support of the ERC Consolidator
Grant MOUSSE No. 726487.
This work was partially supported by the MIUR
under the grant “Dipartimenti di eccellenza 2018-
2022" of the Department of Computer Science of
Sapienza University.
Edoardo Barba, Tommaso Pasini, and Roberto Nav-
igli. 2021a. ESC: Redesigning WSD with extractive
sense comprehension. In Proceedings of the 2021
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 4661–4672, Online.
Association for Computational Linguistics.
Edoardo Barba, Luigi Procopio, and Roberto Navigli.
2021b. ConSeC: Word sense disambiguation as con-
tinuous sense comprehension. In Proceedings of the
2021 Conference on Empirical Methods in Natural
Language Processing, pages 1492–1503, Online and
Punta Cana, Dominican Republic. Association for
Computational Linguistics.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020.
Longformer: The long-document transformer.
Michele Bevilacqua, Rexhina Blloshmi, and Roberto
Navigli. 2021. One SPRING to rule them both: Sym-
metric AMR semantic parsing and generation without
a complex pipeline. In Proceedings of AAAI.
Terra Blevins and Luke Zettlemoyer. 2020. Moving
down the long tail of word sense disambiguation
with gloss informed bi-encoders. In Proceedings
of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 1006–1017, Online.
Association for Computational Linguistics.
Jan A. Botha, Zifei Shan, and Daniel Gillick. 2020. En-
tity Linking in 100 Languages. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 7833–7845,
Online. Association for Computational Linguistics.
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard
Säckinger, and Roopak Shah. 1994. Signature verifi-
cation using a "siamese" time delay neural network.
In Advances in Neural Information Processing Sys-
tems, volume 6. Morgan-Kaufmann.
Samuel Broscheit. 2019. Investigating entity knowl-
edge in BERT with simple neural end-to-end entity
linking. In Proceedings of the 23rd Conference on
Computational Natural Language Learning (CoNLL),
pages 677–685, Hong Kong, China. Association for
Computational Linguistics.
Razvan Bunescu and Marius Pa¸sca. 2006. Using en-
cyclopedic knowledge for named entity disambigua-
tion. In 11th Conference of the European Chapter of
the Association for Computational Linguistics, pages
9–16, Trento, Italy. Association for Computational
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021a.
Highly parallel autoregressive entity linking with dis-
criminative correction. In Proceedings of the 2021
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 7662–7669, Online and
Punta Cana, Dominican Republic. Association for
Computational Linguistics.
Nicola De Cao, Gautier Izacard, Sebastian Riedel, and
Fabio Petroni. 2021b. Autoregressive entity retrieval.
In 9th International Conference on Learning Repre-
sentations, ICLR 2021, Virtual Event, Austria, May
3-7, 2021.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
George R. Doddington, Alexis Mitchell, Mark A. Przy-
bocki, Lance A. Ramshaw, Stephanie M. Strassel,
and Ralph M. Weischedel. 2004. The automatic con-
tent extraction (ACE) program - tasks, data, and eval-
uation. In Proceedings of the Fourth International
Conference on Language Resources and Evaluation,
LREC 2004, May 26-28, 2004, Lisbon, Portugal. Eu-
ropean Language Resources Association.
Gabrilovich Evgeniy, Ringgaard Michael, and Subra-
manya Amarnag. 2013. FACC1: Freebase anno-
tation of clueweb corpora, version 1 (release date
2013-06- 26, format version 1, correction level 0).
Zheng Fang, Yanan Cao, Qian Li, Dongjie Zhang,
Zhenyu Zhang, and Yanbing Liu. 2019. Joint en-
tity linking with deep reinforcement learning. In The
World Wide Web Conference, WWW 2019, San Fran-
cisco, CA, USA, May 13-17, 2019, pages 438–447.
Octavian-Eugen Ganea and Thomas Hofmann. 2017.
Deep joint entity disambiguation with local neural
attention. In Proceedings of the 2017 Conference on
Empirical Methods in Natural Language Processing,
pages 2619–2629, Copenhagen, Denmark. Associa-
tion for Computational Linguistics.
Daniel Gillick, Sayali Kulkarni, Larry Lansing, Alessan-
dro Presta, Jason Baldridge, Eugene Ie, and Diego
Garcia-Olano. 2019. Learning dense representations
for entity retrieval. In Proceedings of the 23rd Con-
ference on Computational Natural Language Learn-
ing (CoNLL), pages 528–537, Hong Kong, China.
Association for Computational Linguistics.
Stephen Guo, Ming-Wei Chang, and Emre Kiciman.
2013. To link or not to link? a study on end-to-
end tweet entity linking. In Proceedings of the 2013
Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Language Technologies, pages 1020–1030, Atlanta,
Georgia. Association for Computational Linguistics.
Zhaochen Guo and Denilson Barbosa. 2018. Robust
named entity disambiguation with random walks.Se-
mantic Web, 9(4):459–479.
Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,
Hagen Fürstenau, Manfred Pinkal, Marc Spaniol,
Bilyana Taneva, Stefan Thater, and Gerhard Weikum.
2011. Robust disambiguation of named entities in
text. In Proceedings of the 2011 Conference on Em-
pirical Methods in Natural Language Processing,
pages 782–792, Edinburgh, Scotland, UK. Associa-
tion for Computational Linguistics.
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,
and Jason Weston. 2020. Poly-encoders: Architec-
tures and pre-training strategies for fast and accurate
multi-sentence scoring. In 8th International Confer-
ence on Learning Representations, ICLR 2020, Addis
Ababa, Ethiopia, April 26-30, 2020.
Heng Ji and Ralph Grishman. 2011. Knowledge base
population: Successful approaches and challenges.
In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Language Technologies, pages 1148–1158, Portland,
Oregon, USA. Association for Computational Lin-
Phong Le and Ivan Titov. 2018. Improving entity link-
ing by modeling latent relations between mentions.
In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers), pages 1595–1604, Melbourne, Aus-
tralia. Association for Computational Linguistics.
Phong Le and Ivan Titov. 2019. Boosting entity linking
performance by leveraging unlabeled documents. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 1935–
1945, Florence, Italy. Association for Computational
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. 2020.
BART: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and com-
prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 7871–7880, Online. Association for Computa-
tional Linguistics.
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu
Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han.
2020. On the variance of the adaptive learning rate
and beyond. In 8th International Conference on
Learning Representations, ICLR 2020, Addis Ababa,
Ethiopia, April 26-30, 2020.
Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee,
Kristina Toutanova, Jacob Devlin, and Honglak Lee.
2019. Zero-shot entity linking by reading entity de-
scriptions. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
pages 3449–3460, Florence, Italy. Association for
Computational Linguistics.
Andrea Moro, Alessandro Raganato, and Roberto Nav-
igli. 2014. Entity linking meets word sense disam-
biguation: a unified approach.Transactions of the
Association for Computational Linguistics, 2:231–
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Te-
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
Junjie Bai, and Soumith Chintala. 2019. PyTorch:
An imperative style, high-performance deep learning
library. In H. Wallach, H. Larochelle, A. Beygelz-
imer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems
32, pages 8024–8035. Curran Associates, Inc.
Luigi Procopio, Rocco Tripodi, and Roberto Navigli.
2021. SGL: Speaking the graph languages of se-
mantic parsing via multilingual translation. In Pro-
ceedings of the 2021 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
325–337, Online. Association for Computational Lin-
Ratish Puduppully, Li Dong, and Mirella Lapata. 2019.
Data-to-text generation with entity modeling. In Pro-
ceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 2023–
2035, Florence, Italy. Association for Computational
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin,
and Sameer Singh. 2020. Beyond accuracy: Be-
havioral testing of NLP models with CheckList. In
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 4902–
4912, Online. Association for Computational Lin-
Hamed Shahbazi, Xiaoli Z. Fern, Reza Ghaeini, Rasha
Obeidat, and Prasad Tadepalli. 2019. Entity-aware
elmo: Learning contextual entity representation for
entity disambiguation.CoRR, abs/1908.05762.
Simone Tedeschi, Simone Conia, Francesco Cecconi,
and Roberto Navigli. 2021. Named Entity Recog-
nition for Entity Linking: What works and what’s
next. In Findings of the Association for Computa-
tional Linguistics: EMNLP 2021, pages 2584–2596,
Punta Cana, Dominican Republic. Association for
Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems, page 6000–6010. Curran Associates,
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
formers: State-of-the-art natural language processing.
In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 38–45, Online. Association
for Computational Linguistics.
Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian
Riedel, and Luke Zettlemoyer. 2020. Scalable zero-
shot entity linking with dense entity retrieval. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 6397–6407, Online. Association for Computa-
tional Linguistics.
Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and
Yoshiyasu Takefuji. 2016. Joint learning of the em-
bedding of words and entities for named entity disam-
biguation. In Proceedings of The 20th SIGNLL Con-
ference on Computational Natural Language Learn-
ing, pages 250–259, Berlin, Germany. Association
for Computational Linguistics.
Xiyuan Yang, Xiaotao Gu, Sheng Lin, Siliang Tang,
Yueting Zhuang, Fei Wu, Zhigang Chen, Guoping
Hu, and Xiang Ren. 2019. Learning dynamic context
augmentation for global entity linking. In Proceed-
ings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th Inter-
national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 271–281, Hong
Kong, China. Association for Computational Lin-
Yi Yang, Ozan Irsoy, and Kazi Shefaet Rahman. 2018.
Collective entity disambiguation with structured gra-
dient tree boosting. In Proceedings of the 2018 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
777–786, New Orleans, Louisiana. Association for
Computational Linguistics.
Wenpeng Yin, Mo Yu, Bing Xiang, Bowen Zhou, and
Hinrich Schütze. 2016. Simple question answering
by attentive convolutional neural network. In Pro-
ceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical
Papers, pages 1746–1756, Osaka, Japan. The COL-
ING 2016 Organizing Committee.
... Despite the limiting operational hypothesis of independence between co-occurring mentions, local methods have nowadays achieved performances that are either on par or above those attained by their global counterparts, mainly thanks to the advent of large pre-trained language models. In particular, among these methods, generative (De Cao et al., 2021) and extractive (Barba et al., 2022) formulations are arguably the most promising directions, having resulted in large performance improvements across multiple benchmarks. Regardless of their modeling differences, the key idea behind these methods is to part away from the previous classification-based approaches and, instead, adopt formulations that better leverage the original pre-training of the underlying language models. ...
... Although they have admittedly attained great improvements, both in-and out-of-domain, to the best of our knowledge, previous works on both these formulations have limited their studies to a single type of textual representation for entities, that is, their title in Wikipedia. This strategy, however, presents a number of issues (Barba et al., 2022) and, in particular, often results in representations that are either insufficiently informative, or are even virtually indistinguishable from one another. In contrast to this trend, we address this limitation and explore the effect of more expressive textual representations on state-of-the-art local methods. ...
... Extractive Modeling. Additionally, we also consider the formulation recently presented by Barba et al. (2022) that frames ED as extractive question answering. Here,c m , defined in the same way as it was for Generative Modeling, above, represents the query, whereas the context is built by concatenating the textual representationsê 1 , . . . ...
... The number of snippets is only partly illustrative of the training signal, as the number of mentions dramatically differs per setup. In "single-mention" data as used by Barba et al. (2022) andDe Cao et al. (2021), each snippet only contains a single annotated entity mention (indicated with a dagger asterisk in Table 1). In "multi-mention" data on the other hand, more than one mention might be annotated, thus potentially greatly increasing the training signal. ...
... Ambiguity is a long-standing problem in natural language processing (NLP) tasks such as word sense disambiguation (Navigli, 2009), entity disambiguation (Barba et al., 2022), and database search result disambiguation (Qian et al., 2021) in the taskoriented dialogue systems. Ezzini et al. (2021) utilized domain-specific data to address the structural ambiguity of sentences. ...
... The cross-encoder outputs whether or not the mention in context refers to the concatenated entity. Besides formulating re-ranking as a classification problem, Barba, Procopio, and Navigli (2022) formulate ED as a text extraction problem, and De Cao et al. (2021) use BART (Lewis et al. 2020) to generate corresponding entity name in an autoregressive manner. Despite huge progress of EL in general domain, the above methods cannot be transferred directly into biomedical domain due to the scarcity of labeled data (Yuan, Yuan, and Yu 2022) and the difference of KB format (Varma et al. 2021). ...
Biomedical entity linking (EL) is the task of linking mentions in a biomedical document to corresponding entities in a knowledge base (KB). The challenge in biomedical EL lies in leveraging mention context to select the most appropriate entity among possible candidates. Although some EL models achieve competitive results by retrieving candidate entities and then exploiting context to re-rank them, these re-ranking models concatenate mention context with one candidate at a time. They lack fine-grained interaction among candidates, and potentially cannot handle ambiguous mentions when facing candidates both with high lexical similarity. We cope with this issue using a re-ranking model based on prompt tuning, which represents mention context and all candidates at once, letting candidates in comparison attend to each other. We also propose a KB-enhanced self-supervised pretraining strategy. Instead of large-scale pretraining on biomedical EL data in previous work, we use masked language modeling with synonyms from KB. Our method achieves state-of-the-art results on 3 biomedical EL datasets: NCBI disease, BC5CDR and COMETA, showing the effectiveness of cross-entity interaction and KB-enhanced pretraining strategy. Code is available at
Computer-aided process planning is the bridge between computer-aided design and computer-aided manufacturing. With the advent of the intelligent manufacturing era, process knowledge is important for process planning. Knowledge graph is a semantic representation method of knowledge that has attracted extensive attention from the industry and academia. Process planning using the process knowledge graph has become an important development direction for computer-aided process planning. From the analysis of the published reviews, there have been many computer-aided process planning reviews with different focuses. We focus on the techniques and applications of knowledge graph in manufacturing process planning. Therefore, this paper comprehensively reviews knowledge graphs in manufacturing process planning. We analyze the key technologies of process knowledge graph, including process knowledge representation, process knowledge extraction, process knowledge graph construction, process knowledge graph refinement, process knowledge graph validation, and process generation. We also explore the combination of process knowledge graphs and large language models. Finally, potential future research directions are proposed.
In this paper, we introduce and discuss the pervasive issue of bias in the large language models that are currently at the core of mainstream approaches to Natural Language Processing (NLP). We first introduce data selection bias, that is, the bias caused by the choice of texts that make up a training corpus. Then, we survey the different types of social bias evidenced in the text generated by language models trained on such corpora, ranging from gender to age, from sexual orientation to ethnicity, and from religion to culture. We conclude with directions focused on measuring, reducing, and tackling the aforementioned types of bias.
Full-text available
Advanced knowledge engineering (KE), represented by knowledge graph (KG), drives the development of various fields and engineering technologies and provides various knowledge fusion and knowledge empowerment interfaces. At the same time, advanced system engineering (SE) takes model-based system engineering (MBSE) as the core to realize formal modeling and process analysis of the whole system. The two complement each other and are the key technologies for the transition from 2.0 to 3.0 in the era of artificial intelligence and the transition from perceptual intelligence to cognitive intelligence. This survey summarizes an advanced information fusion system, from model-driven to knowledge-enabled. Firstly, the concept, representative methods, key technologies and application fields of model-driven system engineering are introduced. Then, it introduces the concept of knowledge-driven knowledge engineering, summarizes the architecture and construction methods of advanced knowledge engineering and summarizes the application fields. Finally, the combination of advanced information fusion systems, development opportunities and challenges are discussed.
Artificial Intelligence (AI) has huge impact on our daily lives with applications such as voice assistants, facial recognition, chatbots, autonomously driving cars, etc. Natural Language Processing (NLP) is a cross-discipline of AI and Linguistics, dedicated to study the understanding of the text. This is a very challenging area due to unstructured nature of the language, with many ambiguous and corner cases. In this thesis we address a very specific area of NLP that involves the understanding of entities (e.g., names of people, organizations, locations) in text. First, we introduce a radically different, entity-centric view of the information in text. We argue that instead of using individual mentions in text to understand their meaning, we should build applications that would work in terms of entity concepts. Next, we present a more detailed model on how the entity-centric approach can be used for the entity linking task. In our work, we show that this task can be improved by considering performing entity linking at the coreference cluster level rather than each of the mentions individually. In our next work, we further study how information from Knowledge Base entities can be integrated into text. Finally, we analyze the evolution of the entities from the evolving temporal perspective.
Entity Linking (EL) aims to map mentions in a text to corresponding entities in a knowledge base. Existing EL methods usually rely on sufficient labeled data to achieve the best performance. However, the massive investment in data makes EL systems viable only to a limited audience. There is ample evidence that introducing entity types can provide the model prior knowledge to maintain the model performance in low-data regimes. Unfortunately, current low-data EL methods usually employ entity types by rule constraints, which are in a shallow manner. Furthermore, they usually ignore fine-grained interaction between mention and its context, resulting in insufficient semantic information of mention representation in low-data regimes. To this end, we propose a Class-Dynamic and Hierarchy-Constrained Network (CDHCN) for entity linking. Specifically, we propose a dynamic class scheme to learn a more effective representation for each entity type. Besides, we formulate a hierarchical constraint scheme to reduce the matching difficulty of the given mention and corresponding candidate entities by utilizing entity types. In addition, we propose an auxiliary task called mention position prediction (MPP) to obtain an informative mention representation in low-data regimes. Finally, extensive in-domain and out-of-domain experiments demonstrate the effectiveness of our method. KeywordsEntity linkingEntity typeHierarchical constraint
Conference Paper
Full-text available
Supervised systems have nowadays become the standard recipe for Word Sense Disambiguation (WSD), with Transformer-based language models as their primary ingredient. However, while these systems have certainly attained unprecedented performances, virtually all of them operate under the constraining assumption that, given a context, each word can be disambiguated individually with no account of the other sense choices. To address this limitation and drop this assumption, we propose CONtinuous SEnse Comprehension (CONSEC), a novel approach to WSD: leveraging a recent re-framing of this task as a text extraction problem, we adapt it to our formulation and introduce a feedback loop strategy that allows the disambiguation of a target word to be conditioned not only on its context but also on the explicit senses assigned to nearby words. We evaluate CONSEC and examine how its components lead it to surpass all its competitors and set a new state of the art on English WSD. We also explore how CONSEC fares in the cross-lingual setting , focusing on 8 languages with various degrees of resource availability, and report significant improvements over prior systems. We release our code at SapienzaNLP/consec.
Full-text available
Entity Linking (EL) systems have achieved impressive results on standard benchmarks, mainly thanks to the contextualized representations provided by recent pretrained language models. However, such systems still require massive amounts of data — millions of labeled examples — to perform at their best, with training times that often exceed several days, especially when limited computational resources are available. In this paper, we look at how Named Entity Recognition (NER) can be exploited to narrow the gap between EL systems trained on high and low amounts of labeled data. More specifically, we show how and to what extent an EL system can benefit from NER to enhance its entity representations, improve candidate selection, select more effective negative samples and enforce hard and soft constraints on its output entities. We release our software — code and model checkpoints — at https://github. com/Babelscape/ner4el.
Conference Paper
Full-text available
Graph-based semantic parsing aims to represent textual meaning through directed graphs. As one of the most promising general-purpose meaning representations, these structures and their parsing have gained a significant interest momentum during recent years, with several diverse formalisms being proposed. Yet, owing to this very heterogeneity, most of the research effort has focused mainly on solutions specific to a given formalism. In this work, instead, we reframe semantic parsing towards multiple formalisms as Multilingual Neural Machine Translation (MNMT), and propose SGL, a many-to-many seq2seq architecture trained with an MNMT objective. Backed by several experiments, we show that this framework is indeed effective once the learning procedure is enhanced with large parallel corpora coming from Machine Translation: we report competitive performances on AMR and UCCA parsing , especially once paired with pre-trained ar-chitectures. Furthermore, we find that models trained under this configuration scale remarkably well to tasks such as cross-lingual AMR parsing: SGL outperforms all its competitors by a large margin without even explicitly seeing non-English to AMR examples at training time and, once these examples are included as well, sets an unprecedented state of the art in this task. We release our code and our models for research purposes at
Conference Paper
Full-text available
Word Sense Disambiguation (WSD) is a historical NLP task aimed at linking words in contexts to discrete sense inventories and it is usually cast as a multi-label classification task. Recently, several neural approaches have employed sense definitions to better represent word meanings. Yet, these approaches do not observe the input sentence and the sense definition candidates all at once, thus potentially reducing the model performance and generalization power. We cope with this issue by reframing WSD as a span extraction problem --- which we called Extractive Sense Comprehension (ESC) --- and propose ESCHER, a transformer-based neural architecture for this new formulation. By means of an extensive array of experiments, we show that ESC unleashes the full potential of our model, leading it to outdo all of its competitors and to set a new state of the art on the English WSD task. In the few-shot scenario, ESCHER proves to exploit training data efficiently, attaining the same performance as its closest competitor while relying on almost three times fewer annotations. Furthermore, ESCHER can nimbly combine data annotated with senses from different lexical resources, achieving performances that were previously out of everyone's reach. The model along with data is available at
Conference Paper
Full-text available
In Text-to-AMR parsing, current state-of-the-art semantic parsers use cumbersome pipelines integrating several different modules or components, and exploit graph recategorization, i.e., a set of content-specific heuristics that are developed on the basis of the training set. However, the generalizability of graph recategorization in an out-of-distribution setting is unclear. In contrast, state-of-the-art AMR-to-Text generation, which can be seen as the inverse to parsing, is based on simpler seq2seq. In this paper, we cast Text-to-AMR and AMR-to-Text as a symmetric transduction task and show that by devising a careful graph linearization and extending a pretrained encoder-decoder model, it is possible to obtain state-of-the-art performances in both tasks using the very same seq2seq approach, i.e., SPRING (Symmetric PaRsIng aNd Generation). Our model does not require complex pipelines, nor heuristics built on heavy assumptions. In fact, we drop the need for graph recategorization, showing that this technique is actually harmful outside of the standard benchmark. Finally, we outperform the previous state of the art on the English AMR 2.0 dataset by a large margin: on Text-to-AMR we obtain an improvement of 3.6 Smatch points, while on AMR-to-Text we outperform the state of the art by 11.2 BLEU points. We release the software at