ChapterPDF Available

On the Reproducibility of the TAGME Entity Linking System


Abstract and Figures

Reproducibility is a fundamental requirement of scientific research. In this paper, we examine the repeatability, reproducibility, and generalizability of TAGME, one of the most popular entity linking systems. By comparing results obtained from its public API with (re)implementations from scratch, we obtain the following findings. The results reported in the TAGME paper cannot be repeated due to the unavailability of data sources. Part of the results are reproducible through the provided API, while the rest are not reproducible. We further show that the TAGME approach is generalizable to the task of entity linking in queries. Finally, we provide insights gained during this process and formulate lessons learned to inform future reducibility efforts.
Content may be subject to copyright.
On the Reproducibility of the
TAGME Entity Linking System
Faegheh Hasibi1, Krisztian Balog2, and Svein Erik Bratsberg1
1Norwegian University of Science and Technology, Trondheim, Norway
2University of Stavanger, Stavanger, Norway
Abstract. Reproducibility is a fundamental requirement of scientific research.
In this paper, we examine the repeatability, reproducibility, and generalizability
of TAGME, one of the most popular entity linking systems. By comparing re-
sults obtained from its public API with (re)implementations from scratch, we
obtain the following findings. The results reported in the TAGME paper cannot
be repeated due to the unavailability of data sources. Part of the results are repro-
ducible through the provided API, while the rest are not reproducible. We further
show that the TAGME approach is generalizable to the task of entity linking in
queries. Finally, we provide insights gained during this process and formulate
lessons learned to inform future reducibility efforts.
1 Introduction
Recognizing and disambiguating entity occurrences in text is a key enabling compo-
nent for semantic search [14]. Over the recent years, various approaches have been
proposed to perform automatic annotation of documents with entities from a reference
knowledge base, a process known as entity linking [7, 8, 10, 12, 15, 16]. Of these,
TAGME [8] is one of the most popular and influential ones. TAGME is specifically
designed for efficient (“on-the-fly”) annotation of short texts, like tweets and search
queries. The latter task, i.e., annotating search queries with entities, was evaluated at
the recently held Entity Recognition and Disambiguation Challenge [1], where the first
and second ranked systems both leveraged or extended TAGME [4, 6]. Despite the ex-
plicit focus on short text, TAGME has been shown to deliver competitive results on
long texts as well [8]. TAGME comes with a web-based interface and a RESTful API
is also provided.3The good empirical performance coupled with the aforementioned
convenience features make TAGME one of the obvious must-have baselines for entity
linking research. The influence and popularity of TAGME is also reflected in citations;
the original TAGME paper [8] (from now on, simply referred to as the TAGME paper)
has been cited around 50 times according to the ACM digital library and nearly 200
times according to Google scholar, at the time of writing. The authors have also pub-
lished an extended report [9] (with more algorithmic details and experiments) that has
received over 50 citations according to Google scholar.
Our focus in this paper is on the repeatability, reproducibility, and generalizability of
the TAGME system; these are obvious desiderata for reliable and extensible research.
2 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg
The recent SIGIR 2015 workshop on Reproducibility, Inexplicability, and Generaliz-
ability of Results (RIGOR)4defined these properties as follows:
Repeatability: “Repeating a previous result under the original conditions (e.g., same
dataset and system configuration).”
Reproducibility: “Reproducing a previous result under different, but comparable
conditions (e.g., different, but comparable dataset).
Generalizability: “Applying an existing, empirically validated technique to a differ-
ent IR task/domain than the original.”
We address each of these aspects in our study, as explained below.
Repeatability. Although TAGME facilitates comparison by providing a publicly
available API, it is not sufficient for the purpose of repeatability. The main reason is
that the API works much like a black-box; it is impossible to check whether it corre-
sponds to the system described in [8]. Actually, it is acknowledged that the API deviates
from the original publication,5but the differences are not documented anywhere. An-
other limiting factor is that the API cannot be used for efficiency comparisons due to
the network overhead. We report on the challenges around repeating the experiments
in [8] and discuss why the results are not repeatable.
Reproducibility. TAGME has been re-implemented in several research papers, see,
e.g., [2, 3, 11], these, however, do not report on the reproducibility of results. In addi-
tion, there are some technical challenges involved in the TAGME approach that have
not always been dealt with properly in the original paper and accordingly in these re-
implementations (as confirmed by some of the respective authors).6We examine the
reproducibility of TAGME, as introduced in [8], and show that some of the results are
not reproducible, while others are reproducible only through the TAGME API.
Generalizability. We test generalizability by applying TAGME to a different task:
entity linking in queries (ELQ). This task has been devised by the Entity Recognition
and Disambiguation (ERD) workshop [1], and has been further elaborated on in [11].
The main difference between conventional entity linking and ELQ is that the latter
accepts that a query might have multiple interpretations, i.e., the output in not a single
annotation, but (possibly multiple) sets of entities that are semantically related to each
other. Even though TAGME has been developed for a different problem (where only a
single interpretation is returned), we show that it is generalizable to the ELQ task.
Before we proceed let us make a disclaimer. In the course of this study, we made a
best effort to reproduce the results presented in [8] based on the information available to
us: the TAGME papers [8, 9] and the source code kindly provided by the authors. Our
main goal with this work is to learn about reproducibility, and is in no way intended to
be a criticism of TAGME. The communication with the TAGME authors is summarized
in Sect. 6. The resources developed within this paper as well as detailed responses from
the TAGME authors (and any possible future updates) are made publicly available at
5 and is also mentioned in [5, 18]
6Personal communication with authors of [3, 8, 11]
On the Reproducibility of the TAGME Entity Linking System 3
Parsing Disambiguation
text Pruning
all candidate
mention-entity pairs
single entity per
Fig. 1. Annotation pipeline in the TAGME system.
2 Overview of TAGME
In this section, we provide an overview of the TAGME approach, as well as the test
collections and evaluation metrics used in the TAGME papers [8, 9].
2.1 Approach
TAGME performs entity linking in a pipeline of three steps: (i) parsing, (ii) disam-
biguation, and (iii) pruning (see Figure 1). We note that while Ferragina and Scaiella
[8] describe multiple approaches for the last two steps, we limit ourselves to their final
suggestions; these are also the choices implemented in the TAGME API.
Before describing the TAGME pipeline, let us define the notation used throughout
this paper. Entity linking is the task of annotating an input text Twith entities Efrom a
reference knowledge base, which is Wikipedia here. Tcontains a set of entity mentions
M, where each mention mMcan refer to a set of candidate entities E(m). These
need to be disambiguated such that each mention points to a single entity e(m).
Parsing. In the first step, TAGME parses the input text and performs mention detec-
tion using a dictionary of entity surface forms. For each entry (surface form) the set of
entities recognized by that name is recorded. This dictionary is built by extracting en-
tity surface forms from four sources: anchor texts of Wikipedia articles, redirect pages,
Wikipedia page titles, and variants of titles (removing parts after the comma or in paren-
theses). Surface forms consisting of numbers only or of a single character, or below a
certain number of occurrences (2) are discarded. Further filtering is performed on the
surface forms with low link probability (i.e., <0.001). Link probability is defined as:
lp(m) = P(link|m) = link(m)
where freq(m)denotes the total number of times mention moccurs in Wikipedia (as
a link or not), and link(m)is the number of times mention mappears as a link.
To detect entity mentions, TAGME matches all n-grams of the input text, up to
n= 6, against the surface form dictionary. For an n-gram contained by another one,
TAGME drops the shorter n-gram, if it has lower link probability than the longer one.
The output of this step is a set of mentions with their corresponding candidate entities.
Disambiguation. Entity disambiguation in TAGME is performed using a voting schema,
that is, the score of each mention-entity pair is computed as the sum of votes given by
candidate entities of all other mentions in the text. Formally, given the set of mentions
M, the relevance score of the entity eto the mention mis defined as:
rel(m, e) = X
vote(m0, e),(2)
4 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg
where vote(m0, e)denotes the agreement between entities of mention m0and the entity
e, computed as follows:
vote(m0, e) = Pe0E(m0)relatedness(e, e0)·commonness(e0, m0)
Commonness is the probability of an entity being the link target of a given mention [13]:
commonness(e0, m0) = P(e0|m0) = link(e0, m0)
where link(e0, m0)is the number of times entity e0is used as a link destination for m0
and link(m0)is the total number of times m0appears as a link. Relatedness measures
the semantic association between two entities [17]:
relatedness(e, e0) = log(max(|in(e)|,|in(e0)|)) log(|in(e)in(e0)|)
log(|E|)log(min(|in(e)|,|in(e0)|)) ,(5)
where in(e)is the set of entities linking to entity eand |E|is the total number of entities.
Once all candidate entities are scored using Eq. (2), TAGME selects the best entity
for each mention. Two approaches are suggested for this purpose: (i) disambiguation by
classifier (DC) and (ii) disambiguation by threshold (DT), of which the latter is selected
as the final choice. Due to efficiency concerns, entities with commonness below a given
threshold τare discarded from the DT computations. The set of commonness-filtered
candidate entities for mention mis Eτ(m) = {eM(e)|commonness(m, e)τ}.
Then, DT considers the top-entities for each mention and then selects the one with the
highest commonness score:
m(e) = arg max
{commonness(m, e) : eEτ(m)etop[rel(m, e)]}.(6)
At the end of this stage, each mention in the input text is assigned a single entity, which
is the most pertinent one to the input text.
Pruning. The aim of the pruning step is to filter out non-meaningful annotations, i.e.,
assign NIL to the mentions that should not be linked to any entity. TAGME hinges on
two features to perform pruning: link probability (Eq. (1)) and coherence. The coher-
ence of an entity is computed with respect to the candidate annotations of all the other
mentions in the text:
coherence(e, T ) = Pe0E(T)−{e}relatedness(e, e0)
|E(T)| − 1,(7)
where E(T)is the set of distinct entities assigned to the mentions in the input text.
TAGME takes the average of the link probability and the coherence score to generate a
ρscore for each entity, which is then compared to the pruning threshold ρ
NA . Entities
with ρ<ρ
NA are discarded, while the rest of them are served as the final result.
On the Reproducibility of the TAGME Entity Linking System 5
2.2 Test collections
Two test collections are used in [8]: W IKI-DISAMB30 and WIKI-A NN OT3 0. Both com-
prise of snippets of around 30 words, extracted from a Wikipedia snapshot of November
2009, and are made publicly available.7In WIKI-DISAMB30, each snippet is linked to
a single entity; in WIKI-AN NOT30 all entity mentions are annotated. We note that the
sizes of these test collections (number of snippets) deviate from what is reported in the
TAGME paper: WIKI- DI SA MB 30 and WIKI-ANNOT30 contain around 2M and 185K
snippets, while the reported numbers are 1.4M and 180K, respectively. This suggests
that the published test collections might be different from the ones used in [8].
2.3 Evaluation Metrics
TAGME is evaluated using three variations of precision and recall. The so-called stan-
dard precision and recall (Pand R), are employed for evaluating the disambiguation
phase, using the WIKI-DISAMB3 0 test collection. The two other metrics, annotation
and topics precision and recall are employed for measuring the end-to-end performance
on the WIKI-A NN OT3 0 test collection. The annotation metrics (Pann and Rann ) com-
pare both the mention and the entity against the ground truth, while the topics metrics
(Ptopics and Rtopics) only consider entity matches. The TAGME papers [8, 9] provide
little information about the evaluation metrics. In particular, the computation of the
standard precision and recall is rather unclear; we discuss it later in Sect. 4.2. Details
are missing regarding the two other metrics too: (i) How are overall precision, recall and
F-measure computed for the annotation metrics? Are they micro- or macro-averaged?
(ii) What are the matching criteria for the annotation metrics? Are partially matching
mentions accepted or only exact matches? In what follows, we formally define the an-
notation and topics metrics, based on the most likely interpretation we established from
the TAGME paper and from our experiments.
We write G(T) = {( ˆm1,ˆe1),...,( ˆmmˆem)}for ground truth annotations of the in-
put text T, and S(T) = {(m1, e1),...,(mn, en)}for the annotations identified by the
system. Neither G(T)nor S(T)contains NULL annotations. The TAGME paper fol-
lows [12], which uses macro-averaging in computing annotation precision and recall:8
Pann =|G(T)∩ S (T)|
|S(T)|, Rann =|G (T)∩ S(T)|
The annotation ( ˆm, ˆe)matches (m, e)if two conditions are fulfilled: (i) entities match
(ˆe=e), and (ii) mentions match or contain each other ( ˆm=mor ˆmmor mˆm).
We note that the TAGME paper refers to “perfect match” of the mentions, while we
use a more relaxed version of matching (by considering containment matches). This
relaxation results in the highest possible Pann and Rann, but even those are below the
numbers reported in [8] (cf. Sect. 4.2).
The topics precision and recall (Ptopics and Rtopics) [16] only consider entity matches
(ˆe=e) and are micro-averaged over the set of all texts F:
8As explained later by the TAGME authors, they in fact used micro-averaging. This contradicts
the referred paper [12], which explicitly defines Pann and Rann as being macro-averaged.
6 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg
Ptopics =PT∈F |G(T)∩ S (T)|
PT∈F |S(T)|, Rtopics =PT∈F |G (T)∩ S(T)|
PT∈F |G(T)|.(9)
For all metrics the overall F-measure is computed from the overall precision and recall.
3 Repeatability
By definition (cf. Sect. 1), repeatability means that a system should be implemented
under the same conditions as the reference system. In our case, the repeatability of the
TAGME experiments in [8] is dependent on the availability of (i) the knowledge base
and (ii) the test collections (text snippets and gold standard annotations).
The reference knowledge base is Wikipedia, specifically, the TAGME paper uses a
dump from November 2009, while the API employs a dump from July 2012. Unfor-
tunately, neither of these dumps is available on the web nor could be provided by the
TAGME authors upon request. We encountered problems with the test collections too.
As we already explained in Sect. 2.2, there are discrepancies between the number of
snippets the test collections (WIKI- DISAMB3 0 and WIKI-AN NOT 30) actually contain
and what is reported in the paper. The latter number is higher, suggesting that the re-
sults in [8] are based only on subsets of the collections.9Further, the WIKI-DISAMB30
is split into training and test sets in the TAGME paper, but those splits are not available.
Due to these reasons, which could all be classified under the general heading of un-
availability of data, we conclude that the TAGME experiments in [8] are not repeatable.
In the next section, we make a best effort at establishing the most similar conditions,
that is, we attempt to reproduce their results.
4 Reproducibility
This section reports on our attempts to reproduce the results presented in the TAGME
paper [8]. The closest publicly available Wikipedia dump is from April 2010,10 which is
about five months newer than the one used in [8]. On a side note we should mention that
we were (negatively) surprised by how difficult it proved to find Wikipedia snapshots
from the past, esp. from this period. We have (re)implemented TAGME based on the
description in the TAGME papers [8, 9] and, when in doubt, we checked the source
code. For a reference comparison, we also include the results from (i) the TAGME API
and (ii) the Dexter entity linking framework [3]. Even though the implementation in
Dexter (specifically, the parser) slightly deviates from the original TAGME system, it
is still useful for validation, as that implementation is done by a third (independent)
group of researchers. We do not include results from running the source code provided
to us because it requires the Wikipedia dump in a format that is no longer available for
the 2010 dump we have access to; running it on a newer Wikipedia version would give
results identical to the API. In what follows, we present the challenges we encountered
during the implementation in Sect. 4.1 and then report on the results in Sect. 4.2.
9It was later explained by the TAGME authors that they actually used only 1.4M out of 2M
snippets from WIKI- DISAMB30, as Weka could not load more than that into memory. From
WIKI-ANNO T30 they used all snippets, the difference is merely a matter of approximation.
On the Reproducibility of the TAGME Entity Linking System 7
4.1 Implementation
During the (re)implementation of TAGME, we encountered several technical challenges,
which we describe here. These could be traced back to differences between the approach
described in the paper and the source code provided by the authors. Without address-
ing these differences, the results generated by our implementation are far from what is
expected and are significantly worse than those by the original system.
Link probability computation. Link probability is one of the main statistical features
used in TAGME. We noticed that the computation of link probability in TAGME devi-
ates from what is defined in Eq. (1): instead of computing the denominator freq(m)as
the number of occurrences of mention min Wikipedia, TAGME computes the number
of documents that mention mappears in. Essentially, document frequency is used in-
stead of term (phrase) frequency. This is most likely due to efficiency considerations,
as the former is much cheaper to compute. However, a lower denominator in Eq. (1)
means that the resulting link probability is a higher value than it is supposed to be. In
fact, this change in the implementation means that it is actually not link probability, but
more like keyphraseness that is being computed. Keyphraseness [15] is defined as:
kp(m) = P(keyword|m) = key(m)
where key(m)denotes number of Wikipedia articles where mention mis selected as
a keyword, i.e., linked to an entity (any entity), and df (m)is the number of articles
containing the mention m. Since in Wikipedia a link is typically created only for the
first occurrence of an entity (link(m)key(m)), we can assume that the numera-
tor of link probability and keyphraseness are identical. This would mean that TAGME
as a matter of fact uses keyphraseness. Nevertheless, as our goal in this paper is to
reproduce the TAGME results, we followed their implementation of this feature, i.e.,
Relatedness computation. We observed that the relatedness score, defined in Eq. (5), is
computed as 1relatedness(e, e0), furthermore, for the entities with zero inlinks or no
common inlinks, the score is set to zero. These details are not explicitly mentioned in
the paper, while they have significant impact on the overall effectiveness of TAGME.
Pruning based on commonness. In addition to the filtering methods mentioned in the
parsing step (cf. Sect. 2.1), TAGME filters entities with commonness score below 0.001,
but it is not documented in the TAGME papers. We followed this filtering approach, as
it makes the system considerably faster.
4.2 Results
We report results for the intermediate disambiguation phase and for the end-to-end en-
tity linking task. For all reproducibility experiments, we set the ρ
NA threshold to 0.2, as
it delivers the best results and is also the recommended value in the TAGME paper.
11 The proper implementation of link probability would result in lower values (as the denominator
would be higher) and would likely require a different threshold value than what is suggested
in [8]. This goes beyond the scope of our paper.
8 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg
Table 1. Results of TAGME reproducibility on the WIKI-DISA MB 30 test collection.
Method P R F
Original paper [8] 0.915 0.909 0.912
TAGME API 0.775 0.775 0.775
Table 2. Results of TAGME reproducibility on the WIKI-ANNOT30 test collection.
Method Pann Rann Fann Ptopics Rtopics Ftopics
Original paper [8] 0.7627 0.7608 0.7617 0.7841 0.7748 0.7794
TAGME API 0.6945 0.7136 0.7039 0.7017 0.7406 0.7206
TAGME-wp10 (our) 0.6143 0.4987 0.5505 0.6499 0.5248 0.5807
Dexter 0.5722 0.5959 0.5838 0.6141 0.6494 0.6313
Disambiguation phase. For evaluating the disambiguation phase, we submitted the
snippets from the WIKI-DISAMB3 0 test collection to the TAGME API, with the prun-
ing threshold set to 0. This setting ensures that no pruning is performed and the output
we get back is what is supposed to be the outcome of the disambiguation phase. We
tried different methods for computing precision and recall, but we were unable to get
the results that are reported in the original TAGME paper (see Table 1). We therefore
relaxed the evaluation conditions in the following way: if any of the entities returned
by the disambiguation phase matches the ground truth entity for the given snippet, then
we set both precision and recall to 1; otherwise they are set to 0. This gives us an upper
bound for the performance that can be achieved on the WIKI-DISAMB30 test collection;
any other interpretation of precision or recall would result in a lower number. What we
found is that even with these relaxed conditions the F-score is far below the reported
value (0.775 vs. 0.912). One reason for the differences could be the discrepancy be-
tween the number of snippets in the test collection and the ones used in [8]. Given the
magnitude of the differences, even against their own API, we decided not to go further
to get the results for our implementation of TAGME. We conclude that this set of results
is not reproducible, due to insufficient experimental details (test collection and metrics).
End-to-end performance. Table 2 shows end-to-end system performance according to
the following implementations: the TAGME API, our implementation using a Wikipedia
snapshot from April 2010, and the Dexter implementation using a Wikipedia snapshot
from March 2013. For all experiments, we compute the evaluation metrics described in
Sect. 2.3. We observe that the API results are lower than in the original paper, but the
difference is below 10%. We attribute this to the fact that the ground truth is generated
from a 2009 version of Wikipedia, while the API is based on the version from 2012.
Concerning our implementation and Dexter (bottom two rows in Table 2) we find
that they are relatively close to each other, but both of them are lower than the TAGME
API results; the relative difference to the API results is -19% for our implementation
and -12% for Dexter in Ftopics score. Ceccarelli et al. [3] also report on deviations, but
they attribute these to the processing of Wikipedia: “we observed that our implemen-
tation always improves over the WikiMiner online service, and that it behaves only
On the Reproducibility of the TAGME Entity Linking System 9
slightly worse then TAGME after the top 5 results, probably due to a different pro-
cessing of Wikipedia.” The difference between Dexter and our implementation stems
from the parsing step. Dexter relies on its own parsing method and removes overlapping
mentions at the end of the annotation process. We, on the other hand, follow TAGME
and delete overlapping mentions in the parsing step (cf. Sect. 2.1). By analyzing our
results, we observed that this parsing policy resulted in early pruning of some correct
entities and led accordingly to lower results.
Our experiments show that the end-to-end results reported in [8] are reproducible
through the TAGME API, but not by (re)implementation of the approach by a third
partner. This is due to undocumented deviations from the published description.
5 Generalizability
To test the generalizability of TAGME, we apply it to a (slightly) different entity linking
task: entity linking in queries (ELQ). As discussed in [1, 11], the aim of this task is to
detect all possible entity linking interpretations of the query. This is different from
conventional entity linking, where a single annotation is created. Let us consider the
query “france world cup 98” to get a better understanding of the differences between the
two tasks. In this example, both FRANCE and FRANCE NATIO NAL F OOT BALL TEAM
are valid entities for the mention “france.” In conventional entity linking, we link each
mention to a single entity, e.g., “france” FRANCE, “world cup” FIFA WORLD
CUP. For the ELQ task, on the other hand, we detect all entity linking interpretations
of the query, where each interpretation is a set of semantically related entities, e.g.,
WOR LD CUP}. In other words, the output of conventional entity linking systems is a
set of mention-entity pairs, while entity linking in queries returns set(s) of entity sets.
Applying a conventional entity linker to the ELQ task restricts the output to a single
interpretation, but can deliver solid performance nonetheless [1]. TAGME has great
potential to be generalized to the ELQ task, as it is designed to operate with short texts.
We detail our experimental setup in Sect. 5.1 and report on the results in Sect. 5.2.
5.1 Experimental Setup
Implementations. We compare four different implementations to assess the generaliz-
ability of TAGME to the ELQ task: the TAGME API, our implementation of TAGME
with two different Wikipedia versions, one from April 2010 and another from May 2012
(which is part of the ClueWeb12 collection), and Dexter’s implementation of TAGME.
Including results using the 2012 version of Wikipedia facilitates a better comparison
between the TAGME API and our implementation, as they both use similar Wikipedia
dumps. It also demonstrates how the version of Wikipedia might affect the results.
Datasets and evaluation metrics. We use two publicly available test collections devel-
oped for the ELQ task: ERD-dev [1] and Y-ERD [11]. ERD-dev includes 99 queries,
while Y-ERD offers a larger selection, containing 2398 queries. The annotations in these
test collections are confined to proper noun entities from a specific Freebase snapshot.12
10 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg
Table 3. TAGME results for the entity linking in queries task.
Method ERD-dev Y-ERD
Pstrict Rstrict Fstrict Pstrict Rstrict Fstrict
TAGME API 0.8352 0.8062 0.8204 0.7173 0.7163 0.7168
TAGME-wp10 (our) 0.7143 0.7088 0.7115 0.6518 0.6515 0.6517
TAGME-wp12 (our) 0.7363 0.7234 0.7298 0.6535 0.6532 0.6533
Dexter 0.7363 0.7073 0.7215 0.6989 0.6979 0.6984
We therefore remove entities that are not present in this snapshot in a post-filtering step.
In all the experiments, ρ
NA is set to 0.1, as it delivers the highest results both for the API
and for our implementations, and is also the recommendation of the TAGME API. Eval-
uation is performed in terms of precision, recall, and F-measure (macro-averaged over
all queries), as proposed in [1]; this variant is referred to as strict evaluation in [11].
5.2 Results
Table 3 presents the TAGME generalizability results. Similar to the reproducibility ex-
periments, we find that the TAGME API provides substantially better results than any
of the other implementations. The most fair comparison between Dexter and our imple-
mentations is the one against TAGME-wp12, as that has the Wikipedia dump closest in
date. For ERD-dev they deliver similar results, while for Y-ERD Dexter has a higher
F-score (but the relative difference is below 10%). Concerning different Wikipedia ver-
sions, the more recent one performs better on the ERD-dev test collection, while the
difference is negligible for Y-ERD. If we take the larger test collection, Y-ERD, to be
the more representative one, then we find that TAGME API >Dexter >TAGME-wp10,
which is consistent with the reproducibility results in Table 2. However, the relative dif-
ferences between the approaches are smaller here. We thus conclude that the TAGME
approach can be generalized to the ELQ task.
6 Discussion and Conclusions
TAGME is an outstanding entity linking system. The authors offer invaluable resources
for the reproducibility of their approach: the test collections, source code, and a REST-
ful API. In this paper we have attempted to (re)implement the system described in [8],
reproduce their results, and generalize the approach to the task of entity linking in
queries. Our experiments have shown that some of the results are not reproducible,
even with the API provided by the authors. For the rest of the results, we have found
that (i) the results reported in the paper are higher than what can be reproduced using
their API, and (ii) the TAGME API gives higher numbers than what is achievable by a
third-party implementation (not only ours, but also that of Dexter [2]). Based on these
findings, we recommend to use the TAGME API, much like a black-box, when entity
linking is performed as part of a larger task. For a reliable and meaningful comparison
between TAGME and a newly proposed entity linking method, the TAGME approach
should be (re)implemented, like it has been done in some prior work, see, e.g., [2, 11].
On the Reproducibility of the TAGME Entity Linking System 11
Post-acceptance responses from the TAGME authors. Upon the acceptance of this pa-
per, the TAGME authors clarified some of the issues that surfaced in this study. This
information came only after the paper was accepted, even though we have raised our
questions during the writing of the paper (at that time, however, the reply we got only
included the source code and the fact that they no longer have the Wikipedia dumps
used in the paper). We integrated their responses throughout the paper as much as it
was possible; we include the rest of them here. First, it turns out that the public API
as well as the provided source code correspond to a newer, updated version (“version
2”) of TAGME. The source code for the original version (“version 1”) described in
the TAGME papers [8, 9] is no longer available. This means that even if we managed
to find the Wikipedia dump used in the TAGME papers and ran their source code, we
would have not been able to reproduce their results. Furthermore, TAGME performs
additional non-documented optimizations when parsing the spots, filtering inappropri-
ate spots, and computing relatedness, as explained by the authors. Another reason for
the differences in performance might have to do with how links are extracted from
Wikipedia. TAGME uses wiki page-to-page link records, while our implementation (as
well as Dexter’s) extracts links from the body of the pages. This affects the computation
of relatedness, as the former source contains 20% more links than the latter. (It should
be noted that this file was not available for the 2010 and 2012 Wikipedia dumps.) The
authors also clarified that all the evaluation metrics are micro-averaged and explained
how the disambiguation phase was evaluated. We refer the interested reader to the on-
line appendix of this paper for further details.
Lessons learned. Even though we have only partially succeeded in reproducing the
TAGME results, we have gained invaluable insights about reproducibility requirements
during the process. Lessons learned include the following: (i) all technical details that
affect effectiveness or efficiency should be explained (or at least mentioned) in paper;
sharing the source code helps, but finding answers in a large codebase can be highly
non-trivial; (ii) if there are differences between the published approach and publicly
made available source code or API (typically, the latter being an updated version), those
should be made explicit; (iii) it is encouraged that authors keep all data sources used
in a published paper (in particular, historical Wikipedia dumps, esp. in some specific
format, are more difficult to find than one might think), so that these can be shared upon
requests from other researchers; (iv) evaluation metrics should be explained in detail.
Maintaining an “online appendix” to a publication is a practical way of providing some
of these extra details that would not fit in the paper due to space limits, and would have
the additional advantage of being easily editable and extensible.
7 Acknowledgement
We would like to thank Paolo Ferragina and Ugo Scaiella for sharing the TAGME
source code with us and for the insightful discussions and clarifications later on. We
also thank Diego Ceccarelli for the discussion on link probability computation and for
providing help with the Dexter API.
12 Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg
[1] D. Carmel, M.-W. Chang, E. Gabrilovich, B.-J. P. Hsu, and K. Wang. ERD’14: En-
tity recognition and disambiguation challenge. SIGIR Forum, 48(2):63–77, 2014.
[2] D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, and S. Trani. Dexter: An open
source framework for entity linking. In Proc. of the Sixth International Workshop
on Exploiting Semantic Annotations in Information Retrieval, pages 17–20, 2013.
[3] D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, and S. Trani. Learning relat-
edness measures for entity linking. In Proc. of CIKM ’13, pages 139–148, 2013.
[4] Y.-P. Chiu, Y.-S. Shih, Y.-Y. Lee, C.-C. Shao, M.-L. Cai, S.-L. Wei, and H.-H.
Chen. NTUNLP approaches to recognizing and disambiguating entities in long
and short text at the ERD challenge 2014. In Proc. of Entity Recognition & Dis-
ambiguation Workshop, pages 3–12, 2014.
[5] M. Cornolti, P. Ferragina, and M. Ciaramita. A framework for benchmarking
entity-annotation systems. In Proc. of WWW ’13, pages 249–260, 2013.
[6] M. Cornolti, P. Ferragina, M. Ciaramita, H. Sch¨
utze, and S. R¨
ud. The SMAPH
system for query entity recognition and disambiguation. In Proc. of Entity Recog-
nition & Disambiguation Workshop, pages 25–30, 2014.
[7] S. Cucerzan. Large-scale named entity disambiguation based on Wikipedia data.
In Proc. of EMNLP-CoNLL ’07, pages 708–716, 2007.
[8] P. Ferragina and U. Scaiella. TAGME: On-the-fly annotation of short text frag-
ments (by Wikipedia entities). In Proc. of CIKM ’10, pages 1625–1628, 2010.
[9] P. Ferragina and U. Scaiella. Fast and accurate annotation of short texts with
Wikipedia pages. CoRR, abs/1006.3498, 2010.
[10] X. Han, L. Sun, and J. Zhao. Collective entity linking in web text: A graph-based
method. In Proc. of SIGIR ’11, pages 765–774, 2011.
[11] F. Hasibi, K. Balog, and S. E. Bratsberg. Entity Linking in Queries: Tasks and
Evaluation. In Proc. of the ICTIR ’15, pages 171–180, 2015.
[12] S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective anno-
tation of Wikipedia entities in web text. In Proc. of KDD ’09, pages 457–466,
[13] O. Medelyan, I. H. Witten, and D. Milne. Topic indexing with Wikipedia. In Proc.
of the AAAI WikiAI workshop, pages 19–24, 2008.
[14] E. Meij, K. Balog, and D. Odijk. Entity linking and retrieval for semantic search.
In Proc. of WSDM ’14, pages 683–684, 2014.
[15] R. Mihalcea and A. Csomai. Wikify!: Linking documents to encyclopedic knowl-
edge. In Proc. of CIKM ’07, pages 233–242, 2007.
[16] D. Milne and I. H. Witten. Learning to link with Wikipedia. In Proc. of CIKM
’08, pages 509–518, 2008.
[17] D. Milne and I. H. Witten. An effective, low-cost measure of semantic relatedness
obtained from Wikipedia links. In Proc. of AAAI Workshop on Wikipedia and
Artificial Intelligence: An Evolving Synergy, pages 25–30, 2008.
[18] R. Usbeck, M. R¨
oder, A.-C. Ngonga Ngomo, C. Baron, A. Both, M. Br¨
D. Ceccarelli, M. Cornolti, D. Cherix, B. Eickmann, P. Ferragina, C. Lemke,
A. Moro, R. Navigli, F. Piccinno, G. Rizzo, H. Sack, R. Speck, R. Troncy, J. Wait-
elonis, and L. Wesemann. GERBIL: General entity annotator benchmarking
framework. In Proc. of WWW ’15, pages 1133–1143, 2015.
... Moreover, TagMe is particularly suitable for our work, since it is specifically designed for short and poorly written texts, such as social media messages [38]. In order to have complete access to all its functionalities and to allow fast queries, we leverage a local deployment [57]. By default, TagMe annotates documents with Wikipedia entities. ...
Online social networks convey rich information about geospatial facets of reality. However in most cases, geographic information is not explicit and structured, thus preventing its exploitation in real-time applications. We address this limitation by introducing a novel geoparsing and geotagging technique called Geo-Semantic-Parsing (GSP). GSP identifies location references in free text and extracts the corresponding geographic coordinates. To reach this goal, we employ a semantic annotator to identify relevant portions of the input text and to link them to the corresponding entity in a knowledge graph. Then, we devise and experiment with several efficient strategies for traversing the knowledge graph, thus expanding the available set of information for the geoparsing task. Finally, we exploit all available information for learning a regression model that selects the best entity with which to geotag the input text. We evaluate GSP on a well-known reference dataset including almost 10 k event-related tweets, achieving F1 = 0.66. We extensively compare our results with those of 2 baselines and 3 state-of-the-art geoparsing techniques, achieving the best performance. On the same dataset, competitors obtain F1 ≤ 0.55. We conclude by providing in-depth analyses of our results, showing that the overall superior performance of GSP is mainly due to a large improvement in recall, with respect to existing techniques.
... Every annotated entity-mention could be mapped to at least one entity, and the set of entities included the ''gold'' entity. However, changes in canonical Wikipedia URLs, accented characters and unicode usually result in mention losses over time, as not all URLs can be mapped to the KB [53]. Table 1 summaries the statistics of the alias-entity mappings on the CoNLL test-b dataset and TAC dataset (restricted to non-NIL mentions), respectively, for the YAGO+KB aliasentity mapping, reported in [29]. ...
Full-text available
Named Entity Disambiguation (NED) refers to the task of resolving multiple named entity mentions in an input-text sequence to their correct references in a knowledge graph. We tackle NED problem by leveraging two novel objectives for pre-training framework, and propose a novel pre-training NED model. Especially, the proposed pre-training NED model consists of: (i) concept-enhanced pre-training, aiming at identifying valid lexical semantic relations with the concept semantic constraints derived from external resource Probase; and (ii) masked entity language model, aiming to train the contextualized embedding by predicting randomly masked entities based on words and non-masked entities in the given input-text. Therefore, the proposed pre-training NED model could merge the advantage of pre-training mechanism for generating contextualized embedding with the superiority of the lexical knowledge (e.g., concept knowledge emphasized here) for understanding language semantic. We conduct experiments on the CoNLL dataset and TAC dataset, and various datasets provided by GERBIL platform. The experimental results demonstrate that the proposed model achieves significantly higher performance than previous models.
... Interpretability is a common desiderata in applications of machine learning pertaining to expert-driven fields (like the medical field, for example) where the users want to understand and validate the meaning of a model before even considering deploying it. One goal of interpretability is to demystify the machine learning 'black-box' for non-experts by creating algorithms that can inform, collaborate with, compete with, and understand users in real-world settings [20]. For example, a good predictor would certainly be useful for the case of a model returning critical decisions, like the effectiveness of a drug in its therapeutic use. ...
Conference Paper
Supervised machine learning algorithms require a set of labelled examples to be trained; however, the labelling process is a costly and time consuming task which is carried out by experts of the domain who label the dataset by means of an iterative process to filter out non-relevant objects of the dataset. In this paper, we describe a set of experiments that use gamification techniques to transform this labelling task into an interactive learning process where users can cooperate in order to achieve a common goal. To this end, first we use a geometrical interpretation of Naïve Bayes (NB) classifiers in order to create an intuitive visualization of the current state of the system and let the user change some of the parameters directly as part of a game. We apply this visualization technique to the classification of newswire and we report the results of the experiments conducted with different groups of people: PhD students, Master Degree students and general public. Then, we present a preliminary experiment of query rewriting for systematic reviews in a medical scenario, which makes use of gamification techniques to collect different formulation of the same query. Both the experiments show how the exploitation of gamification approaches help to engage the users in abstract tasks that might be hard to understand and/or boring to perform.
... At last, two different disambiguation algorithms are tried out to link the correct Wikipedia page with the entity. By the similar way, Tagme [23], [24] and Spotlight [25] extract and link entities to knowledge base. The major difference is Spotlight uses DBpedia as its knowledge base. ...
Full-text available
For a medical treatment with IoT-based facilities, physicians always have to pay much more attentions to the raw medical records of target patients instead of directly making medical advice, conclusions or diagnosis from their experiences. Because the medical records in IoT-based Hospital Information System (HIS) are dispersedly obtained from distributed devices such as tablet computer, personal digital assistant, automated analyzer and other medical devices, they are raw, simple, weak-content and massive. Such medical records cannot be used for further analyzing and decision supporting due to that they are collected in a weak-semantic manner. In this paper, we propose a novel approach to enrich IoT-based medical records by linking them with the knowledge in Linked Open Data (LOD). A case study is conducted on a real-world IoT-based HIS system in association with our approach, the experimental results show that medical records in the local HIS system are significantly enriched and useful for healthcare analysis and decision making, and further demonstrate the feasibility and effectiveness of our approach for knowledge accessing.
Full-text available
The study of contemporary tweet-based Entity Linking (EL) systems reveals a lack of a standard definition and a consensus on the task. Specifically, identifying what should be annotated in texts remains a recurring question. This prevents proper design and fair evaluation of EL systems. To tackle this issue, the present paper introduces a set of rules intended to define the EL task for tweets. We experimented the effectiveness of the proposed rules by developing TELS, an end-to-end supervised system that links tweets to Wikipedia. The experiments conducted on five publicly available datasets show that our system outperforms the baselines with an improvement, in terms of overall macro F1-score (micro F1-score), ranging from 25.04% (7.32%) up to 35.36% (42.03%). Moreover, feature analysis reveals that when the annotation is not limited to very few entity types, the proposed rules capture more efficiently annotators’ tacit agreements from datasets. Consequently, the proposed rules constitute a step further towards a consensus on the EL task.
Semantic Question Answering (SQA) systems automatically interpret user questions expressed in a natural language in terms of semantic queries. This process involves uncertainty, such that the resulting queries do not always accurately match the user intent, especially for more complex and less common questions. In this article, we aim to empower users in guiding SQA systems towards the intended semantic queries through interaction. We introduce IQA - an interaction scheme for SQA pipelines. This scheme facilitates seamless integration of user feedback in the question answering process and relies on Option Gain - a novel metric that enables efficient and intuitive user interaction. Our evaluation shows that using the proposed scheme, even a small number of user interactions can lead to significant improvements in the performance of SQA systems.
Full-text available
Web document collections such as WT10G, GOV2, and ClueWeb are widely used for text retrieval experiments. Documents in these collections contain a fair amount of non-content-related markup in the form of tags, hyperlinks, and so on. Published articles that use these corpora generally do not provide specific details about how this markup information is handled during indexing. However, this question turns out to be important: Through experiments, we find that including or excluding metadata in the index can produce significantly different results with standard IR models. More importantly, the effect varies across models and collections. For example, metadata filtering is found to be generally beneficial when using BM25, or language modeling with Dirichlet smoothing, but can significantly reduce retrieval effectiveness if language modeling is used with Jelinek-Mercer smoothing. We also observe that, in general, the performance differences become more noticeable as the amount of metadata in the test collections increase. Given this variability, we believe that the details of document preprocessing are significant from the point of view of reproducibility. In a second set of experiments, we also study the effect of preprocessing on query expansion using RM3. In this case, once again, we find that it is generally better to remove markup before using documents for query expansion.
Being able to identify entities in a document is a key step toward understanding what the document is about. Entity linking refers to the process of annotating an input text with entity identifiers from a reference knowledge repository. We present a canonical pipeline approach to entity linking that consists of mention detection, candidate selection, and disambiguation components. Then, we look at each of these components in detail. We further discuss evaluation methodology, test collections, and publicly available entity linking systems.
Conference Paper
Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.
Conference Paper
Full-text available
The SMAPH system implements a pipeline of four main steps: (1) Fetching -- it fetches the search results returned by a search engine given the query to be annotated; (2) Spotting -- search result snippets are parsed to identify candidate mentions for the entities to be annotated. This is done in a novel way by detecting the keywords-in-context by looking at the bold parts of the search snippets; (3) Candidate generation -- candidate entities are generated in two ways: from the Wikipedia pages occurring in the search results, and from an existing annotator, using the mentions identified in the spotting step as input; (4) Pruning -- a binary SVM classifier is used to decide which entities to keep/discard in order to generate the final annotation set for the query. The SMAPH system ranked third on the development set and first on the final blind test of the 2014 ERD Challenge short text track.
Full-text available
Annotating queries with entities is one of the core problem areas in query understanding. While seeming similar, the task of entity linking in queries is different from entity linking in documents and requires a methodological departure due to the inherent ambiguity of queries. We differentiate between two specific tasks, semantic mapping and interpretation finding, discuss current evaluation methodology, and propose refinements. We examine publicly available datasets for these tasks and introduce a new manually curated dataset for interpretation finding. To further deepen the understanding of task differences, we present a set of approaches for effectively addressing these tasks and report on experimental results.
Conference Paper
Full-text available
The need to bridge between the unstructured data on the Document Web and the structured data on the Web of Data has led to the development of a considerable number of annotation tools. However, these tools are currently still hard to compare since the published evaluation results are calculated on diverse datasets and evaluated based on different measures. We present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of annotation applications. In particular, GERBIL provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in machine-processable format, allowing for the efficient querying and post-processing of evaluation results. Finally, the tool diag-nostics provided by GERBIL allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. GERBIL aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results.
Conference Paper
Full-text available
In this paper we design and implement a benchmarking framework for fair and exhaustive comparison of entity-an-notation systems. The framework is based upon the defi-nition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all pub-licly available datasets, containing texts of various types such as news, tweets and Web pages. Our framework is easily-extensible with novel entity annotators, datasets and evaluation measures for comparing systems, and it has been released to the public as open source 1 . We use this frame-work to perform the first extensive comparison among all available entity annotators over all available datasets, and draw many interesting conclusions upon their efficiency and effectiveness. We also draw conclusions between academic versus commercial annotators.
Full-text available
We introduce Dexter, an open source framework for entity linking. The entity linking task aims at identifying all the small text fragments in a document referring to an entity contained in a given knowledge base, e.g., Wikipedia. The annotation is usually organized in three tasks. Given an in-put document the first task consists in discovering the frag-ments that could refer to an entity. Since a mention could refer to multiple entities, it is necessary to perform a disam-biguation step, where the correct entity is selected among the candidates. Finally, discovered entities are ranked by some measure of relevance. Many entity linking algorithms have been proposed, but unfortunately only a few authors have released the source code or some APIs. As a result, evaluating today the performance of a method on a single subtask, or comparing different techniques is difficult. In this work we present a new open framework, called Dexter, which implements some popular algorithms and provides all the tools needed to develop any entity linking technique. We believe that a shared framework is fundamental to perform fair comparisons and improve the state of the art.
Conference Paper
Full-text available
Entity Linking is the task of detecting, in text documents, relevant mentions to entities of a given knowledge base. To this end, entity-linking algorithms use several signals and features extracted from the input text or from the knowledge base. The most important of such features is entity relatedness. Indeed, we argue that these algorithms benefit from maximizing the relatedness among the relevant entities selected for annotation, since this minimizes errors in disambiguating entity-linking. The definition of an effective relatedness function is thus a crucial point in any entity-linking algorithm. In this paper we address the problem of learning high quality entity relatedness functions. First, we formalize the problem of learning entity relatedness as a learning-to-rank problem. We propose a methodology to create reference datasets on the basis of manually annotated data. Finally, we show that our machine-learned entity relatedness function performs better than other relatedness functions previously proposed, and, more importantly, improves the overall performance of different state-of-the-art entity-linking algorithms.
Conference Paper
Full-text available
More and more search engine users are expecting direct answers to their information needs, rather than links to documents. Semantic search and its recent applications enabled search engines to organize their wealth of information around entities. Entity linking and retrieval provide the building stones for organizing the web of entities. This tutorial aims to cover all facets of semantic search from a unified point of view and connect real-world applications with results from scientific publications. We provide a comprehensive overview of entity linking and retrieval in the context of semantic search and thoroughly explore techniques for query understanding, entity-based retrieval and ranking on unstructured text, structured knowledge repositories, and a mixture of these. We point out the connections between published approaches and applications, and provide hands-on examples on real-world use cases and datasets.
Conference Paper
This paper presents the NTUNLP systems in the long track and the short track of the Entity Recognition and Disambiguation Challenge 2014. We first create a dictionary that contains the possible surface forms of Freebase Ids, then scan the given text from left to right with the longest match strategy to detect the mentions, and eliminate the unwanted surface forms based on a stop word list. Methods to link to the most relevant entities and select the best candidate are proposed for these two tracks, respectively. The outside resources such as DBpedia Spotlight and TAGME are integrated to our basic NTUNLP systems. Various experimental setups are presented and discussed with the development set. In the formal run, one NTUNLP system wins the first prize in the short track and another NTUNLP system gets the fourth place in the long track.
Conference Paper
To take the first step beyond keyword-based search toward entity-based search, suitable token spans ("spots") on docu- ments must be identified as references to real-world entities from an entity catalog. Several systems have been proposed to link spots on Web pages to entities in Wikipedia. They are largely based on local compatibility between the text around the spot and textual metadata associated with the entity. Two recent systems exploit inter-label dependencies, but in limited ways. We propose a general collective dis- ambiguation approach. Our premise is that coherent docu- ments refer to entities from one or a few related topics or do- mains. We give formulations for the trade-o between local spot-to-entity compatibility and measures of global coher- ence between entities. Optimizing the overall entity assign- ment is NP-hard. We investigate practical solutions based on local hill-climbing, rounding integer linear programs, and pre-clustering entities followed by local optimization within clusters. In experiments involving over a hundred manually- annotated Web pages and tens of thousands of spots, our approaches significantly outperform recently-proposed algo- rithms.