Content uploaded by Roberto Avogadro
Author content
All content in this area was uploaded by Roberto Avogadro on Sep 16, 2022
Content may be subject to copyright.
LamAPI: a Comprehensive Tool for String-based
Entity Retrieval with Type-base Filters
Roberto Avogadro1, Marco Cremaschi1, Fabio D’adda1, Flavio De Paoli1, and
Matteo Palmonari1
Universit`a degli Studi di Milano - Bicocca, 20126 Milano, Italy
{roberto.avogadro,marco.cremaschi,fabio.dadda,
flavio.depaoli,matteo.palmonari}@unimib.it
Abstract. When information available in unstructured or semi struc-
tured formats, e.g., tables or texts, comes in, finding links between strings
appearing in these sources and the entities they refer to in some back-
ground Knowledge Graphs (KGs) is a key step to integrate, enrich and
extend the data and/or KGs. This Entity Linking task is usually decom-
posed into Entity Retrieval and Entity Disambiguation because of the
large entity search space. In this paper, we present an Entity Retrieval
service (LamAPI), and discuss the impact of different retrieval configura-
tions, i.e., query and filtering strategies, on the retrieval of entities. The
approach is to augment the search activity with extra information, like
types, associated with the strings in the original datasets. The results
have been empirically validated against public datasets.
Keywords: Entity Linking ·Entity Retrieval ·Entity Disambiguation
·Knowledge Graph
1 Introduction
A key advantage of developing Knowledge Graphs (KGs) consists in effectively
supporting the integration of data coming with different formats and struc-
tures [4]. In semantic data integration, KGs provide identifiers and descriptions
of entities, thus supporting data integration like tables or texts. The table-to-
KG matching problem, also referred to as semantic table interpretation, has
recently collected much attention in the research community [9,8,2] and is a key
step to enrich data [4,15], and construct and extend KGs from semi-structured
data [22,10]. When information available in unstructured or semi-structured for-
mats, e.g., tables or texts, comes in, finding links between strings (or mentions)
appearing in these sources and the entities they refer to in some background KGs
is a key step to integrate, enrich and extend the data and/or KGs. We name
this task Entity Linking (EL), which comes in different flavours depending on
the considered data formats but with some shared features.
For example, because of the ample entity search space, most of the approaches
to EL include a first step where candidate entities for the input string are col-
lected, i.e., Entity Retrieval (ER) [18], and a second step where the string is
2 R. Avogadro et al.
disambiguated by eventually selecting one or none of the candidate entities, i.e.,
Entity Disambiguation (ED) [17]. In most approaches, ER returns a ranked list
of candidates, while disambiguation consists of re-ranking the input list. Entity
Disambiguation is at the heart of EL, with different approaches that leverage
different kinds of evidence depending on the format and features of the input
text [21]. However, the ER step is also significant considering that its results de-
fine an upper bound for the performance of the end-to-end linking: if an entity is
not among the set of candidates, it cannot be selected as the target for the link.
Also, while it is, in principle, possible to scroll the list of candidates at arbitrary
levels of depth, maintaining acceptable efficiency levels requires cutting off the
results of ER at a reasonable depth.
Approaches to entity searches can either resort to existing lookup APIs, e.g.,
DBpedia SPARQL Query Editor1, DBpedia Spotlight2or Wikidata Query Ser-
vice3, or use recent approaches to dense ER [23], when entities are searched in a
pre-trained dense space, an approach becoming especially popular in EL for tex-
tual data. The APIs reported above provide access to the SPARQL endpoint be-
cause the elements are stored in Resource Description Framework (RDF) format.
Such endpoints are usually offered on local dumps of the original KGs to avoid
network latency and increase efficiency. For instance, DBpedia can be accessed
by OpenLink Virtuoso, a row-wise transaction-oriented RDBMS with a SPARQL
query engine4, and Wikidata by Blazegraph5, which is a high-performance graph
database providing RDF/SPARQL-based APIs. An issue faced with these solu-
tions is the time required for downloading and setting up the datasets: Wiki-
data 2019 requires some days to set up6since the full dump is about 1.1TB
(uncompressed). Moreover, writing SPARQL queries may be an issue since spe-
cific knowledge of the searched Knowledge Graph (KG) is required, besides the
knowledge of the required syntax. Some limitations related to the use of these
endpoints are:
–the SPARQL endpoint response time is directly proportional to the size of
the returned data. As a consequence, sometimes it is not even possible to
get a result because the endpoint fails for timeout;
–the number of requests per second may be severely limited for online end-
points (to ensure feasibility) or computationally too expensive for local end-
points (a reasonable configuration requires at least 64GB of RAM with tons
of CPU cycles);
–there are some intrinsic limits in the SPARQL language expressiveness (i.e.,
full-text search capability, which is required for matching table mentions,
can be obtained only with extremely slow “contains” or “regex” queries7).
1dbpedia.org/sparql
2www.dbpedia-spotlight.org
3query.wikidata.org
4virtuoso.openlinksw.com
5blazegraph.com
6addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/
7docs.openlinksw.com/virtuoso/rdfsparqlrulefulltext/
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 3
Regarding the approaches to dense ER, some limitations can be mentioned
[20,12]:
–the results are strictly related to the type of representation used. Conse-
quently, careful and tedious feature engineering is required when designing
these systems;
–generalising the trained entity linking model to other KGs or domains is
challenging due to the strong dependence on the specific KG and domain
knowledge in the process of designing features;
–these systems depend excessively on external data, and the effectiveness of
the algorithms is directly affected by the quality of the training data, and
their utility is indispensable restricted.
Information Retrieval (IR) approaches based on search engines still provide
valuable solutions to support entity search, mainly because they do not require
training, work with any KG, and easily adapt to changes in the reference KG. Al-
though IR-based entity search has been used extensively, especially in table-to-kg
matching [13,11,1,19], their use has been frequently left to custom optimisations
and not adequately discussed or documented in scientific papers. As a result,
researchers willing to apply such solutions must develop from scratch, including
data indexing techniques, query formulation and service set-up.
In this paper, we aim to present: i) LamAPI, a comprehensive tool for IR-
based ER, augmented with type-based filtering features, and ii) a study of the
impact of different retrieval configurations, i.e., query and filtering strategies,
on the retrieval of entities. The tool supports string-based retrieval but also
hard and soft filters [5] based on an input entity type (i.e., rdf:type for DB-
pedia and Property:P31 for Wikidata). Hard type filters remove non-matching
results, while soft type filters are used to promote or demote results when an
exact match is not feasible. These filters are useful to support either EL in
texts (e.g., by exploiting entity types returned by a classifier [16,14]), or in ta-
bles (e.g., by exploiting a known column type (rdf:type) to filter out irrelevant
entities). While the tool is general, it is especially thought to support EL for
semi-structured data. In our study, we, therefore, focus on evaluating different
retrieval strategies with/without filters on EL in the table-to-KG matching set-
tings, considering two different large KG such as WikiData and DBpedia. Finally,
the tool also contains mappings among the latter two KGs and Wikipedia, thus
supporting cross-KG bridges. The tool and all the resources used for the exper-
iments are released following the FAIR Guiding Principles8.LamAPI is released
under the Apache 2.0 licence.
The rest of this article is organised as follows. Section 2 will be presented a
brief analysis of state of the art on String-based entity retrieval techniques. We
will describe the services offered by LamAPI in Section 3. Section 4 introduces
the Gold Standards, the configuration parameters and finally discusses the eval-
uation results. Finally, we conclude this paper and discuss the future direction
in Section 6.
8www.nature.com/articles/sdata201618
4 R. Avogadro et al.
2 String-based entity retrieval
Given a KG containing a set of entities Eand a collection of named-entity
mentions M, the goal of EL is to map each entity mention m∈Mto its cor-
responding entity e∈Ein the KG. As described above, a typical EL service
consists of the following modules [21]:
1. Entity Retrieval (ER). In this module, for each entity mention m∈M,
irrelevant entities in the KG are filtered out to return a set Emof candidate
entities: entities that mention mmay refer to. To achieve this goal, state-of-
the-art techniques have been used, such as name dictionary-based techniques,
surface form expansion from the local document, and methods based on
search engines.
2. Entity Disambiguation (ED). In this module, the entities in the set Emare
more accurately ranked to select the correct entity among the candidate ones.
In practice, this is a re-ranking activity that considers other information (e.g.,
contextual information) beside the simple textual mention mused in the ER
module.
According to the experiments conducted [7], the role of the ER module is
critical since it should ensure the presence of the correct entity in the returned set
to let the ED module to find it. Hence, the main contribution of this work is to
discuss retrieval configurations, i.e., query and filtering strategies, for retrieving
entities.
Name dictionary-based techniques are the main approaches to ER; such tech-
niques leverage different combinations of features (e.g., labels, alias, Wikipedia
hyperlinks) to build an offline dictionary Dof links between string names and
mapping entities to be used to generate the set of candidate entities. The most
straightforward approach considers exact matching between the textual men-
tion mand string names inside D. Partial matching (e.g., fuzzy and/or n-grams
search) can also be considered.
Beside pure string matching, type constraints (using types/classes of the KG)
associated with string mentions can be exploited to filter candidate entities. In
such a case, the dictionary needs to be augmented with types associated with
linked entities to enable hard or soft filtering. Listing 1.1 and 1.2 report an
example of how type constraints can influence the result of the candidate entity
retrieval for “manchester” textual mention. The former shows the result without
constraint: cities like Manchester situated in England or Parish in Jamaica are
reported (note the similarity score equal to 1.00). The latter shows the result
when type constraints are applied: types like “SoccerClub” and “SportsClub”
allows for the promotion of soccer clubs such as “Manchester United F.C.”,
which is now ranked first (similarity score 0.83).
In this domain, similar approaches have been proposed, such as the MTab
[13] entity search, where keyword search, fuzzy search and aggregation search
are provided. Another relevant approach is EPGEL [11] where the candidate en-
tity generation uses both a keyword and a fuzzy search. This approach also uses
BERT [6] to create a profile for each entity to improve the search results. The
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 5
LinkingPark [1] method proposes a weighted combination of keywords, trigrams
and fuzzy search to maximize recall during the candidate generation process. In
addition, this approach involves verifying the presence of typos before generat-
ing candidates. Concerning the other work, LamAPI provides a n-grams search
and the possibility to include type constraints in the candidate search to apply
type/concept filtering in the ER. Furthermore, LamAPI provide several services
to help researchers in tasks like EL.
Listing 1.1. DBpedia lookup without
type constraints.
1{
" id " : M a nc h e s te r
3" la b el " : M anc h e ste r
" ty p e ": C i ty Se t t l em e n t . ..
5" ed _ sc o re " : 1
},
7{
" id " : Manchester_Parish
9" la b el " : M anc h e ste r
" ty p e ": S e t tle m e nt P opu l a t ed P l a c e
11 " ed _ sc o re " : 1
}
Listing 1.2. DBpedia lookup with
type constraints.
{
2" id " : Manchester_United_F.C.
" la b el " : M anc h e ste r U
4" ty p e ": S o c cer C l ub S po r t s Cl u b . . .
" ed _ sc o re " : 0.8 3 3
6},
{
8" id " : Manchester_City_F.C.
" la b el " : M anc h e ste r C
10 " ty p e ": S o c cer C l ub S po r t s Cl u b . . .
" ed _ sc o re " : 0.8 3 3
12 }
3 LamAPI
The current version of LamAPI integrates DBpedia (v. 2016-10 and v. 2022.03.01)
and Wikidata (v. 20220708), which are the most popular free KGs. However, any
KG, even private and domain-specific could be integrated. The only constraint
is to support indexing as described in Section 3.1.
3.1 Knowledge Graphs indexing
DBpedia, Wikidata and the like are very large KGs that require an enormous
amount of time and resources to perform ER, so we created a more compact
representation of these data suitable for ER tasks. For each KG, we downloaded
a dump (e.g., ‘latest-all.json.bz2’ for DBpedia that sizes 71 GB with multiple
files), created a local copy in a single file by extracting and storing all triples (e.g.,
96580491 entities for DBpedia). We then created an index with ElasticSearch9,
an engine that can search and analyse huge volumes of data in near real-time.
These customised local copies of the KGs are then used to create endpoints to
provide EL retrieval services. The advantage is that these services can work on
partitions of the original KGs to improve performance by saving time and using
fewer resources.
3.2 LamAPI services
Among the several services provided by LamAPI to search and retrieve informa-
tion in a KG, we discuss here the Lookup, which is the relevant service for entity
retrieval.
9www.elastic.co
6 R. Avogadro et al.
Lookup: given a string input, it retrieves a set of candidate entities from the
reference KG. The request can be qualified by setting some attributes:
limit: an integer value specifies the number of entities to retrieve. The default value
is 100, and it has been empirically demonstrated how this limit allows a good
level of coverage.
kg: specifies which KG and version to use. The default is dbpedia 2022 03 01,
and other possible values are dbpedia 2016 10.
fuzzy: a boolean value. When true, it matches tokens inside a string with an edit
distance (Levenshtein distance) less than or equal to 2. This gives a greater
tolerance for spelling errors. When false, the fuzzy operator is not applied
to the input.
ngrams: a boolean value. When true, it permits to search n-grams. After many em-
pirical experiments, we set ‘n’ of n-grams equal to 3. A lower value can bring
some bias in the search, while a higher value could not be very effective in
terms of spelling errors. “albert einstein” using n-grams equal to 3 is split in
[’alb’, ’lbe’, ’ber’, ’ert’, ...]. When false is not applied on input.
types: this parameter allows the specification of a list of types (e.g., rdf:type for
DBpedia and Property:P31 for Wikidata) associated with the input string
to filter the retrieved entities. This attribute plays a key role in re-ranking
the candidates, allowing a more accurate search based on input types.
The following example discusses the difference between a SPARQL query and
the LamAPI Lookup service. Listings 1.3 and 1.4 show a search using the mention
“Manchester” as the input. LamAPI allows performing queries through a simpler
syntax concerning the equivalent query in SPARQL. The Lookup service also al-
lows for managing the presence of misspelt mentions. Finally, another limitation
of SPARQL is the difficulty of creating a sensible ranking of candidates.
Listing 1.3. Search example using a
SPARQL query.
se l e ct di s t inc t ? s w he r e {
2?s ? p ? o .
FI L TE R ( ? p IN ( r df s : l ab el )) .
4?o bi f : co n t ai n s " A l be r t Ei ns t ei n " .
}
6or d er by st rl e n ( s tr (? s ) )
LI M IT 10 0
Listing 1.4. Example of LamAPI
Lookup service.
1/ lo o k up / e n ti ty - r e tr i ev a l ?
name=" A l be rt Ei n st e in " &
3li mi t =1 00 &
to k en = i n si d es l ab - l am a pi - 20 2 2 &
5kg = d bp e di a _2 0 22 _ 03 _ 01 &
fu z zy = F al s e &
7ng r am s = F al se
Examples of returned results with the input string “Albert Einstein” are
shown in Listing 1.5 and Listing 1.6, referred to Wikidata and DBpedia, re-
spectively. Each candidate entity is described, in W3C specification10 format,
by the unique identifier id in the chosen KG, a string label name reporting the
official name of the entity, a set of types associated with the entity, each one de-
scribed by its unique identifier id and the corresponding string label name, and
an optional description of the entity (e.g., DBpedia does not provide descrip-
tions, while Wikidata does). Moreover, a score with the edit distance measure
(Levenshtein distance) between the input textual mention and the entity label
is reported.
10 reconciliation-api.github.io/specs/latest/
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 7
The score provides a candidate ranking that can be used by the Entity Dis-
ambiguation (ED) module for a straightforward selection of the actual link. The
intuition is that when there is one candidate with a score above a certain thresh-
old, it can be selected, whereas when multiple candidates share the same score,
or the highest score is very low, further investigation is needed to find the cor-
rect entity. LamAPI provides additional services that enrich the filter on the
types. Thanks to specific APIs described below, it is possible to identify addi-
tional types, given a type (rdf:type) as input. This technique, which we will
call types extension, allows for relaxing the constraints (in this case on the type)
in case of uncertainty on which type to use as a filter.
Type-similarity: given the unique id of a type as input, it retrieves the top k
most similar types by calculating a ranking based on cosine similarity.
Examples of returned results with the input string Philosopher and Scientist
are shown in Listing 1.7 and Listing 1.8, referred to Wikidata and DBpedia,
respectively.
Listing 1.5. Lookup: returned data
from Wikidata.
1{
" id " : Q 93 7
3" la b el " : A lb e r t E i ns t e in
"description": G er m an - bo r n . ..
5" ty p e ": Q 1 9 35 0 8 9 8 Q 1 63 8 9 557 .. . Q5
" sc o re " : 1 .0
7},
{
9" id " : Q356303
" la b el " : A lb e r t E i ns t e in
11 "description": Am e r ica n a c t or . . .
" ty p e ": 3 3 999 Q2 5 2 62 5 5 . .. Q5
13 " sc o re " : 1 .0
}
Listing 1.6. Lookup: returned data
from DBpedia.
{
2" id " : A l ber t _ E in s t e i n
" la b el " : A lb e r t E i ns t e in
4" de s cr i pt i on " : ...
" ty p e ": S c i en t i s t A ni m a l . ..
6" sc o re " : 1 .0
},
8{
" id " : Albert_Einstein_ATV
10 " la b el " : A lb e r t E i ns t e in AT V
" de s cr i pt i on " : ...
12 " ty p e ": SpaceMission Event ...
" sc o re " : 0 .7 8 9
14 }
Listing 1.7. Type-similarity: returned
data from Wikidata.
Q4 9 64 1 82 ( p h il o so p he r )
2{
" ty p e ": Q4 96 4 18 2 ( p hi lo s op h er )
4"cosine_similarity": 1. 0
},
6{
" ty p e ": Q2 30 6 09 1 ( s oc io l og i st )
8"cosine_similarity": 0. 8 65
}
10 Q9 0 1 ( sc i en t is t )
{
12 " ty p e ": Q9 01 ( s ci e nt i st )
"cosine_similarity": 1. 0
14 },
{
16 " ty p e ": Q1 93 5 08 9 8 ( th e or e ti ca l . . .)
"cosine_similarity": 0. 9 12
18 }
.. .
Listing 1.8. Type-similarity: returned
data from DBpedia.
1Philosopher
{
3" ty p e ": P h i los o p h er
"cosine_similarity": 1. 0
5},
{
7" ty p e ": E c o no m i s t
"cosine_similarity": 0. 6 84
9}
Scientist
11 {" ty p e ": S c i en t i s t
13 "cosine_similarity": 1. 0
},
15 {" ty p e ": M e d ic i a n
17 "cosine_similarity": 0. 7 23
},
19 .. .
4 Validation
In this Section, different retrieval configurations, i.e., query and filtering strate-
gies, are illustrated and validated.
8 R. Avogadro et al.
The dataset used for validation is 2T 2020 [3]: 2T comprises 180 tables with
around 70.000 unique cells. It is characterised by cells with intentionally or-
thographic errors, so the ER with misspelt words can be tested. The dataset
is available for both Wikidata and DBpedia KG; it is possible to compare the
results for both KG using the same tables.
The validation process starts with a set of mentions M, and a number kof
candidates associated with each mention. The Lookup service returns a set of
candidates Emthat includes all the candidates found. The returned set is then
checked against the 2T to verify which among the correct entities are present
and in what position in the ranked results in Em. We compute the coverage
following this formula:
coverage =#candidates f ound
#total candidates to f ind (1)
In Table 1 the various coverage values are presented for lookup based on label
matching on mention by enabling fuzzy and n-grams searches. The experiments
were conducted using 20 parallel processes on a server with 40 CPU(s) Intel
Xeon Silver 4114 CPU @ 2.20GHz and 40GB RAM.
Table 2 and 3 show the coverage using the constraint on types. To select and
expand types, four methods were applied.
1. Type: This method considers only the type or set of types (seed types)
indicated in the call to the Lookup service, and it does not carry out any
expansion of types.
2. Type Co-occurrency: For the seed types, it extracts additional types based
on the co-occurency of types in the KG.
3. Type Cosine Similarity: The seed types are extended by using the cosine
similarity of RDF2Vec11 .
4. Soft Inference: The seed types are extended using a Feed Forward Neural
Network that takes as input the RDF2Vec vector of an entity, linked to a
mention, and predicts the possible types for the input entity [5].
In Table 2, it is possible to notice that the first method achieves a higher
coverage. The best result is obtained by adding two types. Co-Occurrencies and
Type Cosine Similarity are both ’idempotent’ methods. The Soft Inference tech-
nique uses the entities obtained by a prelinking. Not all entity vectors are avail-
able, so we cannot always extend the set of types. In Table 3 we report the results
for Wikidata. Also, in this case, the best results here are achieved using the first
method. The achieved coverage is highest because this KG has a comprehensive
hierarchy with more detailed types.
Even if lower, the coverage values obtained with type expansion methods are
promising. We must consider how the exact type to use as a filter is often not
known a priori in a real scenario. For example, to select a type, a user should
know the profile of a KG and how it is used to describe entities. Thanks to the
methods described above, the search results will contain entities belonging to
other types but still related to the input.
11 rdf2vec.org
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 9
Table 1. Coverage results and response times for different searches in Wikidata and
DBpedia v. 2022.03.01.
Methods DBpedia Wikidata
Coverage Time Coverage Time
N-gram 0,842 228 s 0,787 649 s
Fuzzy 0,806 226 s 0,805 766 s
Token 0,561 227 s 0,530 230 s
N-gram + Fuzzy 0,891 267 s 0,926 1649 s
N-gram + Token 0,883 229 s 0,891 807 s
Fuzzy + Token 0,812 226 s 0,825 773 s
N-gram + Fuzzy + Token 0,895 270 s 0,929 1577 s
Table 2. Coverage results for 2T DBpedia.
Methods w/o type 1 type 2 types 3 4 5 6 7 8 9 10
Type 0,892 0,904 0,905 0,904 0,889 0,884 0,879 0,872 0,870 0,867 0,848
Type Co-occurency 0,892 0,886 0,896 0,886 0,856 0,884 0,830 0,834 0,834 0,833 0,823
Type Cosine Similarity 0,892 0,892 0,886 0,889 0,885 0,881 0,873 0,869 0,825 0,825 0,830
Soft Inference 0,892 0,885 0,872 0,884 0,882 0,879 0,885 0,886 0,878 0,874 0,869
Table 3. Coverage results for 2T Wikidata.
Methods w/o type 1 types 2 types 3 4 5 6 7 8 9 10
Type 0,929 0,941 0,939 0,946 0,946 0,947 0,947 0,945 0,945 0,943 0,944
Type Co-occurency 0,929 0,854 0,808 0,796 0,793 0,795 0,797 0,797 0,795 0,796 0,795
Type Cosine Similarity 0,929 0,853 0,853 0,852 0,851 0,850 0,849 0,849 0,848 0,847 0,845
5 The LamAPI retrieval service
LamAPI is implemented in Python using ElasticSearch and MongoDB. A demon-
stration setting is publicly available12 through a Swagger documentation page for
testing purposes (Figures 1 and 2). LamApi Repository13 is publicly available,
so the code can be downloaded and customised if needed.
For completeness, the list with the relative description of the LamAPI services
is provided.
Types: given the unique id of an entity as input, it retrieves all the types of
which the entity is an instance. The service relies on vector similarity measures
among the types in KG to compute the answer. For DBpedia entities, the service
returns both direct types, transitive types, and Wikidata types of the related
entity, while for Wikidata, it returns only the list of concepts/types for the
input entity.
Literals: given the unique id of an entity as input, it retrieves all relationships
(predicates) and literal values (objects) associated with that entity.
Predicates: given the unique id of two entities as input, it retrieves all the
relationships (predicates) between them.
Objects: given the unique id of an entity as input, it retrieves all related objects
and predicates.
12 lamapi.ml
13 bitbucket.org/discounimib/lamapi
10 R. Avogadro et al.
Fig. 1. LamAPI documentation page. Fig. 2. LamAPI Lookup service.
Type-predicates: given the unique id of two types as input, it retrieves all
predicates that relate entities of input types with a frequency score associated
with each predicate.
Labels: given the unique id of an entity as input, it retrieves all the related
labels and aliases (rdfs:label).
WikiPageWikiLinks: given the unique id of an entity as input, it retrieves
links from a WikiPage to other Wikipages.
Same-as: given the unique id of an entity as input, it returns the corresponding
entity for both Wikidata and DBpedia (schema:sameAs).
Wikipedia-mapping: given the unique id or curid of a Wikipedia entity, it
returns the corresponding entity for Wikidata and DBpedia.
Literal-recogniser: Given an array as input composed of a set of strings, the
endpoint returns the types of each literal by applying a set of regex rules. The list
of literals recognised is dates (e.g., 1997-08-26, 1997.08.26, 1997/08/26), numbers
(e.g., 2.797.800.564, 25 thousand, +/- 34657, 2 km), url,email and time (e.g.,
12.30pm, 12pm).
6 Conclusions
Effective Entity Retrieval services are crucial to effectively support the task of
Entity Linking for unstructured and semi-structured datasets. In this paper, we
discussed how different strategies can be beneficial to reduce the search space and
therefore deliver more accurate results saving time, computing power and storage
capability. The results have been empirically validated against public datasets
of tabular data. Preliminary experiments with textual data are encouraging. We
are planning to complete such validation activities and further develop LamAPI
to provide full support for any format of input datasets. In addition, other search
and filtering strategies will be implemented and tested to provide users with a
complete set of alternatives, along with information on when and how each can
be usefully adopted.
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 11
References
1. Chen, S., Karaoglu, A., Negreanu, C., Ma, T., Yao, J.G., Williams, J., Jiang, F.,
Gordon, A., Lin, C.Y.: Linkingpark: An automatic semantic table interpretation
system. Journal of Web Semantics 74, 100733 (2022)
2. Cutrona, V., Chen, J., Efthymiou, V., Hassanzadeh, O., Jimenez-Ruiz, E., Sequeda,
J., Srinivas, K., Abdelmageed, N., Hulsebos, M., Oliveira, D., Pesquita, C.: Results
of semtab 2021. In: 20th International Semantic Web Conference. vol. 3103, pp.
1–12. CEUR Workshop Proceedings (March 2022)
3. Cutrona, V., Bianchi, F., Jim´enez-Ruiz, E., Palmonari, M.: Tough tables: Carefully
evaluating entity linking for tabular data. In: The Semantic Web – ISWC 2020.
pp. 328–343. Springer International Publishing, Cham (2020)
4. Cutrona, V., Ciavotta, M., Paoli, F.D., Palmonari, M.: ASIA: a tool for assisted
semantic interpretation and annotation of tabular data. In: Proceedings of the
ISWC 2019 Satellite Tracks. CEUR Workshop Proceedings, vol. 2456, pp. 209–
212. CEUR-WS.org (2019)
5. Cutrona, V., Puleri, G., Bianchi, F., Palmonari, M.: Nest: Neural soft type con-
straints to improve entity linking in tables. In: SEMANTiCS. pp. 29–43 (2021)
6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
7. Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.: Evaluating
entity linking with wikipedia. Artificial Intelligence 194, 130–150 (2013), artificial
Intelligence, Wikipedia and Semi-Structured Resources
8. Jimenez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K., Cutrona,
V.: Results of semtab 2020. CEUR Workshop Proceedings 2775, 1–8 (January
2020)
9. Jim´enez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: Semtab
2019: Resources to benchmark tabular data to knowledge graph matching sys-
tems. In: The Semantic Web. pp. 514–530. Springer International Publishing, Cham
(2020)
10. Kejriwal, M., Knoblock, C.A., Szekely, P.: Knowledge graphs: Fundamentals, tech-
niques, and applications. MIT Press (2021)
11. Lai, T.M., Ji, H., Zhai, C.: Improving candidate retrieval with entity profile gen-
eration for wikidata entity linking. arXiv preprint arXiv:2202.13404 (2022)
12. Li, X., Li, Z., Zhang, Z., Liu, N., Yuan, H., Zhang, W., Liu, Z., Wang, J.: Effective
few-shot named entity linking by meta-learning (2022)
13. Nguyen, P., Yamada, I., Kertkeidkachorn, N., Ichise, R., Takeda, H.: Semtab 2021:
Tabular data annotation with mtab tool. In: SemTab@ ISWC. pp. 92–101 (2021)
14. Onoe, Y., Durrett, G.: Fine-grained entity typing for domain independent en-
tity linking. Proceedings of the AAAI Conference on Artificial Intelligence 34(05),
8576–8583 (Apr 2020)
15. Palmonari, M., Ciavotta, M., De Paoli, F., Koˇsmerlj, A., Nikolov, N.: Ew-shopp
project: Supporting event and weather-based data analytics and marketing along
the shopper journey. In: Advances in Service-Oriented and Cloud Computing. pp.
187–191. Springer International Publishing, Cham (2020)
16. Raiman, J., Raiman, O.: Deeptype: Multilingual entity linking by neural type
system evolution. Proceedings of the AAAI Conference on Artificial Intelligence
32(1) (Apr 2018)
12 R. Avogadro et al.
17. Rao, D., McNamee, P., Dredze, M.: Entity Linking: Finding Extracted Entities
in a Knowledge Base, pp. 93–115. Springer Berlin Heidelberg, Berlin, Heidelberg
(2013)
18. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recog-
nition. In: Proceedings of the Thirteenth Conference on Computational Natural
Language Learning (CoNLL-2009). pp. 147–155. Association for Computational
Linguistics, Boulder, Colorado (Jun 2009)
19. Sarthou-Camy, C., Jourdain, G., Chabot, Y., Monnin, P., Deuz´e, F., Huynh, V.P.,
Liu, J., Labb´e, T., Troncy, R.: Dagobah ui: A new hope for semantic table inter-
pretation. In: European Semantic Web Conference. pp. 107–111. Springer (2022)
20. Shen, W., Li, Y., Liu, Y., Han, J., Wang, J., Yuan, X.: Entity linking meets deep
learning: Techniques and solutions. IEEE Transactions on Knowledge and Data
Engineering pp. 1–1 (2021)
21. Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, tech-
niques, and solutions. IEEE Transactions on Knowledge and Data Engineering
27(2), 443–460 (2015)
22. Weikum, G., Dong, X.L., Razniewski, S., Suchanek, F.M.: Machine knowledge:
Creation and curation of comprehensive knowledge bases. Found. Trends Databases
10(2-4), 108–490 (2021)
23. Wu, L., Petroni, F., Josifoski, M., Riedel, S., Zettlemoyer, L.: Zero-shot entity
linking with dense entity retrieval. In: EMNLP (2020)