Conference PaperPDF Available

LamAPI: a Comprehensive Tool for String-based Entity Retrieval with Type-base Filters

Authors:
  • SINTEF digital
  • Università of Milano - Bicocca

Abstract

When information available in unstructured or semi struc-tured formats, e.g., tables or texts, comes in, finding links between strings appearing in these sources and the entities they refer to in some background Knowledge Graphs (KGs) is a key step to integrate, enrich and extend the data and/or KGs. This Entity Linking task is usually decomposed into Entity Retrieval and Entity Disambiguation because of the large entity search space. In this paper, we present an Entity Retrieval service (LamAPI), and discuss the impact of different retrieval configurations , i.e., query and filtering strategies, on the retrieval of entities. The approach is to augment the search activity with extra information, like types, associated with the strings in the original datasets. The results have been empirically validated against public datasets.
LamAPI: a Comprehensive Tool for String-based
Entity Retrieval with Type-base Filters
Roberto Avogadro1, Marco Cremaschi1, Fabio D’adda1, Flavio De Paoli1, and
Matteo Palmonari1
Universit`a degli Studi di Milano - Bicocca, 20126 Milano, Italy
{roberto.avogadro,marco.cremaschi,fabio.dadda,
flavio.depaoli,matteo.palmonari}@unimib.it
Abstract. When information available in unstructured or semi struc-
tured formats, e.g., tables or texts, comes in, finding links between strings
appearing in these sources and the entities they refer to in some back-
ground Knowledge Graphs (KGs) is a key step to integrate, enrich and
extend the data and/or KGs. This Entity Linking task is usually decom-
posed into Entity Retrieval and Entity Disambiguation because of the
large entity search space. In this paper, we present an Entity Retrieval
service (LamAPI), and discuss the impact of different retrieval configura-
tions, i.e., query and filtering strategies, on the retrieval of entities. The
approach is to augment the search activity with extra information, like
types, associated with the strings in the original datasets. The results
have been empirically validated against public datasets.
Keywords: Entity Linking ·Entity Retrieval ·Entity Disambiguation
·Knowledge Graph
1 Introduction
A key advantage of developing Knowledge Graphs (KGs) consists in effectively
supporting the integration of data coming with different formats and struc-
tures [4]. In semantic data integration, KGs provide identifiers and descriptions
of entities, thus supporting data integration like tables or texts. The table-to-
KG matching problem, also referred to as semantic table interpretation, has
recently collected much attention in the research community [9,8,2] and is a key
step to enrich data [4,15], and construct and extend KGs from semi-structured
data [22,10]. When information available in unstructured or semi-structured for-
mats, e.g., tables or texts, comes in, finding links between strings (or mentions)
appearing in these sources and the entities they refer to in some background KGs
is a key step to integrate, enrich and extend the data and/or KGs. We name
this task Entity Linking (EL), which comes in different flavours depending on
the considered data formats but with some shared features.
For example, because of the ample entity search space, most of the approaches
to EL include a first step where candidate entities for the input string are col-
lected, i.e., Entity Retrieval (ER) [18], and a second step where the string is
2 R. Avogadro et al.
disambiguated by eventually selecting one or none of the candidate entities, i.e.,
Entity Disambiguation (ED) [17]. In most approaches, ER returns a ranked list
of candidates, while disambiguation consists of re-ranking the input list. Entity
Disambiguation is at the heart of EL, with different approaches that leverage
different kinds of evidence depending on the format and features of the input
text [21]. However, the ER step is also significant considering that its results de-
fine an upper bound for the performance of the end-to-end linking: if an entity is
not among the set of candidates, it cannot be selected as the target for the link.
Also, while it is, in principle, possible to scroll the list of candidates at arbitrary
levels of depth, maintaining acceptable efficiency levels requires cutting off the
results of ER at a reasonable depth.
Approaches to entity searches can either resort to existing lookup APIs, e.g.,
DBpedia SPARQL Query Editor1, DBpedia Spotlight2or Wikidata Query Ser-
vice3, or use recent approaches to dense ER [23], when entities are searched in a
pre-trained dense space, an approach becoming especially popular in EL for tex-
tual data. The APIs reported above provide access to the SPARQL endpoint be-
cause the elements are stored in Resource Description Framework (RDF) format.
Such endpoints are usually offered on local dumps of the original KGs to avoid
network latency and increase efficiency. For instance, DBpedia can be accessed
by OpenLink Virtuoso, a row-wise transaction-oriented RDBMS with a SPARQL
query engine4, and Wikidata by Blazegraph5, which is a high-performance graph
database providing RDF/SPARQL-based APIs. An issue faced with these solu-
tions is the time required for downloading and setting up the datasets: Wiki-
data 2019 requires some days to set up6since the full dump is about 1.1TB
(uncompressed). Moreover, writing SPARQL queries may be an issue since spe-
cific knowledge of the searched Knowledge Graph (KG) is required, besides the
knowledge of the required syntax. Some limitations related to the use of these
endpoints are:
the SPARQL endpoint response time is directly proportional to the size of
the returned data. As a consequence, sometimes it is not even possible to
get a result because the endpoint fails for timeout;
the number of requests per second may be severely limited for online end-
points (to ensure feasibility) or computationally too expensive for local end-
points (a reasonable configuration requires at least 64GB of RAM with tons
of CPU cycles);
there are some intrinsic limits in the SPARQL language expressiveness (i.e.,
full-text search capability, which is required for matching table mentions,
can be obtained only with extremely slow “contains” or “regex” queries7).
1dbpedia.org/sparql
2www.dbpedia-spotlight.org
3query.wikidata.org
4virtuoso.openlinksw.com
5blazegraph.com
6addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/
7docs.openlinksw.com/virtuoso/rdfsparqlrulefulltext/
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 3
Regarding the approaches to dense ER, some limitations can be mentioned
[20,12]:
the results are strictly related to the type of representation used. Conse-
quently, careful and tedious feature engineering is required when designing
these systems;
generalising the trained entity linking model to other KGs or domains is
challenging due to the strong dependence on the specific KG and domain
knowledge in the process of designing features;
these systems depend excessively on external data, and the effectiveness of
the algorithms is directly affected by the quality of the training data, and
their utility is indispensable restricted.
Information Retrieval (IR) approaches based on search engines still provide
valuable solutions to support entity search, mainly because they do not require
training, work with any KG, and easily adapt to changes in the reference KG. Al-
though IR-based entity search has been used extensively, especially in table-to-kg
matching [13,11,1,19], their use has been frequently left to custom optimisations
and not adequately discussed or documented in scientific papers. As a result,
researchers willing to apply such solutions must develop from scratch, including
data indexing techniques, query formulation and service set-up.
In this paper, we aim to present: i) LamAPI, a comprehensive tool for IR-
based ER, augmented with type-based filtering features, and ii) a study of the
impact of different retrieval configurations, i.e., query and filtering strategies,
on the retrieval of entities. The tool supports string-based retrieval but also
hard and soft filters [5] based on an input entity type (i.e., rdf:type for DB-
pedia and Property:P31 for Wikidata). Hard type filters remove non-matching
results, while soft type filters are used to promote or demote results when an
exact match is not feasible. These filters are useful to support either EL in
texts (e.g., by exploiting entity types returned by a classifier [16,14]), or in ta-
bles (e.g., by exploiting a known column type (rdf:type) to filter out irrelevant
entities). While the tool is general, it is especially thought to support EL for
semi-structured data. In our study, we, therefore, focus on evaluating different
retrieval strategies with/without filters on EL in the table-to-KG matching set-
tings, considering two different large KG such as WikiData and DBpedia. Finally,
the tool also contains mappings among the latter two KGs and Wikipedia, thus
supporting cross-KG bridges. The tool and all the resources used for the exper-
iments are released following the FAIR Guiding Principles8.LamAPI is released
under the Apache 2.0 licence.
The rest of this article is organised as follows. Section 2 will be presented a
brief analysis of state of the art on String-based entity retrieval techniques. We
will describe the services offered by LamAPI in Section 3. Section 4 introduces
the Gold Standards, the configuration parameters and finally discusses the eval-
uation results. Finally, we conclude this paper and discuss the future direction
in Section 6.
8www.nature.com/articles/sdata201618
4 R. Avogadro et al.
2 String-based entity retrieval
Given a KG containing a set of entities Eand a collection of named-entity
mentions M, the goal of EL is to map each entity mention mMto its cor-
responding entity eEin the KG. As described above, a typical EL service
consists of the following modules [21]:
1. Entity Retrieval (ER). In this module, for each entity mention mM,
irrelevant entities in the KG are filtered out to return a set Emof candidate
entities: entities that mention mmay refer to. To achieve this goal, state-of-
the-art techniques have been used, such as name dictionary-based techniques,
surface form expansion from the local document, and methods based on
search engines.
2. Entity Disambiguation (ED). In this module, the entities in the set Emare
more accurately ranked to select the correct entity among the candidate ones.
In practice, this is a re-ranking activity that considers other information (e.g.,
contextual information) beside the simple textual mention mused in the ER
module.
According to the experiments conducted [7], the role of the ER module is
critical since it should ensure the presence of the correct entity in the returned set
to let the ED module to find it. Hence, the main contribution of this work is to
discuss retrieval configurations, i.e., query and filtering strategies, for retrieving
entities.
Name dictionary-based techniques are the main approaches to ER; such tech-
niques leverage different combinations of features (e.g., labels, alias, Wikipedia
hyperlinks) to build an offline dictionary Dof links between string names and
mapping entities to be used to generate the set of candidate entities. The most
straightforward approach considers exact matching between the textual men-
tion mand string names inside D. Partial matching (e.g., fuzzy and/or n-grams
search) can also be considered.
Beside pure string matching, type constraints (using types/classes of the KG)
associated with string mentions can be exploited to filter candidate entities. In
such a case, the dictionary needs to be augmented with types associated with
linked entities to enable hard or soft filtering. Listing 1.1 and 1.2 report an
example of how type constraints can influence the result of the candidate entity
retrieval for “manchester” textual mention. The former shows the result without
constraint: cities like Manchester situated in England or Parish in Jamaica are
reported (note the similarity score equal to 1.00). The latter shows the result
when type constraints are applied: types like “SoccerClub” and “SportsClub”
allows for the promotion of soccer clubs such as “Manchester United F.C.”,
which is now ranked first (similarity score 0.83).
In this domain, similar approaches have been proposed, such as the MTab
[13] entity search, where keyword search, fuzzy search and aggregation search
are provided. Another relevant approach is EPGEL [11] where the candidate en-
tity generation uses both a keyword and a fuzzy search. This approach also uses
BERT [6] to create a profile for each entity to improve the search results. The
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 5
LinkingPark [1] method proposes a weighted combination of keywords, trigrams
and fuzzy search to maximize recall during the candidate generation process. In
addition, this approach involves verifying the presence of typos before generat-
ing candidates. Concerning the other work, LamAPI provides a n-grams search
and the possibility to include type constraints in the candidate search to apply
type/concept filtering in the ER. Furthermore, LamAPI provide several services
to help researchers in tasks like EL.
Listing 1.1. DBpedia lookup without
type constraints.
1{
" id " : M a nc h e s te r
3" la b el " : M anc h e ste r
" ty p e ": C i ty Se t t l em e n t . ..
5" ed _ sc o re " : 1
},
7{
" id " : Manchester_Parish
9" la b el " : M anc h e ste r
" ty p e ": S e t tle m e nt P opu l a t ed P l a c e
11 " ed _ sc o re " : 1
}
Listing 1.2. DBpedia lookup with
type constraints.
{
2" id " : Manchester_United_F.C.
" la b el " : M anc h e ste r U
4" ty p e ": S o c cer C l ub S po r t s Cl u b . . .
" ed _ sc o re " : 0.8 3 3
6},
{
8" id " : Manchester_City_F.C.
" la b el " : M anc h e ste r C
10 " ty p e ": S o c cer C l ub S po r t s Cl u b . . .
" ed _ sc o re " : 0.8 3 3
12 }
3 LamAPI
The current version of LamAPI integrates DBpedia (v. 2016-10 and v. 2022.03.01)
and Wikidata (v. 20220708), which are the most popular free KGs. However, any
KG, even private and domain-specific could be integrated. The only constraint
is to support indexing as described in Section 3.1.
3.1 Knowledge Graphs indexing
DBpedia, Wikidata and the like are very large KGs that require an enormous
amount of time and resources to perform ER, so we created a more compact
representation of these data suitable for ER tasks. For each KG, we downloaded
a dump (e.g., ‘latest-all.json.bz2’ for DBpedia that sizes 71 GB with multiple
files), created a local copy in a single file by extracting and storing all triples (e.g.,
96580491 entities for DBpedia). We then created an index with ElasticSearch9,
an engine that can search and analyse huge volumes of data in near real-time.
These customised local copies of the KGs are then used to create endpoints to
provide EL retrieval services. The advantage is that these services can work on
partitions of the original KGs to improve performance by saving time and using
fewer resources.
3.2 LamAPI services
Among the several services provided by LamAPI to search and retrieve informa-
tion in a KG, we discuss here the Lookup, which is the relevant service for entity
retrieval.
9www.elastic.co
6 R. Avogadro et al.
Lookup: given a string input, it retrieves a set of candidate entities from the
reference KG. The request can be qualified by setting some attributes:
limit: an integer value specifies the number of entities to retrieve. The default value
is 100, and it has been empirically demonstrated how this limit allows a good
level of coverage.
kg: specifies which KG and version to use. The default is dbpedia 2022 03 01,
and other possible values are dbpedia 2016 10.
fuzzy: a boolean value. When true, it matches tokens inside a string with an edit
distance (Levenshtein distance) less than or equal to 2. This gives a greater
tolerance for spelling errors. When false, the fuzzy operator is not applied
to the input.
ngrams: a boolean value. When true, it permits to search n-grams. After many em-
pirical experiments, we set ‘n’ of n-grams equal to 3. A lower value can bring
some bias in the search, while a higher value could not be very effective in
terms of spelling errors. “albert einstein” using n-grams equal to 3 is split in
[’alb’, ’lbe’, ’ber’, ’ert’, ...]. When false is not applied on input.
types: this parameter allows the specification of a list of types (e.g., rdf:type for
DBpedia and Property:P31 for Wikidata) associated with the input string
to filter the retrieved entities. This attribute plays a key role in re-ranking
the candidates, allowing a more accurate search based on input types.
The following example discusses the difference between a SPARQL query and
the LamAPI Lookup service. Listings 1.3 and 1.4 show a search using the mention
“Manchester” as the input. LamAPI allows performing queries through a simpler
syntax concerning the equivalent query in SPARQL. The Lookup service also al-
lows for managing the presence of misspelt mentions. Finally, another limitation
of SPARQL is the difficulty of creating a sensible ranking of candidates.
Listing 1.3. Search example using a
SPARQL query.
se l e ct di s t inc t ? s w he r e {
2?s ? p ? o .
FI L TE R ( ? p IN ( r df s : l ab el )) .
4?o bi f : co n t ai n s " A l be r t Ei ns t ei n " .
}
6or d er by st rl e n ( s tr (? s ) )
LI M IT 10 0
Listing 1.4. Example of LamAPI
Lookup service.
1/ lo o k up / e n ti ty - r e tr i ev a l ?
name=" A l be rt Ei n st e in " &
3li mi t =1 00 &
to k en = i n si d es l ab - l am a pi - 20 2 2 &
5kg = d bp e di a _2 0 22 _ 03 _ 01 &
fu z zy = F al s e &
7ng r am s = F al se
Examples of returned results with the input string “Albert Einstein” are
shown in Listing 1.5 and Listing 1.6, referred to Wikidata and DBpedia, re-
spectively. Each candidate entity is described, in W3C specification10 format,
by the unique identifier id in the chosen KG, a string label name reporting the
official name of the entity, a set of types associated with the entity, each one de-
scribed by its unique identifier id and the corresponding string label name, and
an optional description of the entity (e.g., DBpedia does not provide descrip-
tions, while Wikidata does). Moreover, a score with the edit distance measure
(Levenshtein distance) between the input textual mention and the entity label
is reported.
10 reconciliation-api.github.io/specs/latest/
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 7
The score provides a candidate ranking that can be used by the Entity Dis-
ambiguation (ED) module for a straightforward selection of the actual link. The
intuition is that when there is one candidate with a score above a certain thresh-
old, it can be selected, whereas when multiple candidates share the same score,
or the highest score is very low, further investigation is needed to find the cor-
rect entity. LamAPI provides additional services that enrich the filter on the
types. Thanks to specific APIs described below, it is possible to identify addi-
tional types, given a type (rdf:type) as input. This technique, which we will
call types extension, allows for relaxing the constraints (in this case on the type)
in case of uncertainty on which type to use as a filter.
Type-similarity: given the unique id of a type as input, it retrieves the top k
most similar types by calculating a ranking based on cosine similarity.
Examples of returned results with the input string Philosopher and Scientist
are shown in Listing 1.7 and Listing 1.8, referred to Wikidata and DBpedia,
respectively.
Listing 1.5. Lookup: returned data
from Wikidata.
1{
" id " : Q 93 7
3" la b el " : A lb e r t E i ns t e in
"description": G er m an - bo r n . ..
5" ty p e ": Q 1 9 35 0 8 9 8 Q 1 63 8 9 557 .. . Q5
" sc o re " : 1 .0
7},
{
9" id " : Q356303
" la b el " : A lb e r t E i ns t e in
11 "description": Am e r ica n a c t or . . .
" ty p e ": 3 3 999 Q2 5 2 62 5 5 . .. Q5
13 " sc o re " : 1 .0
}
Listing 1.6. Lookup: returned data
from DBpedia.
{
2" id " : A l ber t _ E in s t e i n
" la b el " : A lb e r t E i ns t e in
4" de s cr i pt i on " : ...
" ty p e ": S c i en t i s t A ni m a l . ..
6" sc o re " : 1 .0
},
8{
" id " : Albert_Einstein_ATV
10 " la b el " : A lb e r t E i ns t e in AT V
" de s cr i pt i on " : ...
12 " ty p e ": SpaceMission Event ...
" sc o re " : 0 .7 8 9
14 }
Listing 1.7. Type-similarity: returned
data from Wikidata.
Q4 9 64 1 82 ( p h il o so p he r )
2{
" ty p e ": Q4 96 4 18 2 ( p hi lo s op h er )
4"cosine_similarity": 1. 0
},
6{
" ty p e ": Q2 30 6 09 1 ( s oc io l og i st )
8"cosine_similarity": 0. 8 65
}
10 Q9 0 1 ( sc i en t is t )
{
12 " ty p e ": Q9 01 ( s ci e nt i st )
"cosine_similarity": 1. 0
14 },
{
16 " ty p e ": Q1 93 5 08 9 8 ( th e or e ti ca l . . .)
"cosine_similarity": 0. 9 12
18 }
.. .
Listing 1.8. Type-similarity: returned
data from DBpedia.
1Philosopher
{
3" ty p e ": P h i los o p h er
"cosine_similarity": 1. 0
5},
{
7" ty p e ": E c o no m i s t
"cosine_similarity": 0. 6 84
9}
Scientist
11 {" ty p e ": S c i en t i s t
13 "cosine_similarity": 1. 0
},
15 {" ty p e ": M e d ic i a n
17 "cosine_similarity": 0. 7 23
},
19 .. .
4 Validation
In this Section, different retrieval configurations, i.e., query and filtering strate-
gies, are illustrated and validated.
8 R. Avogadro et al.
The dataset used for validation is 2T 2020 [3]: 2T comprises 180 tables with
around 70.000 unique cells. It is characterised by cells with intentionally or-
thographic errors, so the ER with misspelt words can be tested. The dataset
is available for both Wikidata and DBpedia KG; it is possible to compare the
results for both KG using the same tables.
The validation process starts with a set of mentions M, and a number kof
candidates associated with each mention. The Lookup service returns a set of
candidates Emthat includes all the candidates found. The returned set is then
checked against the 2T to verify which among the correct entities are present
and in what position in the ranked results in Em. We compute the coverage
following this formula:
coverage =#candidates f ound
#total candidates to f ind (1)
In Table 1 the various coverage values are presented for lookup based on label
matching on mention by enabling fuzzy and n-grams searches. The experiments
were conducted using 20 parallel processes on a server with 40 CPU(s) Intel
Xeon Silver 4114 CPU @ 2.20GHz and 40GB RAM.
Table 2 and 3 show the coverage using the constraint on types. To select and
expand types, four methods were applied.
1. Type: This method considers only the type or set of types (seed types)
indicated in the call to the Lookup service, and it does not carry out any
expansion of types.
2. Type Co-occurrency: For the seed types, it extracts additional types based
on the co-occurency of types in the KG.
3. Type Cosine Similarity: The seed types are extended by using the cosine
similarity of RDF2Vec11 .
4. Soft Inference: The seed types are extended using a Feed Forward Neural
Network that takes as input the RDF2Vec vector of an entity, linked to a
mention, and predicts the possible types for the input entity [5].
In Table 2, it is possible to notice that the first method achieves a higher
coverage. The best result is obtained by adding two types. Co-Occurrencies and
Type Cosine Similarity are both ’idempotent’ methods. The Soft Inference tech-
nique uses the entities obtained by a prelinking. Not all entity vectors are avail-
able, so we cannot always extend the set of types. In Table 3 we report the results
for Wikidata. Also, in this case, the best results here are achieved using the first
method. The achieved coverage is highest because this KG has a comprehensive
hierarchy with more detailed types.
Even if lower, the coverage values obtained with type expansion methods are
promising. We must consider how the exact type to use as a filter is often not
known a priori in a real scenario. For example, to select a type, a user should
know the profile of a KG and how it is used to describe entities. Thanks to the
methods described above, the search results will contain entities belonging to
other types but still related to the input.
11 rdf2vec.org
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 9
Table 1. Coverage results and response times for different searches in Wikidata and
DBpedia v. 2022.03.01.
Methods DBpedia Wikidata
Coverage Time Coverage Time
N-gram 0,842 228 s 0,787 649 s
Fuzzy 0,806 226 s 0,805 766 s
Token 0,561 227 s 0,530 230 s
N-gram + Fuzzy 0,891 267 s 0,926 1649 s
N-gram + Token 0,883 229 s 0,891 807 s
Fuzzy + Token 0,812 226 s 0,825 773 s
N-gram + Fuzzy + Token 0,895 270 s 0,929 1577 s
Table 2. Coverage results for 2T DBpedia.
Methods w/o type 1 type 2 types 3 4 5 6 7 8 9 10
Type 0,892 0,904 0,905 0,904 0,889 0,884 0,879 0,872 0,870 0,867 0,848
Type Co-occurency 0,892 0,886 0,896 0,886 0,856 0,884 0,830 0,834 0,834 0,833 0,823
Type Cosine Similarity 0,892 0,892 0,886 0,889 0,885 0,881 0,873 0,869 0,825 0,825 0,830
Soft Inference 0,892 0,885 0,872 0,884 0,882 0,879 0,885 0,886 0,878 0,874 0,869
Table 3. Coverage results for 2T Wikidata.
Methods w/o type 1 types 2 types 3 4 5 6 7 8 9 10
Type 0,929 0,941 0,939 0,946 0,946 0,947 0,947 0,945 0,945 0,943 0,944
Type Co-occurency 0,929 0,854 0,808 0,796 0,793 0,795 0,797 0,797 0,795 0,796 0,795
Type Cosine Similarity 0,929 0,853 0,853 0,852 0,851 0,850 0,849 0,849 0,848 0,847 0,845
5 The LamAPI retrieval service
LamAPI is implemented in Python using ElasticSearch and MongoDB. A demon-
stration setting is publicly available12 through a Swagger documentation page for
testing purposes (Figures 1 and 2). LamApi Repository13 is publicly available,
so the code can be downloaded and customised if needed.
For completeness, the list with the relative description of the LamAPI services
is provided.
Types: given the unique id of an entity as input, it retrieves all the types of
which the entity is an instance. The service relies on vector similarity measures
among the types in KG to compute the answer. For DBpedia entities, the service
returns both direct types, transitive types, and Wikidata types of the related
entity, while for Wikidata, it returns only the list of concepts/types for the
input entity.
Literals: given the unique id of an entity as input, it retrieves all relationships
(predicates) and literal values (objects) associated with that entity.
Predicates: given the unique id of two entities as input, it retrieves all the
relationships (predicates) between them.
Objects: given the unique id of an entity as input, it retrieves all related objects
and predicates.
12 lamapi.ml
13 bitbucket.org/discounimib/lamapi
10 R. Avogadro et al.
Fig. 1. LamAPI documentation page. Fig. 2. LamAPI Lookup service.
Type-predicates: given the unique id of two types as input, it retrieves all
predicates that relate entities of input types with a frequency score associated
with each predicate.
Labels: given the unique id of an entity as input, it retrieves all the related
labels and aliases (rdfs:label).
WikiPageWikiLinks: given the unique id of an entity as input, it retrieves
links from a WikiPage to other Wikipages.
Same-as: given the unique id of an entity as input, it returns the corresponding
entity for both Wikidata and DBpedia (schema:sameAs).
Wikipedia-mapping: given the unique id or curid of a Wikipedia entity, it
returns the corresponding entity for Wikidata and DBpedia.
Literal-recogniser: Given an array as input composed of a set of strings, the
endpoint returns the types of each literal by applying a set of regex rules. The list
of literals recognised is dates (e.g., 1997-08-26, 1997.08.26, 1997/08/26), numbers
(e.g., 2.797.800.564, 25 thousand, +/- 34657, 2 km), url,email and time (e.g.,
12.30pm, 12pm).
6 Conclusions
Effective Entity Retrieval services are crucial to effectively support the task of
Entity Linking for unstructured and semi-structured datasets. In this paper, we
discussed how different strategies can be beneficial to reduce the search space and
therefore deliver more accurate results saving time, computing power and storage
capability. The results have been empirically validated against public datasets
of tabular data. Preliminary experiments with textual data are encouraging. We
are planning to complete such validation activities and further develop LamAPI
to provide full support for any format of input datasets. In addition, other search
and filtering strategies will be implemented and tested to provide users with a
complete set of alternatives, along with information on when and how each can
be usefully adopted.
LamAPI: a Comprehensive Tool for String-based Entity Retrieval 11
References
1. Chen, S., Karaoglu, A., Negreanu, C., Ma, T., Yao, J.G., Williams, J., Jiang, F.,
Gordon, A., Lin, C.Y.: Linkingpark: An automatic semantic table interpretation
system. Journal of Web Semantics 74, 100733 (2022)
2. Cutrona, V., Chen, J., Efthymiou, V., Hassanzadeh, O., Jimenez-Ruiz, E., Sequeda,
J., Srinivas, K., Abdelmageed, N., Hulsebos, M., Oliveira, D., Pesquita, C.: Results
of semtab 2021. In: 20th International Semantic Web Conference. vol. 3103, pp.
1–12. CEUR Workshop Proceedings (March 2022)
3. Cutrona, V., Bianchi, F., Jim´enez-Ruiz, E., Palmonari, M.: Tough tables: Carefully
evaluating entity linking for tabular data. In: The Semantic Web ISWC 2020.
pp. 328–343. Springer International Publishing, Cham (2020)
4. Cutrona, V., Ciavotta, M., Paoli, F.D., Palmonari, M.: ASIA: a tool for assisted
semantic interpretation and annotation of tabular data. In: Proceedings of the
ISWC 2019 Satellite Tracks. CEUR Workshop Proceedings, vol. 2456, pp. 209–
212. CEUR-WS.org (2019)
5. Cutrona, V., Puleri, G., Bianchi, F., Palmonari, M.: Nest: Neural soft type con-
straints to improve entity linking in tables. In: SEMANTiCS. pp. 29–43 (2021)
6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
7. Hachey, B., Radford, W., Nothman, J., Honnibal, M., Curran, J.R.: Evaluating
entity linking with wikipedia. Artificial Intelligence 194, 130–150 (2013), artificial
Intelligence, Wikipedia and Semi-Structured Resources
8. Jimenez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K., Cutrona,
V.: Results of semtab 2020. CEUR Workshop Proceedings 2775, 1–8 (January
2020)
9. Jim´enez-Ruiz, E., Hassanzadeh, O., Efthymiou, V., Chen, J., Srinivas, K.: Semtab
2019: Resources to benchmark tabular data to knowledge graph matching sys-
tems. In: The Semantic Web. pp. 514–530. Springer International Publishing, Cham
(2020)
10. Kejriwal, M., Knoblock, C.A., Szekely, P.: Knowledge graphs: Fundamentals, tech-
niques, and applications. MIT Press (2021)
11. Lai, T.M., Ji, H., Zhai, C.: Improving candidate retrieval with entity profile gen-
eration for wikidata entity linking. arXiv preprint arXiv:2202.13404 (2022)
12. Li, X., Li, Z., Zhang, Z., Liu, N., Yuan, H., Zhang, W., Liu, Z., Wang, J.: Effective
few-shot named entity linking by meta-learning (2022)
13. Nguyen, P., Yamada, I., Kertkeidkachorn, N., Ichise, R., Takeda, H.: Semtab 2021:
Tabular data annotation with mtab tool. In: SemTab@ ISWC. pp. 92–101 (2021)
14. Onoe, Y., Durrett, G.: Fine-grained entity typing for domain independent en-
tity linking. Proceedings of the AAAI Conference on Artificial Intelligence 34(05),
8576–8583 (Apr 2020)
15. Palmonari, M., Ciavotta, M., De Paoli, F., Koˇsmerlj, A., Nikolov, N.: Ew-shopp
project: Supporting event and weather-based data analytics and marketing along
the shopper journey. In: Advances in Service-Oriented and Cloud Computing. pp.
187–191. Springer International Publishing, Cham (2020)
16. Raiman, J., Raiman, O.: Deeptype: Multilingual entity linking by neural type
system evolution. Proceedings of the AAAI Conference on Artificial Intelligence
32(1) (Apr 2018)
12 R. Avogadro et al.
17. Rao, D., McNamee, P., Dredze, M.: Entity Linking: Finding Extracted Entities
in a Knowledge Base, pp. 93–115. Springer Berlin Heidelberg, Berlin, Heidelberg
(2013)
18. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recog-
nition. In: Proceedings of the Thirteenth Conference on Computational Natural
Language Learning (CoNLL-2009). pp. 147–155. Association for Computational
Linguistics, Boulder, Colorado (Jun 2009)
19. Sarthou-Camy, C., Jourdain, G., Chabot, Y., Monnin, P., Deuz´e, F., Huynh, V.P.,
Liu, J., Labb´e, T., Troncy, R.: Dagobah ui: A new hope for semantic table inter-
pretation. In: European Semantic Web Conference. pp. 107–111. Springer (2022)
20. Shen, W., Li, Y., Liu, Y., Han, J., Wang, J., Yuan, X.: Entity linking meets deep
learning: Techniques and solutions. IEEE Transactions on Knowledge and Data
Engineering pp. 1–1 (2021)
21. Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, tech-
niques, and solutions. IEEE Transactions on Knowledge and Data Engineering
27(2), 443–460 (2015)
22. Weikum, G., Dong, X.L., Razniewski, S., Suchanek, F.M.: Machine knowledge:
Creation and curation of comprehensive knowledge bases. Found. Trends Databases
10(2-4), 108–490 (2021)
23. Wu, L., Petroni, F., Josifoski, M., Riedel, S., Zettlemoyer, L.: Zero-shot entity
linking with dense entity retrieval. In: EMNLP (2020)
... For example, some effort has been arXiv:2408.06423v1 [cs.CL] 12 Aug 2024 dedicated to optimising entity retrieval, considering the limitations of approaches based on SPARQL queries or Wikidata lookup services [8]. A few methods from this research community have included embeddings based on LLMs and graphs to support some tasks [22,32], including CEA, yet at a limited scale. ...
... (1) Test the performance of different genres of STI models when used in combination with a realistic candidate retrieval step, performed by the engineered tool LamAPI [8]; (2) Test the performance of these models on several datasets used to evaluate SOTA STI approaches, through a commonground comparison with approaches trained on these data; (3) Assess the performance improvement by adaptation with additional moderately-out-of-domain fine-tuning (Sec. 4.2); (4) Evaluate the computational efficiency and provide implications for actual usage in different application settings. ...
... The candidates for the mentions contained in the TURL dataset [23], along with its sub-sampled version TURL-2K [80], were retrieved through the Wikidata-Lookup-Service 5 , which is known to have a low coverage w.r.t. other ERs [8]. For this reason, we researched to identify a state-of-the-art approach/tool specific to the ER. ...
Preprint
Full-text available
Tables are crucial containers of information, but understanding their meaning may be challenging. Indeed, recently, there has been a focus on Semantic Table Interpretation (STI), i.e., the task that involves the semantic annotation of tabular data to disambiguate their meaning. Over the years, there has been a surge in interest in data-driven approaches based on deep learning that have increasingly been combined with heuristic-based approaches. In the last period, the advent of Large Language Models (LLMs) has led to a new category of approaches for table annotation. The interest in this research field, characterised by multiple challenges, has led to a proliferation of approaches employing different techniques. However, these approaches have not been consistently evaluated on a common ground, making evaluation and comparison difficult. This work proposes an extensive evaluation of four state-of-the-art (SOTA) approaches - Alligator (formerly s-elBat), Dagobah, TURL, and TableLlama; the first two belong to the family of heuristic-based algorithms, while the others are respectively encoder-only and decoder-only LLMs. The primary objective is to measure the ability of these approaches to solve the entity disambiguation task, with the ultimate aim of charting new research paths in the field.
... However, some approaches also add abbreviations (dbo:abbreviation), descriptions (rdfs:comment) [77,75,158], or, indexes for specific entity types, name (foaf:name), surname (foaf:surname), and given name (foaf:givenName) [114]. The MantisTable team builds a separate system named LamAPI 19 [10,11] which is used across multiple versions of this system. LamAPI 20 tool retrieves entities with the highest similarity between the mention in the cell and the entity's label by combining different search strategies, such as full-text search based on tokens, n-grams and fuzzy search. ...
Preprint
Full-text available
Tabular data plays a pivotal role in various fields, making it a popular format for data manipulation and exchange, particularly on the web. The interpretation, extraction, and processing of tabular information are invaluable for knowledge-intensive applications. Notably, significant efforts have been invested in annotating tabular data with ontologies and entities from background knowledge graphs, a process known as Semantic Table Interpretation (STI). STI automation aids in building knowledge graphs, enriching data, and enhancing web-based question answering. This survey aims to provide a comprehensive overview of the STI landscape. It starts by categorizing approaches using a taxonomy of 31 attributes, allowing for comparisons and evaluations. It also examines available tools, assessing them based on 12 criteria. Furthermore, the survey offers an in-depth analysis of the Gold Standards used for evaluating STI approaches. Finally, it provides practical guidance to help end-users choose the most suitable approach for their specific tasks while also discussing unresolved issues and suggesting potential future research directions.
... The paginated system allows Koala-UI to handle large datasets over time by breaking down responses into manageable parts, which can be processed one page at a time. • LAMAPI [17] provides functionality for manually looking up entities by typing a string. This feature enables users to search for potential matches, offering greater flexibility when automated suggestions are insufficient. ...
Conference Paper
Full-text available
This paper introduces Koala-UI-a user interface system aimed at simplifying the entity linking process within data enrichment pipelines. Koala-UI provides an intuitive mechanism for linking entities across datasets, combining automation with human feedback to ensure accurate and consistent data. Koala-UI was successfully applied in use cases such as public procurement, where it enabled enrichment of a tenders dataset by linking entities to external knowledge graphs. Future developments will focus on expanding its backend to support additional models and enhance its human-in-the-loop capabilities.
... Once these links are established, users can retrieve additional data from reference sources. SemTUI extends its functionality beyond linked data sources by incorporating services from private company knowledge graphs and including geocoding functionalities for address resolution [31,32]. Figure 1 illustrates the architecture of SemTUI, which comprises a web user interface and a back-end serving as an advanced gateway to provide access to a variety of services, including storage capabilities for enriching datasets. ...
Chapter
Full-text available
Matching tables against Knowledge Graphs is a crucial task in many applications. A widely adopted solution to improve the precision of matching algorithms is to refine the set of candidate entities by their type in the Knowledge Graph. However, it is not rare that a type is missing for a given entity. In this paper, we propose a methodology to improve the refinement phase of matching algorithms based on type prediction and soft constraints. We apply our methodology to state-of-the-art algorithms, showing a performance boost on different datasets.
Chapter
Full-text available
Table annotation is a key task to improve querying the Web and support the Knowledge Graph population from legacy sources (tables). Last year, the SemTab challenge was introduced to unify different efforts to evaluate table annotation algorithms by providing a common interface and several general-purpose datasets as a ground truth. The SemTab dataset is useful to have a general understanding of how these algorithms work, and the organizers of the challenge included some artificial noise to the data to make the annotation trickier. However, it is hard to analyze specific aspects in an automatic way. For example, the ambiguity of names at the entity-level can largely affect the quality of the annotation. In this paper, we propose a novel dataset to complement the datasets proposed by SemTab. The dataset consists of a set of high-quality manually-curated tables with non-obviously linkable cells, i.e., where values are ambiguous names, typos, and misspelled entity names not appearing in the current version of the SemTab dataset. These challenges are particularly relevant for the ingestion of structured legacy sources into existing knowledge graphs. Evaluations run on this dataset show that ambiguity is a key problem for entity linking algorithms and encourage a promising direction for future work in the field.
Conference Paper
Full-text available
Tabular data to Knowledge Graph matching is the process of assigning semantic tags from knowledge graphs (e.g., Wikidata or DB-pedia) to the elements of a table. This task is a challenging problem for various reasons, including the lack of metadata (e.g., table and column names), the noisiness, heterogeneity, incompleteness and ambiguity in the data. The results of this task provide significant insights about potentially highly valuable tabular data, as recent works have shown, enabling a new family of data analytics and data science applications. Despite significant amount of work on various flavors of this problem, there is a lack of a common framework to conduct a systematic evaluation of state-of-the-art systems. The creation of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) aims at filling this gap. In this paper, we report about the datasets, infrastructure and lessons learned from the first edition of the SemTab challenge.
Article
Entity linking (EL) is the process of linking entity mentions appearing in web text with their corresponding entities in a knowledge base. EL plays an important role in the fields of knowledge engineering and data mining, underlying a variety of downstream applications such as knowledge base population, content analysis, relation extraction, and question answering. In recent years, deep learning (DL), which has achieved tremendous success in various domains, has also been leveraged in EL methods to surpass traditional machine learning based methods and yield the state-of-the-art performance. In this survey, we present a comprehensive review and analysis of existing DL based EL methods. First of all, we propose a new taxonomy, which organizes existing DL based EL methods using three axes: embedding, feature, and algorithm. Then we systematically survey the representative EL methods along the three axes of the taxonomy. Later, we introduce ten commonly used EL data sets and give a quantitative performance analysis of DL based EL methods over these data sets. Finally, we discuss the remaining limitations of existing methods and highlight some promising future directions.
Article
In this paper, we present LinkingPark, an automatic semantic annotation system for tabular data to knowledge graph matching. LinkingPark is designed as a modular framework which can handle Cell-Entity Annotation (CEA), Column-Type Annotation (CTA), and Columns-Property Annotation (CPA) altogether. It is built upon our previous SemTab 2020 system, which won the 2nd prize among 28 different teams after four rounds of evaluations. Moreover, the system is unsupervised, stand-alone, and flexible for multilingual support. Its backend offers an efficient RESTful API for programmatic access, as well as an Excel Add-in for ease of use. Users can interact with LinkingPark in near real-time, further demonstrating its efficiency.
Chapter
EW-Shopp is an innovation project, the aim of which is to build a platform for support of data linking, integration, and analytics in companies from the e-commerce, retail, and marketing industries. The project consortium joins several business partners from different sectors of e-commerce including marketing, price comparison, and both web and brick-and-mortar stores. The project is developing several pilot services to test the platform and inform its further development.
Article
The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.