Conference PaperPDF Available

DBpedia spotlight: Shedding light on the web of documents


Abstract and Figures

Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.
Content may be subject to copyright.
DBpedia Spotlight: Shedding Light on the Web of
Pablo N. Mendes1, Max Jakob1, Andrés García-Silva2, Christian Bizer1
1Web-based Systems Group, Freie Universität Berlin, Germany
2Ontology Engineering Group, Universidad Politécnica de Madrid, Spain
Interlinking text documents with Linked Open Data enables
the Web of Data to be used as background knowledge within
document-oriented applications such as search and faceted
browsing. As a step towards interconnecting the Web of
Documents with the Web of Data, we developed DBpedia
Spotlight, a system for automatically annotating text docu-
ments with DBpedia URIs. DBpedia Spotlight allows users
to configure the annotations to their specific needs through
the DBpedia Ontology and quality measures such as promi-
nence, topical pertinence, contextual ambiguity and disam-
biguation confidence. We compare our approach with the
state of the art in disambiguation, and evaluate our results
in light of three baselines and six publicly available anno-
tation systems, demonstrating the competitiveness of our
system. DBpedia Spotlight is shared as open source and
deployed as a Web Service freely available for public use.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous;
I.2.7 [Artificial Intelligence]: Natural Language Process-
ing—Language parsing and understanding; I.7 [Document
and Text Processing]: [Miscellaneous]
General Terms
Algorithms, Experimentation
Text Annotation, Linked Data, DBpedia, Named Entity
As the Linked Data ecosystem develops [3], so do the mu-
tual rewards for structured and unstructured data providers
alike. Higher interconnectivity between information sources
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
I-SEMANTICS 2011, 7th Int. Conf. on Semantic Systems, Sept. 7-9, 2011,
Graz, Austria
Copyright 2011 ACM 978-1-4503-0621-8 ...$10.00.
has the potential of increasing discoverability, reusability,
and hence the utility of information. By connecting unstruc-
tured information in text documents with Linked Data, facts
from the Web of Data can be used, for instance, as comple-
mentary information on web pages to enhance information
retrieval, to enable faceted document browsing [15] and cus-
tomization of web feeds based on semantics [18].
DBpedia [4] is developing into an interlinking hub in the
Web of Data that enables access to many data sources in the
Linked Open Data cloud. It contains encyclopedic knowl-
edge from Wikipedia for about 3.5 million resources. About
half of the knowledge base is classified in a consistent cross-
domain ontology with classes such as persons, organisations
or populated places; as well as more fine-grained classifica-
tions like basketball player or flowering plant. Furthermore,
it provides a rich pool of resource attributes and relations
between the resources, connecting products to their makers,
or CEOs to their companies, for example.
In order to enable the linkage of Web documents with
this hub, we developed DBpedia Spotlight, a system to per-
form annotation of DBpedia resources mentioned in text.
In the annotation task, the user provides text fragments
(documents, paragraphs, sentences) and wishes to identify
URIs for resources mentioned within that text. One of the
main challenges in annotation is ambiguity: an entity name,
or surface form, may be used in different contexts to refer
to different DBpedia resources. For example, the surface
form ‘Washington’ can be used to refer the resources dbpe-
dia:George_Washington,dbpedia:Washington,_D.C. and
dbpedia:Washington_(U.S._state) (among others). For
human readers, the disambiguation, i.e. the decision be-
tween candidates for an ambiguous surface form, is usually
performed based on the readers’ knowledge and the context
of a concrete mention. However, the automatic disambigua-
tion of entity mentions remains a challenging problem.
The goal of DBpedia Spotlight is to provide an adaptable
system to find and disambiguate natural language mentions
of DBpedia resources. Much research has been devoted to
the problem of automatic disambiguation - as we discuss
in Section 5. In comparison with previous work, DBpedia
Spotlight aims at a more comprehensive and flexible solu-
tion. First, while other annotation systems are often re-
stricted to a small number of resource types, such as people,
organisations and places, our system attempts to annotate
DBpedia resources of any of the 272 classes (more than 30
top level) in the DBpedia Ontology. Second, since a single
generic solution is unlikely to fit all task-specific require-
ments, our system enables user-provided configurations for
different use cases with different needs. Users can flexibly
specify the domain of interest, as well as the desired cover-
age and error tolerance for each of their specific annotation
DBpedia Spotlight can take full advantage of the DBpedia
ontology for specifying which concepts should be annotated.
Annotations can be restricted to instances of specific classes
(or sets of classes) including subclasses. Alternatively, ar-
bitrary SPARQL queries over the DBpedia knowledge base
can be provided in order to determine the set of instances
that should be annotated. For instance, consider use cases
where users have prior knowledge of some aspects of the text
(e.g. dates), and have specific needs for the annotations (e.g.
only Politicians). A SPARQL query can be sent to DBpe-
dia Spotlight in order to constrain the annotated resources
to only politicians in office between 1995 and 2000, for in-
stance. In general, users can create restrictions using any
part of the DBpedia knowledge base.
Moreover, DBpedia Spotlight computes scores such as promi-
nence (how many times a resource is mentioned in Wikipedia),
topical relevance (how close a paragraph is to a DBpedia re-
source’s context) and contextual ambiguity (is there more
than one candidate resource with similarly high topical rel-
evance for this surface form in its current context?). Users
can configure these parameters according to their task-specific
We evaluate DBpedia Spotlight in two experiments. First
we test our disambiguation strategy on thousands of un-
seen (held out) DBpedia resource mentions from Wikipedia.
Second, we use a set of manually annotated news articles
in order to compare our annotation with publicly available
annotation services.
DBpedia Spotlight is deployed as a Web Service, and fea-
tures a user interface for demonstration. The source code is
publicly available under the Apache license V2, and the doc-
umentation is available at
In Section 2 we describe our approach, followed by an ex-
planation of how our system can be used (Section 3). In
Section 4 we present our evaluation methodology and re-
sults. In Section 5 we discuss related work and in Section 6
we present our conclusions and future work.
Our approach works in four-stages. The spotting stage
recognizes in a sentence the phrases that may indicate a
mention of a DBpedia resource. Candidate selection is sub-
sequently employed to map the spotted phrase to resources
that are candidate disambiguations for that phrase. The
disambiguation stage, in turn, uses the context around the
spotted phrase to decide for the best choice amongst the can-
didates. The annotation can be customized by users to their
specific needs through configuration parameters explained in
subsection 2.5. In the remainder of this section we describe
the datasets and techniques used to enable our annotation
2.1 Dataset and Notation
We utilize the graph of labels, redirects and disambigua-
tions in DBpedia to extract a lexicon that associates mul-
tiple surface forms to a resource and interconnects multiple
resources to an ambiguous label. Labels of the DBpedia re-
sources are created from Wikipedia page titles, which can
be seen as community-approved surface forms. Redirects to
URIs indicate synonyms or alternative surface forms, includ-
ing common misspellings and acronyms. Their labels also
become surface forms. Disambiguations provide ambiguous
surface forms that are “confusable” with all resources they
link to. Their labels become surface forms for all target
resources in the disambiguation page. Note that we erase
trailing parentheses from the labels when constructing sur-
face forms. For example the label ‘Copyright (band)’ pro-
duces the surface form ‘Copyright’. This means that labels
of resources and of redirects can also introduce ambiguous
surface forms, additionally to the labels coming from titles
of disambiguation pages. The collection of surface forms
created as a result constitutes a controlled set of commonly
used labels for the target resources.
Another source of textual references to DBpedia resources
are wikilinks, i.e. the page links in Wikipedia that inter-
connect the articles. We pre-processed Wikipedia articles,
extracting every wikilink l= (s, r) with surface form sas
anchor text and resource ras link target, along with the
paragraph representing the context of that wikilink occur-
rence. Each wikilink was stored as an evidence of occurrence
o= (r, s, C). Each occurrence orecords the fact that the
DBpedia resource rrepresented by the link target has been
mentioned in the context of the paragraph through the use
of the surface form s. Before storage, the context paragraph
was tokenized, stopworded and stemmed, generating a vec-
tor of terms W=hw1, ..., wni. The collection of occurrences
for each resource was then stored as a document in a Lucene
index1for retrieval in the disambiguation stage.
Wikilinks can also be used to estimate the likelihood of
a surface form sreferring to a specific candidate resource
rRs. We consider each wikilink as evidence that the
anchor text is a commonly used surface form for the DBpedia
resource represented by the link target. By counting the
number of times a surface form occurred with and without
a DBpedia resource n(s, r), we can empirically estimate a
prior probability of seeing a resource rgiven that surface
form swas used P(r|s) = n(s, r)/n(s).
2.2 Spotting Algorithm
We use the extended set of labels in the lexicalization
dataset to create a lexicon for spotting. The implementation
used was the LingPipe Exact Dictionary-Based Chunker [2]
which relies on the Aho-Corasick string matching algorithm
[1] with longest case-insensitive match.
Since for many use cases it is unnecessary to annotate
common words, a configuration flag can instruct the system
to disregard in this stage any spots that are only composed
of verbs, adjectives, adverbs and prepositions. The part of
speech tagger used was the LingPipe implementation based
on Hidden Markov Models.
2.3 Candidate Selection
We follow the spotting with a candidate selection stage
in order to map resource names to candidate disambigua-
tions (e.g. Washington as reference to a city, to a person
or to a state). We use the DBpedia Lexicalization dataset
for determining candidate disambiguations for each surface
The candidate selection offers a chance to narrow down
the space of disambiguation possibilities. Selecting fewer
candidates can increase time performance, but it may reduce
recall if performed too aggressively. Due to our generality
and flexibility requirements, we decided to employ minimal
pre-filtering and postpone the selection to a user-configured
post-disambiguation configuration stage. Other approaches
for candidate selection are within our plans for future work.
The candidate selection phase can also be viewed as a way
to pre-rank the candidates for disambiguation before observ-
ing a surface form in the context of a paragraph. Choosing
the DBpedia resource with highest prior probability for a
surface form is the equivalent of selecting the “default sense”
of some phrase according to its usage in Wikipedia. The
prior probability scores of the lexicalizations dataset, for ex-
ample, can be utilized at this point. We report the results
for this approach as a baseline in Section 4.
2.4 Disambiguation
After selecting candidate resources for each surface form,
our system uses the context around the surface forms, e.g.
paragraphs, as information to find the most likely disam-
We modeled DBpedia resource occurrences in a Vector
Space Model (VSM) [22] where each DBpedia resource is a
point in a multidimensional space of words. In light of the
most common use of VSMs in Information Retrieval (IR),
our representation of a DBpedia resource is the analogous
of a document containing the aggregation of all paragraphs
mentioning that concept in Wikipedia. Similarly, the TF
(Term Frequency) weight is commonly used in IR to measure
the local relevance of a term in a document. In our model,
TF represents the relevance of a word for a given resource.
In addition, the Inverse Document Frequency (IDF) weight
[16] represents the general importance of the word in the
collection of DBpedia resources.
Albeit successful for document retrieval, the IDF weight
fails to adequately capture the importance of a word for
disambiguation. For the sake of illustration, suppose that
the term ‘U.S.A’ occurs in only 3 concepts in a collection
of 1 million concepts. Its IDF will be very high, as its
document frequency is very low (3/1,000,000). Now sup-
pose that the three concepts with which it occurs are dbpe-
dia:Washington,_D.C.,dbpedia:George_Washington, and
dbpedia:Washington_(U.S._State). As it turns out, de-
spite the high IDF weight, the word ‘U.S.A’ would be of
little value to disambiguate the surface form ‘Washington’,
as all three potential disambiguations would be associated
with that word. IDF gives an insight into the global impor-
tance of a word (given all resources), but fails to capture
the importance of a word for a specific set of candidate re-
In order to weigh words based on their ability to distin-
guish between candidates for a given surface form, we intro-
duce the Inverse Candidate Frequency (ICF) weight. The
intuition behind ICF is that the discriminative power of a
word is inversely proportional to the number of DBpedia re-
sources it is associated with. Let Rsbe the set of candidate
resources for a surface form s. Let n(wj) be the total num-
ber of resources in Rsthat are associated with the word wj.
Then we define:
IC F (wj) = log |Rs|
n(wj)= log |Rs| − log n(wj) (1)
The theoretical explanation for ICF is analogous to
Deng et al. [9], based on Information Theory. Entropy [23]
has been commonly used to measure uncertainty in prob-
ability distributions. It is argued that the discriminative
ability of a context word should be inversely proportional to
the entropy, i.e. a word commonly co-occurring with many
resources is less discriminative overall. With regard to a
word’s association with DBpedia resources, the entropy of a
word can be defined as: E(w) = PiRsP(ri|w) log P(ri|w).
Suppose that the word wis connected to those resources
with equal probability P(r|w) = 1/n(w) , the maximum en-
tropy is transformed to E(w) = log n(w). Since generally
the entropy tends to be proportional to the frequency n(w),
we use the maximum entropy to approximate the exact en-
tropy in the ICF formula. This simplification has worked
well in our case, simplifying the calculations and reducing
storage and search time requirements.
Given the VSM representation of DBpedia resources with
TF*ICF weights, the disambiguation task can be cast as
a ranking problem where the objective is to rank the cor-
rect DBpedia resource at position 1. Our approach is to
rank candidate resources according to the similarity score
between their context vectors and the context surrounding
the surface form. In this work we use cosine as the similarity
2.5 Configuration
Many of the current approaches for annotation tune their
parameters to a specific task, leaving little flexibility for
users to adapt their solution to other use cases. Our ap-
proach is to generate a number of metrics to inform the users
and let them decide on the policy that best fits their needs.
In order to decide whether to annotate a given resource,
there are several aspects to consider: can this resource be
confused easily with another one in the given context? Is
this a commonly mentioned resource in general? Was the
disambiguation decision made with high confidence? Is the
resource of the desired type? Is the resource in a complex
relationship within the knowledge base that rules it out for
annotation? The offered configuration parameters are de-
scribed next.
Resource Set to Annotate. The use of DBpedia resources
as targets for annotation enables interesting flexibility. The
simplest and probably most widely used case is to anno-
tate only resources of a certain type or set of types. In our
case the available types are derived from the class hierar-
chy provided by the DBpedia Ontology. Users can provide
whitelists (allowed) or blacklists (forbidden) of URIs for an-
notation. Whitelisting a class will allow the annotation of
all direct instances of that class, as well as all instances of
subclasses. Support for SPARQL queries allows even more
flexibility by enabling the specification of arbitrary graph
patterns. There is no restriction to the complexity of rela-
tionships that a resource must fulfil in this configuration
step. For instance, the user could choose to only anno-
tate concepts that are related to a specific geographic area,
time period in history, or are closely connected within the
Wikipedia category system.
Resource Prominence. For many applications, the annota-
tion of rare or exotic resources is not desirable. For example,
the Saxon_genitive (’s) is very commonly found in English
texts to indicate possession (e.g. Austria’s mountains are
beautiful), but it can be argued that for many use cases its
Figure 1: DBpedia Spotlight Web Application.
annotation is rather uninformative. An indicator for that is
that it has only seven Wikipedia inlinks. With the support
parameter, users can specify the minimum number of inlinks
a DBpedia resource has to have in order to be annotated.
Topic Pertinence. The topical relevance of the anno-
tated resource for the given context can be measured by the
similarity score returned by the disambiguation step. The
score is higher for paragraphs that match more closely the
recorded observations for a DBpedia resource. In order to
constrain annotations to topically related resources, a higher
threshold for the topic pertinence can be set.
Contextual Ambiguity. If more than one candidate re-
source has high topical pertinence to a paragraph, it may
be harder to disambiguate between those resources because
they remain partly ambiguous in that context. The differ-
ence in the topical relevance of two candidate resources to a
paragraph gives us an insight on how “confused” the disam-
biguation step was in choosing between these resources. The
score is computed by the relative difference in topic score be-
tween the first and the second ranked resource. Applications
that require high precision may decide to reduce risks by not
annotating resources when the contextual ambiguity is high.
Disambiguation Confidence. We define a confidence pa-
rameter, ranging from 0 to 1, of the annotation performed
by DBpedia Spotlight. This parameter takes into account
factors such as the topical pertinence and the contextual am-
biguity. Setting a high confidence threshold instructs DBpe-
dia Spotlight to avoid incorrect annotations as much as pos-
sible at the risk of losing some correct ones. We estimated
this parameter on a development set of 100,000 Wikipedia
samples. The rationale is that a confidence value of 0.7 will
eliminate 70% of incorrectly disambiguated test cases. For
example, given a confidence of 0.7, we get the topical perti-
nence threshold that 70% of the wrong test samples are be-
low. We integrate that with the contextual ambiguity score
by requiring a low ambiguity when the confidence is high. A
confidence of 0.7, therefore, will only annotate resources if
the contextual ambiguity is less than (1conf idence) = 0.3.
We address the adequacy of this parameter in our evalua-
DBpedia Spotlight is available both as a Web Service and
via a Web Application. In addition, we have published the
lexicalization dataset in RDF so that the community can
benefit from the collected surface forms and the DBpedia
resources representing their possible meanings.
3.1 Web Application
By using the Web application, users can test and visualize
the results of the different service functions. The interface
allows users to configure confidence, support, and to select
the classes of interest from the DBpedia ontology. Text can
be entered in a text box and, at user’s request, DBpedia
Spotlight will highlight the surface forms and create associ-
ations with their corresponding DBpedia resources. Figure
1 shows an example of a news article snippet after being
annotated by our system. In addition to Annotate, we offer
aDisambiguate operation where users can request the dis-
ambiguation of selected phrases (enclosed in double square
brackets). In this case, our system bypasses the spotting
stage and annotates only the selected phrases with DBpe-
dia resources. This function is useful for user interfaces that
allow users to mouse-select text, as well as for the easy in-
corporation of our disambiguation step into third-party ap-
plications that already perform spotting.
3.2 Web Service
In order to facilitate the integration of DBpedia Spotlight
into external web processes, we implemented RESTful and
SOAP web services for the annotation and disambiguation
processes. The web service interface allows access to both
the Annotate and the Disambiguate operations and to all
the configuration parameters of our approach. Thus, in ad-
dition to confidence, support and DBpedia classes, we accept
SPARQL queries for the DBpedia knowledge base to select
the set of resources that are going to be used when anno-
tating. These web services return HTML, XML, JSON or
XHTML+RDFa documents where each DBpedia resource
identified in the text is related to the text chunk where it
was found. The XML fragment presented below shows part
of the annotation of the news snippet shown in Figure 1.
<Annotation text="Pop star Michael Jackson..."
confidence="0.3" support="30"
<Resource URI="dbpedia:Michael_Jackson"
surfaceForm="Michael Jackson" offset="9"
similarityScore="0.31504717469215393" />
Figure 2: Example XML fragment resulting from the anno-
tation service.
3.3 Lexicalization dataset
Besides the DBpedia Spotlight system, the data produced
in this work is also shared in a format to ease its consump-
tion in a number of use cases. The dataset described in
Section 2.1 was encoded in RDF using the Lexvo vocabu-
lary [8] and is provided for download as a DBpedia dataset.
We use the property lexvo:label rather than rdfs:label
or skos:altLabel to associate a resource with surface form
strings. The rdfs:label property intends to represent “a
human-readable version of a resource’s name”
2. The SKOS
Vocabulary “enables a distinction to be made between the
preferred, alternative and ‘hidden’ lexical labels” through
their skos:prefLabel and skos:altLabel. The DBpedia
Spotlight dataset does not claim that a surface form is the
name of a resource, and neither intends to assert preference
between labels. Hence, we use lexvo:label in order to de-
scribe the resource - surface form association with regard to
actual language use. Association scores (e.g. prior proba-
bilities) are attached to lexvo:label relationships through
named graphs.
Users interested in finding names, alternative or preferred
labels can use the provided information in order to make
an informed task-specific choice. Imagine a user attempt-
ing to find DBpedia URIs for presidents and colleges in his
company’s legacy database. The table called President con-
tains two columns: last name, alma mater. Users may use a
SPARQL query, for example, to select the default sense for
the surface form ‘Bush’, given that it is known it has a re-
lationship with the surface form ‘Harvard Business3. The
lexicalizations dataset will provide links between alterna-
tive spellings (e.g. Bush dbpedia:George_W._Bush) and
the knowledge base (DBpedia) will provide the background
knowledge connecting the resource dbpedia:George_W._Bush
to his alma mater dbpedia:Harvard_Business_School. The
association scores will help to rank the most likely of the
3The SPARQL formulation for this query and other exam-
ples are available from the project page.
candidates in this context.
The dataset can also be used to get information about
the strength of association between a surface form and a
resource, term ambiguity or the default sense of a surface
form, just to cite a few use cases.
We carried out two evaluations of DBpedia Spotlight. A
large scale automatic evaluation tested the performance of
the disambiguation component in choosing the correct can-
didate resources for a given surface form. In order to provide
an evaluation of the whole system in a specific annotation
scenario, we also carried out an experiment using a manually
annotated test corpus. In that evaluation we compare our
results with those of several publicly available annotation
4.1 Disambiguation Evaluation
Wikipedia provides a wealth of annotated data that can
be used to evaluate our system on a large scale. We ran-
domly selected 155,000 wikilink samples and set aside as
test data. In order to really capture the ability of our sys-
tem to distinguish between multiple senses of a surface form,
we made sure that all these instances have ambiguous sur-
face forms. We used the remainder of the samples collected
from Wikipedia (about 69 million) as DBpedia resource oc-
currences providing context for disambiguation as described
in Section 2.
In this evaluation, we were interested in the performance
of the disambiguation stage. A spotted surface form, taken
from the anchor text of a wikilink, is given to the disam-
biguation function4along with the paragraph that it was
mentioned in. The task of the disambiguation service is to
select candidate resources for this surface form and decide
between them based on the context.
In order to better assess the contribution of our approach,
we included three baseline methods:
Random Baseline performs candidate selection and picks
one of the candidates with uniform probability. This
baseline serves as a control for easy disambiguations,
since for low ambiguity terms, even random choice
should perform reasonably.
Default Sense Baseline performs candidate selection
and chooses the candidate with the highest prior prob-
ability (without using the context). More formally:
arg maxrRsP(r|s) . This baseline helps to assess how
common were the DBpedia resources included in the
annotation dataset.
Default Similarity uses TF*IDF term scoring as a ref-
erence to evaluate the influence of our TF*ICF ap-
4.1.1 Results
The results for the baselines and DBpedia Spotlight are
presented in Table 1. The performance of the baseline that
makes random disambiguation choices confirms the high am-
biguity in our dataset (less than 1/4 of the disambiguations
were correct at random). Using the prior probability to
choose the default sense performs reasonably well, being ac-
curate in 55.12% of the disambiguations. This is indication
4in our implementation, for convenience, the candidate se-
lection can be called from the disambiguation
Disambiguation Approach Accuracy
Baseline Random 17.77%
Baseline Default Sense 55.12%
Baseline TF*IDF 55.91%
DBpedia Spotlight TF*ICF 73.39%
DBpedia Spotlight Mixed 80.52%
Table 1: Accuracies for each of the approaches tested in the
disambiguation evaluation.
that our evaluation set was composed by a good balance of
common DBpedia resources and less prominent ones. The
use of context for disambiguation through the default scor-
ing of TF*IDF obtained 55.91%, while the TF*ICF score
introduced in this work improved the results to 73.39%.
The performance of TF*ICF is an encouraging indication
that a simple ranking-based disambiguation algorithm can
be successful if enough contextual evidence is provided.
We also attempted a simple combination of the prior (de-
fault sense) and TF*ICF scores, which we called DBpe-
dia Spotlight Mixed. The mixing weights were estimated
through a small experiment using linear regression over held
out training data. The results reported in this work used
mixed scores computed through the formula:
Mixed(r, s, C) =
1234.3989 P(r|s)
+0.9968 contextualScore(r, s, C)
0.0275 (2)
The prior probability P(r|s) was calculated as described
in Section 2.1. The contextual score used was the cosine sim-
ilarity of term vectors weighted by TF*ICF as described in
Section 2.4. Further research is needed to carefully examine
the contribution of each component to the final score.
4.2 Annotation Evaluation
Although we used an unseen dataset for evaluating our
disambiguation approach, it is possible that the type of dis-
course and the annotation style of Wikipedia would bias the
results in favor of systems trained with that kind of data.
The Wikipedia guidelines for link creation focus on non-
obvious references5. If a wikilink would not contribute to
the understanding of a specific article, the Wikipedia Man-
ual of Style discourages its creation. Therefore, we created
a manual evaluation dataset from a news corpus in order
to complement that evaluation. In this second evaluation,
we would like to assess completeness of linking as well. We
created an annotation scenario in which the annotators were
asked to add links to DBpedia resources for all phrases that
would add information to the provided text.
Our test corpus consisted of 30 randomly selected para-
graphs from New York Times documents from 10 different
categories. In order to construct a gold standard, each
evaluator first independently annotated the corpus, after
which they met and agreed upon the ground truth evalu-
ation choices. The ratio of annotated to not-annotated to-
kens was 33%. This corpus is available for download on the
project homepage.
We compared our results on this test corpus with the per-
formance of publicly available annotation services: Open-
Calais6, Zemanta7, Ontos Semantic API8, The Wiki Ma-
chine9, Alchemy API10 and M&W’s wikifier [20]. Linking
to DBpedia is supported in those services in different lev-
els. Alchemy API provides links to DBpedia and Freebase
among other sources. Open Calais and Ontos provide some
limited linkage between their private identifiers and DBpe-
dia resources. As of the time of writing, Ontos only links
people and companies to DBpedia. For the cases where the
systems were able to extract resources but do not give DB-
pedia URIs, we used a simple transformation on the ex-
tracted resources that constructed DBpedia URIs from la-
bels - e.g. ‘apple’ becomes dbpedia:Apple. We report re-
sults with and without this transformation. The results that
used the transformation are labeled Ontos+Na¨
ıve and Open
ıve. The service APIs of Zemanta, The Wiki Ma-
chine and M&W do not explicitly return DBpedia URIs, but
the URIs can be inferred from the Wikipedia links that they
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Annotation Evaluation
Spotlight (no config)
Figure 3: DBpedia Spotlight with different configurations
(lines) in comparison with other systems (points).
4.2.1 Results
Retrieval as well as classification tasks exhibit an inherent
precision-recall trade-off [5]. The configuration of DBpedia
Spotlight allows users to customize the level of annotation
to their specific application needs. Figure 3 shows the evalu-
ation results. Each point in the plot represents the precision
(vertical axis) and recall (horizontal axis) of each evaluation
System F1
DBpedia Spotlight (best configuration) 56.0%
DBpedia Spotlight (no configuration) 45.2%
The Wiki Machine 59.5%
Zemanta 39.1%
Open Calais+Na¨
ıve 16.7%
Alchemy 14.7%
ıve 10.6%
Open Calais 6.7%
Ontos 1.5%
Table 2: F1scores for each of the approaches tested in the
annotation evaluation.
run. The lines show the trade-off between precision and re-
call as we vary the confidence and support parameters in our
service. Each line represents one value of support (varying
from 0 to 1000). Each point in the line is a value of confi-
dence (0.1 to 0.9) for the corresponding support. It can be
observed that higher confidence values (with higher support)
produce higher precision at the cost of some recall and vice
versa. This is encouraging indication that our parameters
achieve their objectives.
The shape of the displayed graph shows that the per-
formance of DBpedia Spotlight is in a competitive range.
Most annotation services lay beneath the F1-score of our
system with every confidence value. Table 5 shows the best
F1-scores of each approach. The best F1-score of DBpedia
Spotlight was reached with confidence value of 0.6. The
WikiMachine has the highest F1-score, but tends to over-
annotate the articles, which results in a high recall, at the
cost of low precision. Meanwhile, Zemanta dominates in
precision, but has low recall. With different confidence and
support parameters, DBpedia Spotlight is able to approxi-
mate the results of both WikiMachine and Zemanta, while
offering many other configurations with different precision-
recall trade-offs in between.
Many existing approaches for entity annotation have fo-
cused on annotating salient entity references, commonly only
entities of specific types (Person, Organization, Location)
[14, 21, 24, 12] or entities that are in the subject of sen-
tences [11]. Hassell et al. [14] exploit the structure of a call
for papers corpus for relation extraction and later disam-
biguation of academic researchers. Rowe [21] concentrates
on disambiguating person names with social graphs, while
Volz et al. [24] present a disambiguation algorithm for the
geographic domain that is based on popularity scores and
textual patterns. Gruhl et al. [12] also constrain their anno-
tation efforts to cultural entities in a specific domain. Our
objective is to be able to annotate any entities in DBpedia.
Other approaches have attempted the non-type-specific
annotation of entities. However, several optimize their ap-
proaches for precision, leaving little flexibility for users with
use cases where recall is important, or they have not evalu-
ated the applicability of their approaches with more general
use cases [10, 6, 7, 19].
SemTag [10] was the first Web-scale named entity disam-
biguation system. They used metadata associated with each
entity in an entity catalog derived from TAP [13] as context
for disambiguation. SemTag specialized in precision at the
cost of recall, producing an average of less than two anno-
tations per page.
Bunesco and Pasca [6], Cucerzan [7], Mihalcea and Cso-
mai (Wikify!) [19] and Witten and Milne (M&W) [20], like
us, also used text from Wikipedia in order to learn how to
annotate. Bunesco and Pasca only evaluate articles under
the “people by occupation” category, while Cucerzan’s and
Wikify!’s conservative spotting only annotate 4.5% and 6%
of all tokens in the input text, respectively. In Wikify!, this
spotting yields surface forms with low ambiguity for which
even a random disambiguator achieves an F1score of 0.6.
Fader et al. [11] chooses the candidate with the highest
prior probability unless the contextual evidence is higher
than a threshold. In their dataset 27.94% of the surface
forms are unambiguous and 46.53% of the ambiguous ones
can be correctly disambiguated by just choosing the default
sense (according to our index).
Kulkarni et al. [17] attempts the joint optimization of all
spotted surface forms in order to realize the collective anno-
tation of entities. The inference problem formulated by the
authors is NP-hard, leading to their proposition of a Linear
Programing and a Hill-climbing approach for optimization.
We propose instead a simple, inexpensive approach that can
be easily configured and adapted to task-specific needs, fa-
cilitated by the DBpedia Ontology and configuration param-
In this paper we presented DBpedia Spotlight, a tool to
detect mentions of DBpedia resources in text. It enables
users to link text documents to the Linked Open Data cloud
through the DBpedia interlinking hub. The annotations pro-
vided by DBpedia Spotlight enable the enrichment of web-
sites with background knowledge, faceted browsing in text
documents and enhanced search capabilities. The main ad-
vantage of our system is its comprehensiveness and flexibil-
ity, allowing one to configure it based on the DBpedia on-
tology, as well as prominence, contextual ambiguity, topical
pertinence and confidence scores. The resources that should
be annotated can be specified by a list of resource types or
by more complex relationships within the knowledge base.
We compared our system with other publicly available
services and showed how we retained competitiveness with
a more configurable approach. In the future we plan to
incorporate more knowledge from the Linked Open Data
cloud in order to enhance the annotation algorithm.
A project page with news, documentation, downloads,
demonstrations and other information is available at http:
The development of DBpedia Spotlight was supported by
the European Commission through the project LOD2 – Cre-
ating Knowledge out of Linked Data and by Neofonie GmbH,
a Berlin-based company offering technologies in the area of
Web search, social media and mobile applications.
Thanks to Andreas Schultz and Paul Kreis for their help
with setting up our servers and evaluation, and to Joachim
Daiber for his contributions in dataset creation, evaluation
clients and preprocessing code that was partially utilized in
the finalizing stages of this work.
[1] A. V. Aho and M. J. Corasick. Efficient string
matching: an aid to bibliographic search. Commun.
ACM, 18:333–340, June 1975.
[2] Alias-i. LingPipe 4.0.0., retrieved on
24.08.2010, 2008.
[3] C. Bizer, T. Heath, and T. Berners-Lee. Linked data -
the story so far. Int. J. Semantic Web Inf. Syst.,
5(3):1–22, 2009.
[4] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer,
C. Becker, R. Cyganiak, and S. Hellmann. DBpedia -
A crystallization point for the Web of Data. Web
Semantics: Science, Services and Agents on the World
Wide Web, 7:154–165, September 2009.
[5] M. Buckland and F. Gey. The relationship between
Recall and Precision. J. Am. Soc. Inf. Sci.,
45(1):12–19, January 1994.
[6] R. C. Bunescu and M. Pasca. Using encyclopedic
knowledge for named entity disambiguation. In EACL,
[7] S. Cucerzan. Large-scale named entity disambiguation
based on wikipedia data. In EMNLP-CoNLL, pages
708–716, 2007.
[8] G. de Melo and G. Weikum. Language as a foundation
of the Semantic Web. In C. Bizer and A. Joshi,
editors, Proceedings of the Poster and Demonstration
Session at the 7th International Semantic Web
Conference (ISWC 2008), volume 401 of CEUR WS,
Karlsruhe, Germany, 2008. CEUR.
[9] H. Deng, I. King, and M. R. Lyu. Entropy-biased
models for query representation on the click graph. In
SIGIR ’09: Proceedings of the 32nd international
ACM SIGIR conference on Research and development
in information retrieval, pages 339–346, New York,
NY, USA, 2009. ACM.
[10] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha,
A. Jhingran, T. Kanungo, S. Rajagopalan,
A. Tomkins, J. A. Tomlin, and J. Y. Zien. Semtag and
seeker: bootstrapping the semantic web via automated
semantic annotation. In Proceedings of the 12th
international conference on World Wide Web, WWW
’03, pages 178–186, New York, NY, USA, 2003. ACM.
[11] A. Fader, S. Soderland, and O. Etzioni. Scaling
wikipedia-based named entity disambiguation to
arbitrary web text. In Proceedings of the WikiAI 09 -
IJCAI Workshop: User Contributed Knowledge and
Artificial Intelligence: An Evolving Synergy, Pasadena,
CA, USA, July 2009.
[12] D. Gruhl, M. Nagarajan, J. Pieper, C. Robson, and
A. P. Sheth. Context and domain knowledge enhanced
entity spotting in informal text. In International
Semantic Web Conference, pages 260–276, 2009.
[13] R. V. Guha and R. McCool. Tap: A semantic web
test-bed. J. Web Sem., 1(1):81–87, 2003.
[14] J. Hassell, B. Aleman-Meza, and I. Arpinar.
Ontology-driven automatic entity disambiguation in
unstructured text. In I. Cruz, S. Decker, D. Allemang,
C. Preist, D. Schwabe, P. Mika, M. Uschold, and
L. Aroyo, editors, The Semantic Web - ISWC 2006,
volume 4273 of Lecture Notes in Computer Science,
pages 44–57. Springer Berlin / Heidelberg, 2006.
[15] M. Hearst. UIs for Faceted Navigation: Recent
Advances and Remaining Open Problems. In
Workshop on Computer Interaction and Information
Retrieval, HCIR, Redmond, WA, Oct. 2008.
[16] K. S. Jones. A statistical interpretation of term
specificity and its application in retrieval. Journal of
Documentation, 28:11–21, 1972.
[17] S. Kulkarni, A. Singh, G. Ramakrishnan, and
S. Chakrabarti. Collective annotation of wikipedia
entities in web text. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge
discovery and data mining, KDD ’09, pages 457–466,
New York, NY, USA, 2009. ACM.
[18] P. N. Mendes, A. Passant, P. Kapanipathi, and A. P.
Sheth. Linked open social signals. In Web Intelligence
and Intelligent Agent Technology, 2010. WI-IAT ’10.
IEEE/WIC/ACM International Conference on, 2010.
[19] R. Mihalcea and A. Csomai. Wikify!: linking
documents to encyclopedic knowledge. In CIKM ’07:
Proceedings of the sixteenth ACM conference on
Conference on information and knowledge
management, pages 233–242, New York, NY, USA,
2007. ACM.
[20] D. Milne and I. H. Witten. Learning to link with
wikipedia. In Proceeding of the 17th ACM conference
on Information and knowledge management, CIKM
’08, pages 509–518, New York, NY, USA, 2008. ACM.
[21] M. Rowe. Applying semantic social graphs to
disambiguate identity references. In L. Aroyo,
P. Traverso, F. Ciravegna, P. Cimiano, T. Heath,
E. Hyv¨
ı£¡nen, R. Mizoguchi, E. Oren, M. Sabou, and
E. Simperl, editors, The Semantic Web: Research and
Applications, volume 5554 of Lecture Notes in
Computer Science, pages 461–475. Springer Berlin /
Heidelberg, 2009.
[22] G. Salton, A. Wong, and C. S. Yang. A vector space
model for automatic indexing. Communications of the
ACM, 18:613–620, November 1975.
[23] C. E. Shannon. Prediction and entropy of printed
english. Bell Systems Technical Journal, pages 50–64,
[24] R. Volz, J. Kleb, and W. Mueller. Towards
ontology-based disambiguation of geographical
identifiers. In I3, 2007.
... It contains 15 manually annotated news articles on politics in 5 different languages. Other widely-used datasets for entity linking are available within the GERBIL platform (Röder et al., 2018): AQUAINT (Milne & Witten, 2008), ACE2004 (Ratinov et al., 2011), DBpedia Spotlight (Mendes et al., 2011), and others. ...
... However, there are a few exceptions. The SemEval 2015 Task 13 (Moro & Navigli, 2015) and DBpedia Spotlight (Mendes et al., 2011) datasets allow for nested entities. VoxEL (Rosales-Méndez et al., 2018) provides two versions of the dataset: strict and relaxed. ...
Full-text available
This paper describes NEREL—a Russian news dataset suited for three tasks: nested named entity recognition, relation extraction, and entity linking. Compared to flat entities, nested named entities provide a richer and more complete annotation while also increasing the coverage of relations annotation and entity linking. Relations between nested named entities may cross entity boundaries to connect to shorter entities nested within longer ones, which makes it harder to detect such relations. NEREL is currently the largest Russian dataset annotated with entities and relations: it comprises 29 named entity types and 49 relation types. At the time of writing, the dataset contains 56 K named entities and 39 K relations annotated in 933 person-oriented news articles. NEREL is annotated with relations at three levels: (1) within nested named entities, (2) within sentences, and (3) with relations crossing sentence boundaries. We provide benchmark evaluation of current state-of-the-art methods in all three tasks. The dataset is freely available at
... Entity linking is the task concerned with linking terms of a given text to appropriate entities extracted from knowledge bases; in other words, it gives an entity-based representation which suits the given text. There are many entity linking tools, among which DBpedia Spotlight [11], TagMe [12], REL [13], WAT [32], and FEL [33] are the most widely used. Most of these entity-linking tools are designed for general text annotation purposes. ...
... Therefore, query annotation is the critical factor in a purely entity-based retrieval system, which is why we consider only completely annotated queries in our experiments. We tested our retrieval approach by using two arbitrary entity linking methods for query annotation, including DBpedia Spotlight [11] and REL [13]. Moreover, the two Python APIs provided in Table 4 are the used implementations corresponding to each of these two entity linking methods. ...
Full-text available
Over the past decade, knowledge bases (KB) have been increasingly utilized to complete and enrich the representation of queries and documents in order to improve the document retrieval task. Although many approaches have used KB for such purposes, the problem of how to effectively leverage entity-based representation still needs to be resolved. This paper proposes a Purely Entity-based Semantic Search Approach for Information Retrieval (PESS4IR) as a novel solution. The approach includes (i) its own entity linking method and (ii) an inverted indexing method, and for document retrieval and ranking, (iii) an appropriate ranking method is designed to take advantage of all the strengths of the approach. We report the findings on the performance of our approach, which is tested by queries annotated by two known entity linking tools, REL and DBpedia-Spotlight. The experiments are performed on the standard TREC 2004 Robust and MSMARCO collections. By using the REL method on the Robust collection, for the queries whose terms are all annotated and whose average annotation scores are greater than or equal to 0.75, our approach achieves the maximum nDCG@5 score (1.00). Also, it is shown that using PESS4IR alongside another document retrieval method would improve performance, unless that method alone achieves the maximum nDCG@5 score for those highly annotated queries.
... Semantic Annotation (5): Tools for collaboratively annotating semantic data. Digital Library (6): Specific tools for the management and exploration of a collection of books. ...
... It is available as a demo web application, REST service, or downloadable package. [6] ( Figure 5) is a tool for automatically annotating mentions of DBpedia resources in text. It is available as a demo web application, REST service, or downloadable package. ...
Full-text available
In the era of big data, linked data interfaces play a critical role in enabling access to and management of large-scale, heterogeneous datasets. This survey investigates forty-seven interfaces developed by the semantic web community in the context of the Web of Linked Data, displaying information about general topics and digital library contents. The interfaces are classified based on their interaction paradigm, the type of information they display, and the complexity reduction strategies they employ. The main purpose to be addressed is the possibility of categorizing a great number of available tools so that comparison among them becomes feasible and valuable. The analysis reveals that most interfaces use a hybrid interaction paradigm combining browsing, searching, and displaying information in lists or tables. Complexity reduction strategies, such as faceted search and summary visualization, are also identified. Emerging trends in linked data interface focus on user-centric design and advancements in semantic annotation methods, leveraging machine learning techniques for data enrichment and retrieval. Additionally, an interactive platform is provided to explore and compare data on the analyzed tools. Overall, there is no one-size-fits-all solution for developing linked data interfaces and tailoring the interaction paradigm and complexity reduction strategies to specific user needs is essential.
... In fact, in order to identify relationships among texts, we must first identify the significant concepts in such texts to follow their properties in ontologies and establish the necessary connections, and one helpful tool for this is the DBpedia Spotlight (DB-SL). DB-SL analyzes raw text data to discover existing mentions of real DBpedia entities; that is, it translates simple text into structured data about specific DBpedia entities [57]. The user provides the text, and DB-SL examines it and finds some strings in it (known as surface form) with the goal of finding the associated mentions of existing DBpedia entities, which is a specific concept defined in the realm of the DBpedia dataset (referred usually to as named entity). ...
Full-text available
Cultural heritage is one of many fields that has seen a significant digital transformation in the form of digitization and asset annotations for heritage preservation, inheritance, and dissemination. However, a lack of accurate and descriptive metadata in this field has an impact on the usability and discoverability of digital content, affecting cultural heritage platform visitors and resulting in an unsatisfactory user experience as well as limiting processing capabilities to add new functionalities. Over time, cultural heritage institutions were responsible for providing metadata for their collection items with the help of professionals, which is expensive and requires significant effort and time. In this sense, crowdsourcing can play a significant role in digital transformation or massive data processing, which can be useful for leveraging the crowd and enriching the metadata quality of digital cultural content. This paper focuses on a very important challenge faced by cultural heritage crowdsourcing platforms, which is how to attract users and make such activities enjoyable for them in order to achieve higher-quality annotations. One way to address this is to offer personalized interesting items based on each user preference, rather than making the user experience random and demanding. Thus, we present an image annotation recommendation system for users of cultural heritage platforms. The recommendation system design incorporates various technologies intending to help users in selecting the best matching images for annotations based on their interests and characteristics. Different classification methods were implemented to validate the accuracy of our work on Egyptian heritage.
... If a matching Wikipedia article's title could be found, the term was included in the interest model; otherwise, it was removed. To connect keyphrases to concepts in the DBpedia knowledge base [28], we utilized DBpedia Spotlight [119] as an entity linking service. ...
Full-text available
The fast growth of data in the academic field has contributed to making recommendation systems for scientific papers more popular. Content-based filtering (CBF), a pivotal technique in recommender systems (RS), holds particular significance in the realm of scientific publication recommendations. In a content-based scientific publication RS, recommendations are composed by observing the features of users and papers. Content-based recommendation encompasses three primary steps, namely, item representation, user modeling, and recommendation generation. A crucial part of generating recommendations is the user modeling process. Nevertheless, this step is often neglected in existing content-based scientific publication RS. Moreover, most existing approaches do not capture the semantics of user models and papers. To address these limitations, in this paper we present a transparent Recommendation and Interest Modeling Application (RIMA), a content-based scientific publication RS that implicitly derives user interest models from their authored papers. To address the semantic issues, RIMA combines word embedding-based keyphrase extraction techniques with knowledge bases to generate semantically-enriched user interest models, and additionally leverages pretrained transformer sentence encoders to represent user models and papers and compute their similarities. The effectiveness of our approach was assessed through an offline evaluation by conducting extensive experiments on various datasets along with user study (N = 22), demonstrating that (a) combining SIFRank and SqueezeBERT as an embedding-based keyphrase extraction method with DBpedia as a knowledge base improved the quality of the user interest modeling step, and (b) using the msmarco-distilbert-base-tas-b sentence transformer model achieved better results in the recommendation generation step.
... In the literature, named entity disambiguation (NED) is commonly defined as the process of determining the precise meaning or sense of a named entity within a given context. These named entities can be identified by well-known named entity recognition tools like DBpedia-Spotlight (Mendes et al., 2011). More specifically, the goal of NED is to resolve ambiguity by associating the named entity with a specific concept within a semantic knowledge base. ...
Full-text available
Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.
...  Dbpedia ontology is an open knowledge graph crowd-sourced from the web contents to create structured knowledge such as the Dbpedia-Person dataset (Mendes et al., 2011). ...
Full-text available
As the size and complexity of data continue to increase, the need for efficient and effective analysis methods becomes ever more crucial. Tensorization, the process of converting 2-dimensional datasets into multidimensional structures, has emerged as a promising approach for multiway analysis methods. This paper explores the steps involved in tensorization, multidimensional data sources, various multiway analysis methods employed, and the benefits of these approaches. A small example of Blind Source Separation (BSS) is presented comparing 2-dimensional algorithms and a multiway algorithm in Python. Results indicate that multiway analysis is more expressive. Additionally, tensorization techniques aid in compressing deep learning models by reducing the number of required parameters while enhancing the expression of relationships across dimensions. A survey of the multi-away analysis methods and integration with various Deep Neural Networks models is presented using case studies in different domains.
... Pubtator 5 [25] et al. [26]. CORD19-NEKG is an RDF dataset describing named entities in the CORD-19 dataset, which have been extracted using: i) the DBPedia Spotlight [27] named entity extraction tool, which uses DBPedia entities to annotate text automatically; ii) Entity-fishing 6, which uses Wikidata entities to annotate text automatically; and iii) the NCBO BioPortal Annotator [28], which annotates text automatically with user-selected ontologies and vocabularies. COVID-KG [29] is another KG based on the CORD-19 dataset. ...
Full-text available
Background This paper proposes Cyrus, a new transparency evaluation framework, for Open Knowledge Extraction (OKE) systems. Cyrus is based on the state-of-the-art transparency models and linked data quality assessment dimensions. It brings together a comprehensive view of transparency dimensions for OKE systems. The Cyrus framework is used to evaluate the transparency of three linked datasets, which are built from the same corpus by three state-of-the-art OKE systems. The evaluation is automatically performed using a combination of three state-of-the-art FAIRness (Findability, Accessibility, Interoperability, Reusability) assessment tools and a linked data quality evaluation framework, called Luzzu. This evaluation includes six Cyrus data transparency dimensions for which existing assessment tools could be identified. OKE systems extract structured knowledge from unstructured or semi-structured text in the form of linked data. These systems are fundamental components of advanced knowledge services. However, due to the lack of a transparency framework for OKE, most OKE systems are not transparent. This means that their processes and outcomes are not understandable and interpretable. A comprehensive framework sheds light on different aspects of transparency, allows comparison between the transparency of different systems by supporting the development of transparency scores, gives insight into the transparency weaknesses of the system, and ways to improve them. Automatic transparency evaluation helps with scalability and facilitates transparency assessment. The transparency problem has been identified as critical by the European Union Trustworthy Artificial Intelligence (AI) guidelines. In this paper, Cyrus provides the first comprehensive view of transparency dimensions for OKE systems by merging the perspectives of the FAccT (Fairness, Accountability, and Transparency), FAIR, and linked data quality research communities. Results In Cyrus, data transparency includes ten dimensions which are grouped in two categories. In this paper, six of these dimensions, i.e., provenance, interpretability, understandability, licensing, availability, interlinking have been evaluated automatically for three state-of-the-art OKE systems, using the state-of-the-art metrics and tools. Covid-on-the-Web is identified to have the highest mean transparency. Conclusions This is the first research to study the transparency of OKE systems that provides a comprehensive set of transparency dimensions spanning ethics, trustworthy AI, and data quality approaches to transparency. It also demonstrates how to perform automated transparency evaluation that combines existing FAIRness and linked data quality assessment tools for the first time. We show that state-of-the-art OKE systems vary in the transparency of the linked data generated and that these differences can be automatically quantified leading to potential applications in trustworthy AI, compliance, data protection, data governance, and future OKE system design and testing.
There is a recent trend for using the novel Artificial Intelligence ChatGPT chatbox, which provides detailed responses and articulate answers across many domains of knowledge. However, in many cases it returns plausible-sounding but incorrect or inaccurate responses, whereas it does not provide evidence. Therefore, any user has to further search for checking the accuracy of the answer or/and for finding more information about the entities of the response. At the same time there is a high proliferation of RDF Knowledge Graphs (KGs) over any real domain, that offer high quality structured data. For enabling the combination of ChatGPT and RDF KGs, we present a research prototype, called \(\texttt{GPT}{\bullet }\texttt{LODS} \), which is able to enrich any ChatGPT response with more information from hundreds of RDF KGs. In particular, it identifies and annotates each entity of the response with statistics and hyperlinks to LODsyndesis KG (which contains integrated data from 400 RDF KGs and over 412 million entities). In this way, it is feasible to enrich the content of entities and to perform fact checking and validation for the facts of the response at real time. URL: Demo Video:
Full-text available
Despite its size, Wikidata remains incomplete and inaccurate in many areas. Hundreds of thousands of articles on English Wikipedia have zero or limited meaningful structure on Wikidata. Much work has been done in the literature to partially or fully automate the process of completing knowledge graphs, but little of it has been practically applied to Wikidata. This paper presents two interconnected practical approaches to speeding up the Wikidata completion task. The first is Wwwyzzerdd, a browser extension that allows users to quickly import statements from Wikipedia to Wikidata. Wwwyzzerdd has been used to make over 100 thousand edits to Wikidata. The second is Psychiq, a new model for predicting instance and subclass statements based on English Wikipedia articles. Psychiq’s performance and characteristics make it well suited to solving a variety of problems for the Wikidata community. One initial use is integrating the Psychiq model into the Wwwyzzerdd browser extension.
Full-text available
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Full-text available
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
Conference Paper
Full-text available
This paper explores the application of restricted relationship graphs (RDF) and statistical NLP techniques to improve named entity annotation in challenging Informal English domains. We validate our approach using on-line forums discussing popular music. Named entity annotation is particularly difficult in this domain because it is characterized by a large number of ambiguous entities, such as the Madonna album “Music” or Lilly Allen’s pop hit “Smile”. We evaluate improvements in annotation accuracy that can be obtained by restricting the set of possible entities using real-world constraints. We find that constrained domain entity extraction raises the annotation accuracy significantly, making an infeasible task practical. We then show that we can further improve annotation accuracy by over 50% by applying SVM based NLP systems trained on word-usages in this domain.
Full-text available
The term "Linked Data" refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
This paper investigates the "named-entity disam- biguation" task on the Web—identifying the refer- ent of a string, found on an arbitrary Web page. The GROUNDER system, introduced in this paper, ad- dresses two challenges not considered by previous work: how to utilize a priori information (e.g., Bill Clinton is more prominent on the Web than Clin- ton County) to improve disambiguation, and how to compose this prior information with contextual evidence. GROUNDER addresses both challenges by leverag- ing the user-contributed knowledge in Wikipedia and providing a novel formulation of the task. On a sample of strings drawn from the Web, GROUNDER achieves precision of 1.0 at recall 0.34, and preci- sion 0.90 at recall 0.60.
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.
Faceted navigation is a proven technique for supporting ex-ploration and discovery within an information collection. The underlying data model is simple enough to make nav-igation understandable while at the same time rich enough to make navigation flexible in a wide range of domains. Nonetheless, there remain issues in both the presentation of navigation options in the interface and in how to extend the model to allow more flexible discovery while still retain-ing understandability. This paper explores both of these issues.
The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier that can be dereferenced over the Web into a rich RDF description of the entity, including human-readable definitions in 30 languages, relationships to other resources, classifications in four concept hierarchies, various facts as well as data-level links to other Web data sources describing the entity. Over the last year, an increasing number of data publishers have begun to set data-level links to DBpedia resources, making DBpedia a central interlinking hub for the emerging Web of Data. Currently, the Web of interlinked data sources around DBpedia provides approximately 4.7 billion pieces of information and covers domains such as geographic information, people, companies, films, music, genes, drugs, books, and scientific publications. This article describes the extraction of the DBpedia knowledge base, the current status of interlinking DBpedia with other data sources on the Web, and gives an overview of applications that facilitate the Web of Data around DBpedia.