Conference PaperPDF Available

DBpedia spotlight: Shedding light on the web of documents

Authors:

Abstract and Figures

Interlinking text documents with Linked Open Data enables the Web of Data to be used as background knowledge within document-oriented applications such as search and faceted browsing. As a step towards interconnecting the Web of Documents with the Web of Data, we developed DBpedia Spotlight, a system for automatically annotating text documents with DBpedia URIs. DBpedia Spotlight allows users to configure the annotations to their specific needs through the DBpedia Ontology and quality measures such as prominence, topical pertinence, contextual ambiguity and disambiguation confidence. We compare our approach with the state of the art in disambiguation, and evaluate our results in light of three baselines and six publicly available annotation systems, demonstrating the competitiveness of our system. DBpedia Spotlight is shared as open source and deployed as a Web Service freely available for public use.
Content may be subject to copyright.
DBpedia Spotlight: Shedding Light on the Web of
Documents
Pablo N. Mendes1, Max Jakob1, Andrés García-Silva2, Christian Bizer1
1Web-based Systems Group, Freie Universität Berlin, Germany
first.last@fu-berlin.de
2Ontology Engineering Group, Universidad Politécnica de Madrid, Spain
hgarcia@fi.upm.es
ABSTRACT
Interlinking text documents with Linked Open Data enables
the Web of Data to be used as background knowledge within
document-oriented applications such as search and faceted
browsing. As a step towards interconnecting the Web of
Documents with the Web of Data, we developed DBpedia
Spotlight, a system for automatically annotating text docu-
ments with DBpedia URIs. DBpedia Spotlight allows users
to configure the annotations to their specific needs through
the DBpedia Ontology and quality measures such as promi-
nence, topical pertinence, contextual ambiguity and disam-
biguation confidence. We compare our approach with the
state of the art in disambiguation, and evaluate our results
in light of three baselines and six publicly available anno-
tation systems, demonstrating the competitiveness of our
system. DBpedia Spotlight is shared as open source and
deployed as a Web Service freely available for public use.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous;
I.2.7 [Artificial Intelligence]: Natural Language Process-
ing—Language parsing and understanding; I.7 [Document
and Text Processing]: [Miscellaneous]
General Terms
Algorithms, Experimentation
Keywords
Text Annotation, Linked Data, DBpedia, Named Entity
Disambiguation
1. INTRODUCTION
As the Linked Data ecosystem develops [3], so do the mu-
tual rewards for structured and unstructured data providers
alike. Higher interconnectivity between information sources
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
I-SEMANTICS 2011, 7th Int. Conf. on Semantic Systems, Sept. 7-9, 2011,
Graz, Austria
Copyright 2011 ACM 978-1-4503-0621-8 ...$10.00.
has the potential of increasing discoverability, reusability,
and hence the utility of information. By connecting unstruc-
tured information in text documents with Linked Data, facts
from the Web of Data can be used, for instance, as comple-
mentary information on web pages to enhance information
retrieval, to enable faceted document browsing [15] and cus-
tomization of web feeds based on semantics [18].
DBpedia [4] is developing into an interlinking hub in the
Web of Data that enables access to many data sources in the
Linked Open Data cloud. It contains encyclopedic knowl-
edge from Wikipedia for about 3.5 million resources. About
half of the knowledge base is classified in a consistent cross-
domain ontology with classes such as persons, organisations
or populated places; as well as more fine-grained classifica-
tions like basketball player or flowering plant. Furthermore,
it provides a rich pool of resource attributes and relations
between the resources, connecting products to their makers,
or CEOs to their companies, for example.
In order to enable the linkage of Web documents with
this hub, we developed DBpedia Spotlight, a system to per-
form annotation of DBpedia resources mentioned in text.
In the annotation task, the user provides text fragments
(documents, paragraphs, sentences) and wishes to identify
URIs for resources mentioned within that text. One of the
main challenges in annotation is ambiguity: an entity name,
or surface form, may be used in different contexts to refer
to different DBpedia resources. For example, the surface
form ‘Washington’ can be used to refer the resources dbpe-
dia:George_Washington,dbpedia:Washington,_D.C. and
dbpedia:Washington_(U.S._state) (among others). For
human readers, the disambiguation, i.e. the decision be-
tween candidates for an ambiguous surface form, is usually
performed based on the readers’ knowledge and the context
of a concrete mention. However, the automatic disambigua-
tion of entity mentions remains a challenging problem.
The goal of DBpedia Spotlight is to provide an adaptable
system to find and disambiguate natural language mentions
of DBpedia resources. Much research has been devoted to
the problem of automatic disambiguation - as we discuss
in Section 5. In comparison with previous work, DBpedia
Spotlight aims at a more comprehensive and flexible solu-
tion. First, while other annotation systems are often re-
stricted to a small number of resource types, such as people,
organisations and places, our system attempts to annotate
DBpedia resources of any of the 272 classes (more than 30
top level) in the DBpedia Ontology. Second, since a single
generic solution is unlikely to fit all task-specific require-
ments, our system enables user-provided configurations for
different use cases with different needs. Users can flexibly
specify the domain of interest, as well as the desired cover-
age and error tolerance for each of their specific annotation
tasks.
DBpedia Spotlight can take full advantage of the DBpedia
ontology for specifying which concepts should be annotated.
Annotations can be restricted to instances of specific classes
(or sets of classes) including subclasses. Alternatively, ar-
bitrary SPARQL queries over the DBpedia knowledge base
can be provided in order to determine the set of instances
that should be annotated. For instance, consider use cases
where users have prior knowledge of some aspects of the text
(e.g. dates), and have specific needs for the annotations (e.g.
only Politicians). A SPARQL query can be sent to DBpe-
dia Spotlight in order to constrain the annotated resources
to only politicians in office between 1995 and 2000, for in-
stance. In general, users can create restrictions using any
part of the DBpedia knowledge base.
Moreover, DBpedia Spotlight computes scores such as promi-
nence (how many times a resource is mentioned in Wikipedia),
topical relevance (how close a paragraph is to a DBpedia re-
source’s context) and contextual ambiguity (is there more
than one candidate resource with similarly high topical rel-
evance for this surface form in its current context?). Users
can configure these parameters according to their task-specific
requirements.
We evaluate DBpedia Spotlight in two experiments. First
we test our disambiguation strategy on thousands of un-
seen (held out) DBpedia resource mentions from Wikipedia.
Second, we use a set of manually annotated news articles
in order to compare our annotation with publicly available
annotation services.
DBpedia Spotlight is deployed as a Web Service, and fea-
tures a user interface for demonstration. The source code is
publicly available under the Apache license V2, and the doc-
umentation is available at http://dbpedia.org/spotlight.
In Section 2 we describe our approach, followed by an ex-
planation of how our system can be used (Section 3). In
Section 4 we present our evaluation methodology and re-
sults. In Section 5 we discuss related work and in Section 6
we present our conclusions and future work.
2. APPROACH
Our approach works in four-stages. The spotting stage
recognizes in a sentence the phrases that may indicate a
mention of a DBpedia resource. Candidate selection is sub-
sequently employed to map the spotted phrase to resources
that are candidate disambiguations for that phrase. The
disambiguation stage, in turn, uses the context around the
spotted phrase to decide for the best choice amongst the can-
didates. The annotation can be customized by users to their
specific needs through configuration parameters explained in
subsection 2.5. In the remainder of this section we describe
the datasets and techniques used to enable our annotation
process.
2.1 Dataset and Notation
We utilize the graph of labels, redirects and disambigua-
tions in DBpedia to extract a lexicon that associates mul-
tiple surface forms to a resource and interconnects multiple
resources to an ambiguous label. Labels of the DBpedia re-
sources are created from Wikipedia page titles, which can
be seen as community-approved surface forms. Redirects to
URIs indicate synonyms or alternative surface forms, includ-
ing common misspellings and acronyms. Their labels also
become surface forms. Disambiguations provide ambiguous
surface forms that are “confusable” with all resources they
link to. Their labels become surface forms for all target
resources in the disambiguation page. Note that we erase
trailing parentheses from the labels when constructing sur-
face forms. For example the label ‘Copyright (band)’ pro-
duces the surface form ‘Copyright’. This means that labels
of resources and of redirects can also introduce ambiguous
surface forms, additionally to the labels coming from titles
of disambiguation pages. The collection of surface forms
created as a result constitutes a controlled set of commonly
used labels for the target resources.
Another source of textual references to DBpedia resources
are wikilinks, i.e. the page links in Wikipedia that inter-
connect the articles. We pre-processed Wikipedia articles,
extracting every wikilink l= (s, r) with surface form sas
anchor text and resource ras link target, along with the
paragraph representing the context of that wikilink occur-
rence. Each wikilink was stored as an evidence of occurrence
o= (r, s, C). Each occurrence orecords the fact that the
DBpedia resource rrepresented by the link target has been
mentioned in the context of the paragraph through the use
of the surface form s. Before storage, the context paragraph
was tokenized, stopworded and stemmed, generating a vec-
tor of terms W=hw1, ..., wni. The collection of occurrences
for each resource was then stored as a document in a Lucene
index1for retrieval in the disambiguation stage.
Wikilinks can also be used to estimate the likelihood of
a surface form sreferring to a specific candidate resource
rRs. We consider each wikilink as evidence that the
anchor text is a commonly used surface form for the DBpedia
resource represented by the link target. By counting the
number of times a surface form occurred with and without
a DBpedia resource n(s, r), we can empirically estimate a
prior probability of seeing a resource rgiven that surface
form swas used P(r|s) = n(s, r)/n(s).
2.2 Spotting Algorithm
We use the extended set of labels in the lexicalization
dataset to create a lexicon for spotting. The implementation
used was the LingPipe Exact Dictionary-Based Chunker [2]
which relies on the Aho-Corasick string matching algorithm
[1] with longest case-insensitive match.
Since for many use cases it is unnecessary to annotate
common words, a configuration flag can instruct the system
to disregard in this stage any spots that are only composed
of verbs, adjectives, adverbs and prepositions. The part of
speech tagger used was the LingPipe implementation based
on Hidden Markov Models.
2.3 Candidate Selection
We follow the spotting with a candidate selection stage
in order to map resource names to candidate disambigua-
tions (e.g. Washington as reference to a city, to a person
or to a state). We use the DBpedia Lexicalization dataset
for determining candidate disambiguations for each surface
form.
The candidate selection offers a chance to narrow down
the space of disambiguation possibilities. Selecting fewer
1http://lucene.apache.org
candidates can increase time performance, but it may reduce
recall if performed too aggressively. Due to our generality
and flexibility requirements, we decided to employ minimal
pre-filtering and postpone the selection to a user-configured
post-disambiguation configuration stage. Other approaches
for candidate selection are within our plans for future work.
The candidate selection phase can also be viewed as a way
to pre-rank the candidates for disambiguation before observ-
ing a surface form in the context of a paragraph. Choosing
the DBpedia resource with highest prior probability for a
surface form is the equivalent of selecting the “default sense”
of some phrase according to its usage in Wikipedia. The
prior probability scores of the lexicalizations dataset, for ex-
ample, can be utilized at this point. We report the results
for this approach as a baseline in Section 4.
2.4 Disambiguation
After selecting candidate resources for each surface form,
our system uses the context around the surface forms, e.g.
paragraphs, as information to find the most likely disam-
biguations.
We modeled DBpedia resource occurrences in a Vector
Space Model (VSM) [22] where each DBpedia resource is a
point in a multidimensional space of words. In light of the
most common use of VSMs in Information Retrieval (IR),
our representation of a DBpedia resource is the analogous
of a document containing the aggregation of all paragraphs
mentioning that concept in Wikipedia. Similarly, the TF
(Term Frequency) weight is commonly used in IR to measure
the local relevance of a term in a document. In our model,
TF represents the relevance of a word for a given resource.
In addition, the Inverse Document Frequency (IDF) weight
[16] represents the general importance of the word in the
collection of DBpedia resources.
Albeit successful for document retrieval, the IDF weight
fails to adequately capture the importance of a word for
disambiguation. For the sake of illustration, suppose that
the term ‘U.S.A’ occurs in only 3 concepts in a collection
of 1 million concepts. Its IDF will be very high, as its
document frequency is very low (3/1,000,000). Now sup-
pose that the three concepts with which it occurs are dbpe-
dia:Washington,_D.C.,dbpedia:George_Washington, and
dbpedia:Washington_(U.S._State). As it turns out, de-
spite the high IDF weight, the word ‘U.S.A’ would be of
little value to disambiguate the surface form ‘Washington’,
as all three potential disambiguations would be associated
with that word. IDF gives an insight into the global impor-
tance of a word (given all resources), but fails to capture
the importance of a word for a specific set of candidate re-
sources.
In order to weigh words based on their ability to distin-
guish between candidates for a given surface form, we intro-
duce the Inverse Candidate Frequency (ICF) weight. The
intuition behind ICF is that the discriminative power of a
word is inversely proportional to the number of DBpedia re-
sources it is associated with. Let Rsbe the set of candidate
resources for a surface form s. Let n(wj) be the total num-
ber of resources in Rsthat are associated with the word wj.
Then we define:
IC F (wj) = log |Rs|
n(wj)= log |Rs| − log n(wj) (1)
The theoretical explanation for ICF is analogous to
Deng et al. [9], based on Information Theory. Entropy [23]
has been commonly used to measure uncertainty in prob-
ability distributions. It is argued that the discriminative
ability of a context word should be inversely proportional to
the entropy, i.e. a word commonly co-occurring with many
resources is less discriminative overall. With regard to a
word’s association with DBpedia resources, the entropy of a
word can be defined as: E(w) = PiRsP(ri|w) log P(ri|w).
Suppose that the word wis connected to those resources
with equal probability P(r|w) = 1/n(w) , the maximum en-
tropy is transformed to E(w) = log n(w). Since generally
the entropy tends to be proportional to the frequency n(w),
we use the maximum entropy to approximate the exact en-
tropy in the ICF formula. This simplification has worked
well in our case, simplifying the calculations and reducing
storage and search time requirements.
Given the VSM representation of DBpedia resources with
TF*ICF weights, the disambiguation task can be cast as
a ranking problem where the objective is to rank the cor-
rect DBpedia resource at position 1. Our approach is to
rank candidate resources according to the similarity score
between their context vectors and the context surrounding
the surface form. In this work we use cosine as the similarity
measure.
2.5 Configuration
Many of the current approaches for annotation tune their
parameters to a specific task, leaving little flexibility for
users to adapt their solution to other use cases. Our ap-
proach is to generate a number of metrics to inform the users
and let them decide on the policy that best fits their needs.
In order to decide whether to annotate a given resource,
there are several aspects to consider: can this resource be
confused easily with another one in the given context? Is
this a commonly mentioned resource in general? Was the
disambiguation decision made with high confidence? Is the
resource of the desired type? Is the resource in a complex
relationship within the knowledge base that rules it out for
annotation? The offered configuration parameters are de-
scribed next.
Resource Set to Annotate. The use of DBpedia resources
as targets for annotation enables interesting flexibility. The
simplest and probably most widely used case is to anno-
tate only resources of a certain type or set of types. In our
case the available types are derived from the class hierar-
chy provided by the DBpedia Ontology. Users can provide
whitelists (allowed) or blacklists (forbidden) of URIs for an-
notation. Whitelisting a class will allow the annotation of
all direct instances of that class, as well as all instances of
subclasses. Support for SPARQL queries allows even more
flexibility by enabling the specification of arbitrary graph
patterns. There is no restriction to the complexity of rela-
tionships that a resource must fulfil in this configuration
step. For instance, the user could choose to only anno-
tate concepts that are related to a specific geographic area,
time period in history, or are closely connected within the
Wikipedia category system.
Resource Prominence. For many applications, the annota-
tion of rare or exotic resources is not desirable. For example,
the Saxon_genitive (’s) is very commonly found in English
texts to indicate possession (e.g. Austria’s mountains are
beautiful), but it can be argued that for many use cases its
Figure 1: DBpedia Spotlight Web Application.
annotation is rather uninformative. An indicator for that is
that it has only seven Wikipedia inlinks. With the support
parameter, users can specify the minimum number of inlinks
a DBpedia resource has to have in order to be annotated.
Topic Pertinence. The topical relevance of the anno-
tated resource for the given context can be measured by the
similarity score returned by the disambiguation step. The
score is higher for paragraphs that match more closely the
recorded observations for a DBpedia resource. In order to
constrain annotations to topically related resources, a higher
threshold for the topic pertinence can be set.
Contextual Ambiguity. If more than one candidate re-
source has high topical pertinence to a paragraph, it may
be harder to disambiguate between those resources because
they remain partly ambiguous in that context. The differ-
ence in the topical relevance of two candidate resources to a
paragraph gives us an insight on how “confused” the disam-
biguation step was in choosing between these resources. The
score is computed by the relative difference in topic score be-
tween the first and the second ranked resource. Applications
that require high precision may decide to reduce risks by not
annotating resources when the contextual ambiguity is high.
Disambiguation Confidence. We define a confidence pa-
rameter, ranging from 0 to 1, of the annotation performed
by DBpedia Spotlight. This parameter takes into account
factors such as the topical pertinence and the contextual am-
biguity. Setting a high confidence threshold instructs DBpe-
dia Spotlight to avoid incorrect annotations as much as pos-
sible at the risk of losing some correct ones. We estimated
this parameter on a development set of 100,000 Wikipedia
samples. The rationale is that a confidence value of 0.7 will
eliminate 70% of incorrectly disambiguated test cases. For
example, given a confidence of 0.7, we get the topical perti-
nence threshold that 70% of the wrong test samples are be-
low. We integrate that with the contextual ambiguity score
by requiring a low ambiguity when the confidence is high. A
confidence of 0.7, therefore, will only annotate resources if
the contextual ambiguity is less than (1conf idence) = 0.3.
We address the adequacy of this parameter in our evalua-
tion.
3. USING DBPEDIA SPOTLIGHT
DBpedia Spotlight is available both as a Web Service and
via a Web Application. In addition, we have published the
lexicalization dataset in RDF so that the community can
benefit from the collected surface forms and the DBpedia
resources representing their possible meanings.
3.1 Web Application
By using the Web application, users can test and visualize
the results of the different service functions. The interface
allows users to configure confidence, support, and to select
the classes of interest from the DBpedia ontology. Text can
be entered in a text box and, at user’s request, DBpedia
Spotlight will highlight the surface forms and create associ-
ations with their corresponding DBpedia resources. Figure
1 shows an example of a news article snippet after being
annotated by our system. In addition to Annotate, we offer
aDisambiguate operation where users can request the dis-
ambiguation of selected phrases (enclosed in double square
brackets). In this case, our system bypasses the spotting
stage and annotates only the selected phrases with DBpe-
dia resources. This function is useful for user interfaces that
allow users to mouse-select text, as well as for the easy in-
corporation of our disambiguation step into third-party ap-
plications that already perform spotting.
3.2 Web Service
In order to facilitate the integration of DBpedia Spotlight
into external web processes, we implemented RESTful and
SOAP web services for the annotation and disambiguation
processes. The web service interface allows access to both
the Annotate and the Disambiguate operations and to all
the configuration parameters of our approach. Thus, in ad-
dition to confidence, support and DBpedia classes, we accept
SPARQL queries for the DBpedia knowledge base to select
the set of resources that are going to be used when anno-
tating. These web services return HTML, XML, JSON or
XHTML+RDFa documents where each DBpedia resource
identified in the text is related to the text chunk where it
was found. The XML fragment presented below shows part
of the annotation of the news snippet shown in Figure 1.
<Annotation text="Pop star Michael Jackson..."
confidence="0.3" support="30"
types="Person,Place,...">
<Resources>
<Resource URI="dbpedia:Michael_Jackson"
support="5761"
types="MusicalArtist,Artist,Person"
surfaceForm="Michael Jackson" offset="9"
similarityScore="0.31504717469215393" />
...
</Resources>
</Annotation>
Figure 2: Example XML fragment resulting from the anno-
tation service.
3.3 Lexicalization dataset
Besides the DBpedia Spotlight system, the data produced
in this work is also shared in a format to ease its consump-
tion in a number of use cases. The dataset described in
Section 2.1 was encoded in RDF using the Lexvo vocabu-
lary [8] and is provided for download as a DBpedia dataset.
We use the property lexvo:label rather than rdfs:label
or skos:altLabel to associate a resource with surface form
strings. The rdfs:label property intends to represent “a
human-readable version of a resource’s name”
2. The SKOS
Vocabulary “enables a distinction to be made between the
preferred, alternative and ‘hidden’ lexical labels” through
their skos:prefLabel and skos:altLabel. The DBpedia
Spotlight dataset does not claim that a surface form is the
name of a resource, and neither intends to assert preference
between labels. Hence, we use lexvo:label in order to de-
scribe the resource - surface form association with regard to
actual language use. Association scores (e.g. prior proba-
bilities) are attached to lexvo:label relationships through
named graphs.
Users interested in finding names, alternative or preferred
labels can use the provided information in order to make
an informed task-specific choice. Imagine a user attempt-
ing to find DBpedia URIs for presidents and colleges in his
company’s legacy database. The table called President con-
tains two columns: last name, alma mater. Users may use a
SPARQL query, for example, to select the default sense for
the surface form ‘Bush’, given that it is known it has a re-
lationship with the surface form ‘Harvard Business3. The
lexicalizations dataset will provide links between alterna-
tive spellings (e.g. Bush dbpedia:George_W._Bush) and
the knowledge base (DBpedia) will provide the background
knowledge connecting the resource dbpedia:George_W._Bush
to his alma mater dbpedia:Harvard_Business_School. The
association scores will help to rank the most likely of the
2http://www.w3.org/TR/rdf-schema/
3The SPARQL formulation for this query and other exam-
ples are available from the project page.
candidates in this context.
The dataset can also be used to get information about
the strength of association between a surface form and a
resource, term ambiguity or the default sense of a surface
form, just to cite a few use cases.
4. EVALUATION
We carried out two evaluations of DBpedia Spotlight. A
large scale automatic evaluation tested the performance of
the disambiguation component in choosing the correct can-
didate resources for a given surface form. In order to provide
an evaluation of the whole system in a specific annotation
scenario, we also carried out an experiment using a manually
annotated test corpus. In that evaluation we compare our
results with those of several publicly available annotation
services.
4.1 Disambiguation Evaluation
Wikipedia provides a wealth of annotated data that can
be used to evaluate our system on a large scale. We ran-
domly selected 155,000 wikilink samples and set aside as
test data. In order to really capture the ability of our sys-
tem to distinguish between multiple senses of a surface form,
we made sure that all these instances have ambiguous sur-
face forms. We used the remainder of the samples collected
from Wikipedia (about 69 million) as DBpedia resource oc-
currences providing context for disambiguation as described
in Section 2.
In this evaluation, we were interested in the performance
of the disambiguation stage. A spotted surface form, taken
from the anchor text of a wikilink, is given to the disam-
biguation function4along with the paragraph that it was
mentioned in. The task of the disambiguation service is to
select candidate resources for this surface form and decide
between them based on the context.
In order to better assess the contribution of our approach,
we included three baseline methods:
Random Baseline performs candidate selection and picks
one of the candidates with uniform probability. This
baseline serves as a control for easy disambiguations,
since for low ambiguity terms, even random choice
should perform reasonably.
Default Sense Baseline performs candidate selection
and chooses the candidate with the highest prior prob-
ability (without using the context). More formally:
arg maxrRsP(r|s) . This baseline helps to assess how
common were the DBpedia resources included in the
annotation dataset.
Default Similarity uses TF*IDF term scoring as a ref-
erence to evaluate the influence of our TF*ICF ap-
proach.
4.1.1 Results
The results for the baselines and DBpedia Spotlight are
presented in Table 1. The performance of the baseline that
makes random disambiguation choices confirms the high am-
biguity in our dataset (less than 1/4 of the disambiguations
were correct at random). Using the prior probability to
choose the default sense performs reasonably well, being ac-
curate in 55.12% of the disambiguations. This is indication
4in our implementation, for convenience, the candidate se-
lection can be called from the disambiguation
Disambiguation Approach Accuracy
Baseline Random 17.77%
Baseline Default Sense 55.12%
Baseline TF*IDF 55.91%
DBpedia Spotlight TF*ICF 73.39%
DBpedia Spotlight Mixed 80.52%
Table 1: Accuracies for each of the approaches tested in the
disambiguation evaluation.
that our evaluation set was composed by a good balance of
common DBpedia resources and less prominent ones. The
use of context for disambiguation through the default scor-
ing of TF*IDF obtained 55.91%, while the TF*ICF score
introduced in this work improved the results to 73.39%.
The performance of TF*ICF is an encouraging indication
that a simple ranking-based disambiguation algorithm can
be successful if enough contextual evidence is provided.
We also attempted a simple combination of the prior (de-
fault sense) and TF*ICF scores, which we called DBpe-
dia Spotlight Mixed. The mixing weights were estimated
through a small experiment using linear regression over held
out training data. The results reported in this work used
mixed scores computed through the formula:
Mixed(r, s, C) =
1234.3989 P(r|s)
+0.9968 contextualScore(r, s, C)
0.0275 (2)
The prior probability P(r|s) was calculated as described
in Section 2.1. The contextual score used was the cosine sim-
ilarity of term vectors weighted by TF*ICF as described in
Section 2.4. Further research is needed to carefully examine
the contribution of each component to the final score.
4.2 Annotation Evaluation
Although we used an unseen dataset for evaluating our
disambiguation approach, it is possible that the type of dis-
course and the annotation style of Wikipedia would bias the
results in favor of systems trained with that kind of data.
The Wikipedia guidelines for link creation focus on non-
obvious references5. If a wikilink would not contribute to
the understanding of a specific article, the Wikipedia Man-
ual of Style discourages its creation. Therefore, we created
a manual evaluation dataset from a news corpus in order
to complement that evaluation. In this second evaluation,
we would like to assess completeness of linking as well. We
created an annotation scenario in which the annotators were
asked to add links to DBpedia resources for all phrases that
would add information to the provided text.
Our test corpus consisted of 30 randomly selected para-
graphs from New York Times documents from 10 different
categories. In order to construct a gold standard, each
evaluator first independently annotated the corpus, after
which they met and agreed upon the ground truth evalu-
ation choices. The ratio of annotated to not-annotated to-
kens was 33%. This corpus is available for download on the
project homepage.
5http://en.wikipedia.org/wiki/Wikipedia:Manual_of_
Style_(linking)
We compared our results on this test corpus with the per-
formance of publicly available annotation services: Open-
Calais6, Zemanta7, Ontos Semantic API8, The Wiki Ma-
chine9, Alchemy API10 and M&W’s wikifier [20]. Linking
to DBpedia is supported in those services in different lev-
els. Alchemy API provides links to DBpedia and Freebase
among other sources. Open Calais and Ontos provide some
limited linkage between their private identifiers and DBpe-
dia resources. As of the time of writing, Ontos only links
people and companies to DBpedia. For the cases where the
systems were able to extract resources but do not give DB-
pedia URIs, we used a simple transformation on the ex-
tracted resources that constructed DBpedia URIs from la-
bels - e.g. ‘apple’ becomes dbpedia:Apple. We report re-
sults with and without this transformation. The results that
used the transformation are labeled Ontos+Na¨
ıve and Open
Calais+Na¨
ıve. The service APIs of Zemanta, The Wiki Ma-
chine and M&W do not explicitly return DBpedia URIs, but
the URIs can be inferred from the Wikipedia links that they
return.
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Precision
Annotation Evaluation
Alchemy
Ontos
Ontos+Naïve
OpenCalais
OpenCalais+Naïve
Spotlight (no config)
SpotlightRandom
WMWikify
WikiMachine
Zemanta
Support=0
Support=100
Support=500
Support=1000
Figure 3: DBpedia Spotlight with different configurations
(lines) in comparison with other systems (points).
4.2.1 Results
Retrieval as well as classification tasks exhibit an inherent
precision-recall trade-off [5]. The configuration of DBpedia
Spotlight allows users to customize the level of annotation
to their specific application needs. Figure 3 shows the evalu-
ation results. Each point in the plot represents the precision
(vertical axis) and recall (horizontal axis) of each evaluation
6http://www.opencalais.com
7http://www.zemanta.com
8http://www.ontos.com
9http://thewikimachine.fbk.eu
10http://www.alchemyapi.com
System F1
DBpedia Spotlight (best configuration) 56.0%
DBpedia Spotlight (no configuration) 45.2%
The Wiki Machine 59.5%
Zemanta 39.1%
Open Calais+Na¨
ıve 16.7%
Alchemy 14.7%
Ontos+Na¨
ıve 10.6%
Open Calais 6.7%
Ontos 1.5%
Table 2: F1scores for each of the approaches tested in the
annotation evaluation.
run. The lines show the trade-off between precision and re-
call as we vary the confidence and support parameters in our
service. Each line represents one value of support (varying
from 0 to 1000). Each point in the line is a value of confi-
dence (0.1 to 0.9) for the corresponding support. It can be
observed that higher confidence values (with higher support)
produce higher precision at the cost of some recall and vice
versa. This is encouraging indication that our parameters
achieve their objectives.
The shape of the displayed graph shows that the per-
formance of DBpedia Spotlight is in a competitive range.
Most annotation services lay beneath the F1-score of our
system with every confidence value. Table 5 shows the best
F1-scores of each approach. The best F1-score of DBpedia
Spotlight was reached with confidence value of 0.6. The
WikiMachine has the highest F1-score, but tends to over-
annotate the articles, which results in a high recall, at the
cost of low precision. Meanwhile, Zemanta dominates in
precision, but has low recall. With different confidence and
support parameters, DBpedia Spotlight is able to approxi-
mate the results of both WikiMachine and Zemanta, while
offering many other configurations with different precision-
recall trade-offs in between.
5. RELATED WORK
Many existing approaches for entity annotation have fo-
cused on annotating salient entity references, commonly only
entities of specific types (Person, Organization, Location)
[14, 21, 24, 12] or entities that are in the subject of sen-
tences [11]. Hassell et al. [14] exploit the structure of a call
for papers corpus for relation extraction and later disam-
biguation of academic researchers. Rowe [21] concentrates
on disambiguating person names with social graphs, while
Volz et al. [24] present a disambiguation algorithm for the
geographic domain that is based on popularity scores and
textual patterns. Gruhl et al. [12] also constrain their anno-
tation efforts to cultural entities in a specific domain. Our
objective is to be able to annotate any entities in DBpedia.
Other approaches have attempted the non-type-specific
annotation of entities. However, several optimize their ap-
proaches for precision, leaving little flexibility for users with
use cases where recall is important, or they have not evalu-
ated the applicability of their approaches with more general
use cases [10, 6, 7, 19].
SemTag [10] was the first Web-scale named entity disam-
biguation system. They used metadata associated with each
entity in an entity catalog derived from TAP [13] as context
for disambiguation. SemTag specialized in precision at the
cost of recall, producing an average of less than two anno-
tations per page.
Bunesco and Pasca [6], Cucerzan [7], Mihalcea and Cso-
mai (Wikify!) [19] and Witten and Milne (M&W) [20], like
us, also used text from Wikipedia in order to learn how to
annotate. Bunesco and Pasca only evaluate articles under
the “people by occupation” category, while Cucerzan’s and
Wikify!’s conservative spotting only annotate 4.5% and 6%
of all tokens in the input text, respectively. In Wikify!, this
spotting yields surface forms with low ambiguity for which
even a random disambiguator achieves an F1score of 0.6.
Fader et al. [11] chooses the candidate with the highest
prior probability unless the contextual evidence is higher
than a threshold. In their dataset 27.94% of the surface
forms are unambiguous and 46.53% of the ambiguous ones
can be correctly disambiguated by just choosing the default
sense (according to our index).
Kulkarni et al. [17] attempts the joint optimization of all
spotted surface forms in order to realize the collective anno-
tation of entities. The inference problem formulated by the
authors is NP-hard, leading to their proposition of a Linear
Programing and a Hill-climbing approach for optimization.
We propose instead a simple, inexpensive approach that can
be easily configured and adapted to task-specific needs, fa-
cilitated by the DBpedia Ontology and configuration param-
eters.
6. CONCLUSION
In this paper we presented DBpedia Spotlight, a tool to
detect mentions of DBpedia resources in text. It enables
users to link text documents to the Linked Open Data cloud
through the DBpedia interlinking hub. The annotations pro-
vided by DBpedia Spotlight enable the enrichment of web-
sites with background knowledge, faceted browsing in text
documents and enhanced search capabilities. The main ad-
vantage of our system is its comprehensiveness and flexibil-
ity, allowing one to configure it based on the DBpedia on-
tology, as well as prominence, contextual ambiguity, topical
pertinence and confidence scores. The resources that should
be annotated can be specified by a list of resource types or
by more complex relationships within the knowledge base.
We compared our system with other publicly available
services and showed how we retained competitiveness with
a more configurable approach. In the future we plan to
incorporate more knowledge from the Linked Open Data
cloud in order to enhance the annotation algorithm.
A project page with news, documentation, downloads,
demonstrations and other information is available at http:
//dbpedia.org/spotlight.
7. ACKNOWLEDGEMENTS
The development of DBpedia Spotlight was supported by
the European Commission through the project LOD2 – Cre-
ating Knowledge out of Linked Data and by Neofonie GmbH,
a Berlin-based company offering technologies in the area of
Web search, social media and mobile applications.
Thanks to Andreas Schultz and Paul Kreis for their help
with setting up our servers and evaluation, and to Joachim
Daiber for his contributions in dataset creation, evaluation
clients and preprocessing code that was partially utilized in
the finalizing stages of this work.
8. REFERENCES
[1] A. V. Aho and M. J. Corasick. Efficient string
matching: an aid to bibliographic search. Commun.
ACM, 18:333–340, June 1975.
[2] Alias-i. LingPipe 4.0.0.
http://alias-i.com/lingpipe, retrieved on
24.08.2010, 2008.
[3] C. Bizer, T. Heath, and T. Berners-Lee. Linked data -
the story so far. Int. J. Semantic Web Inf. Syst.,
5(3):1–22, 2009.
[4] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer,
C. Becker, R. Cyganiak, and S. Hellmann. DBpedia -
A crystallization point for the Web of Data. Web
Semantics: Science, Services and Agents on the World
Wide Web, 7:154–165, September 2009.
[5] M. Buckland and F. Gey. The relationship between
Recall and Precision. J. Am. Soc. Inf. Sci.,
45(1):12–19, January 1994.
[6] R. C. Bunescu and M. Pasca. Using encyclopedic
knowledge for named entity disambiguation. In EACL,
2006.
[7] S. Cucerzan. Large-scale named entity disambiguation
based on wikipedia data. In EMNLP-CoNLL, pages
708–716, 2007.
[8] G. de Melo and G. Weikum. Language as a foundation
of the Semantic Web. In C. Bizer and A. Joshi,
editors, Proceedings of the Poster and Demonstration
Session at the 7th International Semantic Web
Conference (ISWC 2008), volume 401 of CEUR WS,
Karlsruhe, Germany, 2008. CEUR.
[9] H. Deng, I. King, and M. R. Lyu. Entropy-biased
models for query representation on the click graph. In
SIGIR ’09: Proceedings of the 32nd international
ACM SIGIR conference on Research and development
in information retrieval, pages 339–346, New York,
NY, USA, 2009. ACM.
[10] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha,
A. Jhingran, T. Kanungo, S. Rajagopalan,
A. Tomkins, J. A. Tomlin, and J. Y. Zien. Semtag and
seeker: bootstrapping the semantic web via automated
semantic annotation. In Proceedings of the 12th
international conference on World Wide Web, WWW
’03, pages 178–186, New York, NY, USA, 2003. ACM.
[11] A. Fader, S. Soderland, and O. Etzioni. Scaling
wikipedia-based named entity disambiguation to
arbitrary web text. In Proceedings of the WikiAI 09 -
IJCAI Workshop: User Contributed Knowledge and
Artificial Intelligence: An Evolving Synergy, Pasadena,
CA, USA, July 2009.
[12] D. Gruhl, M. Nagarajan, J. Pieper, C. Robson, and
A. P. Sheth. Context and domain knowledge enhanced
entity spotting in informal text. In International
Semantic Web Conference, pages 260–276, 2009.
[13] R. V. Guha and R. McCool. Tap: A semantic web
test-bed. J. Web Sem., 1(1):81–87, 2003.
[14] J. Hassell, B. Aleman-Meza, and I. Arpinar.
Ontology-driven automatic entity disambiguation in
unstructured text. In I. Cruz, S. Decker, D. Allemang,
C. Preist, D. Schwabe, P. Mika, M. Uschold, and
L. Aroyo, editors, The Semantic Web - ISWC 2006,
volume 4273 of Lecture Notes in Computer Science,
pages 44–57. Springer Berlin / Heidelberg, 2006.
[15] M. Hearst. UIs for Faceted Navigation: Recent
Advances and Remaining Open Problems. In
Workshop on Computer Interaction and Information
Retrieval, HCIR, Redmond, WA, Oct. 2008.
[16] K. S. Jones. A statistical interpretation of term
specificity and its application in retrieval. Journal of
Documentation, 28:11–21, 1972.
[17] S. Kulkarni, A. Singh, G. Ramakrishnan, and
S. Chakrabarti. Collective annotation of wikipedia
entities in web text. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge
discovery and data mining, KDD ’09, pages 457–466,
New York, NY, USA, 2009. ACM.
[18] P. N. Mendes, A. Passant, P. Kapanipathi, and A. P.
Sheth. Linked open social signals. In Web Intelligence
and Intelligent Agent Technology, 2010. WI-IAT ’10.
IEEE/WIC/ACM International Conference on, 2010.
[19] R. Mihalcea and A. Csomai. Wikify!: linking
documents to encyclopedic knowledge. In CIKM ’07:
Proceedings of the sixteenth ACM conference on
Conference on information and knowledge
management, pages 233–242, New York, NY, USA,
2007. ACM.
[20] D. Milne and I. H. Witten. Learning to link with
wikipedia. In Proceeding of the 17th ACM conference
on Information and knowledge management, CIKM
’08, pages 509–518, New York, NY, USA, 2008. ACM.
[21] M. Rowe. Applying semantic social graphs to
disambiguate identity references. In L. Aroyo,
P. Traverso, F. Ciravegna, P. Cimiano, T. Heath,
E. Hyv¨
ı£¡nen, R. Mizoguchi, E. Oren, M. Sabou, and
E. Simperl, editors, The Semantic Web: Research and
Applications, volume 5554 of Lecture Notes in
Computer Science, pages 461–475. Springer Berlin /
Heidelberg, 2009.
[22] G. Salton, A. Wong, and C. S. Yang. A vector space
model for automatic indexing. Communications of the
ACM, 18:613–620, November 1975.
[23] C. E. Shannon. Prediction and entropy of printed
english. Bell Systems Technical Journal, pages 50–64,
1951.
[24] R. Volz, J. Kleb, and W. Mueller. Towards
ontology-based disambiguation of geographical
identifiers. In I3, 2007.
... For extracting graph triples from Wikidata, as mentioned in Figure 1, we have adopted two methods for the approach: (1) Vocabulary-based and (2) Based on Unstructured text. Former includes fetching Wikidata items using a collection of manufacturing vocabulary terms through the utilization of textbook index words, keywords from research papers, and named entity recognition using FabNER (Kumar and Starly, 2021), followed by the use of DBpedia (Mendes et al., 2011) to find Wikidata items. The latter is a semi-supervised approach that utilizes students' notes, considering standard textbooks as the reference. ...
Preprint
Full-text available
As the demands for large-scale information processing have grown, knowledge graph-based approaches have gained prominence for representing general and domain knowledge. The development of such general representations is essential, particularly in domains such as manufacturing which intelligent processes and adaptive education can enhance. Despite the continuous accumulation of text in these domains, the lack of structured data has created information extraction and knowledge transfer barriers. In this paper, we report on work towards developing robust knowledge graphs based upon entity and relation data for both commercial and educational uses. To create the FabKG (Manufacturing knowledge graph), we have utilized textbook index words, research paper keywords, FabNER (manufacturing NER), to extract a sub knowledge base contained within Wikidata. Moreover, we propose a novel crowdsourcing method for KG creation by leveraging student notes, which contain invaluable information but are not captured as meaningful information, excluding their use in personal preparation for learning and written exams. We have created a knowledge graph containing 65000+ triples using all data sources. We have also shown the use case of domain-specific question answering and expression/formula-based question answering for educational purposes.
... With the Entity Linking, the entities identified in phase 5, such as persons, organizations, and locations are linked to resources (URI) available in Linked Datasets. For example, DBpedia Spotlight [17] can be used to link to resources of DBpedia and Linked Geo Data. Besides, an Italian version of Blink 2 [18] can be used to link entities to Wikipedia or to populate a new KB with entities not linked to external resources. ...
Conference Paper
Full-text available
Event Extraction is a complex and interesting topic in Information Extraction that includes methods for the identification of event's type, participants, location, and date from free text or web data. The result of event extraction systems can be used in several fields, such as online monitoring systems or decision support tools. In this paper, we introduce a framework that combines several techniques (lexical, semantic, machine learning, neural networks) to extract events from Italian news articles for crime analysis purposes. Furthermore, we concentrate to represent the extracted events in a Knowledge Graph. An evaluation on crimes in the province of Modena is reported.
... DBpedia is used in a large number of NLP applications, especially for named entity annotation and disambiguation(García-Silva, Szomszor, Alani, & Corcho, 2009;Kobilarov et al., 2009;Mendes, Jakob, García-Silva, & Bizer, 2011;Hulpus, Hayes, Karnstedt, & Greene, 2013) and question answering(Unger et al., 2012;Lopez, Fernández, Motta, & Stieler, 2012;Damljanovic, Agatonovic, & Cunningham, 2011;Cabrio et al., 2012), among which the most remarkable is the IBM Watson system(Ferrucci et al., 2010). Many datasets, such as DrugBank(Wishart et al., 2017), LinkedGeoData(Stadler, Lehmann, Höffner, & Auer, 2012), the CIA World Factbook (Central Intelligence Agency, 2009), and Book Mashup(Bizer, Cyganiak, & Gauß, 2007) among others, also link to DBpedia for uniquely identifying their entities.YAGO (Yet Another Great Ontology) is another knowledge graph built from Web content which combines information from Wikipedia and WordNet. ...
Thesis
Full-text available
Natural Language Processing has an important role in Artificial Intelligence for easing human-machine interaction. Processing human language, though, poses many challenges, among which is the semantics-related phenomenon known as language variability, the fact that the same thing can be said in several ways. NLP applications' inputs and outputs can be expressed in different forms, whose equivalence can be verified through inference. The textual entailment paradigm was established to enable the creation of a unifying framework for applied inference, providing a means of delivering other NLP task from handling inference issues in an ad-hoc manner, using instead the outputs of an inference-dedicated mechanism. Text entailment, the task of determining whether a piece of text logically follows from another piece of text, involves different scenarios, which can range from a simple syntactic variation to more complex semantic relationships between sentences. However, most approaches try a one-size-fits-all solution that usually favors some scenario to the detriment of another. The commonsense world knowledge necessary to support more complex inferences is also usually employed in a limited way, with most approaches sticking to shallow semantic information, leaving more elaborate semantic relationships aside. Furthermore, most systems still work as a "black box", providing a yes/no answer that does not explain the underlying reasoning process. This thesis aims at addressing these issues by proposing a composite interpretable approach for recognizing text entailment where the entailment pair is analyzed so the most relevant phenomenon is detected and the suitable method can be used to solve it. Syntactic variations are dealt with through the analysis of the sentences' syntactic structures, and semantic relationships are detected with the aid of a knowledge graph built from natural language dictionary definitions. Also, if a semantic matching is involved, the answer is made interpretable through the generation of natural language justifications that explain the semantic relationship between the pieces of text. The result is the XTE - Explainable Text Entailment - a system that outperforms well-established tools based on single-technique entailment algorithms, and that also gives an important step towards Explainable AI, allowing the inference model interpretation, making the semantic reasoning process explicit and understandable.
... For extracting graph triples from Wikidata, as mentioned in Figure 1, we have adopted two methods for the approach: (1) Vocabulary-based and (2) Based on Unstructured text. Former includes fetching Wikidata items using a collection of manufacturing vocabulary terms through the utilization of textbook index words, keywords from research papers, and named entity recognition using FabNER (Kumar and Starly, 2021), followed by the use of DBpedia (Mendes et al., 2011) to find Wikidata items. The latter is a semi-supervised approach that utilizes students' notes, considering standard textbooks as the reference. ...
Conference Paper
Full-text available
As the demands for large-scale information processing have grown, knowledge graph-based approaches have gained prominence for representing general and domain knowledge. The development of such general representations is essential, particularly in domains such as manufacturing which intelligent processes and adap-tive education can enhance. Despite the continuous accumulation of text in these domains, the lack of structured data has created information extraction and knowledge transfer barriers. In this paper, we report on work towards developing robust knowledge graphs based upon entity and relation data for both commercial and educational uses. To create the FabKG (Manufac-turing knowledge graph), we have utilized textbook index words, research paper keywords, FabNER (manufacturing NER), to extract a sub knowledge base contained within Wikidata. Moreover, we propose a novel crowdsourcing method for KG creation by leveraging student notes, which contain invaluable information but are not captured as meaningful information, excluding their use in personal preparation for learning and written exams. We have created a knowledge graph containing 65000+ triples using all data sources. We have also shown the use case of domain-specific question answering and expression/formula-based question answering for educational purposes.
... A semantic network is a directed graph where each node denotes a concept or an attribute and each arc denotes a relationship between a pair of nodes. They can be constructed for specific applications or large general purpose networks can be harnessed such as WordNet which was developed for organizing over 100K words into sets of synonyms (called synsets) according to context plus hypernym and hyponym relationships (Miller 1995), or DBpedia which was developed to capture a wide range of concepts from Wikipedia including people, places, and organizations (Mendes et al. 2011). There are distance measures based on the structure of the graph (Budanitsky and Hirst 2006). ...
Conference Paper
An argument can be regarded as some premises and a claim following from those premises. Normally, arguments exchanged by human agents are enthymemes, which generally means that some premises are implicit. So when an enthymeme is presented, the presenter expects that the recipient can identify the missing premises. An important kind of implicitness arises when a presenter assumes that two symbols denote the same, or nearly the same, concept (e.g. dad and father), and uses the symbols interchangeably. To model this process, we propose the use of semantic distance measures (e.g. based on a vector representation of word embeddings or a semantic network representation of words) to determine whether one symbol can be substituted by another. We present a theoretical framework for using substitutions, together with abduction of default knowledge, for understanding enthymemes based on deductive argumentation, and investigate how this could be used in practice.
Chapter
Note-taking apps on tablets are increasingly becoming the go-to space for managing learning material as a student. In particular, digital note-taking presents certain advantages over traditional pen-and-paper approaches when it comes to organizing and retrieving a library of notes thanks to various search functionalities. This paper presents improvements to the classic textual-input-based search field, by introducing a semantic search that considers the meaning of a user’s search terms and an automatic question-answering process that extracts the answer to the user’s question from their notes for more efficient information retrieval. Additionally, visual methods for finding specific notes are proposed, which do not require the input of text by the user: through the integration of a semantic similarity metric, notes similar to a selected document can be displayed based on common topics. Furthermore, a fully interactive process allows one to search for notes by selecting different types of dynamically generated filters, thus eliminating the need for textual input. Finally, a graph-based visualization is explored for the search results, which clusters semantically similar notes closer together to relay additional information to the user besides the raw search results.
Article
ICT platforms for news production, distribution, and consumption must exploit the ever-growing availability of digital data. These data originate from different sources and in different formats; they arrive at different velocities and in different volumes. Semantic knowledge graphs (KGs) is an established technique for integrating such heterogeneous information. It is therefore well-aligned with the needs of news producers and distributors, and it is likely to become increasingly important for the news industry. This paper reviews the research on using semantic knowledge graphs for production, distribution, and consumption of news. The purpose is to present an overview of the field; to investigate what it means; and to suggest opportunities and needs for further research and development.
Preprint
Full-text available
We are faced with an unprecedented production in scholarly publications worldwide. Stakeholders in the digital libraries posit that the document-based publishing paradigm has reached the limits of adequacy. Instead, structured, machine-interpretable, fine-grained scholarly knowledge publishing as Knowledge Graphs (KG) is strongly advocated. In this work, we develop and analyze a large-scale structured dataset of STEM articles across 10 different disciplines, viz. Agriculture, Astronomy, Biology, Chemistry, Computer Science, Earth Science, Engineering, Material Science, Mathematics, and Medicine. Our analysis is defined over a large-scale corpus comprising 60K abstracts structured as four scientific entities process, method, material, and data. Thus our study presents, for the first-time, an analysis of a large-scale multidisciplinary corpus under the construct of four named entity labels that are specifically defined and selected to be domain-independent as opposed to domain-specific. The work is then inadvertently a feasibility test of characterizing multidisciplinary science with domain-independent concepts. Further, to summarize the distinct facets of scientific knowledge per concept per discipline, a set of word cloud visualizations are offered. The STEM-NER-60k corpus, created in this work, comprises over 1M extracted entities from 60k STEM articles obtained from a major publishing platform and is publicly released https://github.com/jd-coderepos/stem-ner-60k.
Article
Full-text available
We are faced with an unprecedented production in scholarly publications worldwide. Stakeholders in the digital libraries posit that the document-based publishing paradigm has reached the limits of adequacy. Instead, structured, machine-interpretable, fine-grained scholarly knowledge publishing as Knowledge Graphs (KG) is strongly advocated. In this work, we develop and analyze a large-scale structured dataset of STEM articles across 10 different disciplines, viz. Agriculture, Astronomy, Biology, Chemistry, Computer Science, Earth Science, Engineering, Material Science, Mathematics, and Medicine. Our analysis is defined over a large-scale corpus comprising 60K abstracts structured as four scientific entities process, method, material, and data. Thus our study presents, for the first-time, an analysis of a large-scale multidisciplinary corpus under the construct of four named entity labels that are specifically defined and selected to be domain-independent as opposed to domain-specific. The work is then inadvertently a feasibility test of characterizing multidisciplinary science with domain-independent concepts. Further, to summarize the distinct facets of scientific knowledge per concept per discipline, a set of word cloud visualizations are offered. The STEM-NER-60k corpus, created in this work, comprises over 1M extracted entities from 60k STEM articles obtained from a major publishing platform and is publicly released https://github.com/jd-coderepos/stem-ner-60k.
Article
Full-text available
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Article
Full-text available
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
Conference Paper
Full-text available
This paper explores the application of restricted relationship graphs (RDF) and statistical NLP techniques to improve named entity annotation in challenging Informal English domains. We validate our approach using on-line forums discussing popular music. Named entity annotation is particularly difficult in this domain because it is characterized by a large number of ambiguous entities, such as the Madonna album “Music” or Lilly Allen’s pop hit “Smile”. We evaluate improvements in annotation accuracy that can be obtained by restricting the set of possible entities using real-world constraints. We find that constrained domain entity extraction raises the annotation accuracy significantly, making an infeasible task practical. We then show that we can further improve annotation accuracy by over 50% by applying SVM based NLP systems trained on word-usages in this domain.
Article
Full-text available
The term "Linked Data" refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions-the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
Article
This paper investigates the "named-entity disam- biguation" task on the Web—identifying the refer- ent of a string, found on an arbitrary Web page. The GROUNDER system, introduced in this paper, ad- dresses two challenges not considered by previous work: how to utilize a priori information (e.g., Bill Clinton is more prominent on the Web than Clin- ton County) to improve disambiguation, and how to compose this prior information with contextual evidence. GROUNDER addresses both challenges by leverag- ing the user-contributed knowledge in Wikipedia and providing a novel formulation of the task. On a sample of strings drawn from the Web, GROUNDER achieves precision of 1.0 at recall 0.34, and preci- sion 0.90 at recall 0.60.
Article
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
Article
A new method of estimating the entropy and redundancy of a language is described. This method exploits the knowledge of the language statistics possessed by those who speak the language, and depends on experimental results in prediction of the next letter when the preceding text is known. Results of experiments in prediction are given, and some properties of an ideal predictor are developed.
Article
Faceted navigation is a proven technique for supporting ex-ploration and discovery within an information collection. The underlying data model is simple enough to make nav-igation understandable while at the same time rich enough to make navigation flexible in a wide range of domains. Nonetheless, there remain issues in both the presentation of navigation options in the interface and in how to extend the model to allow more flexible discovery while still retain-ing understandability. This paper explores both of these issues.
Article
The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier that can be dereferenced over the Web into a rich RDF description of the entity, including human-readable definitions in 30 languages, relationships to other resources, classifications in four concept hierarchies, various facts as well as data-level links to other Web data sources describing the entity. Over the last year, an increasing number of data publishers have begun to set data-level links to DBpedia resources, making DBpedia a central interlinking hub for the emerging Web of Data. Currently, the Web of interlinked data sources around DBpedia provides approximately 4.7 billion pieces of information and covers domains such as geographic information, people, companies, films, music, genes, drugs, books, and scientific publications. This article describes the extraction of the DBpedia knowledge base, the current status of interlinking DBpedia with other data sources on the Web, and gives an overview of applications that facilitate the Web of Data around DBpedia.