ArticlePDF Available

DBpedia: A Multilingual Cross-Domain Knowledge Base

Authors:
Article

DBpedia: A Multilingual Cross-Domain Knowledge Base

Abstract and Figures

The DBpedia project extracts structured information from Wikipedia editions in 97 different languages and combines this information into a large multi-lingual knowledge base covering many specific domains and general world knowledge. The knowledge base contains textual descriptions (titles and abstracts) of concepts in up to 97 languages. It also contains structured knowledge that has been extracted from the infobox systems of Wikipedias in 15 different languages and is mapped onto a single consistent ontology by a community effort. The knowledge base can be queried using the SPARQL query language and all its data sets are freely available for download. In this paper, we describe the general DBpedia knowledge base and as well as the DBpedia data sets that specifically aim at supporting computational linguistics tasks. These task include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and Relationship Extraction. These use cases are outlined, pointing at added value that the structured data of DBpedia provides.
Content may be subject to copyright.
DBpedia: A Multilingual Cross-Domain Knowledge Base
Pablo N. Mendes1, Max Jakob2, Christian Bizer1
1Web Based Systems Group, Freie Universität Berlin, Germany
2Neofonie GmbH, Berlin, Germany
first.last@fu-berlin.de, first.last@neofonie.de
Abstract
The DBpedia project extracts structured information from Wikipedia editions in 97 different languages and combines this information
into a large multi-lingual knowledge base covering many specific domains and general world knowledge. The knowledge base contains
textual descriptions (titles and abstracts) of concepts in up to 97 languages. It also contains structured knowledge that has been extracted
from the infobox systems of Wikipedias in 15 different languages and is mapped onto a single consistent ontology by a community
effort. The knowledge base can be queried using the SPARQL query language and all its data sets are freely available for download.
In this paper, we describe the general DBpedia knowledge base and as well as the DBpedia data sets that specifically aim at supporting
computational linguistics tasks. These task include Entity Linking, Word Sense Disambiguation, Question Answering, Slot Filling and
Relationship Extraction. These use cases are outlined, pointing at added value that the structured data of DBpedia provides.
Keywords: Knowledge Base, Semantic Web, Ontology
1. Introduction
Wikipedia has grown into one of the central knowledge
sources of mankind and is maintained by thousands of con-
tributors. Wikipedia articles consist mostly of natural lan-
guage text, but also contain different types of structured in-
formation, such as infobox templates, categorization infor-
mation, images, geo-coordinates, and links to external Web
pages. The DBpedia project (Bizer et al., 2009) extracts
various kinds of structured information from Wikipedia
editions in multiple languages through an open source ex-
traction framework. It combines all this information into a
multilingual multidomain knowledge base. For every page
in Wikipedia, a Uniform Resource Identifier (URI) is cre-
ated in DBpedia to identify an entity or concept being de-
scribed by the corresponding Wikipedia page. During the
extraction process, structured information from the wiki
such as infobox fields, categories and page links are ex-
tracted as RDF triples and are added to the knowledge base
as properties of the corresponding URI.
In order to homogenize the description of information in the
knowledge base, a community effort has been initiated to
develop an ontology schema and mappings from Wikipedia
infobox properties to this ontology. This significantly in-
creases the quality of the raw Wikipedia infobox data by
typing resources, merging name variations and assigning
specific datatypes to the values. As of March 2012, there
are mapping communities for 23 languages 1. The En-
glish Language Wikipedia, as well as the Greek, Polish,
Portuguese and Spanish language editions have mapped (to
the DBpedia Ontology) templates covering approximately
80% of template occurrences 2. Other languages such as
Catalan, Slovenian, German, Georgian and Hungarian have
1See: http://mappings.dbpedia.org
2See: http://mappings.dbpedia.org/index.
php/Mapping_Statistics
covered nearly 60% of template occurrences. As a con-
sequence, most of the facts displayed in Wikipedia pages
via templates are being extracted and mapped to a unified
schema.
In this paper, we describe the DBpedia knowledge base and
the DBpedia data sets that specifically aim at supporting
computational linguistics tasks. These include the Lexical-
ization, Topic Signatures, Topical Concepts and Grammat-
ical Gender data sets.
2. Resources
2.1. The DBpedia Ontology
The DBpedia Ontology organizes the knowledge on
Wikipedia in 320 classes which form a subsumption hierar-
chy and are described by 1,650 different properties. It fea-
tures labels and abstracts for 3.64 million things in up to 97
different languages of which 1.83 million are classified in
a consistent ontology, including 416,000 persons, 526,000
places, 106,000 music albums, 60,000 films, 17,500 video
games, 169,000 organizations, 183,000 species and 5,400
diseases. Additionally, there are 6,300,000 links to external
web pages, 2,724,000 links to images, 740,000 Wikipedia
categories and 690,000 geographic coordinates for places.
The alignment between Wikipedia infoboxes and the ontol-
ogy is done via community-provided mappings that help to
normalize name variation in properties and classes. Hetero-
geneities in the Wikipedia infobox system, like using dif-
ferent infoboxes for the same type of entity (class) or using
different property names for the same property, can be al-
leviated in this way. For example, ‘date of birth’ and ‘birth
date’ are both mapped to the same property birthDate,
and infoboxes ‘Infobox Person’ and ‘Infobox FoundingPer-
son’ have been mapped by the DBpedia community to the
class Person. DBpedia Mappings currently exist for 23
languages, which means that other infobox properties such
as ‘data de nascimento’ or ‘Geburtstag’ – date of birth in
Portuguese and German, respectively – also get mapped to
the global identifier birthDate. That means, in turn,
that information from all these language versions of DB-
pedia can be merged. Knowledge bases for smaller lan-
guages can therefore be augmented with knowledge from
larger sources such as the English edition. Conversely, the
larger DBpedia editions can benefit from more specialized
knowledge from localized editions (Tacchini et al., 2009).
2.2. The Lexicalization Data Set
DBpedia also provides data sets explicitly created to sup-
port natural language processing tasks. The DBpedia Lex-
icalization Data Set provides access to alternative names
for entities and concepts, associated with several scores es-
timating the association strength between name and URI.
Currently, it contains 6.6 million scores for alternative
names.
Three DBpedia data sets are used as sources of name vari-
ation: Titles, Redirects and Disambiguation Links3.La-
bels of the DBpedia resources are created from Wikipedia
page titles, which can be seen as community-approved sur-
face forms. Redirects to URIs indicate synonyms or alter-
native surface forms, including common misspellings and
acronyms. As redirects may point to other redirects, we
compute the transitive closure of a graph built from redi-
rects. Their labels also become surface forms. Disam-
biguation Links provide ambiguous surface forms that are
“confusable” with all resources they link to. Their labels
become surface forms for all target resources in the dis-
ambiguation page. Note that we erase trailing parentheses
from the labels when constructing surface forms. For exam-
ple the label ‘Copyright (band)’ produces the surface form
Copyright’. This means that labels of resources and of
redirects can also introduce ambiguous surface forms, addi-
tionally to the labels coming from titles of disambiguation
pages. The collection of surface forms created as a result of
this step constitutes an initial set of name variations for the
target resources.
We augment the name variations extracted from titles, redi-
rects and disambiguations by collecting the anchor texts
of page links on Wikipedia. Anchor texts are the visible,
clickable text of wiki page links that are specified after a
pipe symbol in the MediaWiki syntax (e.g. [[Apple_
Inc.|Apple]]). By collecting all occurrences of page
links, we can create statistics of co-occurrence for enti-
ties and their name variations. We perform this task by
counting how many times a certain surface form sf has
been used to link to a page uri. We calculate the condi-
tional probabilities p(uri|sf )and p(sf|uri)using maxi-
mum likelihood estimates (MLE). The pointwise mutual in-
formation pmi(sf, uri)is also given as a measure of asso-
ciation strength. Finally, as a measure of the prominence of
a DBpedia resource within Wikipedia, p(uri)is estimated
by the normalized count of incoming page links of a uri in
Wikipedia.
This data set can be used to estimate ambiguity of phrases,
to help select unambiguous identifiers for ambiguous
phrases, or to provide alternative names for entities, just
to cite a few examples. The DBpedia Lexicalization Data
Set has been used as one of the data sources for developing
DBpedia Spotlight, a general-purpose entity disambigua-
tion system (Mendes et al., 2011b).
3http://wiki.dbpedia.org/Downloads37
1dbpedia:Alkane carbon alkanes atoms
2dbpedia:Astronaut space nasa
3dbpedia:Apollo_8 first moon week
4dbpedia:Actinopterygii fish species genus
5dbpedia:Anthophyta forests temperate plants
Figure 1: A snippet of the Topic Signatures Data Set.
By analyzing the DBpedia Lexicalization Data Set, one can
note that approximately 4.4 million surface forms are un-
ambiguous and 392,000 are ambiguous. The overall av-
erage ambiguity per surface form is 1.22 – i.e. the aver-
age number of possible disambiguations per surface form.
Considering only the ambiguous surface forms, the average
ambiguity per surface form is 2.52. Each DBpedia resource
has an average of 2.32 alternative names. These statistics
were obtained from Wikipedia dumps using a script4writ-
ten in Pig Latin (Olston et al., 2008) which allows its exe-
cution in a distributed environment using Hadoop5.
2.3. The Topic Signatures Data Set
The Topic Signatures Data Set enables the description of
DBpedia Resources in a more unstructured fashion, as
compared to the structured factual data provided by the
Mapping-based properties. We extract paragraphs that con-
tain wiki links to the corresponding Wikipedia page of each
DBpedia entity or concept. We consider each paragraph as
contextual information to model the semantics of that en-
tity under the Distributional Hypothesis (Harris, 1954). The
intuition behind this hypothesis is that entities or concepts
that occur in similar contexts tend to have similar meanings.
We tokenize and aggregate all paragraphs in a Vector Space
Model (Salton et al., 1975) of terms weighted by their co-
occurrence with the target entity. In our VSM, each entity
is represented by a vector, and each term is a dimension of
this vector. Term scores are computed using the TF*IDF
weight.
We use those weights to select the strongest related terms
for each entity and build topic signatures (Lin and Hovy,
2000). Figure 1 shows examples of topic signatures in our
data set.
Topic signatures can be useful in tasks such as Query Ex-
pansion and Document Summarization (Nastase, 2008).
An earlier version of this data set has been successfully em-
ployed to classify ambiguously described images as good
depictions of DBpedia entities (García-Silva et al., 2011).
2.4. The Thematic Concepts Data Set
Wikipedia relies on a category system to capture the idea of
a ‘theme’, a subject that is discussed in its articles. Many
of the categories in Wikipedia are linked to an article that
describes the main topic of that category. We rely on this
information to mark DBpedia entities and concepts that are
‘thematic’, that is, they are the center of discussion for a
category.
4Script available at https://github.com/
dicode-project/pignlproc
5http://hadoop.apache.org
1SELECT ?resource
2WHERE {
3?resource dcterms:subject
4<http://dbpedia.org/resource/Category:Biology> .
5}
Figure 2: SPARQL query demonstrating how to retrieve
entities and concepts under a certain category.
1PREFIX dbpedia-owl:<http://dbpedia.org/ontology/>
2PREFIX dbpedia:<http://dbpedia.org/resource/>
3SELECT ?resource
4WHERE {
5?resource dbpedia-owl:wikiPageWikiLink dbpedia:Biology .
6}
Figure 3: SPARQL query demonstrating how to retrieve
pages linking to topical concepts.
A simple SPARQL query can retrieve all DBpedia re-
sources within a given Wikipedia category (Figure 2). A
variation of this query can use the Thematic Concepts Data
Set to retrieve other DBpedia resources related to certain
theme (Figure 3). The two queries can be combined with
trivial use of SPARQL UNION. This set of resources can
be used, for instance, for creating a corpus from Wikipedia
to be used as training data for topic classifiers.
2.5. The Grammatical Gender Data Set
DBpedia contains 416,000 instances of the class Person.
We have created a DBpedia Extractor that uses a simple
heuristic to decide on a grammatical gender for each per-
son extracted. While parsing an article in the English
Wikipedia, if there is a mapping from an infobox in this
article to the class dbpedia-owl:Person, we record the fre-
quency of gender-specific pronouns in their declined forms
(Subject, Object, Possessive Adjective, Possessive Pronoun
and Reflexive) – i.e. he, him, his, himself (masculine) and
she, her, hers, herself (feminine).
1dbpedia:Aristotle foaf:gender "male"@en .
2dbpedia:Abraham_Lincoln foaf:gender "male"@en .
3dbpedia:Ayn_Rand foaf:gender "female"@en .
4dbpedia:Andre_Agassi foaf:gender "male"@en .
5dbpedia:Anna_Kournikova foaf:gender "female"@en .
6dbpedia:Agatha_Christie foaf:gender "female"@en .
Figure 4: A snippet of the Grammatical Gender Data Set.
We assert the grammatical gender for the instance being
extracted if the number of occurrences of masculine pro-
nouns is superior than the occurrence of feminine pronouns
by a margin, and vice-versa. In order to increase the confi-
dence in the extracted grammatical gender, the current ver-
sion of the data set requires that the difference in frequency
is 200%. Furthermore, we experimented with a minimum
occurrence of gender-specific pronouns on one page of 5,
4 and 3. The resulting data covers 68%, 75% and 81%,
respectively, of the known instances of persons in DBpe-
dia. Our extraction process assigned the grammatical gen-
der "male" to roughly 85% and "female" roughly 15% of
the people. Figure 4 shows example data.
2.6. RDF Links to other Data Sets
DBpedia provides 6.2 million RDF links pointing at records
in other data sets. For instance, links to Word Net Synsets
(Fellbaum, 1998) were generated by relating Wikipedia in-
fobox templates and Word Net synsets and adding a cor-
responding link to each entity that uses a specific tem-
plate. DBpedia also includes links to other ontologies and
knowledge bases, including Cyc (Lenat, 1995), Umbel.org,
Schema.org and Freebase.com.
Other useful linked sources are Project Gutenberg 6, which
offers thousands of free e-books and New York Times,
which began to publish its inventory of articles collected
over the past 150 years. As of January 2010, 10,000 subject
headings had been shared. The links from DBpedia to au-
thors and texts in Project Gutenberg could be used for back-
ing author identification methods, for instance. Meanwhile,
the links to concepts in the New York Times database, en-
able its usage as an evaluation corpus (Sandhaus, 2008) for
Named Entity Recognition and Disambiguation algorithms,
amongst others.
3. Use Cases
In this section, we outline four use cases of the DBpedia
knowledge base in tasks related to computational linguis-
tics and natural language processing.
3.1. Reference Knowledge Base for Disambiguation
Tasks
The existence of a homogenized schema for describing
data in DBpedia, coupled with its origins on the largest
source of multilingual encyclopaedic text available, makes
this knowledge base particularly interesting as a resource
for natural language processing. DBpedia can be used,
for instance, as a reference knowledge base for Entity
Linking (McNamee et al., 2010), and other Word Sense
Disambiguation-related tasks.
For example, the Entity Linking task at TAC-KBP 2011 (Ji
et al., 2011) uses a target knowledge base that can be
automatically mapped to DBpedia via Wikipedia links.
It has been shown that simple entity linking algorithms
can leverage this mapping to obtain a µAV G of 0.827
in the TACKBP-2010 and 0.727 in TACKBP-2011 data
sets (Mendes et al., 2011a). A number of academic and
commercial projects already perform Entity Linking di-
rectly to DBpedia (Mendes et al., 2011b; Ltd., 2009;
Reuters, 2008), and others can be mapped to DBpedia via
Wikipedia (Orchestr8 LLC, 2009; Giuliano et al., 2009;
Ferragina and Scaiella, 2010; Ratinov et al., 2011; Han and
Sun, 2011).
One advantage of using DBpedia over Wikipedia as target
knowledge base for evaluations is the DBpedia Ontology.
By providing a hierarchical classification of concepts, DB-
pedia allows one to select a subset of classes on which to
6See: http://www.gutenberg.org/
1PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
2PREFIX dbpedia-owl:<http://dbpedia.org/ontology/>
3SELECT DISTINCT ?person
4WHERE {
5?person rdf:type dbpedia-owl:Person.
6}
Figure 5: SPARQL query demonstrating how to select all
instances of type Person.
1PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
2PREFIX dbpedia-owl:<http://dbpedia.org/ontology/>
3SELECT DISTINCT ?person ?link
4WHERE {
5?person rdf:type dbpedia-owl:Person .
6?person dbpedia-owl:wikiPageWikiLink ?link .
7}
Figure 6: SPARQL query demonstrating how to select all
pages linking at entities of type Person.
focus a particular disambiguation task. With a simple Web
query (Figure 5) one can obtain a list of entities of type
Person or Organization (or even more specific types such
as Politician or School).
Simple extensions to those queries can also retrieve a list
of all Wikipedia pages that link to entities matching those
queries. An example of such a query7is shown in Figure 6.
These pages, along with the in-text links can be used as
training data for Named Entity Recognition or Entity Link-
ing algorithms, for example. A similar approach is used by
DBpedia Spotlight.
3.2. Question Answering: World Knowledge
Automatic answering of natural language questions gains
importance as information needs of non-technical users
grow in complexity. Complex questions have been tra-
ditionally approached through the usage of databases and
query languages. However, such query languages may
not be a viable option for non-technical users. Moreover,
alongside structured information in databases, the amount
of information available in natural language increases at a
fast pace. The complexity of retrieving required informa-
tion and the complexity of interpreting results call for more
than classical document retrieval.
DBpedia contains structured information about a variety of
fields and domains from Wikipedia. This information can
be leveraged in question answering systems, for example,
to map natural language to a target query language. The
QALD-1 Challenge (qal, 2011) was an evaluation cam-
paign where natural language questions were translated to
SPARQL queries, aiming at retrieving factual answers for
those questions. As part of this task, it is necessary to con-
strain on certain ontology properties (e.g. the gender and
age of a person) and it can be beneficial to use the DBpedia
7Please note that the wikiPageLinks data set is not loaded in
the public SPARQL endpoint, but is available for download and
local usage.
1PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
2PREFIX dbpedia-owl:<http://dbpedia.org/ontology/>
3PREFIX dbpedia:<http://dbpedia.org/resource/>
4SELECT DISTINCT ?widow
5WHERE {
6?politician rdf:type dbpedia-owl:Person.
7?politician dbpedia-owl:occupation dbpedia:Politician.
8?politician dbpedia-owl:deathPlace dbpedia:Texas.
9?politician dbpedia-owl:spouse ?widow.
10 }
Figure 7: SPARQL query demonstrating how to select all
pages linking at entities of type Person.
ontology. For example, Figure 7 shows the SPARQL query
for the question Who is widow to a politician that died in
Texas? (qal, 2011).
3.3. Slot Filling and Relationship Extraction
Since the DBpedia knowledge base contains also structured
information extracted from infoboxes, it can be used as ref-
erence knowledge base for other tasks such as slot filling
and relationship extraction. Through mappings of several
infobox fields to one ontology property, a more harmonized
view of the data is provided, allowing researchers to exploit
Wikipedia to a larger extent, e.g. attempting multilingual
relationship extraction.
3.4. Information Retrieval: Query Expansion
Understanding keyword queries is a difficult task, espe-
cially due to the fact that such queries usually contain very
few keywords that could be used for disambiguating am-
biguous words. While users are typing keywords, current
search engines offer a drop-down box with suggestions of
common keyword combinations that relate to what the user
is typing.
For an ontology-based system that interfaces with the users
through keyword searches, such ‘auto-suggest’ functional-
ity can be achieved through the use of the data in the DBpe-
dia Lexicalization Data Set. Figure 8 shows how to retrieve
all resources that are candidate disambiguations for a sur-
face form, along with a score of association strength. The
available scores are described in Section 2.2..
1PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
2SELECT ?resource ?score WHERE {
3GRAPH ?g {
4?resource skos:altLabel ?label.
5}
6?g <http://dbpedia.org/spotlight/score> ?score.
7FILTER (REGEX(?label, "apple", "i"))
8}
Figure 8: SPARQL query for retrieving candidate disam-
biguations for the string ‘apple’.
4. Conclusion
DBpedia is a multilingual multidomain knowledge base
that can be directly used in many tasks in natural language
processing. All DBpedia data sets are freely available under
the terms of the Creative Commons Attribution-ShareAlike
3.0 License and the GNU Free Documentation License and
can be downloaded from the project website8. Furthermore,
through the use of W3C-recommended Web technologies,
a subset of the DBpedia knowledge base is also available
for online usage through Web queries9.
5. Acknowledgements
We wish to thank Robert Isele and the developers of the
DBpedia Extraction Framework, Paul Kreis and the in-
ternational team of DBpedia Mapping Editors, as well as
Dimitris Kontokostas and the DBpedia Internationalization
team for their invaluable work on the DBpedia project.
This work was partially funded by the European Commis-
sion through FP7 grants LOD2 - Creating Knowledge out of
Interlinked Data (Grant No. 257943) and DICODE - Mas-
tering Data-Intensive Collaboration and Decision Making
(Grant No. 257184).
6. References
Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören
Auer, Christian Becker, Richard Cyganiak, and Sebas-
tian Hellmann. 2009. DBpedia - A crystallization point
for the Web of Data. Journal of Web Semantics: Science,
Services and Agents on the World Wide Web, (7):154–
165.
Christiane Fellbaum, editor. 1998. WordNet An Electronic
Lexical Database. The MIT Press, Cambridge, MA ;
London, May.
Paolo Ferragina and Ugo Scaiella. 2010. Tagme: on-the-
fly annotation of short text fragments (by wikipedia en-
tities). In Proceedings of the 19th ACM international
conference on Information and knowledge management,
CIKM ’10, pages 1625–1628, New York, NY, USA.
ACM.
Andrés García-Silva, Max Jakob, Pablo N. Mendes, and
Christian Bizer. 2011. Multipedia: enriching DBpedia
with multimedia information. In Proceedings of the sixth
international conference on Knowledge capture, K-CAP
’11, pages 137–144, New York, NY, USA. ACM.
Claudio Giuliano, Alfio Massimiliano Gliozzo, and Carlo
Strapparava. 2009. Kernel methods for minimally super-
vised wsd. Comput. Linguist., 35:513–528, December.
Xianpei Han and Le Sun. 2011. A generative entity-
mention model for linking entities with knowledge base.
In Proceedings of the 49th Annual Meeting of the Associ-
ation for Computational Linguistics: Human Language
Technologies, pages 945–954, Portland, Oregon, USA,
June. Association for Computational Linguistics.
Zellig Harris. 1954. Distributional structure. Word,
10(23):146–162.
Heng Ji, Ralph Grishman, and Hoa Dang. 2011. Overview
of the TAC2011 Knowledge Base Population Track. In
Proceedings of the Text Analysis Conference (TAC 2011).
8http://dbpedia.org/downloads
9http://dbpedia.org/sparql
Douglas Lenat. 1995. Cyc: A large-scale investment in
knowledge infrastructure. Communications of the ACM,
38(11):33–38, November.
Chin-Yew Lin and Eduard H. Hovy. 2000. The automated
acquisition of topic signatures for text summarization. In
COLING, pages 495–501.
Zemanta Ltd. 2009. Zemanta api overview.
http://www.zemanta.com/api/.
Paul McNamee, Hoa Trang Dang, Heather Simpson,
Patrick Schone, and Stephanie M. Strassel. 2010. An
evaluation of technologies for knowledge base popula-
tion. In LREC. European Language Resources Associa-
tion.
Pablo N. Mendes, Joachim Daiber, Max Jakob, and Chris-
tian Bizer. 2011a. Evaluating dbpedia spotlight for the
tac-kbp entity linking task. In Proceedings of the TAC-
KBP 2011 Workshop.
Pablo N. Mendes, Max Jakob, Andrés García-Silva, and
Christian Bizer. 2011b. DBpedia Spotlight: Shedding
Light on the Web of Documents. In Proceedings of the
7th International Conference on Semantic Systems (I-
Semantics).
Vivi Nastase. 2008. Topic-driven multi-document summa-
rization with encyclopedic knowledge and spreading ac-
tivation. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, EMNLP ’08,
pages 763–772, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Christopher Olston, Benjamin Reed, Utkarsh Srivastava,
Ravi Kumar, and Andrew Tomkins. 2008. Pig latin:
a not-so-foreign language for data processing. In Pro-
ceedings of the 2008 ACM SIGMOD international con-
ference on Management of data, SIGMOD ’08, pages
1099–1110, New York, NY, USA. ACM.
Orchestr8 LLC. 2009. AlchemyAPI. http://www.
alchemyapi.com/, retrieved on 11.12.2010.
2011. Proceedings of 1st Workshop on Question Answer-
ing over Linked Data (QALD-1), collocated with the 8th
Extended Semantic Web Conference (ESWC 2011), Her-
aklion, Greece, 6.
Lev-Arie Ratinov, Dan Roth, Doug Downey, and Mike An-
derson. 2011. Local and global algorithms for disam-
biguation to wikipedia. In ACL, pages 1375–1384.
Thomson Reuters. 2008. OpenCalais: Connect. Every-
thing. http://www.opencalais.com/about,
retrieved on 11.12.2010.
G. Salton, A. Wong, and C. S. Yang. 1975. A vector space
model for automatic indexing. Communications of the
ACM, 18:613–620, November.
Evan Sandhaus. 2008. The New York Times Annotated
Corpus.
Eugenio Tacchini, Andreas Schultz, and Christian Bizer.
2009. Experiments with wikipedia cross-language data
fusion. volume 449 of CEUR Workshop Proceedings
ISSN 1613-0073, June.
... e) Principle 5: To enable interoperability, we also link the content of HPC ontology with equivalent concepts and relationships from mainstream ontologies, such as Dublin Core metadata [24], Schema.org [21] and DBpedia [25]. Users who are already familiar with other ontologies can still use the same terms. ...
... Due to their consistency and expressivity, numerous ontologies [35]- [38] have been developed to effectively combine data or information from multiple heterogeneous sources. For example, DBpedia [25], [39] extracts entities and relations from Wikipedia to create an ontology which becomes a natural hub for connecting datasets. Sun et al. [40] developed Geospatial data ontology as the semantic foundation of geospatial data integration and sharing. ...
... For this purpose, they presented a specialized parallel corpus, derived from the WebNLG dataset [30]. The latter is obtained from the DBpedia knowledge base [31] using content selection and crowdsourcing. It is annotated with semantic triplets of subject-relationobject, obtained semi-automatically. ...
Conference Paper
Automatic Text Simplification (ATS) is the process of reducing a text's linguistic complexity to improve its understandability and readability while maintaining its original information, content, and meaning. Several text transformation operations can be performed such as splitting a sentence into several shorter sentences, substitution of complex elements, and reorganization. It has been shown that the implementation of these operations essentially at a syntactic level causes several problems that could be solved by using semantic representations. In this paper, we present GRASS (GRAph-based Semantic representation for syntactic Simplification), a rulebased automatic syntactic simplification system that uses semantic representations. The system allows the syntactic transformation of complex constructions, such as subordination clauses, appositive clauses, coordination clauses, and passive forms into simpler sentences. It is based on graph-based meaning representation of the text expressed in DMRS (Dependency Minimal Recursion Semantics) notation and it uses rewriting rules. The experimental results obtained on a reference corpus and according to specific metrics outperform the results obtained by other state of the art systems on the same reference corpus.
... When analysis outputs are intended to be reused or published (e.g. research articles) or made available in data collections (e.g. Knowledge Bases or Knowledge Graphs (Fafalios et al., 2018;Mendes et al., 2012)) that enable further analysis and/or research with the data, the aforementioned risks are multiplied. Such activities can hence result in an infringement of data subjects rights to privacy and data protection and generate liability concerns for the data controller (i.e. the data scientist). ...
Preprint
Sentiment analysis has always been an important driver of political decisions and campaigns across all fields. Novel technologies allow automatizing analysis of sentiments on a big scale and hence provide allegedly more accurate outcomes. With user numbers in the billions and their increasingly important role in societal discussions, social media platforms become a glaring data source for these types of analysis. Due to its public availability, the relative ease of access and the sheer amount of available data, the Twitter API has become a particularly important source to researchers and data analysts alike. Despite the evident value of these data sources, the analysis of such data comes with legal, ethical and societal risks that should be taken into consideration when analysing data from Twitter. This paper describes these risks along the technical processing pipeline and proposes related mitigation measures.
... The FAIR [4]; Findability, Accessibility, Interoperability and Reuse principles refer to the guidelines for good data management practice: resources must be more accessible, understood, exchanged and reused by machines; these principles in the context of SWT and particularly in the field of Knowledge Graph, encourage communities such as Open Linked Data Community (OpenLOD) and Semantic Web to join efforts into the Fair Knowledge Graph topology [4] building. On the other hand, the Open Knowledge Graphs (OpenKGs) are semantically available on the Web [2,5], they can be focused on universal domains, e.g., DBpedia [6], WikiData [7], or domain-specific knowledge such as YAGO [8] or WordNet 2 ; most of them are integrated and can be found through the OpenLOD Cloud 3 , and typically, they can be queried by SPARQL interfaces. Nowadays, Linked Open Data and FAIR principles have played an essential role in the KG spreading. ...
Conference Paper
Full-text available
A Knowledge Graph (KG) is a form of structured human knowledge depicting relations between entities, destined to reflect cog-nition and human-level intelligence. Large and openly available knowledge graphs (KGs) like DBpedia, YAGO, WikiData are universal cross-domain knowledge bases and are also accessible within the Linked Open Data (LOD) cloud, according to the FAIR principles that make data find-able, accessible, interoperable and reusable. This work aims at proposing a methodological approach to construct domain-oriented knowledge graphs by parsing natural language content to extract simple triple-based sentences that summarize the analyzed text. The triples coded in RDF are in the form of subject, predicate, and object. The goal is to generate a KG that, through the main identified concepts, can be navigable and linked to the existing KGs to be automatically found and usable on the Web LOD cloud.
... However, the second approach faces bigger challenges in terms of coverage of reliable information. Existing knowledge bases tend to be too small to cover sufficient information for claim validation purposes (Azmy et al., 2018;Mendes et al., 2012;Pellissier Tanon et al., 2020). Attempts have been made to automatically populate knowledge bases (Adel, 2018;Balog, 2018;Mesquita et al., 2019;Nakashole & Weikum, 2012) but this method has the risk of further introducing unreliable noise and makes it harder to maintain the knowledge bases. ...
Article
Full-text available
As online false information continues to grow, automated fact-checking has gained an increasing amount of attention in recent years. Researchers in the field of Natural Language Processing (NLP) have contributed to the task by building fact-checking datasets, devising automated fact-checking pipelines and proposing NLP methods to further research in the development of different components. This article reviews relevant research on automated fact-checking covering both the claim detection and claim validation components.
... On the other hand, unsupervised knowledge-based WSD systems generally depend on graph-based models (e.g., [24]) and lexical databases, such as WordNet [25], Wikipedia corpus [26], and DBPedia [27]. ...
Conference Paper
Over the last decades, the healthcare sector has witnessed massive growth in the use of various information technologies such as Electronic Health Records (EHRs), a digital version of a patient's paper chart that electronically stores all information related to patient care. After the emergence of the EHRs, a large amount of language data is being produced. This data is a collection of documents or text files that include essential information, but most of them are unstructured to be properly analyzed by computers. To handle the human language complexity and make this data more machine-analyzable, researchers often rely on using Natural Language Processing (NLP) techniques. This paper presents a review that features the application of NLP and machine learning to EHRs and present associated challenges. Besides, we outline various applications and initiatives in the healthcare sector. Finally, we provide promising directions for future research.
Article
Cross-domain named entity recognition (NER) models are able to cope with the scarcity issue of NER samples in target domains. However, most of the existing NER benchmarks lack domain-specialized entity types or do not focus on a certain domain, leading to a less effective cross-domain evaluation. To address these obstacles, we introduce a cross-domain NER dataset (CrossNER), a fully-labeled collection of NER data spanning over five diverse domains with specialized entity categories for different domains. Additionally, we also provide a domain-related corpus since using it to continue pre-training language models (domain-adaptive pre-training) is effective for the domain adaptation. We then conduct comprehensive experiments to explore the effectiveness of leveraging different levels of the domain corpus and pre-training strategies to do domain-adaptive pre-training for the cross-domain task. Results show that focusing on the fractional corpus containing domain-specialized entities and utilizing a more challenging pre-training strategy in domain-adaptive pre-training are beneficial for the NER domain adaptation, and our proposed method can consistently outperform existing cross-domain NER baselines. Nevertheless, experiments also illustrate the challenge of this cross-domain NER task. We hope that our dataset and baselines will catalyze research in the NER domain adaptation area. The code and data are available at https://github.com/zliucr/CrossNER.
Article
One of the effective approaches for answering natural language questions (NLQs) over knowledge graphs consists of two main stages. It first creates a query graph based on the NLQ and then matches this graph over the knowledge graph to construct a structured query. An obstacle in the first stage is the need to build question interpretations with candidate resources, even if some implicit phrases exist in the sentence. In the second stage, a serious problem is to map diverse NLQ relations to their corresponding predicates. To overcome these problems, in this paper, we propose a novel sequential word parsing-based method to construct and refine an uncertain question graph that is disambiguated directly over the knowledge graph. Instead of relying on the syntactic dependency relations and some predefined rules that recognize the relations and their arguments, we consider the identified entities and variables in the NLQ as well as their corresponding place in the structure of a query graph pattern to build question triples. First, by leveraging the ordered dependency tree of an NLQ, sentence words are reordered. Then the question graph structure is constructed by parsing the new sequence backward, starting from the identified items. Subsequently, the question graph is refined by eliminating the useless elements. Additionally, to improve the relation similarity measure in the graph similarity process, we exploit the knowledge hidden in a relation pattern taxonomy. Experimental studies over several benchmarks demonstrate that our proposed approach is effective as it achieves promising results in answering the complex NLQs.
Article
Full-text available
This paper reports the participation of team WBSG in the English entity-linking task at TAC KBP 2011. As first-timers in TAC, we used the opportunity to test the entity link-ing accuracy of our general-purpose entity and concept annotation system. Our system, DB-pedia Spotlight, does not specialize on named entities of certain types. Its aim is to annotate any of the ≈3.5M entities and other concepts in DBpedia, a knowledge base extracted from Wikipedia. We applied the disambiguation step of our algorithm as is, without benefiting from the a priori knowledge of target entity types, and without applying specialized NIL detection or clustering approaches. We sim-ply mapped our KB to that of TAC KBP's, and labelled as NIL the entities from the former that were absent from the latter. Our simple approach worked surprisingly well, achieving 0.703 B 3 F1. In future work we plan to spe-cialize our algorithm to the entity types in TAC KBP, and investigate in which cases a larger KB provided us an advantage or disad-vantage in the context of this evaluation.
Article
Full-text available
There are currently Wikipedia editions in 264 different languages. Each of these editions contains infoboxes that provide structured data about the topic of the article in which an infobox is contained. The content of infoboxes about the same topic in different Wikipedia editions varies in completeness, coverage and quality. This paper examines the hypothesis that by extracting infobox data from multiple Wikipedia editions and by fusing the extracted data among editions it should be possible to complement data from one edition with previously missing values from other editions and to increase the overall quality of the extracted dataset by choosing property values that are most likely correct in case of inconsistencies among editions. We will present a software framework for fusing RDF datasets based on different conflict resolution strategies. We will apply the framework to fuse infobox data that has been extracted from the English, German, Italian and French editions of Wikipedia and will discuss the accuracy of the conflict resolution strategies that were used in this experiment.
Conference Paper
Full-text available
Enriching knowledge bases with multimedia information makes it possible to complement textual descriptions with visual and audio information. Such complementary information can help users to understand the meaning of assertions, and in general improve the user experience with the knowledge base. In this paper we address the problem of how to enrich ontology instances with candidate images retrieved from existing Web search engines. DBpedia has evolved into a major hub in the Linked Data cloud, interconnecting millions of entities organized under a consistent ontology. Our approach taps into the Wikipedia corpus to gather context information for DBpedia instances and takes advantage of image tagging information when this is available to calculate semantic relatedness between instances and candidate images. We performed experiments with focus on the particularly challenging problem of highly ambiguous names. Both methods presented in this work outperformed the baseline. Our best method leveraged context words from Wikipedia, tags from Flickr and type information from DBpedia to achieve an average precision of 80%.
Conference Paper
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.
Article
For the purposes of the present discussion, the term structure will be used in the following non-rigorous sense: A set of phonemes or a set of data is structured in respect to some feature, to the extent that we can form in terms of that feature some organized system of statements which describes the members of the set and their interrelations (at least up to some limit of complexity). In this sense, language can be structured in respect to various independent features. And whether it is structured (to more than a trivial extent) in respect to, say, regular historical change, social intercourse, meaning, or distribution — or to what extent it is structured in any of these respects — is a matter decidable by investigation. Here we will discuss how each language can be described in terms of a distributional structure, i.e. in terms of the occurrence of parts (ultimately sounds) relative to other parts, and how this description is complete without intrusion of other features such as history or meaning. It goes without saying that other studies of language — historical, psychological, etc.—are also possible, both in relation to distributional structure and independently of it.
Article
The DBpedia project is a community effort to extract structured information from Wikipedia and to make this information accessible on the Web. The resulting DBpedia knowledge base currently describes over 2.6 million entities. For each of these entities, DBpedia defines a globally unique identifier that can be dereferenced over the Web into a rich RDF description of the entity, including human-readable definitions in 30 languages, relationships to other resources, classifications in four concept hierarchies, various facts as well as data-level links to other Web data sources describing the entity. Over the last year, an increasing number of data publishers have begun to set data-level links to DBpedia resources, making DBpedia a central interlinking hub for the emerging Web of Data. Currently, the Web of interlinked data sources around DBpedia provides approximately 4.7 billion pieces of information and covers domains such as geographic information, people, companies, films, music, genes, drugs, books, and scientific publications. This article describes the extraction of the DBpedia knowledge base, the current status of interlinking DBpedia with other data sources on the Web, and gives an overview of applications that facilitate the Web of Data around DBpedia.
Conference Paper
Information of interest to users is often dis- tributed over a set of documents. Users can specify their request for information as a query/topic - a set of one or more sentences or questions. Producing a good summary of the relevant information relies on understand- ing the query and linking it with the associ- ated set of documents. To "understand" the query we expand it using encyclopedic knowl- edge in Wikipedia. The expanded query is linked with its associated documents through spreading activation in a graph that represents words and their grammatical connections in these documents. The topic expanded words and activated nodes in the graph are used to produce an extractive summary. The method proposed is tested on the DUC summariza- tion data. The system implemented ranks high compared to the participating systems in the DUC competitions, confirming our hypothesis that encyclopedic knowledge is a useful addi- tion to a summarization system.