ArticlePDF Available

DBpedia ontology enrichment for inconsistency detection

Authors:

Abstract

In recent years the Web of Data experiences an extraordinary development: an increasing amount of Linked Data is available on the World Wide Web (WWW) and new use cases are emerging continually. However, the provided data is only valuable if it is accurate and without contradictions. One essential part of the Web of Data is DBpedia, which covers the structured data of Wikipedia. Due to its automatic extraction based on Wikipedia resources that have been created by various contributors, DBpedia data often is error-prone. In order to enable the detection of inconsistencies this work focuses on the enrichment of the DBpedia ontology by statistical methods. Taken the enriched ontology as a basis the process of the extraction of Wikipedia data is adapted, in a way that inconsistencies are detected during the extraction. The creation of suitable correction suggestions should encourage users to solve existing errors and thus create a knowledge base of higher quality.
DBpedia Ontology Enrichment for Inconsistency Detection
Gerald Töpper
Hasso Plattner Institute
Prof.-Dr.-Helmert-Str. 2–3
14482 Potsdam, Germany
gerald.toepper@student.
hpi.uni-potsdam.de
Magnus Knuth
Hasso Plattner Institute
Prof.-Dr.-Helmert-Str. 2–3
14482 Potsdam, Germany
magnus.knuth@hpi.uni-
potsdam.de
Harald Sack
Hasso Plattner Institute
Prof.-Dr.-Helmert-Str. 2–3
14482 Potsdam, Germany
harald.sack@hpi.uni-
potsdam.de
ABSTRACT
In recent years the Web of Data experiences an extraordi-
nary development: an increasing amount of Linked Data is
available on the World Wide Web (WWW) and new use
cases are emerging continually. However, the provided data
is only valuable if it is accurate and without contradictions.
One essential part of the Web of Data is DBpedia, which
covers the structured data of Wikipedia. Due to its auto-
matic extraction based on Wikipedia resources that have
been created by various contributors, DBpedia data often
is error-prone. In order to enable the detection of inconsis-
tencies this work focuses on the enrichment of the DBpedia
ontology by statistical methods. Taken the enriched ontol-
ogy as a basis the process of the extraction of Wikipedia
data is adapted, in a way that inconsistencies are detected
during the extraction. The creation of suitable correction
suggestions should encourage users to solve existing errors
and thus create a knowledge base of higher quality.
Categories and Subject Descriptors
H.3.5 [Information Storage and Retrieval]: On-line In-
formation Services; I.2.4 [Computing Methodologies]:
Knowledge Representation Formalisms and Methods—Se-
mantic Networks
Keywords
DBpedia, Linked Data, Data Cleansing, Ontology Enrich-
ment
1. INTRODUCTION
With the continuous growth of Linked Data on the World
Wide Web (WWW) and the increase of web applications
that consume Linked Data, the quality of Linked Data re-
sources has become a relevant issue. Recent initiatives (e. g.
the Pedantic Web group
1
) uncovered various defects and
flaws in Linked Data resources.
1
http://pedantic-web.org/
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
I-SEMANTICS 2012, 8th Int. Conf. on Semantic Systems Sept. 5–7, 2012,
Graz, Austria.
Copyright 2012 ACM 978-1-4503-1112-0 ...$10.00
As one essential multi-domain part of the Web of Data, DB-
pedia is broadly used and its data quality affects a wide
range of web applications. But due to the automatic extrac-
tion based on Wikipedia resources that have been created
by a large number of non-expert users its data is partly in-
correct or incomplete.
In this work we propose an approach to identify inconsisten-
cies in the DBpedia dataset based on an improved DBpedia
ontology. Therefore, the ontology is enriched with axioms
that have been identified with methods from the field of
Inductive Logical Programming (ILP). By applying the en-
riched ontology during the extraction process it is possible
to induce contradictions that point to incorrect facts, which
demand for correction either in the original Wikipedia page,
in the DBpedia ontology, or in the mappings
2
that have been
used for extraction.
The paper is organized as follows: Section 2 recapitulates
previous work in the field of error detection and correction
in Linked Data as well as ontology enrichment. The pro-
posed methods applied for enriching the DBpedia ontology
are presented in Section 3. This ontology is then used for
the detection of inconsistencies, which has been integrated
into the DBpedia Extraction Framework
3
as presented in
Section 4. Section 5 assesses the results achieved for ontol-
ogy enrichment and error detection. Section 6 summarizes
the achievements of this work and provides an outlook for
further research.
2. RELATED WORK
The detection of inconsistencies and errors within Linked
Data is subject of various studies recently. Hogan et al.
have focused on different types of errors referring to accessi-
bility, syntactical correctness, and consistency of published
RDF data [1]. For these error types proposals have been
made, how publishers are able to avoid and users are able
to deal with those errors. Similarly, in [2] a system has been
presented, which efficiently creates statistics according to
accessibility of RDF documents and SPARQL endpoints as
well as syntactical errors. The system periodically updates
the statistics and publishes it on the web
4
.
A related approach to detect inconsistencies in Linked Data
by means of logical reasoning is presented by eron et al. in
2
DBpedia mappings wiki: http://mappings.dbpedia.org/
3
http://sourceforge.net/projects/dbpedia/develop
4
http://stats.lod2.eu/
33
[3], which searches for inconsistency patterns using SPARQL
queries applied to the deductive closure of the graph of
all inferred rules. Their definition of an inconsistency re-
gards an absent type assignment as inconsistent, whereas
our approach additionally demands classes to be explicitly
disjoint. Furthermore, we automatically enhance the under-
lying ontology beforehand with such disjointness constraints
and missing domain and range restrictions.
However, in this work the detection of inconsistencies within
DBpedia is highlighted, which is achieved by automatic se-
mantic enrichment of the underlying ontology. Apart from
the manual enrichment of ontologies many approaches have
been developed for the purpose of automatic enrichment.
Popular methods for the automatic construction of ontolo-
gies from semi- and unstructured data are Natural Language
Processing (NLP) and Machine Learning (ML). Our ap-
proach follows existing methods from Inductive Logic Pro-
gramming (ILP), a subfield of ML, which deduces a hypoth-
esis based on background knowledge in terms of positive and
negative examples.
In [4], which uses ILP, association rules well-known in the
field of data mining are constructed and translated into
axioms. In that context axioms are created, which express
a subclass relationship or specify the domain, range or tran-
sitivity of a role. Existing rdf:type statements serve as a
basis for the generation of the transaction tables, from which
the association rules are produced. Association rules are ex-
ploited as well in [5], where disjoint classes are identified.
ILP is also applied in the DL-Learner framework
5
,which
learns complex class descriptions on the basis of instances
[6]. ORE
6
is using that framework as a foundation and
learns axioms, which express a subclass relationship or a
class equivalence. An integrated reasoner identifies resulting
inconsistencies within the ontology. Both the inconsistency
and suggestions on how to solve it are presented to the user,
where one possible solution is the deletion of an acquired
axiom [7].
Haase et al. describe an approach that keeps generated on-
tologies consistent [8]. For this purpose the automatically
acquired knowledge is enriched with annotations, which com-
prise a declaration about the correctness and the relevance
of the axioms. Subsequently arising inconsistencies are han-
dled immediately. The result is a knowledge base, which is
consistent in any case and semantically correct.
In [9] four different approaches for handling inconsistencies
in changing ontologies have been surveyed: the evolution of
a consistent ontology, the reparation of inconsistencies, rea-
soning in the presence of inconsistencies, and multi-version
reasoning. With respect to the DBpedia ontology the second
variant repairing the inconsistencies through a continuous
process consisting of the identification and the resolution
of inconsistencies seems to be feasible. Due to the huge
amount of data in the DBpedia knowledge base ignoring the
inconsistencies is unavoidable in some cases, why the third
variant would apply.
5
http://dl-learner.org
6
http://ore-tool.net
Crowdsourcing is another approach that allows to apply hu-
man intelligence for the detection of errors in knowledge
bases. The quiz WhoKnows? [10] generates questions out
of the mapping-based dataset of DBpedia. In case that a
question appears odd, the player has the chance to report
this. Facts, i. e. RDF triples, applied in frequently reported
questions indicate erroneous data. In [11] a technique is
presented, which uses this information for the creation of
patch requests. As a use case WhoKnows? was extended
in a way that a player is able to state, which fact in DBpe-
dia is obviously wrong or eventually missing. According to
this information a suitable patch request is generated and
submitted to a central repository. The collected patch re-
quests can be applied to the corresponding dataset and thus
improve the quality of the data.
3. UPGRADING THE DBPEDIA
ONTOLOGY
Multiple levels of errors might occur in automatically gen-
erated RDF knowledge bases. The first and most obvious
class are syntactic errors in RDF that can be detected with
the help of a simple RDF parser/validator. There are also
some syntactic errors in DBpedia, e. g. incorrect date infor-
mation.
Onelevelabovearelogical errors. They are caused by con-
tradicting RDF triples and can be identified with the help
of a reasoner, such as Pellet
7
or Fact++
8
. To produce a
logical contradiction the negation of facts or the construc-
tion of binding constraints must be possible. With RDF(S)
alone the construction of plain logical contradictions is not
possible.
The upper level semantic errors comprise facts that are not
corresponding to facts in the real world. These inconsisten-
cies are most hard to detect since they are so far usually
logically correct and demand real world knowledge for iden-
tification. Semantic errors represent the majority of flaws
in the DBpedia dataset. They emerge from the difficulty to
parse the broad range of Wikipedia infoboxes correctly. An
example of such an error is the fact
dbp:2666_%28novel%29 dbo:publisher dbp:Barcelona .
This RDF triple refers to the fact that the novel “2666” has
been published by Barcelona, which is obviously wrong, be-
cause Barcelona is a city while a publisher should be a person
or a publishing company. As shown in Figure 1 this fact re-
sults from the composition of the Wikipedia infobox
9
that
quotes “Editoral Anagrama” as the actual publisher of the
book and in absence of a Wikipedia entry for “Editoral Ana-
grama” an additional link to Barcelona, the residence of the
publishing company, is mentioned. Since links are preferen-
tially treated by the automatic extractor, dbp:Barcelona is
choosen for the extracted triple’s object.
Our approach to detect such semantic inconsistencies auto-
matically is to transform semantic errors into logical ones
by extending the axioms of the underlying ontology in order
7
http://clarkparsia.com/pellet/
8
http://owl.man.ac.uk/factplusplus/
9
c. f. http://en.wikipedia.org/wiki/2666?oldid=480743891
34
Figure 1: Extract of the infobox in the article 2666
to cause a logical contradiction that can then be recognised
by a reasoner.
How this can be achieved, will be demonstrated using the ex-
ample mentioned above. Apart from the extracted triple ad-
ditional information is known about the property dbo:pub-
lisher and the object entity dbp:Barcelona:
dbp:2666_%28novel%29 dbo:publisher dbp:Barcelona .
dbo:publisher rdfs:range dbo:Company .
dbp:Barcelona rdf:type dbo:Settlement .
So far, this tripleset is not inconsistent. Instead, a rea-
soner would deduce dbo:Company as a new rdf:type for
dbp:Barcelona, which is no contradiction to the original
type dbo:Settlement, as long as both classes are not marked
disjoint. By adding the disjointness axiom
dbo:Company owl:disjointWith dbo:Settlement .
the inconsistency of the tripleset can be deduced.
Concerning formal restrictions the current state of the DB-
pedia ontology is a stumbling block, since there are no do-
main and range restriction for a number of properties and no
class disjointness axioms included. In the following, meth-
ods are presented to derive this missing information from
the DBpedia dataset. For this task the English DBpedia 3.7
dataset [12] has been used. In order to improve the DB-
pedia ontology those methods determine new domain and
range restrictions as well as class disjointness axioms that
can be added to the ontology.
3.1 Property domain restrictions
To be able to deduce inconsistencies related to property do-
mains, domain restrictions have to be specified explicitly.
However, approximately 16% of all properties in the DBpe-
dia ontology lack the declaration of an rdfs:domain-value.
Additionally a few properties exist, whose domain does not
correspond to the common usage of the property in the
ABox. Thus, for all properties of the DBpedia ontology
new domains have been determined. Subsequently a metric
is presented that is based on the ABox of the ontology.
Let KB = {(spo):s E C p P o E L C}
be the knowledge base, E the set of all entities, P the set of
all properties, p P , L the set of all Literals, C the set of
all classes and c C. The metric md
p,c
indicates whether
class c is the domain of property p. As shown in equation 1,
it is calculated from the number of triples (spo) KB,
whose subject s E belongs to the class c,relativelytothe
number of triples (spo) KB, whose subject s E belongs
to any class more specific than owl:Thing (abbreviated T in
the equations). The equations use the abbreviation (s a c)
for expressing the fact that the entity s E belongs to the
class c.
md
p,c
=
|{(spo):(spo) KB (s a c) KB}|
|{(spo):(spo) KB (s a d) KB d = T}|
(1)
Because of the fact that the subjects that occur with the
property p,dobelongtomorethanonerdf:type,asetof
classes with their respective md
p,c
-value is generated. The
higher the value for md
p,c
the more frequent is the usage of
property p for subjects that do belong to class c. The class
with the highest value (max
md
p
) is defined as the domain
of the property. If there are multiple classes possessing the
highest value and those classes are in a subclass relation-
ship, the most specific class will be declared as the domain.
Some properties are applied universally in different domains,
whichiswhyacertainclassasanrdfs:domain is not de-
terminable. Due to the fact that only atomic classes can
be specified in the DBpedia mappings wiki, owl:Thing is
the only possible domain for those generic properties. In
case max
md
p
lies under a certain threshold τ
md
, owl:Thing
is defined as the domain of the property.
The appropriate threshold τ
md
=0.96 has been determined
via a randomized analysis. For most properties max
md
p
equals 1.0 meaning that such a property is non-generic and
a concrete class is determinable as a domain. In order to ob-
tain a meaningful sample all properties with max
md
p
=1.0
are ignored for the creation of the sample. Out of the re-
maining properties 10% respectively 15 properties are ran-
domly chosen. After the manual classification, whether these
properties are generic or a concrete class is determinable, the
threshold τ
md
is determined taking the classification and the
max
md
p
-value of these properties into account. Using this
threshold the domain for 1,363 DBpedia properties has been
determined, whereas 107 of them are used rather generically,
which is why owl:Thing is selected as their domain.
3.2 Property range restrictions
The metric mr
p,c
, which indicates whether c is the range of
an object property p, is quite similar to the approach shown
before that identifies the domain of the property.
Let KB again be the knowledge base, E the set of all entities,
OP the set of all object properties, p OP, C the set of all
classes and c C. The metric mr
p,c
calculates from the
number of triples (spo) KB, whose object o E belongs
to the class c, relatively to the number of triples (spo) KB,
whose object o E belongs to any more specific class than
owl:Thing:
mr
p,c
=
|{(spo):(spo) KB (o a c) KB}|
|{(spo):(spo) KB (o a d) KB d = T}|
(2)
On the basis of the highest value (max
mr
p
) and the threshold
τ
mr
a range of the property p is specified, which can either
be a class of the DBpedia ontology or owl:Thing.
The threshold τ
mr
=0.77, which has been investigated by
means of a randomized analysis, seems appropriate. The
procedure for finding the threshold is comparable to the one
35
applied to the domain of properties. With this threshold
592 DBpedia properties have been classified in terms of their
range. Due to their generality 82 of them have received the
range owl:Thing only.
3.3 Class disjointness axioms
For the recognition of disjoint classes the Vector Space Model
(VSM) [13] from Information Retrieval (IR) has been ap-
plied. It can be used for measuring the relevance of a docu-
ment with respect to a query as well as the similarity of two
documents. Likewise the similarity of two ontology classes
can be computed, from which the disjointness of both classes
is deduced in case their similarity value computes below a
given threshold.
In IR similar documents contain equivalent terms. Analo-
gously, entities of similar classes occur more frequently with
the same properties. An RDF triple containing an object
property as a predicate has both an entity as a subject and
an entity as an object. Semantically there is a difference
whether an entity occurs as a subject or as an object in as-
sociation with a property. Thus for every object property
the corresponding inverse property has to be considered.
Let DP be the set of all datatype properties and OP the
set of all object properties. Consequently IOP = {p
: p
p
1
p OP} denotes the set of all inverse object properties.
The set of all properties P results from the union of all
datatype, object and inverse object properties: P = DP
OP IOP = {p
1
,p
2
,...,p
k
,...,p
n
}. Furthermore, let C =
{c
1
,c
2
,...,c
i
,...,c
m
} be the set of all classes. The class
vector v
c
i
=(w
c
i
,p
1
,w
c
i
,p
2
,...,w
c
i
,p
k
,...,w
c
i
,p
n
)isann-
dimensional vector that corresponds to the class c
i
, whereby
w
c
i
,p
k
represents the weight of a property p
k
in a class c
i
.
In IR the weight of a term in a document for a given set of
documents can be determined based on the term frequency-
inverse document frequency (TF-IDF). Comparably the weight
w
c,p
of a property p in a class c can be calculated from the
product of the property frequency (PF) and the inverse class
frequency (ICF).
w
c,p
= pf
c,p
icf
p
(3)
In this context the PF pf
c,p
is the absolute frequency of a
property p along with the entities of the class c.LetKB
be the knowledge base, then the PF equals the number of
triples (spo) KB, whose subject s belongs to the class c.
pf
c,p
= |{(spo):(spo) KB (s a c) KB}| (4)
In IR the sublinear TF scaling is used to take into account
that multiple occurrences of the same term in a document
do not imply a multiplex relevance of a single occurrence.
Similarly, a property, which occurs with a number of entities
of the same class, is not as many times relevant for the class
than a property, which occurs with only one entity of the
class. For that reason the sublinear PF scaling is applied,
which calculates from the logarithm of the PF pf
c,p
.
wpf
c,p
=
1+logpf
c,p
,ifpf
c,p
> 0
0,else
(5)
The ICF icf
p
measures the general relevance of a property
p for the complete knowledge base. Let KB be the knowl-
edge base, then the ICF equals the logarithm of the ratio
between the number of all classes C and the number of the
classes c, whose entities occur with the property p.Incase
a property p occurs with entities of only a few classes, the
ICF is accordingly higher:
icf
p
=log
|C|
|{c : c C (spo) KB (s a c) KB}|
(6)
Analogous to the document similarity in IR, the similarity
of two classes c
i
und c
j
can be computed by means of the
cosine similarity of their class vectors v
c
i
and v
c
j
:
sim
c
i
,c
j
=
v
c
i
v
c
j
|v
c
i
||v
c
j
|
=
n
k=1
w
c
i
,p
k
w
c
j
,p
k
n
k=1
w
2
c
i
,p
k
n
k=1
w
2
c
j
,p
k
(7)
The similarity value sim
c
i
,c
j
of two classes c
i
und c
j
is nor-
malized between 0 und 1, since w
c
i
,p
k
>= 0 und w
c
j
,p
k
>=0
holds for all p
k
. A random sample has pointed out that
τ
sim
=0.17 seems to be an appropriate threshold.
Consequently all class pairs (c
i
,c
j
)withsim
c
i
,c
j
<=0.17
can be considered as disjoint, which results in 37,091 disjoint
pairs of classes identified in the DBpedia ontology.
4. DETECTION OF INCONSISTENCIES
The MappingExtractor as part of the DBpedia Extraction
Framework extracts the content of infoboxes and tables of
a Wikipedia article and maps the resulting data onto the
DBpedia ontology. As an outcome RDF triples are gener-
ated, which represent the statements in the infoboxes and
tables as well as the ontology class of a distinct entity (via
rdf:type). The identification of inconsistencies in prop-
erty rdfs:domain and rdfs:range has been implemented
and integrated within the extraction process of a Wikipedia
dump. The extracted data are examined after all the pages
of the Wikipedia dump have been processed. That is be-
cause checking the consistency of the range of a property in
an RDF triple needs the information about the rdf:type of
the object, which might be extracted later.
During extraction all the triples generated by the MappingEx-
tractor are stored within the main memory. After the ex-
traction has finished consistency checking is performed in
parallel. For every single triple it has to be verified whether
an rdf:type of the subject and the rdfs:domain-value of the
property are disjoint. Analogously, it is verified for all triples
containing an object property, whether an rdf:type of the
object and the rdfs:range-value of the predicate are dis-
joint. In case a violation of this rule is detected a suggestion
36
to correct the inconsistency will be created. Currently, the
suggestions are stored within a database for further manual
processing. Due to the huge amount of data it is not feasible
to apply a reasoner such as Pellet for the consistency checks.
Instead, violations are identified expediently in the way that
has been described in the first place by an own implementa-
tion. The complete consistency check of all triples extracted
from the Wikipedia dump requires six minutes on a server
with 16 cores
10
and 96 GB main memory.
For the violation of the domain of an RDF property the
following solution variants are conceivable:
1. (D1) Map the template property onto another ontology
property.
2. (D2) Remove the classes’ disjointness axiom.
3. (D3) Change the domain of the ontology property to
owl:Thing.
In some cases the inconsistency results from an incorrectly
used property (D1), i. e. the property was intended for and
is usually used in a different context. Basically, within the
DBpedia mappings wiki the template property of an infobox
is mapped onto an ontology property, whose meaning does
not conform to the meaning of the template property. In
order to solve these inconsistencies appropriate properties
have to be determined, whose domain and range fit to the
rdf:type of the subject and the rdf:type of the object of
the inconsistent triple.
Moreover, the class that represents the rdf:type of the sub-
ject and the class that represents the domain of the property
might not be disjoint (D2). Consequently the removal of the
disjointness axiom will be suggested.
Another reason for the inconsistencies might be caused by
the fact that the ontology property is generic (D3), i. e. it
can be applied in various domains and no distinct class can
be determined as rdfs:domain. Due to the fact that only
atomic classes can be specified in the DBpedia mappings
wiki, only owl:Thing qualifies as a feasible rdfs:domain-
value.
For the violation of the range of a property the following
suggestions are generated:
1. (R1) Create a link to the appropriate article in the
articles infobox.
2. (R2) Delete the value or associate the value with an-
other template property.
3. (R3) Map the template property onto another ontology
property.
4. (R4) Remove the classes’ disjointness axiom.
5. (R5) Change the range of the ontology property to
owl:Thing.
10
16x Intel(R) Core(TM) i7 CPU 930 @ 2.80 GHz
One reason for a range-violation is caused by the fact that
a linked article in a value of an infobox is confused with
the actual article (R1). This happens due to the existence
of homonyms or similar spellings of terms, which identify
different real world entities or abstract concepts. In order
to distinguish these ambiguities disambiguation sites
11
ex-
ist that list all articles with same or similar spelling. This
information, which can be obtained by applying the Disam-
biguationExtractor,isusedforthevalidationofapotentially
correct article. For each article, whose title has a similar
spelling to the one provoking the range-violation, it is veri-
fied that the rdf:type of the corresponding DBpedia entity
suits to the range of the property that belongs to the vali-
dated triple. In case of a successful verification the article is
correct.
Furthermore, the case might occur that information does not
fit to the template property within an infobox (R2). Delet-
ing the value or associating the value with another more
appropriate template property might eliminate the inconsis-
tency.
The three remaining correction suggestions (R3, R4, R5) are
created in a comparable way to the violation of the domain
of a property.
5. EVALUATION
According to the two step approach, the evaluation has been
performed for the enrichment task and detection task sepa-
rately.
5.1 Enrichment
The enrichment of the DBpedia ontology as stated in Sec-
tion 3 embraces the domain and range of properties and
the disjointness of classes. The evaluation of the suggested
metrics has been carried out by means of random samples.
Statistical features provide information, whether the ran-
dom samplings are sufficient to express representative state-
ments about the population. The mapping-based dataset of
the English DBpedia in the version 3.7 has been used as a
basis for the evaluation.
For all properties occurring in the ABox a domain re-
striction has been determined with the aid of the met-
ric md
p,c
either a class of the DBpedia ontology or the
class owl:Thing. Out of the 1,363 properties 5% are ran-
domly chosen und the correctness of their classification is
checked
12
. Table 1 shows the results of the random sample.
Table 1: Results of the classification of the domain
of DBpedia properties
N
n t
p
f
p
pr 95% confidence interval
1,363 68 67 1 0.985 [0.957, 0.999]
In this context N represents the population, n the sample
size, t
p
the number of correctly classified properties, f
p
the
11
http://en.wikipedia.org/wiki/Wikipedia:Disambiguation
12
All validations have been performed by the authors. Agree-
ments have been reached by referring to the respective tem-
plate and property description pages in Wikipedia and DB-
pedia, and by majority vote.
37
number of wrongly classified properties and pr the estimated
value for the precision. The confidence interval indicates
that 95 times out of 100 the actual precision, which refers to
the population, lies within the stated interval boundaries.
The results show that if a domain is suggested, approxi-
mately only one out of 100 suggestions is incorrect.
Such a result is due to the fact that the class of the en-
tity that occurs in the subject of a triple arises from the in-
fobox of the corresponding Wikipedia article. The only false
positive example within the sample is caused by a general
problem of both the metrics md
p,c
and mr
p,c
. The prop-
erty dbo:productionStartYear denotes the year, in which
the production of a thing is started. While it is commonly
used in relation to the entites of the class dbo:MeanOfTrans-
portation, it occurs infrequently with the entities of the
class dbo:AutomobileEngine.Sincemax
md
p
>= τ
md
holds
for the property, the domain is defined as dbo:MeanOfTrans-
portation. In practice the property is a generic one and its
domain should be owl:Thing.
In Section 3 a range restriction has been assigned to 592
properties. Approximately 10% of these properties have
been randomly chosen for manually verification of the clas-
sification. Table 2 shows the results of the random sample.
Table 2: Results of the classification of the range of
DBpedia properties
N
n t
p
f
p
pr 95% confidence interval
592 59 51 8 0.864 [0.781, 0.948]
The estimated precision is pr 0.864. The reason for
a lower result in comparison with the precision regarding
the domain can be explained by the diversity of informa-
tion stated in the values of the Wikipedia infoboxes. In
some cases it is not determinable, which kind of informa-
tion should be associated with a certain template property.
Often there is no information in the description site for an
infobox. Another problem is that some entities do not be-
long to a more specific rdf:type than owl:Thing.Thus,the
range defined by the metric mr
p,c
, is somehow skewed.
Out of the 37,091 class pairs that have been declared as dis-
joint classes approximately 0,5% of the pairs are taken as a
random sample. Subsequently their classification is checked.
Table 3 shows the result of the random sample.
Table 3: Results of the classification of the disjoint-
ness of two classes
N
n t
p
f
p
pr 95% confidence interval
37,091 185 183 2 0.989 [0.974, 1.0]
The estimated precision for the metric mr
p,c
amounts to
pr 0.989, which depends heavily on the chosen thresh-
old τ
sim
. A higher threshold results in the declaration of
more actually disjoint classes, but has also negative con-
sequences for the precision value. The two false positive
examples relate to potential exceptions. For instance it is
conceivable that a dbo:FigureSkater aims at a career as a
dbo:Politician after the career as a sportsman. The fact
that the entities of both classes infrequently occur with the
same properties in the ABox leads to a low similarity value
sim
c
i
,c
j
0.056.
5.2 Detection
The consistency check examined 3,110,392 entities, whereas
the majority of 3,060,898 resources showed to be consistent.
The remaining 49,494 entities have been classified as incon-
sistent having 60,602 inconsistencies. In 12,218 cases the
inconsistency results from a domain restriction violation of
a used property, 40,404 inconsistencies result from range re-
striction violations.
For the violations regarding both the domain and range re-
striction a sample of each 100 inconsistencies has been re-
viewed manually, which of the proposed correction sugges-
tions conduct to a reasonable removal of the inconsistency.
Figure 2 shows the ratio of the different suggestions, which
have been applied to eliminate domain restriction viola-
tions.







 

Figure 2: Proportion of the correction suggestions
leading to the removal of the domain restriction vi-
olations
The actual reason for the majority of the inconsistencies
is an erroneous definition of the TBox of the ontology. In
two third of all cases a more specific class than owl:Thing
is declared as a domain, whereas owl:Thing would be the
correct one. In another four cases the axiom declaring a
class pair being disjoint is wrong. Explanations concerning
flaws in the DBpedia ontology have been discussed earlier in
the evaluation. In 30 cases the correction suggestion, which
recommends to change the mapping from the template prop-
erty to the ontology property, leads to the removal of the in-
consistency. Exemplary the property dbo:division has the
class dbo:Species as a domain and therefore should specify
the division of a species. But the property is similarly used
for representing the branch of a company. Consequently
the domain of the property could be set to owl:Thing but
there exists a semantic difference, whether the division of
aspeciesorthebranchofacompanyismeant. Thusthe
semantic correct solution is to map each template property
onto a separate ontology property.
Figure 3 shows the ratio of the different suggestions, which
are applied in order that range restriction violations are
eliminated.
Comparable to domain restriction violations some inconsis-
tencies originate from a flawed DBpedia ontology. In 28
38




 !$
$! 
  !

   !$
!$$
 !!  
! !$
 !
! !

Figure 3: Proportion of the correction suggestions
leading to the removal of the range restriction vio-
lations
cases the mapping from a template property onto an ontol-
ogy property is wrong - a corresponding example has been
discussed in the evaluation of the domain restriction vio-
lations. In 27 cases the deletion of the value or the asso-
ciation of the value with another template property leads
to the elimination of the inconsistency. For example the
movie Kamen Rider J uses the infobox film, in which the
template property producer refers to a company. The tem-
plate property producer maps onto the ontology property
dbo:producer, whose range is dbo:Person. Due to the fact
that the class of all companies is disjoint to the class of all
persons, an inconsistency occurs. Additionally the infobox
also owns the template property studio, which serves for
stating the companies that produce a movie. Accordingly,
this is a fitting template property, which could be associ-
ated to the company. Sometimes such a well fitting tem-
plate property does not exist. Concerning this matter only
the deletion of the value is conductive, which can involve the
loss of relevant data. In two cases the linking of a suggested
article in the infobox value leads to the removal of the in-
consistency. The suggested article has been determined by
the disambiguation site of the article linked in the infobox
value. Exemplary the Australian actor dbp:John_Jarratt
has the movie dbp:Australia_%282008_film%29 as a birth-
place, whereas the country dbp:Australia would be the cor-
rect value. The following explanations refer to corrections
that cannot be accomplished by means of the generated sug-
gestions. In Figure 3 they are denoted by the category other.
Frequently the actually correct article cannot be suggested
for linking in the infobox, since no corresponding disam-
biguation site exists. In other cases the actual article cannot
be linked as an infobox value, since it simply does not exist
in Wikipedia. Instead, a related article is linked, which can
lead to an inconsistency.
6. CONCLUSION AND OUTLOOK
In this paper an approach is presented to identify incon-
sistent triples during the extraction process of the DBpe-
dia dataset which may lead to a higher quality extraction.
Therefore, the DBpedia ontology has been improved by ex-
tracting further axioms concerning properties’ domain and
range restrictions, and class disjointness from the original
DBpedia dataset. The applied methods performed with rea-
sonably high precision, which allows to use the enriched on-
tology
13
for other purposes, too.
For now, the correction of inconsistencies needs to be per-
formed manually according to the suggestions. Therefore, a
user interface for managing the correction suggestions will
be provided that might report supplementary information,
for instance, the solution that minimizes the number of re-
maining inconsistencies, to support the decision of the user.
To decide automatically, which triple is in charge of inducing
the inconsistency, statistical methods are conceivable, while
further research is needed.
The realized experiments base on the dump of DBpedia ver-
sion 3.7, but are also applicable to succeeding versions. For
future application the error detection needs to be adapted
to the DBpedia Live extraction process [14], which provides
online updates of DBpedia according to the changes in the
Wikipedia articles and mappings. In such a scenario, an
iterative approach can be followed to obtain a consistent
DBpedia incrementally.
7. REFERENCES
[1] Hogan,A.,Harth,A.,Passant,A.,Decker,S.,
Polleres, A.: Weaving the pedantic web. In: Linked
Data on the Web Workshop (LDOW 2010) at WWW
2010. Volume 628., CEUR Workshop Proceedings
(2010) 30–34
[2] Demter, J., Auer, S., Martin, M., Lehmann, J.:
LODStats an extensible framework for
high-performance dataset analytics. (To appear)
[3] P´eron, Y., Raimbault, F., M´enier, G., Marteau, P.F.:
On the detection of inconsistencies in RDF data sets
and their correction at ontological level. In:
Proceedings of the 10th International Semantic Web
Conference (ISWC 2011). (2011)
[4] V
¨
olker, J., Niepert, M.: Statistical schema induction.
In Grobelnik, M., Simperl, E., eds.: Proceedings of the
8th Extended Semantic Web Conference (ESWC
2011). ESWC’11, Heraklion, Crete, Greece, Springer
(2011) 124–138
[5] Fleischhacker, D., V
¨
olker, J.: Inductive learning of
disjointness axioms. In: On the Move to Meaningful
Internet Systems: OTM 2011. Volume 7045. Springer
(2011) 680–697
[6] Lehmann, J.: DL-Learner: learning concepts in
description logics. Journal of Machine Learning
Research (JMLR) 10 (2009) 2639–2642
[7] Lehmann, J., B
¨
uhmann, L.: ORE a tool for repairing
and enriching knowledge bases. In: Proceedings of the
9th International Semantic Web Conference (ISWC
2010). Volume 6497 of Lecture Notes in Computer
Science., Berlin/Heidelberg, Springer (2010) 177–193
[8] Haase, P., V
¨
olker, J.: Ontology learning and reasoning
dealing with uncertainty and inconsistency. In
Costa, P.C., D’Amato, C., Fanizzi, N., Laskey, K.B.,
Laskey,K.J.,Lukasiewicz,T.,Nickles,M.,Pool,M.,
eds.: Uncertainty Reasoning for the Semantic Web I.
Volume 5327. Springer, Berlin, Heidelberg (2008)
366–384
13
The enriched DBpedia ontology can be accessed via
http://purl.org/hpi/dbpedia
enriched.owl
39
[9] Haase, P., van Harmelen, F., Huang, Z.,
Stuckenschmidt, H., Sure, Y.: A framework for
handling inconsistency in changing ontologies. In:
Proceedings of the 4th International Semantic Web
Conference (ISWC 2005). Volume 3729., Springer
(2005) 353–367
[10] Waitelonis, J., Ludwig, N., Knuth, M., Sack, H.:
WhoKnows? evaluating linked data heuristics with a
quiz that cleans up DBpedia. International Journal of
Interactive Technology and Smart Education (ITSE)
8(3) (2011) 236–248
[11] Knuth, M., Hercher, J., Sack, H.: Collaboratively
patching linked data. In: Proceedings of 2nd
International Workshop on Usage Analysis and the
Web of Data (USEWOD 2012), co-located with the
21st International World Wide Web Conference 2012
(WWW 2012), Lyon, France (April 2012)
[12] DBpedia: DBpedia 3.7 downloads (September 2011)
accessed Nov 28., 2011.
[13] Manning, C.D., Raghavan, P., Sch
¨
utze, H.:
Introduction to Information Retrieval. Cambridge
University Press, New York, NY, USA (2008)
[14] Hellmann, S., Stadler, C., Lehmann, J., Auer, S.:
DBpedia Live extraction. In: On the Move to
Meaningful Internet Systems: OTM 2009. Volume
5871. Springer (2009) 1209–1223
40
... Existing work on KB quality issues covers not only error detection and assessment, but also quality improvement via completion, canonicalization and so on [46]. Regarding error detection, erroneous assertions can be detected by various methods, including consistency checking with defined, mined or external constraints [28,48,56], prediction by machine learning or statistical methods [9,34,47], and evaluation by query templates [29]; see Section 2.1 for more details. However, erroneous assertions are typically discarded [8,42] or reported to curators for manual correction. ...
... Explicitly stated KB constraints can be directly used, but these are often weak or even non-existent. Thus, before using the DBpedia ontology to validate assertions, Topper et al. [56] enriched it with class disjointness, and property domain and range costraints, all derived via statistical analysis; Paulheim and Gangemi [48] enriched it via alignment with the DOLCE-Zero foundational ontology. Various constraint and rule languages such as Shapes Constraint Language (SHACL) 1 [28], Rule-Based Web Logics [1] and SPARQL query templates [29], have also been proposed so that external knowledge can be encoded and applied. ...
Article
Full-text available
Various knowledge bases (KBs) have been constructed via information extraction from encyclopedias, text and tables, as well as alignment of multiple sources. Their usefulness and usability is often limited by quality issues. One common issue is the presence of erroneous assertions and alignments, often caused by lexical or semantic confusion. We study the problem of correcting such assertions and alignments, and present a general correction framework which combines lexical matching, context-aware sub-KB extraction, semantic embedding, soft constraint mining and semantic consistency checking. The framework is evaluated with one set of literal assertions from DBpedia, one set of entity assertions from an enterprise medical KB, and one set of mapping assertions from a music KB constructed by integrating Wikidata, Discogs and MusicBrainz. It has achieved promising results, with a correction rate (i.e., the ratio of the target assertions/alignments that are corrected with right substitutes) of 70.1 %, 60.9 % and 71.8 %, respectively.
... Then [22] proposes an improvement of association rule mining for learning disjointness axioms and applies the learned axioms to inconsistency detection. In addition to disjoint axioms, enriching DBpedia ontology with domain and range restrictions and class disjointness axioms is also discussed [31]. The enhanced ontologies are further used for error detection. ...
Article
Full-text available
Knowledge graph reasoning (KGR) aims to infer new knowledge or detect noises, which is essential for improving the quality of knowledge graphs. Recently, various KGR techniques, such as symbolic- and embedding-based methods, have been proposed and shown strong reasoning ability. Symbolic-based reasoning methods infer missing triples according to predefined rules or ontologies. Although rules and axioms have proven effective, it is difficult to obtain them. Embedding-based reasoning methods represent entities and relations as vectors, and complete KGs via vector computation. However, they mainly rely on structural information and ignore implicit axiom information not predefined in KGs but can be reflected in data. That is, each correct triple is also a logically consistent triple and satisfies all axioms. In this paper, we propose a novel NeuRal Axiom Network (NeuRAN) framework that combines explicit structural and implicit axiom information without introducing additional ontologies. Specifically, the framework consists of a KG embedding module that preserves the semantics of triples and five axiom modules that encode five kinds of implicit axioms. These axioms correspond to five typical object property expression axioms defined in OWL2, including ObjectPropertyDomain, ObjectPropertyRange, DisjointObjectProperties, IrreflexiveObjectProperty and AsymmetricObjectProperty. The KG embedding module and axiom modules compute the scores that the triple conforms to the semantics and the corresponding axioms, respectively. Compared with KG embedding models and CKRL, our method achieves comparable performance on noise detection and triple classification and achieves significant performance on link prediction. Compared with TransE and TransH, our method improves the link prediction performance on the Hits@1 metric by 22.0% and 20.8% on WN18RR-10% dataset, respectively.
... Consider a statement "A teacher is a person who teaches a course." In this statement, ABox, TBox statement, and the knowledge base are mentioned below: [46,47]. In [48], Li et al. proposed an ontology inconsistency measurement method named ETOICM to measure the inconsistency using Dempster-Shafer Theory. ...
Article
Full-text available
There are various real-world applications and areas, knowledge that handles with ambiguity, imperfect, or partial are difficult to capture. Such situations cause problems to discover new knowledge while dealing with decision-making and information retrieval. Ontologies are description-based logic but not able to handle the uncertain or incomplete knowledge in specific application domains. Therefore, it is very important to deal with uncertainty in ontologies. Moreover, we need to handle uncertainty for organizing the web data, so that machines can easily understand and retrieve the desired information efficiently and accurately. In this paper, we have identified various uncertainties in ontology/ies to achieve above objectives, based on different classification of ontology like intra-ontology, inter-ontology, and external ontology using different operations. Furthermore, we have carried out impact analysis of uncertainty using different context and situations to ontology and its operations for how and where uncertainties have to be represented and what semantics are considerable. In the literature, we have found that various researchers have been working on identifying the uncertainties in different domains of application like vague, inaccurate, missing, etc. Despite of identification of these uncertainties, we have also mapped the various situations and context related to ontology which is missing in the literature.
... Error detection in KGs has also been studied by using ILP methods to discover axioms concerning properties' domain and range restrictions that identify contradictions [149]. Another approach identifies outliers after grouping subjects by type [150]. ...
Thesis
Entity-centric knowledge graphs (KGs) are becoming increasingly popular for gathering information about entities. The schemas of KGs are semantically rich, with many different types and predicates to define the entities and their relationships. These KGs contain knowledge that requires understanding of the KG’s structure and patterns to be exploited. Their rich data structure can express entities with semantic types and relationships, oftentimes domain-specific, that must be made explicit and understood to get the most out of the data. Although different applications can benefit from such rich structure, this comes at a price. A significant challenge with KGs is the quality of their data. Without high-quality data, the applications cannot use the KG. However, as a result of the automatic creation and update of KGs, there are a lot of noisy and inconsistent data in them and, because of the large number of triples in a KG, manual validation is impossible. In this thesis, we present different tools that can be utilized in the process of continuous creation and curation of KGs. We first present an approach designed to create a KG in the accounting field by matching entities. We then introduce methods for the continuous curation of KGs. We present an algorithm for conditional rule mining and apply it on large graphs. Next, we describe RuleHub, an extensible corpus of rules for public KGs which provides functionalities for the archival and the retrieval of rules. We also report methods for using logical rules in two different applications: teaching soft rules to pre-trained language models (RuleBert) and explainable fact checking (ExpClaim).
... This method analyzes and generates patterns, which can be used to expand the ontology. Töpper et al. [186] proposed a method based on an improved DBpedia ontology to identify inconsistencies in the DBpedia dataset. This method uses a rich ontology in the extraction process, which may lead to contradictions pointing to incorrect facts. ...
Article
Full-text available
The Publisher regrets that this article is an accidental duplication of an article that has already been published, http://dx.doi.org/10.1016/j.fmre.2021.09.003. The duplicate article has therefore been withdrawn. The full Elsevier Policy on Article Withdrawal can be found at https://www.elsevier.com/about/our-business/policies/article-withdrawal.
Thesis
Full-text available
In the Semantic Web era, Linked Open Data (LOD) is its most successful implementation, which currently contains billions of RDF (Resource Data Framework) triples derived from multiple, distributed, heterogeneous sources. The role of a general semantic schema, represented as an ontology, is essential to ensure the correctness and consistency in LOD and make it possible to infer implicit knowledge by reasoning. The growth of LOD creates an opportunity for the discovery of ontological knowledge from its raw RDF data itself to enrich relevant knowledge bases. In this work, we aim at discovering schema-level knowledge in the form of axioms encoded in OWL (Ontology Web Language) from RDF data. The approaches to automated generation of the axioms from recorded RDF facts on the Web may be regarded as a case of inductive reasoning and ontology learning. The instances, represented by RDF triples, play the role of specific observations, from which axioms can be extracted by generalization.Based on the insight that discovering new knowledge is essentially an evolutionary process, whereby hypotheses are generated by some heuristic mechanism and then tested against the available evidence, so that only the best hypotheses survive, we propose a model applying Grammatical Evolution, one type of evolutionary algorithm, to mine OWL axioms from an RDF data repository. In addition, we specialize the model for the specific problem of learning OWL class disjointness axioms, along with the experiments performed on DBpedia, one of the prominent examples of LOD. Furthermore, we use different axiom scoring functions based on possibility theory, which are well-suited to the open world assumption scenario of LOD, to evaluate the quality of discovered axioms. Specifically, we proposed a set of measures to build objective functions based on single-objective and multi-objective models, respectively. Finally, in order to validate it, the performance of our approach is evaluated against subjective and objective benchmarks, and is also compared to the main state-of-the-art systems.
Article
Full-text available
The rapid development of Linked Data leads to a proliferation of errors in pub-lished data, primarily related to inconsistencies between data instances and their related ontologies. This problem alters the reliability of Semantic Web applications when they in-volve the analysis or the exploitation of heterogeneous rdf data sets. We focus in this article on a way to correct inconsistencies caused by the domain and the range of a property. We present an algorithm to identify the source of these inconsistencies in the ontology, and to provides guidelines to correct or improve the ontology. The localization of the inconsistencies is based on a quantitative comparison between the classes of domains and ranges defined in the ontology and the ones built by the exhaustive analysis of instances used as subject or object in the properties. We show the usefulness of this method on a case study involving dbpedia: we use our approach to diagnose and correct common inconsistencies. Another experiment conducted on a large data set generated by sp2bench validates the scalability of the proposed algorithm.
Conference Paper
Full-text available
One of the major obstacles for a wider usage of Web Data is the diculty to obtain a clear picture of the available datasets. In order to reuse, link, revise or query a dataset published on the Web it is important to know the structure, coverage and coherence of the data. In order to obtain such information we developed LODStats { a statement-stream-based approach for gathering comprehensive statistics about datasets adhering to the Resource Description Framework (RDF). LODStats is based on the declarative description of statistical dataset characteristics. Its main advantages over other approaches are a smaller memory footprint and signicantly better performance and scalability. We integrated LODStats with the CKAN dataset metadata registry and obtained a comprehensive picture of the current state of a signicant part of the Data Web.
Conference Paper
Full-text available
Today's Web of Data is noisy. Linked Data often needs extensive preprocessing to enable efficient use of heterogeneous resources. While consistent and valid data provides the key to efficient data processing and aggregation we are facing two main challenges: (1st) Identification of erroneous facts and tracking their origins in dynamically connected datasets is a difficult task, and (2nd) efforts in the curation of deficient facts in Linked Data are exchanged rather rarely. Since erroneous data often is duplicated and (re-)distributed by mashup applications it is not only the responsibility of a few original publishers to keep their data tidy, but progresses to be a mission for all distributers and consumers of Linked Data too. We present a new approach to expose and to reuse patches on erroneous data to enhance and to add quality information to the Web of Data. The feasibility of our approach is demonstrated by example of a collaborative game that patches statements in DBpedia data and provides notifications for relevant changes.
Conference Paper
Full-text available
One of the major problems of large scale, distributed and evolving on- tologies is the potential introduction of inconsistencies. In this paper we survey four different approaches to handling inconsistency in DL-based ontologies: con- sistent ontology evolution, repairing inconsistencies, reasoning in the presence of inconsistencies and multi-version reasoning. We present a common formal ba- sis for all of them, and use this common basis to compare these approaches. We discuss the different requirements for each of these methods, the conditions un- der which each of them is applicable, the knowledge requirements of the various methods, and the different usage scenarios to which they would apply.
Conference Paper
Full-text available
While the number and size of Semantic Web knowledge bases increases, their maintenance and quality assurance are still difficult. In this article, we present ORE, a tool for repairing and enriching OWL ontologies. State-of the-art methods in ontology debugging and supervised machine learning form the basis of ORE and are adapted or extended so as to work well in practice. ORE supports the detection of a variety of ontology modelling problems and guides the user through the process of resolving them. Furthermore, the tool allows to extend an ontology through (semi-)automatic supervised learning. A wizardlike process helps the user to resolve potential issues after axioms are added.
Conference Paper
Full-text available
Ontology Learning from text aims at generating domain ontologies from textual resources by applying natural language processing and machine learning techniques. It is inherent in the ontology learning process that the ac- quired ontologies represent uncertain and possibly contradicting knowledge. From a logical perspective, the learned ontologies are potentially inconsistent knowl- edge bases that thus do not allow meaningful reasoning directly. In this paper we present an approach to generate consistent OWL ontologies from learned ontol- ogy models by taking the uncertainty of the knowledge into account. We further present evaluation results from experiments with ontologies learned from a Digi- tal Library.
Conference Paper
Over a decade after RDF has been published as a W3C recommendation, publishing open and machine-readable content on the Web has recently received a lot more attention, including from corporate and governmental bodies; notably thanks to the Linked Open Data community, there now exists a rich vein of heterogeneous RDF data published on the Web (the so-called "Web of Data") accessible to all. However, RDF publishers are prone to making errors which compromise the effectiveness of applications leveraging the resulting data. In this paper, we discuss common errors in RDF publishing, their consequences for applications, along with possible publisher-oriented approaches to improve the quality of structured, machine-readable and open data on the Web.
Conference Paper
While the realization of the Semantic Web as once envisioned by Tim Berners-Lee remains in a distant future, the Web of Data has already become a reality. Billions of RDF statements on the Internet, facts about a variety of different domains, are ready to be used by semantic applications. Some of these applications, however, crucially hinge on the availability of expressive schemas suitable for logical inference that yields non-trivial conclusions. In this paper, we present a statistical approach to the induction of expressive schemas from large RDF repositories. We describe in detail the implementation of this approach and report on an evaluation that we conducted using several data sets including DBpedia.