ArticlePDF Available

Sieve: linked data quality assessment and fusion

Authors:

Abstract and Figures

The Web of Linked Data grows rapidly and already contains data originating from hundreds of data sources. The quality of data from those sources is very diverse, as values may be out of date, incomplete or incorrect. Moreover, data sources may provide conflicting values for a single real-world object. In order for Linked Data applications to consume data from this global data space in an integrated fashion, a number of challenges have to be overcome. One of these challenges is to rate and to integrate data based on their quality. However, quality is a very subjective matter, and finding a canonic judgement that is suitable for each and every task is not feasible. To simplify the task of consuming high-quality data, we present Sieve, a framework for flexibly expressing quality assessment methods as well as fusion methods. Sieve is integrated into the Linked Data Integration Framework (LDIF), which handles Data Access, Schema Mapping and Identity Resolution, all crucial preliminaries for quality assessment and fusion. We demonstrate Sieve in a data integration scenario importing data from the English and Portuguese versions of DBpedia, and discuss how we increase completeness, conciseness and consistency through the use of our framework.
Content may be subject to copyright.
Sieve: Linked Data Quality Assessment and Fusion
Pablo N. Mendes, Hannes Mühleisen, Christian Bizer
Web Based Systems Group
Freie Universität Berlin
Berlin, Germany, 14195
first.last@fu-berlin.de
ABSTRACT
The Web of Linked Data grows rapidly and already contains
data originating from hundreds of data sources. The quality
of data from those sources is very diverse, as values may be
out of date, incomplete or incorrect. Moreover, data sources
may provide conflicting values for a single real-world object.
In order for Linked Data applications to consume data
from this global data space in an integrated fashion, a num-
ber of challenges have to be overcome. One of these chal-
lenges is to rate and to integrate data based on their quality.
However, quality is a very subjective matter, and finding a
canonic judgement that is suitable for each and every task
is not feasible.
To simplify the task of consuming high-quality data, we
present Sieve, a framework for flexibly expressing quality as-
sessment methods as well as fusion methods. Sieve is inte-
grated into the Linked Data Integration Framework (LDIF),
which handles Data Access, Schema Mapping and Identity
Resolution, all crucial preliminaries for quality assessment
and fusion.
We demonstrate Sieve in a data integration scenario im-
porting data from the English and Portuguese versions of
DBpedia, and discuss how we increase completeness, con-
ciseness and consistency through the use of our framework.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous;
H.2.5 [Information Systems]: Database Management—
Heterogeneous databases
Keywords
Linked Data, RDF, Data Integration, Data Quality, Data
Fusion, Semantic Web
1. INTRODUCTION
The Web of Linked Data has seen an exponential growth
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
LWDM2012 March 26–30, 2012, Berlin, Germany
Copyright 2012 ACM 978-1-4503-0790-1/12/03 ...$10.00.
over the past five years
1
. From 12 Linked Data sets cata-
logued in 2007, the Linked Data cloud has grown to almost
300 data sets encompassing approximately 31 billion triples,
according to the most recent survey conducted in September
2011 [10].
The information contained in each of these sources often
overlaps. In fact, there are approximately 500 million ex-
plicit links between data sets [10], where each link indicates
that one data set ‘talks about’ a data item from another
data set. Further overlapping information may exist, even
though no explicit links have been established yet. For in-
stance, two data sets may use different identifiers (URIs) for
the same real world objects (e.g. Bill Clinton has an identi-
fier in the English and the Portuguese DBpedia). Similarly,
two different attribute identifiers may be used for equivalent
attributes (e.g. both foaf:name and dbprop:name contain
the name ‘Bill Clinton’.)
Applications that consume data from the Linked Data
cloud are confronted with the challenge of obtaining a ho-
mogenized view of this global data space [8]. The Linked
Data Integration Framework (LDIF) was created with the
objective of supporting users in this task. LDIF is able to
conflate multiple identifiers of the same object into a canon-
ical URI (identity resolution), while mapping equivalent at-
tributes and class names into a homogeneous target repre-
sentation (schema mapping).
As a result of such a data integration process, multiple
values for the same attribute may be observed e.g. orig-
inating from multiple sources. For attributes that only ad-
mit one value (e.g. total area or population of a city) this
represents a conflict for the consumer application to resolve.
With the objective of supporting user applications in dealing
with such conflicts, we created Sieve - Linked Data Quality
Assessment and Data Fusion.
Sieve is included as a module in LDIF, and can be cus-
tomized for user applications programmatically (through an
open source Scala API) and through configuration parame-
ters that describe users’ task-specific needs. Sieve includes a
Quality Assessment module and a Data Fusion module. The
Quality Assessment module leverages user-selected meta-
data as quality indicators to produce quality assessment
scores through user-configured scoring functions. The Data
Fusion module is able to use quality scores in order to per-
form user-configurable conflict resolution tasks.
In this paper we demonstrate Sieve through a data inte-
gration scenario involving the internationalized editions of
DBpedia, which extracts structured data from Wikipedia.
1
http://lod-cloud.net
In this scenario we consume data from the English and Por-
tuguese DBpedia editions and wish to obtain data about
Brazilian municipalities that is more complete, concise, con-
sistent and up to date than in their original sources.
In Section 2 we describe the LDIF architecture, placing
the newly created Sieve modules in the larger context of
data integration. Subsequently, we explain the Quality As-
sessment (Section 3) and Data Fusion (Section 4) modules
in more detail. In Section 5 we describe our demonstra-
tion setup and in Section 6 we discuss the results. Finally,
in Section 7 we make final comments and point to future
directions.
2. LDIF ARCHITECTURE
The Linked Data Integration Framework (LDIF) [13] of-
fers a modular architecture to support a wide range of appli-
cations in producing a homogenized view over heterogeneous
data originating from diverse sources. The architecture of
LDIF is displayed in Figure 1.
Data sets are imported into LDIF through Web Data Ac-
cess Modules. Currently supported data access modules in-
clude the Triple/Quad Dump Import, which loads files en-
coded in all major RDF formats from the disk or the Web,
a Crawler Import which relies on the dereferenceability of
URIs to obtain RDF by navigating through the Web of
Linked Data, and finally the SPARQL Import, which allows
SQL-like queries to import data from SPARQL servers [7]
on the Web.
As data is imported, its provenance or lineage is also
recorded. Provenance data contains information about the
history of data items, for example their origins. For prove-
nance tracking, LDIF relies on the Named Graphs data
_:i0 ldif:importId "DBpedia.0" ldif:provenance .
_:i0 ldif:lastUpdate \\
"2011-09-21T19:01:00-05:00"^^xsd:dateTime ldif:provenance .
_:i0 ldif:hasDatasource "DBpedia" ldif:provenance .
_:i0 ldif:hasImportType "quad" ldif:provenance .
_:i0 ldif:hasOriginalLocation "dbp_dump.nt.bz2" ldif:provenance .
Figure 2: Provenance metadata output by LDIF.
model [6]. Information is stored as quadruples (quads) in
the form <subject, predicate, object, graph>.
The fourth element of the quad is interpreted as a graph
identifier, a way to refer to a group of triples. LDIF is ag-
nostic to the provenance model used, allowing users to at-
tach arbitrary provenance metadata to their named graphs.
While performing import jobs, LDIF also automatically adds
its own provenance metadata that record time of import,
URL source, etc. as displayed in Figure 2. Please note that
URIs in the figure were shortened and the second line was
wrapped due to space restrictions.
Since data from multiple sources may use different vocab-
ularies to describe overlapping information, LDIF includes
a Schema Mapping step, which relies on the R2R Frame-
work [2]. R2R alleviates schema heterogeneity by translat-
ing source schema element names, structure and values into
a user-configured target representation. Through its R2R
Mapping Language, the framework supports simple or more
complex (1-to-n and n-to-1) transformations. Moreover, it
is able to normalize different units of measurement, perform
string transformations or manipulate data types and lan-
guage tags.
Furthermore, since data may also include multiple identi-
LDIF
Application Layer
Data Access,
Integration and
Storage Layer
Web of Data
Publication Layer
Integrated
Web Data
Web Data
Access
Module
Vocabulary
Mapping
Module
Identity
Resolution
Module
SPARQL or RDF API
RDF/
XML
Database A
Database B
LD Wrapper
LD Wrapper
HTTP
HTTP
HTTP
CMS
RDFa
HTTP
(Sieve)
Data Fusion
Module
(Sieve)
Data Quality
Assessment
Module
Figure 1: LDIF Architecture
fiers for the same real-world entity, LDIF also supports an
Identity Resolution step through Silk [14][9]. The Silk Link
Discovery Framework allows flexibility in the identity resolu-
tion heuristics through the declarative Silk - Link Specifica-
tion Language (Silk-LSL). In order to specify the condition
which must hold true for two entities to be considered a du-
plicate, the user may apply different similarity metrics to
property values of an entity or related entities. The cho-
sen similarity scores can also be combined and weighted
using various similarity aggregation functions. The result
of the identity resolution step is a set of equivalence links
(owl:sameAs) between URIs.
As an additional normalization step, LDIF’s URITrans-
lator module can homogenize all URI aliases identified as
duplicates by Silk, grouping all properties of those URIs
into one canonical target URI. For instance, suppose that
we have 3 URIs identified as duplicates, with 2 properties
each. The URITranslator will be able to produce one canon-
ical URI with 6 property values. The choice of which URI
will represent the cluster of duplicates can be done through a
string pattern (URI minting) or through heuristics to choose
between the existing URIs. The current implementation
uses a ‘readability’ heuristic that prefers URIs with higher
proportion of letters. After replacing the original URIs by a
canonical one, URITranslator adds owl:sameAs links point-
ing at the original URIs, which makes it possible for applica-
tions to refer back to the original data sources on the Web.
If the LDIF input data already contains owl:sameAs links,
the referenced URIs are also normalized accordingly.
As a result of the aforementioned data integration steps,
the schema and object identifiers will be normalized accord-
ing to user configuration, but the property values originating
from different sources may be of low quality, redundant or
conflicting. In order to alleviate this problem, we developed
Sieve as a LDIF module to perform quality assessment and
data fusion. Sieve assumes that data has been normalized
through schema mapping and identity resolution. Therefore,
as a result of this normalization process,
if two descriptions refer to the same object in the real
world, then these descriptions should have been as-
signed the same URI,
if there are two property values for the same real world
attribute, then these two properties should have been
assigned to a unique URI with two property values.
3. DATA QUALITY
A popular definition for quality is “fitness for use” [11].
Therefore, the interpretation of the quality of some data
item depends on who will use this information, and what is
the task for which they intend to employ it. While one user
may consider the data quality sufficient for a given task,
it may not be sufficient for another task or another user.
Moreover, quality is commonly perceived as multifaceted,
as the “fitness for use” may depend on several dimensions
such as accuracy, timeliness, completeness, relevancy, objec-
tivity, believability, understandability, consistency, concise-
ness, availability, and verifiability [1].
These information quality dimensions are not independent
of each other and typically only a subset of the dimensions
is relevant in a specific situation. Which quality dimensions
are relevant and which levels of quality are required for each
dimension is determined by the specific task at hand and
the subjective preferences of the information consumer [12].
In Sieve, the quality assessment task is realized through a
flexible module, where the user can choose which character-
istics of the data indicate higher quality, how this quality is
quantified and how should it be stored in the system. This
is enabled by a conceptual model composed of assessment
metrics, indicators and scoring functions [1].
Assessment Metrics are procedures for measuring an
information quality dimension. In our model, each assess-
ment metric relies on a set of quality indicators and cal-
culates an assessment score from these indicators using a
scoring function.
A Data Quality Indicator is an aspect of a data item
or data set that may give an indication to the user of the
suitability of the data for some intended use. The types of
information which may be used as quality indicators are very
diverse. Besides the information to be assessed itself, indi-
cators may stem from meta-information about the circum-
stances in which information was created, on background
information about the information provider, or on ratings
provided by the information consumers themselves, other
information consumers, or domain experts.
A Scoring Function is an assessment of a data quality
indicator to be evaluated by the user in the process of de-
ciding on the suitability of the data for some intended use.
There may be a choice of several scoring functions for pro-
ducing a score based on a given indicator. Depending on
the quality dimension to be assessed and the chosen quality
indicators, scoring functions range from simple comparisons,
like “assign true if the quality indicator has a value greater
than X”, over set functions, like “assign true if the indicator
is in the set Y”, aggregation functions, like “count or sum up
all indicator values”, to more complex statistical functions,
text-analysis, or network-analysis methods.
An Aggregate Metric is a user-specified aggregate as-
sessment metric built out of individual assessment metrics.
These aggregations produce new assessment values through
the average, sum, max, min or threshold functions applied to
a set of assessment metrics. Aggregate assessment metrics
are better visualized as trees, where an aggregation function
is applied to the leaves and combined up the tree until a
single value is obtained. The functions to be applied at each
branch are specified by the users.
3.1 Sieve Quality Assessment Module
In Sieve, users have the flexibility of defining relevant in-
dicators and respective scoring functions for their specific
quality assessment task. A number of scoring functions are
provided with Sieve which can be configured for a specific
task through an XML file. Moreover, users can extend the
current list of scoring functions and configuration options by
implementing a simple programmatic interface that takes in
a list of metadata values and outputs a real-valued score.
The complete list of supported scoring functions is avail-
able from the Sieve specification website
2
. In this paper we
have used:
TimeCloseness – measures the distance between the in-
put date from the provenance graph to the current
date, with more recent data receive scores closer to 1
2
http://www4.wiwiss.fu-berlin.de/bizer/sieve/
Preference assigns decreasing scores to each graph
URI provided in the configuration
Through the Sieve Quality Assessment Specification Lan-
guage, users can define AssessmentMetric elements that use a
specific ScoringFunction implementation to generate a score
based on a given indicator and parameters provided in the
Input element. Multiple AssessmentMetric elements can be
composed into a AggregatedMetric element, yielding a score
which is an aggregation of the component scores.
Listing 1 shows a quality assessment policy that outputs
scores for the dimensions “Recency” and “Reputation”. In
the Recency dimension, Sieve will use the TimeCloseness
function in order to measure the distance between two dates:
a date input as an indicator, and the current date. In this
case, the configuration used the property lastUpdated as in-
dicator (line 6), and a range parameter (in days) to nor-
malize the scores. The function outputs a score between 0
and 1, where values closer to 1 indicate that the two dates
are very close. The computed score will be output as value
for the sieve:recency property. Consequently, the more re-
cently updated graphs are ranked higher by this function.
In the Reputation dimension, Sieve will take a parameter
list (line 11) with a space-separated list of graphs to trust.
These graphs will be scored by the function Preference from
higher to lower priority in order of appearance in the list
parameter. Consequently, in the example at hand, values
originating from the Portuguese DBpedia version will take
higher priority over those originating from the English ver-
sion (line 12).
1 < S i e v e>
2 <Q u a l i t y A s s e s s m e n t>
3 <A s s e s s m e n t M e t r i c i d= s i e v e : r e c e n c y >
4 <S c o r i n g F u n c t i o n c l a s s =T i m e C l o s e n e s s >
5 <Param name=t imeS p an v a l u e=”7 ”/>
6 <I n p u t pat h=?GRAPH/ p r o v e n a n c e : l a s t U p d a t e d ”/>
7 </ S c o r i n g F u n c t i o n>
8 </ As s e s s m e n t M e t r i c>
9 <A s s e s s m e n t M e t r i c i d= s i e v e : r e p u t a t i o n >
10 <S c o r i n g F u n c t i o n c l a s s = S c o r e d L i s t >
11 <Param name= p r i o r i t y
12 v a l u e= h t t p : // pt . w i k i p e d i a . o rg h t t p : // en .
w i k i p e d i a . or g ”/>
13 </ S c o r i n g F u n c t i o n>
14 </ As s e s s m e n t M e t r i c>
15 </ Q u a l i t y A s s e s s m e n t>
16 </ S i e v e>
Listing 1: Sieve Data Quality Assessment
Specification Example
Users can also aggregate multiple scores into a new ag-
gregated metric. In our running example, one could have
chosen to aggregate sieve:recency and sieve:reputation into
an indicator, say, sieve:believability which may use the func-
tion TieBreaker(sieve:recency, sieve:reputation) to rank scores
by recency first, and in case two articles have the same re-
cency, assign a higher score to the most reputable source.
We refer the reader to a detailed explanation of the quality
assessment specification language at our project website
2
.
The output of the quality assessment module is a set
of quads, where the calculated scores are associated with
each graph. An example is shown in Figure 3, where en-
wiki:Juiz de Fora is the URI for a graph grouping all triples
extracted from the English Wikipedia page for the Brazil-
ian municipality of Juiz de Fora. These scores represent
the user-configured interpretation of quality and can be con-
enwiki:Juiz_de_Fora sieve:recency "0.4" ldif:provenance .
ptwiki:Juiz_de_Fora sieve:recency "0.8" ldif:provenance .
enwiki:Juiz_de_Fora sieve:reputation "0.9" ldif:provenance .
ptwiki:Juiz_de_Fora sieve:reputation "0.45" ldif:provenance .
Figure 3: Quality assessment metadata output by
Sieve
sumed downstream by applications that will rank, filter or
transform data based on their judged quality. For instance,
these scores will be used subsequently by the Data Fusion
module when deciding how to fuse quads based on the meta-
data associated with each graph.
4. DATA FUSION
In the context of data integration, Data Fusion is defined
as the “process of fusing multiple records representing the
same real-world object into a single, consistent, and clean
representation” [5]. Data Fusion is commonly seen as a third
step following schema mapping and identity resolution, as a
way to deal with conflicts that either already existed in the
original sources or were generated by integrating them. For
instance, by mapping two equivalent attributes from differ-
ent schemata, the system may generate one canonical at-
tribute with two different values. Similarly, the identity res-
olution step may collapse two object identifiers into one,
requiring that applications deal with the multiple attribute
values originating from each data source.
Our data fusion framework is inspired by the work of Blei-
holder and Naumann [3]. They described a framework for
data fusion in the context of relational databases that in-
cludes three major categories of conflict handling strategies:
Conflict-ignoring strategies, which defer conflict reso-
lution to the user. For instance, the strategy PassItOn
simply relays conflicts to the user or application con-
suming integrated data.
Conflict-avoiding strategies, which apply a unique de-
cision to all data. For instance, the strategy TrustY-
ourFriends prefers data from specific data sources.
Conflict-resolution strategies, which decide between ex-
isting data (e.g. KeepUpToDate, which takes the most
recent value), or mediate the creation of a new value
from the existing ones (e.g. Average).
In contrast to their framework, we provide a stricter sepa-
ration of data quality assessment functions and fusion func-
tions. In our framework, the function TrustYourFriends com-
bines two aspects: One aspect of quality assessment: the
assignment of higher ‘reputation’ scores to some sources;
and one aspect of data fusion: prefer values with highest
scores in a given indicator (in this case, reputation). Sim-
ilarly, the fusion function KeepUpToDate can be expressed
in our framework by preferring values with higher scores in
the “Recency” indicator.
In Sieve, fusion functions are basically of two types. Fil-
ter Functions (deciding strategies) remove some or all values
from the input, according to some quality metric. One ex-
ample filter function is: keep the value with the highest score
for a given metric. Transform Functions (mediating strate-
gies) operate over each value in the input, generating a new
list of values built from the initially provided ones. A wide
Juiz_de_Fora
En Pt
dbedia-owl:areaTotal 1437 km2 1436.850 km2
dbedia-owl:elevation 678m null
last edit 05/Oct/2011 08/Nov/2011
changes last year 17 167
dbpedia:Juiz_de_Fora
Fused
dbedia-owl:areaTotal 1437 km2
dbedia-owl:elevation 678m
Extraction
English Portuguese
Sieve
Figure 4: Illustration of the Data Fusion process. Example data originating from the DBpedia Extraction from
the English and Portuguese Wikipedia editions are sent through Sieve, generating a cleaner representation.
range of fusion functions have been proposed [4], and are
being implemented in Sieve, including:
Filter removes all values for which the input quality
assessment metric is below a given threshold
KeepSingleValueByQualityScore keeps only the value
with the highest quality assessment
Average, Max, Min takes the average, chooses the
maximum, or minimum of all input values for a given
numeric property
First, Last, Random – takes the first, last or the element
at some random position for a given property
PickMostFrequent selects the value that appears more
frequently in the list of conflicting values
The complete list of supported fusion functions is available
from the Sieve specification website
2
.
4.1 Sieve Data Fusion Module
Similarly to the quality assessment module, the data fu-
sion module can also be configured through XML. The Sieve
Data Fusion Specification language takes a property-centric
perspective. The user has the flexibility of deciding what ac-
tion to perform for each property of a class that requires data
fusion. Actions range from ignoring conflicts (e.g. PassItOn)
to filtering out information (e.g. Filter) based on quality in-
dicators, and can include also value transformations (e.g.
Average).
The quality indicators used for deciding on which data
fusion operation to perform are provided by the quality as-
sessment module. The data fusion component takes as input
a set of metadata quads (containing quality information),
a set of entity descriptions containing the properties to be
fused and a data fusion specification.
1 < S i e v e>
2 <F u s i o n>
3 <C l a s s name=d b p e d i a : S e t t l e m e n t >
4 <P r o p e r t y name= r d f s : l a b e l >
5 <F u s i o n F u n c t i o n c l a s s=Pa s s I tO n ”/>
6 </ P r o p e r t y>
7 <P r o p e r t y name=d bp ed ia o w l : a r e a T o t a l >
8 <F u s i o n F u n c t i o n c l a s s=
K e e p S i n g l e V a l u e B y Q u a l i t y S c o r e
9 m e t r i c= s i e v e : r e p u t a t i o n ”/>
10 </ P r o p e r t y>
11 <P r o p e r t y name=d bp ed ia o w l : p o p u l a t i o n T o t a l >
12 <F u s i o n F u n c t i o n c l a s s=
K e e p S i n g l e V a l u e B y Q u a l i t y S c o r e
13 m e t r i c= s i e v e : r e c e n c y ”/>
14 </ P r o p e r t y>
15 </ C l a s s>
16 </ Fu s i o n>
17 </ S i e v e>
Listing 2: Sieve Data Fusion Specification Example
Listing 2 shows a data fusion specification that acts on
every instance of dbpedia:Settlement and consolidates infor-
mation for the properties areaTotal and populationTotal from
the DBpedia Ontology. Both properties use the fusion func-
tion KeepSingleValueByQualityScore, which removes all but
the highest ranked value, where the ranking is determined
by a quality indicator calculated in the previous step. The
fusion for the areaTotal property takes the value with the
highest ldif:reputation (line 9), while populationTotal takes
the value with the highest ldif:recency (line 13). Values for
rdfs:label are simply repeated to the output, as multiple val-
ues for this property are often useful for language customiza-
Table 1: Properties and their occurrences for Brazilian municipalities in different DBpedia editions
property only en only pt redundant conflicting
areaTotal 2/5565 3562/5565 27/5565 378/5565
foundingDate 234/5565 58/5565 1/5565 0/5565
populationTotal 5/5565 3552/5565 47/5565 370/5565
tion in client applications.
5. DEMONSTRATION
In order to demonstrate the capabilities of Sieve, we have
collected data about municipalities in Brazil from different
Wikipedia editions, namely the English- and Portuguese-
language Wikipedia editions. According to the Brazilian
Institute for Geography and Statistics, there are 5,565 mu-
nicipalities in Brazil
3
. However, this information may be ab-
sent from DBpedia due to incomplete or missing Wikipedia
infoboxes, missing mappings in the DBpedia Mapping Wiki
or irregularities in the data format that were not resolved by
the DBpedia Extraction Framework. Furthermore, a partic-
ular Wikipedia edition may be out of date or incorrect. See
Figure 4 for the schematics of our demonstration.
In this section we show how such problems can be al-
leviated by fusing data from multiple sources. We start by
describing the data sets employed, and subsequently present
the results obtained for (slight modifications of) the specifi-
cations presented in Listing 1 and Listing 2.
5.1 Data Sets
The data sets employed in our use case were obtained from
the English and Portuguese dumps of DBpedia 3.7
4
. We
collected mapping-based properties, inter-language links, in-
stance types and provenance information, which were then
fed as quads into Sieve through LDIF.
Target attributes.
For the sake of simplicity of explanation, we will focus our
discussion on a few properties. Namely, we have collected
the total population, total area and the founding date of
each municipality. Table 4.1 shows the distribution of val-
ues that occur only in DBpedia English but not in DBpedia
Portuguese (second column: only en), and vice-versa (third
column: only pt). The table also shows properties that oc-
curred in both DBpedia dumps (fourth column: redundant)
with the same values, and properties that occurred in both
data sets, but with different values (fifth column: conflict-
ing).
Inter-language Links.
Wikipedia pages offer links to other language editions in
the left-hand side menu in each article. We collected those
links and represented them as owl:sameAs links that were
provided as input to the LDIF engine. Through the identity
resolution module (Silk) and the URI translation module,
these identity links will be used to merge object descriptions
into one URI per object.
Instance Types.
Wikipedia editors commonly create so called “list pages”,
3
http://www.ibge.gov.br/cidadesat
4
http://dbpedia.org/Downloads37
which are used to organize collections of links to pages match-
ing certain criteria. We have used a page listing all Brazilian
municipalities
5
to derive an extra source of instance type and
country location from the Wikipedia page. We used this set
as our target universe, that is, the complete set of URIs to
be obtained after integration.
Pre-processing provenance..
The DBpedia Extraction Framework tracks the source ar-
ticle from which each property was extracted. For the pur-
pose of this demonstration, we normalized the provenance
information from the extraction framework to associate ev-
ery property value extracted to its original source page in
Wikipedia. For each source page, we collected the last up-
date from the Wikipedia dump, and included it in the LDIF
provenance graph.
All of the data collected was serialized in the NQuads
format and fed into LDIF for processing, resulting in one
integrated NQuads output file. The next section discusses
the results of this integration process.
6. RESULTS
Data Integration is commonly applied in order to increase
data quality along at least three dimensions: completeness,
conciseness and consistency [5].
Completeness.
On the schema level, a data set is complete if it contains
all of the attributes needed for a given task. On the data
(instance) level, a data set is complete if it contains all of
the necessary objects for a given task. Naturally, the com-
pleteness of a data set can only be judged in the presence
of a task where the ideal set of attributes and objects are
known.
In the use case described in this paper, the task required
retrieving 3 attributes (areaTotal, foundingDate, population-
Total) for all 5565 objects (Brazilian municipalities). Ac-
cording to Bleiholder and Naumann [5], the extensional com-
pleteness (data level), can be measured in terms of the pro-
portion of target URIs found in the output (Equation 1),
while the intensional completeness (schema level) can be
measured by the proportion of target properties found in
the output (Equation 2).
extensional completeness =
||uniq. obj. in data set||
||all uniq. obj. in universe||
(1)
intensional c ompleteness =
||uniq. attr. in data set||
||all uniq. attr. in universe||
(2)
5
http://pt.wikipedia.org/wiki/Anexo:Lista_de_
munic´ıpios_do_Brasil
Table 2: Impact of the data integration process in quality indicators.
Prop erty p Completeness(p) Conciseness(p) Consistency(p)
en pt final gain gain
areaTotal 7.31% 71.28% 71.32% +10.20% +9.52%
foundingDate 4.22% 1.06% 5.27% +0.34% -
populationTotal 7.58% 71.32% 71.41% +10.49% +9.31%
For the original data sets, the English DBpedia contained
2097/5565 municipalities, which means that the extensional
completeness was 37% before the integration process, while
DBpedia Portuguese contained 3970/5565 municipalities (ex-
tensional completeness of 71%). After integration, 3979/5565
municipalities were found, increasing extensional complete-
ness in 36% and 0.5% for the DBpedia English and DBpedia
Portuguese respectively.
In order to provide a more fine-grained analysis of the
impact of the integrated data set with regard to the original
sources, we have defined another measure of completeness
that takes into consideration the instantiations of properties.
That is, it measures the proportion of objects that contain
a value for a given property p, in relation to the universe of
objects (Equation 3).
Completeness(p) =
||obj. with property p in data set||
||all uniq. obj. in universe||
(3)
The Completeness(p) for both English and Portuguese-
language DBpedia editions are shown on Table 2 for each
property in our use case (areaTotal, foundingDate and
populationTotal. The percentages shown represent the
completeness of DBpedia English before integration (en),
DBpedia Portuguese before integration (pt), and complete-
ness after integration (final). As expected, the integration
process increased completeness for both data sets, with par-
ticularly high increase (more than 9x) for DBpedia English
in the properties areaTotal and populationTotal. The
property foundingDate was actually more complete in DB-
pedia English, and provided an increase of roughly 4% in
completeness for DBpedia Portuguese.
Conciseness.
On the schema level, a data set is concise if it does not
contain redundant attributes (two equivalent attributes with
different names). On the data (instance) level, a data set is
concise if it does not contain redundant objects (two equiva-
lent objects with different identifiers). The extensional con-
ciseness measures the number of unique objects in relation
to the overall number of object representations in the data
set [5]. Similarly, intensional conciseness measures the num-
ber of unique attributes of a dataset in relation to the overall
number of attributes in a target schema [5]. The intensional
conciseness in our use case was 1, since both datasets used
the same schema. The extensional conciseness was also 1,
since both DBpedia editions only contained one URI per ob-
ject. Similarly to the case of completeness, we have defined
a finer grained conciseness metric for a given property p to
measure the proportion of objects that do not contain more
than one identical value for p (redundant), with regard to
the universe of unique property values (Equation 4).
Conciseness(p) =
||obj. with uniq. values f or p in data set||
||all uniq. obj. with p in dataset||
(4)
The increase in Conciseness(p) for each of the proper-
ties p in our use case is shown on Table 2. The properties
areaTotal and populationTotal were 89.80% and 89.51%
concise, respectively, while foundingDate was 99.66% con-
cise. The integration process yielded an increase in concise-
ness of roughly 10% for areaTotal and populationTotal,
with only minor increase for foundingDate which was al-
ready very concise in the original datasets.
Consistency.
A data set is consistent if it is free of conflicting informa-
tion. In the context of this paper, the consistency of a data
set is measured by considering properties with cardinality 1
that contain more than one (distinct) value.
We have defined the consistency of a data set for a given
property p to measure the proportion of objects that do not
contain more than one distinct value for p, with regard to
the universe of unique property values.
Consistency(p) =
||obj. without conflicts for p in data set||
||all uniq. obj. with p in dataset||
(5)
The increase in Consistency(p) for each of the properties
p in our use case is shown on Table 2. The property found-
ingDate had no conflicts observed in the original data. How-
ever, we observed an increase in consistency of roughly 10%
for areaTotal and populationTotal which were 90.48% and
90.69% consistent in the original data.
7. CONCLUSION
We have described Sieve, a Linked Data Quality Assess-
ment and Data Fusion module. Sieve is employed by the
Linked Data Integration Framework (LDIF) after the Schema
Mapping and Identity Resolution steps. Sieve’s role is to
assess the quality of the integrated data and subsequently
decide on which values to keep, discard or transform accord-
ing to user-configured quality assessment metrics and fusion
functions. Sieve is agnostic to provenance and quality vocab-
ularies, allowing users to configure which metadata to read,
and which functions to apply via a declarative specification
language.
Through a use case that imported data about Brazilian
municipalities from international DBpedia editions, we have
demonstrated the usage of Sieve in a simple scenario that
yielded an integrated data set that was more complete, con-
cise, consistent and up to date than the original sources. The
English DBpedia, although considered the most mature of
the international DBpedia editions, did not have a particu-
larly high coverage of Brazilian municipalities, and benefited
from the higher coverage offered by the Portuguese DBpedia
for those particular items. Furthermore, as the population
of a city is prone to change, it is important to keep the most
recent values. Blending values from multiple DBpedia edi-
tions allowed us to include the most recent value among the
sources. However, identity resolution and schema mapping
introduced multiple values for the same properties (values
originating from different sources). The application of data
fusion through Sieve allowed us to remove redundant and
conflicting values, increasing the conciseness and the consis-
tency of the data set.
Future work includes the development of more quality as-
sessment scoring functions and data fusion functions, as well
as performance and scalability experiments.
8. ACKNOWLEDGEMENTS
The authors would like to thank Andreas Schultz, Robert
Isele and Anja Jentzsch for their valuable comments on this
manuscript. We also thank them, as well as Andrea Matteini
and Christian Becker for their work on other components of
LDIF.
This work was supported by the EU FP7 grants LOD2
- Creating Knowledge out of Interlinked Data (Grant No.
257943) and PlanetData - A European Network of Excel-
lence on Large-Scale Data Management (Grant No. 257641).
9. REFERENCES
[1] C. Bizer and R. Cyganiak. Quality-driven information
filtering using the wiqa policy framework. Web
Semant., 7:1–10, January 2009.
[2] C. Bizer and A. Schultz. The R2R Framework :
Publishing and discovering mappings on the web.
Work, page 19, 2010.
[3] J. Bleiholder and F. Naumann. Declarative Data
Fusion: Syntax, Semantics, and Implementation.
pages 58–73. 2005.
[4] J. Bleiholder and F. Naumann. Conflict handling
strategies in an integrated information system. In
Proceedings of the International Workshop on
Information Integration on the Web (IIWeb),
Edinburgh, UK, 0 2006.
[5] J. Bleiholder and F. Naumann. Data fusion. ACM
Comput. Surv., 41:1:1–1:41, January 2009.
[6] J. J. Carroll, C. Bizer, P. J. Hayes, and P. Stickler.
Named graphs. J. Web Sem., 3(4):247–267, 2005.
[7] K. G. Clark, L. Feigenbaum, and E. Torres. SPARQL
Protocol for RDF. January 2008.
[8] T. Heath and C. Bizer. Linked data: evolving the web
into a global data space. Morgan and Claypool, [San
Rafael, Calif.], 2011.
[9] R. Isele, A. Jentzsch, and C. Bizer. Silk Server -
Adding missing Links while consuming Linked Data.
In 1st International Workshop on Consuming Linked
Data (COLD 2010), Shanghai, 2010.
[10] A. Jentzsch, C. Bizer, and R. Cyganiak. State of the
LOD Cloud, September 2011.
[11] J. Juran. The Quality Control Handbook.
McGraw-Hill, New York, 3rd edition, 1974.
[12] F. Naumann. Quality-Driven Query Answering for
Integrated Information Systems. Springer, Berlin
Heidelberg New York, 2002.
[13] A. Schultz, A. Matteini, R. Isele, C. Bizer, and
C. Becker. Ldif : Linked data integration framework.
2011.
[14] J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov.
Discovering and maintaining links on the web of data.
In The Semantic Web ISWC 2009: 8th International
Semantic Web Conference, Chantilly, VA, USA, pages
650–665. 2009.
... Linked Data Quality is a multi-dimensional concept [14,23,27,32,42,43,53,62,76] including availability, completeness, etc., in which, providing the source of facts is considered part of believability and verifiability dimensions (see Section 2.3) [23,27,76]. Providing the provenance increases the trust in data [23,27,76]. ...
... Conciseness is classified into (i) intensional conciseness (schema level) which refers to the case when the data does not contain redundant attributes and (ii) extensional conciseness (data level) which refers to the case when the data does not contain redundant objects" [76]. Redundancy in both schema and instance levels is covered in the Mendes et al. [53] framework. Debattista et al. [23] considered instance-level redundancy in their investigation of Linked Data. ...
... According to Zaveri et al., "currency measures how promptly the data is updated" [76]. This dimension is usually measured by computing the distance between the latest time data modified and the observation time [53]. Sometimes the release time of data is also included in the calculation [62]. ...
Article
Full-text available
Wikidata is a collaborative multi-purpose Knowledge Graph (KG) with the unique feature of adding provenance data to the statements of items as a reference. More than 73% of Wikidata statements have provenance metadata; however, few studies exist on the referencing quality in this KG, focusing only on the relevancy and trustworthiness of external sources. While there are existing frameworks to assess the quality of Linked Data, and in some aspects their metrics investigate provenance, there are none focused on reference quality. We define a comprehensive referencing quality assessment framework based on Linked Data quality dimensions, such as completeness and understandability. We implement the objective metrics of the assessment framework as the Referencing Quality Scoring System – RQSS. The system provides quantified scores by which the referencing quality can be analyzed and compared. RQSS scripts can also be reused to monitor the referencing quality regularly. Due to the scale of Wikidata, we have used well-defined subsets to evaluate the quality of references in Wikidata using RQSS. We evaluate RQSS over three topical subsets: Gene Wiki, Music, and Ships, corresponding to three Wikidata WikiProjects, along with four random subsets of various sizes. The evaluation shows that RQSS is practical and provides valuable information, which can be used by Wikidata contributors and project holders to identify the quality gaps. Based on RQSS, the average referencing quality in Wikidata subsets is 0.58 out of 1. Random subsets (representative of Wikidata) have higher overall scores than topical subsets by 0.05, with Gene Wiki having the highest scores amongst topical subsets. Regarding referencing quality dimensions, all subsets have high scores in accuracy, availability, security, and understandability, but have weaker scores in completeness, verifiability, objectivity, and versatility. Although RQSS is developed based on the Wikidata RDF model, its referencing quality assessment framework can be applied to KGs in general.
... The performance of downstream applications can be significantly compromised by a low-quality KG, resulting in issues such as false positives, undetected threats, and flawed deductions and decision-making. Hence, accurately evaluating the quality of KGs to guarantee the dependability and correctness of their triple data is an important research challenge that needs to be addressed [4,5]. ...
Article
Full-text available
As the forms of cyber threats become increasingly severe, cybersecurity knowledge graphs (KGs) have become essential tools for understanding and mitigating these threats. However, the quality of the KG is critical to its effectiveness in cybersecurity applications. In this paper, we propose a spurious-negative sample augmentation-based quality evaluation method for cybersecurity KGs (SNAQE) that includes two key modules: the multi-scale spurious-negative triple detection module and the adaptive mixup based on the attention mechanism module. The multi-scale spurious-negative triple detection module classifies the sampled negative triples into spurious-negative and true-negative triples. Subsequently, the attention mechanism-based adaptive mixup module selects appropriate mixup targets for each spurious-negative triple, constructing partially correct triples and achieving more precise sample generation in the entity embedding space to assist in training the KG quality evaluation models. Through extensive experimental validation, the SNAQE model not only performs excellently in general-domain KG quality evaluation but also achieves outstanding outcomes in the cybersecurity KGs, significantly enhancing the accuracy and F1 score of the model, with the best F1 score of 0.969 achieved on the FB15K dataset.
... This module consists of different components to evaluate data quality, these components are [5]: ...
Preprint
Data integration is the process of collecting data from different data sources and providing user with unified view of answers that meet his requirements. The quality of query answers can be improved by identifying the quality of data sources according to some quality measures and retrieving data from only significant ones. Query answers that returned from significant data sources can be ranked according to quality requirements that specified in user query and proposed queries types to return only top-k query answers. In this paper, Data integration framework called data integration to return ranked alternatives (DIRA) will be introduced depending on data quality assessment module that will use data sources quality to choose the significant ones and ranking algorithm to return top-k query answers according to different queries types.
... The extent of information omission in Knowlegde Graph Sieve: linked data quality assessment and fusion [137]; ...
Article
Full-text available
In recent years, knowledge graph technology has been widely applied in various fields such as intelligent auditing, urban transportation planning, legal research, and financial analysis. In traditional auditing methods, there are inefficiencies in data integration and analysis, making it difficult to achieve deep correlation analysis and risk identification among data. Additionally, decision support systems in the auditing process may face issues of insufficient information interpretability and limited predictive capability, thus affecting the quality of auditing and the scientificity of decision-making. However, knowledge graphs, by constructing rich networks of entity relationships, provide deep knowledge support for areas such as intelligent search, recommendation systems, and semantic understanding, significantly improving the accuracy and efficiency of information processing. This presents new opportunities to address the challenges of traditional auditing techniques. In this paper, we investigate the integration of intelligent auditing and knowledge graphs, focusing on the application of knowledge graph technology in auditing work for power engineering projects. We particularly emphasize mainstream key technologies of knowledge graphs, such as data extraction, knowledge fusion, and knowledge graph reasoning. We also introduce the application of knowledge graph technology in intelligent auditing, such as improving auditing efficiency and identifying auditing risks. Furthermore, considering the environment of cloud-edge collaboration to reduce computing latency, knowledge graphs can also play an important role in intelligent auditing. By integrating knowledge graph technology with cloud-edge collaboration, distributed computing and data processing can be achieved, reducing computing latency and improving the response speed and efficiency of intelligent auditing systems. Finally, we summarize the current research status, outlining the challenges faced by knowledge graph technology in the field of intelligent auditing, such as scalability and security. At the same time, we elaborate on the future development trends and opportunities of knowledge graphs in intelligent auditing.
Article
Full-text available
With Knowledge Graphs (KGs) at the center of numerous applications such as recommender systems and question-answering, the need for generalized pipelines to construct and continuously update such KGs is increasing. While the individual steps that are necessary to create KGs from unstructured sources (e.g., text) and structured data sources (e.g., databases) are mostly well researched for their one-shot execution, their adoption for incremental KG updates and the interplay of the individual steps have hardly been investigated in a systematic manner so far. In this work, we first discuss the main graph models for KGs and introduce the major requirements for future KG construction pipelines. Next, we provide an overview of the necessary steps to build high-quality KGs, including cross-cutting topics such as metadata management, ontology development, and quality assurance. We then evaluate the state of the art of KG construction with respect to the introduced requirements for specific popular KGs, as well as some recent tools and strategies for KG construction. Finally, we identify areas in need of further research and improvement.
Chapter
A knowledge graph is not created once but needs to be curated continuously to ensure a high quality. Knowledge assessment is the first step of curating a knowledge graph. We introduce quality dimensions, their metrics and calculation functions, as well as how to calculate a weighted aggregated quality score. We present some assessment tools and explain one of them in a larger detail. We provide an indicative example on how correctness (accuracy) of a knowledge graph can be calculated based on the German Tourism Knowledge Graph.
Chapter
Another important quality dimension for knowledge graphs is completeness. We explain how we can enrich an existing knowledge graph with new knowledge. First, we present different ways to align ontologies, and then we show how factual knowledge is enriched with a particular focus on duplicate resolution and data fusion.
Article
Full-text available
The LDIF - Linked Data Integration Framework can be used within Linked Data applications to translate heterogeneous data from the Web of Linked Data into a clean local target representation while keeping track of data provenance. LDIF provides an expressive mapping language for translating data from the various vocabularies that are used on the Web into a consistent, local target vocabulary. LDIF includes an identity resolution component which discovers URI aliases in the input data and replaces them with a single target URI based on user-provided matching heuristics. For provenance tracking, the LDIF framework employs the Named Graphs data model. This paper describes the architecture of the LDIF framework and presents a performance evaluation of a life science use case.
Article
Full-text available
Integrated information systems provide users and applications with a unified view of het- erogeneous data sources. To provide a single consistent result for every object represented in these data sources, data fusion is concerned with resolving data inconsistencies present in and among the sources. We present a classification of conflict resolution strategies and show how these are realized using conflict handling functions. A catalog of such functions is given, together with a description of some of their properties. We further show how the functions are used within an integrated information system, the Humboldt-Merger (HumMer).
Article
Full-text available
The Web of Linked Data is built upon the idea that data items on the Web are connected by RDF links. Sadly, the reality on the Web shows that Linked Data sources set some RDF links point-ing at data items in related data sources, but they clearly do not set RDF links to all data sources that provide related data. In this paper, we present Silk Server, an identity resolution component, which can be used within Linked Data application architectures to augment Web data with additional RDF links. Silk Server is designed to be used with an incoming stream of RDF instances, produced for example by a Linked Data crawler. Silk Server matches the RDF descriptions of incoming in-stances against a local set of known instances and discovers missing links between them. Based on this assessment, an application can store data about newly discovered instances in its repository or fuse data that is already known about an entity with additional data about the entity from the Web. Afterwards, we report on the results of an experiment in which Silk Server was used to generate RDF links between authors and publications from the Semantic Web Dog Food Corpus and a stream of FOAF profiles that were crawled from the Web.
Article
Full-text available
The promise of the Web of Linked Data is to enable client applications to discover new data sources by following RDF links at run-time and to smoothly integrate data from these sources. Linked Data sources use different vocabularies to describe the same type of objects. It is also common practice to mix terms from different widely used vocabularies with proprietary terms. Thus Linked Data applications need to apply mappings to translate Web data to their local schema before doing any sophisticated data processing. Maintaining a local or central set of mappings that cover all Linked Data sources is likely to be impossible due to the size and dynamics of the Web of Linked Data. Thus this paper propagates a distributed, pay-as-you-go integration approach where data publishers, vocabulary maintainers and third parties may publish expressive mappings on the Web. A client application which discovers data that is represented using terms that are unknown to the application may search the Web for mappings and apply the discovered mappings to translate data to its local schema. We propose a language for publishing expressive, named mappings on the Web and a composition method for chaining partial mappings from different sources based on a mapping quality assessment heuristic. The composition method is implemented within the R2R Mapping Engine which can be used by Linked Data applications to translate Web data to their local schema.
Conference Paper
Full-text available
The Web of Data is built upon two simple ideas: Employ the RDF data model to publish structured data on the Web and to create explicit data links between entities within different data sources. This paper presents the Silk – Linking Framework, a toolkit for discovering and maintaining data links between Web data sources. Silk consists of three components: 1. A link discovery engine, which computes links between data sources based on a declarative specification of the conditions that entities must fulfill in order to be interlinked; 2. A tool for evaluating the generated data links in order to fine-tune the linking specification; 3. A protocol for maintaining data links between continuously changing data sources. The protocol allows data sources to exchange both linksets as well as detailed change information and enables continuous link recomputation. The interplay of all the components is demonstrated within a life science use case.
Article
Querying the Web.- Integrating Autonomous Information Sources.- Information Quality.- Information Quality Criteria.- Quality Ranking Methods.- Quality-Driven Query Answering.- Quality-Driven Query Planning.- Query Planning Revisited.- Completeness of Data.- Completeness-Driven Query Optimization.- Discussion.- Conclusion.
Article
Web-based information systems, such as search engines, news portals, and community sites, provide access to information originating from numerous information providers. The quality of provided information varies as information providers have different levels of knowledge and different intentions. Users of web-based systems are therefore confronted with the increasingly difficult task of selecting high-quality information from the vast amount of web-accessible information. How can information systems support users to distinguish high-quality from low-quality information? Which filtering mechanisms can be used to suppress low-quality information? How can filtering decisions be explained to the user? This article identifies information quality problems that arise in the context of web-based systems, and gives an overview of quality indicators as well as information quality assessment metrics for web-based systems. Afterwards, we introduce the WIQA—Information Quality Assessment Framework. The framework enables information consumers to apply a wide range of policies to filter information. The framework employs the Named Graphs data model for the representation of information together with quality-related meta-information. The framework uses the WIQA-PL policy language for expressing information filtering policies against this data model. WIQA-PL policies are expressed in the form of graph patterns and filter conditions. This allows the compact representation of policies that rely on complex meta-information such as provenance chains or combinations of provenance information and background information about information providers. In order to facilitate the information consumers’ understanding of filtering decisions, the framework generates explanations of why information satisfies a specific policy.
Article
The Semantic Web consists of many RDF graphs nameable by URIs. This paper extends the syntax and semantics of RDF to cover such named graphs. This enables RDF statements that describe graphs, which is beneficial in many Semantic Web application areas. Named graphs are given an abstract syntax, a formal semantics, an XML syntax, and a syntax based on N3. SPARQL is a query language applicable to named graphs. A specific application area discussed in detail is that of describing provenance information. This paper provides a formally defined framework suited to being a foundation for the Semantic Web trust layer.