ArticlePDF Available

Reifying RDF: What works well with wikidata?

Authors:

Abstract and Figures

In this paper, we compare various options for reifying RDF triples. We are motivated by the goal of representing Wikidata as RDF, which would allow legacy Semantic Web languages, techniques and tools - for example, SPARQL engines - to be used for Wikidata. However, Wikidata annotates statements with qualifiers and references, which require some notion of reification to model in RDF. We thus investigate four such options: (1) standard reification, (2) n-ary relations, (3) singleton properties, and (4) named graphs. Taking a recent dump of Wikidata, we generate the four RDF datasets pertaining to each model and discuss high-level aspects relating to data sizes, etc. To empirically compare the effect of the different models on query times, we collect a set of benchmark queries with four model-specific versions of each query. We present the results of running these queries against five popular SPARQL implementations: 4 store, BlazeGraph, GraphDB, Jena TDB and Virtuoso.
Content may be subject to copyright.
Reifying RDF: What Works Well With Wikidata?
Daniel Hernández1, Aidan Hogan1, and Markus Krötzsch2
1Department of Computer Science, University of Chile
2Technische Universität Dresden, Germany
Abstract. In this paper, we compare various options for reifying RDF
triples. We are motivated by the goal of representing Wikidata as RDF,
which would allow legacy Semantic Web languages, techniques and tools
– for example, SPARQL engines – to be used for Wikidata. However,
Wikidata annotates statements with qualifiers and references, which re-
quire some notion of reification to model in RDF. We thus investigate
four such options: (1) standard reification, (2) n-ary relations, (3) single-
ton properties, and (4) named graphs. Taking a recent dump of Wikidata,
we generate the four RDF datasets pertaining to each model and discuss
high-level aspects relating to data sizes, etc. To empirically compare the
effect of the different models on query times, we collect a set of bench-
mark queries with four model-specific versions of each query. We present
the results of running these queries against five popular SPARQL imple-
mentations: 4store, BlazeGraph, GraphDB, Jena TDB and Virtuoso.
1 Introduction
Wikidata is a collaboratively edited knowledge-base under development by the
Wikimedia foundation whose aim is to curate and represent the factual informa-
tion of Wikipedia (across all languages) in an interoperable, machine-readable
format [20]. Until now, such factual information has been embedded within mil-
lions of Wikipedia articles spanning hundreds of languages, often with a high
degree of overlap. Although initiatives like DBpedia [15] and YAGO [13] have
generated influential knowledge-bases by applying custom extraction frameworks
over Wikipedia, the often ad hoc way in which structured data is embedded in
Wikipedia limits the amount of data that can be cleanly captured. Likewise,
when information is gathered from multiple articles or multiple language versions
of Wikipedia, the results may not always be coherent: facts that are mirrored
in multiple places must be manually curated and updated by human editors,
meaning that they may not always correspond at a given point in time.
Therefore, by allowing human editors to collaboratively add, edit and curate
a centralised, structured knowledge-base directly, the goal of Wikidata is to keep
a single consistent version of factual data relevant to Wikipedia. The resulting
knowledge-base is not only useful to Wikipedia – for example, for automatically
generating articles that list entities conforming to a certain restriction (e.g.,
female heads of state) or for generating infobox data consistently across all lan-
guages – but also to the Web community in general. Since the launch of Wikidata
in October 2012, more than 80 thousand editors have contributed information on
18 million entities (data items and properties). In comparison, English Wikipedia
has around 6 million pages, 4.5 million of which are considered proper articles.
As of July 2015, Wikidata has gathered over 65 million statements.
One of the next goals of the Wikidata project is to explore methods by which
the public can query the knowledge-base. The Wikidata developers are currently
investigating the use of RDF as an exchange format for Wikidata with SPARQL
query functionality. Indeed, the factual information of Wikidata corresponds
quite closely with the RDF data model, where the main data item (entity) can
be viewed as the subject of a triple and the attribute–value pairs associated with
that item can be mapped naturally to predicates and objects associated with the
subject. However, Wikidata also allows editors to annotate attribute–value pairs
with additional information, such as qualifiers and references. Qualifiers provide
context for the validity of the statement in question, for example providing a time
period during which the statement was true. References point to authoritative
sources from which the statement can be verified. About half of the statements
in Wikidata (32.5 million) already provide a reference, and it is an important
goal of the project to further increase this number.
Hence, to represent Wikidata in RDF while capturing meta-information such
as qualifiers and references, we need some way in RDF to describe the RDF
triples themselves (which herein we will refer to as “ reification in the general
sense of the term, as distinct from the specific proposal for reification defined in
the 2004 RDF standard [3], which we refer to as “standard reification”).
In relation to Wikidata, we need a method that is compatible with existing
Semantic Web standards and tools, and that does not consider the domain of
triple annotation as fixed in any way: in other words, it does not fix the domain
of annotations to time, or provenance, or so forth [22]. With respect to general
methods for reification within RDF, we identify four main options:
standard reification (sr)whereby an RDF resource is used to denote the
triple itself, denoting its subject, predicate and object as attributes and
allowing additional meta-information to be added [16,4].
n-ary relations (nr)whereby an intermediate resource is used to denote the
relationship, allowing it to be annotated with meta-information [16,8].
singleton properties (sp)whereby a predicate unique to the statement is cre-
ated, which can be linked to the high-level predicate indicating the relation-
ship, and onto which can be added additional meta-information [18].
Named Graphs (ng)whereby triples (or sets thereof) can be identified in a
fourth field using, e.g., an IRI, onto which meta-information is added [10,5].
Any of these four options would allow the qualified statements in the Wiki-
data knowledge-base to be represented and processed using current Semantic
Web norms. In fact, Erxleben et al. [8] previously proposed an RDF represen-
tation of the Wikidata knowledge-base using a form of n-ary relations. It is
important to note that while the first three formats rely solely on the core RDF
model, Named Graphs represents an extension of the traditional triple model,
adding a fourth element; however, the notion of Named Graphs is well-supported
in the SPARQL standard [10], and as “RDF Datasets” in RDF 1.1 [5].
Thus arises the core question tackled in this paper: what are the relative
strengths and weaknesses of each of the four formats? We are particularly inter-
ested in exploring this question quantitatively with respect to Wikidata. We thus
take a recent dump of Wikidata and create four RDF datasets: one for each of
the formats listed above. The focus of this preliminary paper is to gain empirical
insights on how these formats affect query execution times for off-the-shelf tools.
We thus (attempt to) load these four datasets into five popular SPARQL engines
– namely 4store [9], BlazeGraph (formerly BigData) [19], GraphDB (formerly
(Big)OWLIM) [2], Jena TDB [21], and Virtuoso [7] – and apply four versions of
a query benchmark containing 14 queries to each dataset in each engine.
2 The Wikidata Data-model
Figure 1a provides an example statement taken from Wikidata describing the
entity Abraham Lincoln. We show internal identifiers in grey, where those begin-
ning with Qrefer to entities, and those referring to Prefer to properties. These
identifiers map to IRIs, where information about that entity or relationship can
be found. All entities and relationships are also associated with labels, where
the English versions are shown for readability. Values of properties may also be
datatype literals, as exemplified with the dates.
The snippet contains a primary relation, with Abraham Lincoln as subject, po-
sition held as predicate, and President of the United States of America as object. Such
binary relations are naturally representable in RDF. However, the statement is
also associated with some qualifiers and their values. Qualifiers are property
terms such as start time,follows, etc., whose values may scope the validity of the
statement and/or provide additional context. Additionally, statements are often
associated with one or more references that support the claims and with a rank
that marks the most important statements for a given property. The details are
not relevant to our research: we can treat references and ranks as special types
of qualifiers. We use the term statement to refer to a primary relation and its
associated qualifications; e.g., Fig. 1a illustrates a single statement.
Conceptually, one could view Wikidata as a “Property Graph”: a directed
labelled graph where edges themselves can have attributes and values [11,6]. A
related idea would be to consider Wikidata as consisting of quins of the form
(s, p, o, q, v), where (s, p, o)refers to the primary relation,qis a qualifier prop-
erty, and vis a qualifier value [12]. Referring to Fig. 1a again, we could encode a
quin (:Q91,:P39,:Q11696,:P580,"1861/03/14"^^xsd:date), which states that
Abraham Lincoln had relation position held to President of the United States of Amer-
ica under the qualifier property start time with the qualifier value 4 March 1861. All
quins with a common primary relation would constitute a statement. However,
quins of this form are not a suitable format for Wikidata since a given primary
relation may be associated with different groupings of qualifiers. For example,
Grover Cleveland was President of the United States for two non-consecutive
terms (i.e., with different start and end times, different predecessors and succes-
sors). In Wikidata, this is represented as two separate statements whose primary
relations are both identical, but where the qualifiers (start time,end time,follows,
followed by) differ. For this reason, reification schemes based conceptually on
quins – such as RDF* [12,11] – may not be directly suitable for Wikidata.
A tempting fix might be to add an additional column and represent Wikidata
using sextuples of the form (s, p, o, q, v, i)where iis an added statement identifier.
Thus in the case of Grover Cleveland, his two non-consecutive terms would be
represented as two statements with two distinct statement identifiers. While in
principle sextuples would be sufficient, in practice (i) the relation itself may
contain nulls, since some statements do not currently have qualifiers or may
not even support qualifiers (as is the case with labels, for example), (ii) qualifier
values may themselves be complex and require some description: for example,
dates may be associated with time precisions or calendars.3
For this reason, we propose to view Wikidata conceptually in terms of two
tables: one containing quads of the form (s, p, o, i)where (s, p, o)is a primary
relation and iis an identifier for that statement; the other a triple table storing
(i) primary relations that can never be qualified (e.g., labels) and thus do not
need to be identified, (ii) triples of the form (i, q, v)that specify the qualifiers as-
sociated to a statement, and (iii) triples of the form (v, x, y)that further describe
the properties of qualifier values. Table 1 provides an example that encodes some
of the data seen in Fig. 1a – as well as some further type information not shown
– into two such tables: the quads table on the left encodes qualifiable primary
relations with an identifier, and the triples table on the right encodes (i) qual-
ifications using the statement identifiers, (ii) non-qualifiable primary relations,
such as those that specify labels, and (iii) type information for complex values,
such as to provide a precision, calendar, etc.
Compared to sextuples, the quad/triple schema only costs one additional
tuple per statement, will lead to dense instances (even if some qualifiable primary
relations are not currently qualified), and will not repeat the primary relation
for each qualifier; conversely, the quad/triple schema may require more joins for
certain query patterns (e.g., find primary relations with a follows qualifier).
Likewise, the quad/triple schema is quite close to an RDF-compatible encod-
ing. As per Fig. 1b, the triples from Table 1 are already an RDF graph; we can
thus focus on encoding quads of the form (s, p, o, i)in RDF.
3 From Higher Arity Data to RDF (and back)
The question now is: how can we represent the statements of Wikidata as triples
in RDF? Furthermore: how many triples would we need per statement? And
how might we know for certain that we don’t lose something in the translation?
The transformation from Wikidata to RDF can be seen as an instance of
schema translation, where these questions then directly relate to the area of
3See https://www.wikidata.org/wiki/Special:ListDatatypes; retr. 2015/07/11.
Abraham Lincoln [Q91]
position held [P39] President of the United States of America [Q11696]
start time [P580] “4 March 1861”
end time [P582] “15 April 1865”
follows [P155] James Buchanan [Q12325]
followed by [P156] Andrew Johnson [Q8612]
(a) Raw Wikidata format
:X1
:Q12325 :P155
:Q8612 :P166
:D1
:P580
:D2
:P582
1861-03-04
:time
1865-04-15
:time
11
:timePrecision
:timePrecision
:Q1985727
:preferredCalendar
:preferredCalendar
(b) Qualifier information common to all formats
:X1:Q91
r:subject
:P39
r:predicate
:Q11696
r:object
(c) Standard reification
:Q91 :X1
:P39s
:Q11696
:P39v
:P39
:valueProperty:statementProperty
(d) n-ary relations
:Q91 :Q11696
:X1
:P39
:singletonPropertyOf
(e) Singleton properties
:X1
:Q91 :Q11696
:P39
(f) Named Graphs
Fig. 1: Reification examples
Table 1: Using quads and triples to encode Wikidata
Quads
s p o i
:Q91 :P39 :Q11696 :X1
. . . . . . . . . . . .
Triples
s p o
:X1 :P580 :D1
:X1 :P582 :D2
:X1 :P155 :Q12325
:X1 :P156 :Q8612
. . . . . . . . .
:D1 :time "1861-03-04"^^xsd:date
:D1 :timePrecision "11"
:D1 :preferredCalendar :Q1985727
. . . . . . . . .
:Q91 rdfs:label "Abraham Lincoln"@en
. . . . . . . . .
relative information capacity in the database community [14,17], which studies
how one can translate from one database schema to another, and what sorts
of guarantees can be made based on such a translation. Miller et al. [17] relate
some of the theoretical notions of information capacity to practical guarantees
for common schema translation tasks. In this view, the task of translating from
Wikidata to RDF is a unidirectional scenario: we want to query a Wikidata
instance through an RDF view without loss of information, and to recover the
Wikidata instance from the RDF view, but we do not require, e.g., updates on
the RDF view to be reflected in Wikidata. We therefore need a transforma-
tion whereby the RDF view dominates the Wikidata schema, meaning that the
transformation must map a unique instance of Wikidata to a unique RDF view.
We can formalise this by considering an instance of Wikidata as a database
with the schema shown in Table 1. We require that any instance of Wikidata
can be mapped to a RDF graph, and that any conjunctive query (CQ; select-
project-join query in SQL) over the Wikidata instance can be translated to a
conjunctive query over the RDF graph that returns the same answers. We call
such a translation query dominating. We do not further restrict how translations
are to be specified so long as they are well-defined and computable.
A query dominating translation of a relation of arity 4(R4) to a unique
instance of a relation of arity 3(R3) must lead to a higher number of tuples in
the target instance. In general, we can always achieve this by encoding a quad
into four triples of the form (s, py, oz), where sis a unique ID for the quad, py
denotes a position in the quad (where 1y4), and ozis the term appearing
in that position of the quad in R4.
Example 1. The quad :Q91 :P39 :Q11696 :X1 can be mapped to four triples:
:S1 :P1 :Q91
:S1 :P2 :P39
:S1 :P3 :Q11696
:S1 :P4 :X1
Any conjunctive query over any set of quads can be translated into a conjunctive
query over the corresponding triple instance that returns the same answer: for
each tuple in the former query, add four tuples to the latter query with a fresh
common subject variable, with :P1 . . . :P4 as predicate, and with the correspond-
ing terms from the sextuple (be they constants or variables) as objects. The fresh
subject variables can be made existential/projected away.
With this scheme, we require 4ktriples to encode kquads. This encoding can
be generalised to encode database tables of arbitrary arity, which is essentially
the approach taken by the Direct Mapping of relational databases to RDF [1].
Quad encodings that use fewer than four triples usually require additional as-
sumptions. The above encoding requires the availability of an unlimited amount
of identifiers, which is always given in RDF, where there is an infinite supply of
IRIs. However, the technique does not assume that auxiliary identifiers such as
:S1 or :P4 do not occur elsewhere: even if these IRIs are used in the given set of
quads, their use in the object position of triples would not cause any confusion.
If we make the additional assumption that some “reserved” identifiers are not
used in the input quad data, we can find encodings with fewer triples per quad.
Example 2. If we assume that IRIs :P1 and :P2 are not used in the input
instance, the quad of Example 1 can be encoded in the following three triples:
:S1 :P1 :Q91
:S1 :P2 :P39
:S1 :Q11696 :X1
The translation is not faithful when the initial assumption is violated. For ex-
ample, the encoding of the quad :Q91 :P39 :P1 :Q92 would contain triples :S2
:P1 :Q91 and :S2 :P1 :Q92, which would be ambiguous.
The assumption of some reserved IRIs is still viable in practice, and indeed
three of the encoding approaches we look at in the following section assume
some reserved vocabulary to be available. Using suitable naming strategies, one
can prevent ambiguities. Other domain-specific assumptions are also sometimes
used to define suitable encodings. For example, when constructing quads from
Wikidata, the statement identifiers that are the fourth component of each quad
functionally determine the first three components, and this can be used to sim-
plify the encoding. We will highlight these assumptions as appropriate.
4 Existing Reification Approaches
In this section, we discuss how legacy reification-style approaches can be lever-
aged to model Wikidata in RDF, where we have seen that all that remains is to
model quads in RDF. We also discuss the natural option of using named graphs,
which support quads directly. The various approaches are illustrated in Fig. 1,
where 1b shows the qualifier information common to all approaches, i.e., the
triple data, while 1c–1f show alternative encodings of quads.
Standard Reification The first approach we look at is standard RDF reifica-
tion [16,4], where a resource is used to denote the statement, and where addi-
tional information about the statement can be added. The scheme is depicted in
Fig. 1c. To represent a quad of the form (s, p, o, i), we add the following triples:
(i, r:subject, s),(i, r:predicate, p),(i, r:object, o), where r: is the RDF vo-
cabulary namespace. We omit the redundant declaration as type r:Statement,
which can be inferred from the domain of r:subject. Moreover, we simplify
the encoding by using the Wikidata statement identifier as subject, rather than
using a blank node. We can therefore represent nquadruples with 3ntriples.
n-ary Relation The second approach is to use an n-ary relation style of mod-
elling, where a resource is used to identify a relationship. Such a scheme is
depicted in Fig. 1d, which follows the proposal by Erxleben et al. [8]. Instead
of stating that a subject has a given value, the model states that the subject
is involved in a relationship, and that that relationship has a value and some
qualifiers. The :subjectProperty and :valueProperty edges are important to
be able to query for the original name of the property holding between a given
subject and object.4For identifying the relationship, we can simply use the state-
ment identifier. To represent a quadruple of the form (s, p, o, i), we must add the
triples (s, ps, i),(i, pv, o),(pv,:valueProperty, p),(ps,:statementProperty, p),
where pvand psare fresh properties created from p. To represent nquadruples
of the form (s, p, o, i), with munique values for p, we thus need 2(n+m)triples.
Note that for the translation to be query dominating, we must assume that
predicates such as :P39s and :P39v in Fig. 1c, as well as the reserved terms
:statementProperty and :valueProperty, do not appear in the input.
Singleton Properties Originally proposed by Nguyen et al. [18], the core idea
behind singleton properties is to create a property that is only used for a sin-
gle statement, which can then be used to annotate more information about the
statement. The idea is captured in Fig. 1e. To represent a quadruple of the
form (s, p, o, i), we must add the triples (s, i, o),(i, :singletonPropertyOf, p).
Thus to represent nquadruples, we need 2ntriples, making this the most con-
cise scheme so far. To be query dominating, we must assume that the term
:singletonPropertyOf cannot appear as a statement identifier.
Named Graphs Unlike the previous three schemes, Named Graphs extends
the RDF triple model and considers sets of pairs of the form (G, n)where Gis
an RDF graph and nis an IRI (or a blank node in some cases, or can even be
omitted for a default graph). We can flatten this representation by taking the
union over G× {n}for each such pair, resulting in quadruples. Thus we can
encode a quadruple (s, p, o, i)directly using N-Quads, as illustrated in Fig. 1f.
4Referring to Fig. 1c, another option to save some triples might be to use the original
property :P39 (position held) instead of :P39s or :P39v, but this could be conceptually
ugly since, e.g., if we replaced :P39s, the resulting triple would be effectively stating
that (Abraham Lincoln,position held,[a statement identifier]).
Table 2: Number of triples needed to model quads (n= 57,088,184,p= 1,311)
Schema: sr (3n)nr (2(n+p))sp (2n)ng (n)
Tuples: 171,264,552 114,178,990 114,176,368 57,088,184
Other possibilities? Having introduced the most well-known options for reifi-
cation, one may ask if these are all the reasonable alternatives for representing
quadruples of the form (s, p, o, i)– where ifunctionally determines (s, p, o)– in
a manner compatible with the RDF standards. As demonstrated by singleton
properties, we can encode such quads into two triples, where iappears somewhere
in both triples, and where s,pand oeach appear in one of the four remaining
positions, and where a reserved term is used to fill the last position. This gives
us 108 possible schemes5that use two triples to represent a quad in a similar
manner to the singleton properties proposal. Just to take one example, we could
model such a quad in two triples as (i, r:subject, s),(i, p, o)—an abbreviated
form of standard reification. As before, we should assume that the properties p
and r:subject are distinct from qualifier properties. Likewise, if we are not so
concerned with conciseness and allow a third triple, the possibilities increase fur-
ther. To reiterate, our current goal is to evaluate existing, well-known proposals,
but we wish to mention that many other such possibilities do exist in theory.
5 SPARQL Querying Experiments
We have looked at four well-known approaches to annotate triples, in terms of
how they are formed, what assumptions they make, and how many triples they
require. In this section, we aim to see how these proposals work in practice,
particularly in the context of querying Wikidata. Experiments were run on an
Intel E5-2407 Quad-Core 2.2GHz machine with a standard SATA hard-drive and
32 GB of RAM. More details about the configuration of these experiments are
available from http://users.dcc.uchile.cl/~dhernand/wrdf/.
To start with, we took the RDF export of Wikidata from Erxleben et al. [8]
(2015-02-23), which was natively in an n-ary relation style format, and built the
equivalent data for all four datasets. The number of triples common to all formats
was 237.6 million. With respect to representing the quads, Table 2 provides a
breakdown of the number of output tuples for each model.
5There are 3! possibilities for where iappears in the two triples: ss,pp,oo,sp,so,
po. For the latter three configurations, all four remaining slots are distinguished so
we have 4! ways to slot in the last four terms. For the former three configurations,
both triples are thus far identical, so we only have half the slots, making 4×3
permutations. Putting it all together, 3×4! + 3 ×4×3 = 108.
0 200 400
0
2,000
4,000
6,000
8,000
4store
0 200 400
0
50
100
150
200
BlazeGraph
0 200 400
0
2,000
4,000
6,000
8,000
GraphDB
0 200 400
0
50
100
Jena
0 200 400
0
50
100
Virtuoso
Standard Reification
n-ary Relations
Singleton Properties
Named Graphs
x-axes: Statements (×103)
y-axes: Index Size (MB)
Fig. 2: Growth in index sizes for first 400,000 statements
Loading data: We selected five RDF engines for experiments: 4store, Blaze-
Graph, GraphDB, Jena and Virtuoso. The first step was to load the four datasets
for the four models into each engine. We immediately started encountering prob-
lems with some of the engines. To quantify these issues, we created three collec-
tions of 100,000, 200,000, and 400,000 raw statements and converted them into
the four models.6We then tried to load these twelve files into each engine. The
resulting growth in on-disk index sizes is illustrated in Figure 2 (measured from
the database directory), where we see that: (i) even though different models lead
to different triple counts, index sizes were often nearly identical: we believe that
since the entropy of the data is quite similar, compression manages to factor
out the redundant repetitions in the models; (ii) some of the indexes start with
some space allocated, where in fact for BlazeGraph, the initial allocation of disk
space (200MB) was not affected by the data loads; (iii) 4store and GraphDB
both ran into problems when loading singleton properties, where it seems the
indexing schemes used assume a low number of unique predicates.7With respect
to point (iii), given that even small samples lead to blow-ups in index sizes, we
decided not to proceed with indexing the singleton properties dataset in 4store
or GraphDB. While later loading the full named graphs dataset, 4store slowed
to loading 24 triples/second; we thus also had to kill that index load.8
6We do not report the times for full index builds since, due to time constraints, we
often ran these in parallel uncontrolled settings.
7In the case of 4store, for example, in the database directory, two new files were
created for each predicate.
8See https://groups.google.com/forum/#!topic/4store-support/uv8yHrb-ng4;
retr. 2015/07/11.
Benchmark queries: From two online lists of test-case SPARQL queries, we
selected a total of 14 benchmark queries.9These are listed in Table 3; since we
need to create four versions of each query for each reification model, we use
an abstract quad syntax where necessary, which will be expanded in a model-
specific way such that the queries will return the same answers over each model.
An example of a quad in the abstract syntax and its four expansions is provided
in Table 4. In a similar manner, the 14 queries of Table 3 are used to generate
four equivalent query benchmarks for testing, making a total of 56 queries.
In terms of the queries themselves, they exhibit a variety of query features
and number/type of joins; some involve qualifier information while some do not.
Some of the queries are related; for example, Q1 and Q2 both ask for information
about US presidents, and Q4 and Q5 both ask about female mayors. Importantly,
Q4 and Q5 both require use of a SPARQL property path (:P31/:P279*), which
we cannot capture appropriately in the abstract syntax. In fact, these property
paths cannot be expressed in either the singleton properties model or the stan-
dard reification model; they can only be expressed in the n-ary relation model
(:P31s/:P31v/(:P279s/:P279v)*), and the named graph model (:P31/:P279*)
assuming the default graph can be set as the union of all graphs (since one cannot
do property paths across graphs in SPARQL, only within graphs).
Query results: For each engine and each model, we ran the queries sequentially
(Q114) five times on a cold index. Since the engines had varying degrees of
caching behaviour after the first run – which is not the focus of this paper – we
present times for the first “cold” run of each query.10 Since we aim to run 14 ×
4×5×5=1,400 query executions, to keep the experiment duration manageable,
all engines were configured for a timeout of 60 seconds. Since different engines
interpret timeouts differently (e.g., connection timeouts, overall timeouts, etc.),
we considered any query taking longer than 60 seconds to run as a timeout. We
also briefly inspected results to see that they corresponded with equivalent runs,
ruling queries that returned truncated results as failed executions.11
The query times for all five engines and four models are reported in Figure 3,
where the y-axis is in log scale from 10 ms to 60,000 ms (the timeout) in all
cases for ease of comparison. Query times are not shown in cases where the
query could not be run (for example, the index could not be built as discussed
previously, or property-paths could not be specified for that model, or in the
case of 4store, certain query features were not supported), or where the query
failed (with a timeout, a partial result, or a server exception). We see that in
terms of engines, Virtuoso provides the most reliable/performant results across
all models. Jena failed to return answers for singleton properties, timing-out on
all queries (we expect some aspect of the query processing does not perform well
9https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples
and http://wikidata.metaphacts.com/resource/Samples
10 Data for other runs are available from the web-page linked earlier.
11 The results for queries such as Q5,Q7 and Q14 may (validly) return different answers
for different executions due to use of LIMIT without an explicit order.
Table 3: Evaluation queries in abstract syntax
#Q1: US presidents and their wives
SELECT ?up ?w ?l ?wl WHERE { <:Q30 :P6 ?up _:i1> . <?up :P26 ?w ?i2> . OPTIONAL {
?up rs:label ?l . ?w rs:label ?wl . FILTER(lang(?l) = "en" && lang(?wl) = "en") } }
#Q2: US presidents and causes of death
SELECT ?h ?c ?hl ?cl WHERE { <?h :P39 :Q11696 ?i1> . <?h :P509 ?c ?i2> . OPTIONAL {
?h rs:label ?hl . ?c rs:label ?cl . FILTER(lang(?hl) = "en" && lang(?cl) = "en") } }
#Q3: People born before 1880 with no death date
SELECT ?h ?date WHERE { <?h :P31 :Q5 ?i1> . <?h :P569 ?dateS ?i2> . ?dateS :time ?date .
FILTER NOT EXISTS { ?h :P570s [ :P570v ?d ] . }
FILTER (datatype(?date) = xsd:date && ?date < "1880-01-01Z"^^xsd:date) } LIMIT 100
#Q4: Cities with female mayors ordered by population
SELECT DISTINCT ?city ?citylabel ?mayorlabel (MAX(?pop) AS ?max_pop) WHERE {
?city :P31/:P279* :Q515 . <?city :P6 ?mayor ?i1> . FILTER NOT EXISTS { ?i1 :P582q ?x }
<?mayor :P21 :Q6581072 _:i2> . <?city :P1082 ?pop _:i3> . ?pop :numericValue ?pop .
OPTIONAL { ?city rs:label ?citylabel . FILTER ( LANG(?citylabel) = "en" ) }
OPTIONAL { ?mayor rs:label ?mayorlabel . FILTER ( LANG(?mayorlabel) = "en" ) } }
GROUP BY ?city ?citylabel ?mayorlabel ORDER BY DESC(?max_pop) LIMIT 10
#Q5: Countries ordered by number of female city mayors
SELECT ?country ?label (COUNT(*) as ?count) WHERE { ?city :P31/:P279* :Q515 .
<?city :P6 ?mayor ?i1> . FILTER NOT EXISTS { ?i1 :P582q ?x }
<?mayor :P21 :Q6581072 ?i2> . <?city :P17 ?country ?i3> . ?pop :numericValue ?pop .
OPTIONAL { ?country rs:label ?label . FILTER ( LANG(?label) = "en" ) } }
GROUP BY ?country ?label ORDER BY DESC(?count) LIMIT 100
#Q6: US states ordered by number of neighbouring states
SELECT ?state ?label ?borders WHERE { { SELECT ?state (COUNT(?neigh) as ?borders)
WHERE { <?state :P31 :Q35657 ?i1> . <?neigh :P47 ?state _:i2> .
<?neigh :P31 :Q35657 ?i3 > . } GROUP BY ?state }
OPTIONAL { ?state rs:label ?label . FILTER(lang(?label) = "en") } } ORDER BY DESC(?borders)
#Q7: People whose birthday is “today”
SELECT DISTINCT ?entity ?year WHERE { <?entityS :P569 ?value ?i1> . ?value :time ?date .
?entityS rs:label ?entity . FILTER(lang(?entity)="en")
FILTER(date(?date)=month(now()) && date(?date)=day(now())) } LIMIT 10
#Q8: All property–value pairs for Douglas Adams
SELECT ?property ?value WHERE { <:Q42 ?property ?value ?i> }
#Q9: Populations of Berlin, ordered by least recent
SELECT ?pop ?time WHERE { <:Q64 wd:P1082 ?popS ?i> . ?popS :numericValue ?pop .
?i wd:P585q [ :time ?time ] . } ORDER BY (?time)
#Q10: Countries without an end-date
SELECT ?country ?countryName WHERE { <?country wd:P31 wd:Q3624078 ?i> .
FILTER NOT EXISTS { ?g :P582q ?endDate } ?country rdfs:label ?countryName .
FILTER(lang(?countryName)="en") }
#Q11: US Presidents and their terms, ordered by start-date
SELECT ?president ?start ?end WHERE { <:Q30 :P6 ?president ?i > .
?g :P580q [ :time ?start ] ; :P582q [ :time ?end ] . } ORDER BY (?start)
#Q12: All qualifier properties used with "head of government" property
SELECT DISTINCT ?q_p WHERE { <?s :P6 ?o ?i > . ?g ?q_p ?q_v . }
#Q13: Current countries ordered by most neighbours
SELECT ?countryName (COUNT (DISTINCT ?neighbor) AS ?neighbors) WHERE {
<?country :P31 :Q3624078 ?i1> . FILTER NOT EXISTS { ?i1 :P582 ?endDate }
?country rs:label ?countryName FILTER(lang(?countryName)="en") OPTIONAL {
<?country :P47 ?neighbor ?i2 > <?neighbor :P31 :Q3624078 ?i3> .
FILTER NOT EXISTS { ?i3 :P582q ?endDate2 } } }
GROUP BY(?countryName) ORDER BY DESC(?neighbors)
#Q14: People who have Wikidata accounts
SELECT ?person ?name ?uname WHERE { <?person :P553 wd:Q52 ?i > .
?i :P554q ?uname . ?person rs:label ?name . FILTER(LANG(?name) = "en") . } LIMIT 100
Table 4: Expansion of abstract syntax quad for getting US presidents
abstract syntax: <:Q30 :P6 ?up ?i> .
std. reification: ?i r:subject :Q30 ; r:predicate :P6 ; r:object ?up .
n-ary relations: :Q30 :P6s ?i . ?i :P6v ?up .
sing. properties: :Q30 ?i ?up . ?i1 r:singletonPropertyOf :P6 .
named graphs: GRAPH ?i { :Q30 :P6 ?up . }
assuming many unique predicates). We see that both BlazeGraph and GraphDB
managed to process most of the queries for the indexes we could build, but with
longer runtimes than Virtuoso. In general, 4store struggled with the benchmark
and returned few valid responses in the allotted time.
Returning to our focus in terms of comparing the four reification models,
Table 5 provides a summary of how the models ranked for each engine and
overall. For a given engine and query, we look at which model performed best,
counting the number of firsts, seconds, thirds, fourths, failures (fa) and cases
where the query could not be exectuted (ne); e.g., referring back to Figure 3,
we see that for Virtuoso, Q1, the fastest models in order were named graphs,
n-ary relations, singleton properties, and standard reification. Thus, in Table 5,
under Virtuoso, we add a one to named graphs in the 1st column, a one to n-ary
relations in the 2nd column, and so forth. Along these lines, for example, the
score of 4 for singleton-properties (sp) in the 1st column of Virtuoso means that
this model successfully returned results faster than all other models in 4 out of
the 14 cases. The total column then adds the positions for all engines; across all
five engines, we see that named graphs is the best supported (fastest in 17/70
cases), with standard reification and n-ary relations not far behind (fastest in
16/70 cases). All engines aside from Virtuoso seem to struggle with singleton
properties; presumably these engines make some (arguably naive) assumptions
that the number of unique predicates in the indexed data is low.
Although the first three RDF-level formats would typically require more joins
to be executed than named graphs, the joins in question are through the state-
ment identifier, which is highly selective; assuming “equivalent” query plans,
the increased joins are unlikely to overtly affect performance, particularly when
forming part of a more complex query. However, additional joins do complicate
query planning, where aspects of different data models may affect established
techniques for query optimisation differently. In general, we speculate that in
cases where a particular model was much slower for a particular query and en-
gine, that the engine in question selected a (comparatively) worse query plan.
6 Conclusions
In this paper, we have looked at four well-known reification models for RDF:
standard reification, n-ary relations, singleton properties and named graphs. We
were particularly interested in the goal of modelling Wikidata as RDF, such that
Table 5: Ranking of reification models per query response times
4store BlazeGraph GraphDB Jena Virtuoso Total
sr nr sp ng sr nr sp ng sr nr sp ng sr nr sp ng sr nr sp ng sr nr sp ng
1st 3 2 0 0 2 6 0 3 2 2 0 7 7 2 0 3 2 4 4 4 16 16 4 17
2nd 1 2 0 0 3 1 2 4 2 5 0 4 0 6 0 5 2 5 1 6 8 19 3 19
3rd – – – – 3 3 1 2 7 4 0 0 2 2 0 3 3 3 2 3 15 12 3 8
4th – – – – 1 0 6 2 – – – – 0 0 0 0 4 2 4 1 5 2 10 3
fa 5 5 0 0 3 4 3 3 1 3 0 3 3 4 12 3 1 0 1 0 13 16 16 9
ne 5 5 14 14 2 0 2 0 2 0 14 0 2 0 2 0 2 0 2 0 13 5 34 14
it can be indexed and queried by existing SPARQL technologies. We sketched a
conceptual overview of a Wikidata schema based on quads/triples, thus reducing
the goal of modelling Wikidata to that of modelling quads in RDF (quads where
the fourth element functionally specifies the triple), and introduced the four
reification models in this context. We found that singleton-properties offered the
most concise representation on a triple level, but that n-ary predicates was the
only model with built-in support for SPARQL property paths. With respect to
experiments over five SPARQL engines – 4store, BlazeGraph, GraphDB, Jena
and Virtuoso – we found that the former four engines struggled with the high
number of unique predicates generated by singleton properties, and that 4store
likewise struggled with a high number of named graphs. Otherwise, in terms of
query performance, we found no clear winner between standard reification, n-ary
predicates and named graphs. We hope that these results may be insightful for
Wikidata developers – and other practitioners – who wish to select a practical
scheme for querying reified RDF data.
Acknowledgements This work was partially funded by the DFG in projects DIA-
MOND (Emmy Noether grant KR 4381/1-1) and HAEC (CRC 912), by the Millennium
Nucleus Center for Semantic Web Research under Grant No. NC120004 and by Fonde-
cyt Grant No. 11140900.
References
1. Arenas, M., Bertails, A., Prud’hommeaux, E., Sequeda, J. (eds.): Direct Mapping
of Relational Data to RDF. W3C Recommendation (27 September 2012), http:
//www.w3.org/TR/rdb-direct-mapping/
2. Bishop, B., Kiryakov, A., Ognyanoff, D., Peikov, I., Tashev, Z., Velkov, R.:
OWLIM: a family of scalable semantic repositories. Sem. Web J. 2(1), 33–42 (2011)
3. Brickley, D., Guha, R. (eds.): RDF Vocabulary Description Language 1.0: RDF
Schema. W3C Recommendation (10 February 2004), http://www.w3.org/TR/
rdf-schema/
4. Brickley, D., Guha, R. (eds.): RDF Schema 1.1. W3C Recommendation (25 Febru-
ary 2014), http://www.w3.org/TR/rdf-schema/
5. Cyganiak, R., Wood, D., Lanthaler, M., Klyne, G., Carroll, J.J., McBride, B. (eds.):
RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation (25 February
2014), http://www.w3.org/TR/rdf11-concepts/
6. Das, S., Srinivasan, J., Perry, M., Chong, E.I., Banerjee, J.: A tale of two graphs:
Property Graphs as RDF in Oracle. In: EDBT. pp. 762–773 (2014), http://dx.
doi.org/10.5441/002/edbt.2014.82
7. Erling, O.: Virtuoso, a hybrid RDBMS/graph column store. IEEE Data Eng. Bull.
35(1), 3–8 (2012)
8. Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D.: Introducing
Wikidata to the linked data web. In: ISWC. pp. 50–65 (2014)
9. Harris, S., Lamb, N., Shadbolt, N.: 4store: The design and implementation of a
clustered RDF store. In: Workshop on Scalable Semantic Web Systems. CEUR-
WS, vol. 517, pp. 94–109 (2009)
10. Harris, S., Seaborne, A., Prud’hommeaux, E. (eds.): SPARQL 1.1 Query
Language. W3C Recommendation (21 March 2013), http://www.w3.org/TR/
sparql11-query/
11. Hartig, O.: Reconciliation of RDF* and Property Graphs. CoRR abs/1409.3288
(2014), http://arxiv.org/abs/1409.3288
12. Hartig, O., Thompson, B.: Foundations of an alternative approach to reification in
RDF. CoRR abs/1406.3399 (2014), http://arxiv.org/abs/1406.3399
13. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and
temporally enhanced knowledge base from Wikipedia. Artif. Intell. 194, 28–61
(2013)
14. Hull, R.: Relative information capacity of simple relational database schemata. In:
PODS. pp. 97–109 (1984)
15. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - A large-
scale, multilingual knowledge base extracted from Wikipedia. Sem. Web J. 6(2),
167–195 (2015)
16. Manola, F., Miller, E. (eds.): Resource Description Framework (RDF): Primer.
W3C Recommendation (10 February 2004), http://www.w3.org/TR/rdf-primer/
17. Miller, R.J., Ioannidis, Y.E., Ramakrishnan, R.: Schema equivalence in heteroge-
neous systems: bridging theory and practice. Inf. Syst. 19(1), 3–31 (1994)
18. Nguyen, V., Bodenreider, O., Sheth, A.: Don’t like RDF reification? Making state-
ments about statements using singleton property. In: WWW. pp. 759–770. ACM
(2014)
19. Thompson, B.B., Personick, M., Cutcher, M.: The Bigdata®RDF graph database.
In: Linked Data Management, pp. 193–237. Chapman and Hall/CRC (2014)
20. Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledgebase. Comm.
ACM 57, 78–85 (2014)
21. Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D., Ding, L.: Supporting scalable,
persistent Semantic Web applications. IEEE Data Eng. Bull. 26(4), 33–39 (2003)
22. Zimmermann, A., Lopes, N., Polleres, A., Straccia, U.: A general framework for
representing, reasoning and querying with annotated Semantic Web data. J. Web
Sem. 11, 72–95 (2012)
Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10 Q11 Q12 Q13 Q14
101
102
103
104
Runtime (ms)
4store
Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10 Q11 Q12 Q13 Q14
101
102
103
104
Runtime (ms)
BlazeGraph
Q1Q2Q3Q6Q7Q8Q9Q10 Q11 Q13 Q14
101
102
103
104
Runtime (ms)
GraphDB
Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10 Q11 Q12 Q13 Q14
101
102
103
104
Runtime (ms)
Jena
Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10 Q11 Q12 Q13 Q14
101
102
103
104
Runtime (ms)
Virtuoso
Fig. 3: Query results for all five engines and four models (log scale)
... The input, structured data, taken from available data sources (DBPedia [2,4,21,23], Wikidata [3,4,7,21,23]), is organized in different formats, from RDF triples [2,3,4,9], tables [5,7,8] (Wikipedia infoboxes, slot-value pairs), to knowledge graphs [22,24] built from RDF triples. In our paper, we refer from the works [1,27,28,29,30] to format Wikidata statements in a set of quads and triples which is possible to transform to RDF triples for creating knowledge graphs. ...
... Statements must be constructed in a clear, concise structure and have connections with other semantic structural types, such as RDF triples. To encode Wikidata as RDF, there are various ways [1,28,32,33], from standard reification, n-ary relations, singleton properties, named graphs, to property graphs. We are inspired by Hernández et. ...
... We are inspired by Hernández et. al [1] to present Wikidata statements in two tables, QUAD and TRIPLE but in a slightly different way. ...
Preprint
Full-text available
Acknowledged as one of the most successful online cooperative projects in human society, Wikipedia has obtained rapid growth in recent years and desires continuously to expand content and disseminate knowledge values for everyone globally. The shortage of volunteers brings to Wikipedia many issues, including developing content for over 300 languages at the present. Therefore, the benefit that machines can automatically generate content to reduce human efforts on Wikipedia language projects could be considerable. In this paper, we propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models. The results are helpful not only for the data-to-text generation task but also for other relevant works in the field.
... What if, for example, we want to add data that describe edges themselves, or graphs themselves? While more complex data can be modeled in directed labeled graphs using various forms of reification [14], the result can often be verbose and unintuitive [1]. Hence a wide range of graph-based models have emerged [3]: the (labeled) property graph model [1,26] has gained significant popularity in the Database community, while models such as named graphs [10] and RDF-star (RDF*) [12] have been proposed within the Semantic Web community. ...
... How can we design a graph database engine that can seamlessly ingest, integrate and query data from any such model? One possibility is to take the simplest model -directed labeled graphs -as our base and use reification [14] to represent more complex models, but as mentioned before, reification is too verbose. Another possibility is to take a more complex model -say property graphs -as our base [4,11], but this would add complexity to higher levels when we think of graph queries, analytics, learning, etc. ...
... Representing statements like this in a directed labeled graph requires some form of reification to decompose n-ary relations into binary relations [14]. Figure 2 shows a graph where e 1 and e 2 are nodes representing n-ary relationships. ...
... This process is known as reification. For a more recent comparison of various ways of achieving reification, see Hernández et al. [11]; in particular they compare query performance across four reification techniques and five SPARQL implementations. Hartig and Thompson [5] point out that, in general, reification techniques require a large number of triples and are cumbersome for querying. ...
... Hernández et al. [11] give examples of the other use of reification, where further information is being added to the triple. Here they use the term qualifier to describe such metadata. ...
... Of relevance to this current study, questions with reverse predicates were answered less accurately and more slowly than analogous questions. The American philosopher C.S. Peirce, distinguished between symbolic signs, where representation depends on convention, and iconic signs, where representation depends on structural similarity [24] 11 . In SPARQL, the use of ^ for reverse predicates is symbolic; whereas in Cypher the use of left-to-right and right-to-left arrows, to indicate the direction of the edge, is iconic. ...
Preprint
Full-text available
This study compares participant acceptance of the property graph and edge-labelled graph paradigms, as represented by Cypher and the proposed extensions to the W3C standards, RDF* and SPARQL*. In general, modelling preferences are consistent across the two paradigms. When presented with location information, participants preferred to create nodes to represent cities, rather than use metadata; although the preference was less marked for Cypher. In Cypher, participants showed little difference in preference between representing dates or population size as nodes. In RDF*, this choice was not necessary since both could be represented as literals. However, there was a significant preference for using the date as metadata to describe a triple containing population size, rather than vice versa. There was no significant difference overall in accuracy of interpretation of queries in the two paradigms; although in one specific case, the use of a reverse arrow in Cypher was interpreted significantly more accurately than the ^ symbol in SPARQL. Based on our results and on the comments of participants, we make some recommendations for modellers. Techniques for reifing RDF have attracted a great deal of research. Recently, a hybrid approach, employing some of the features of property graphs, has claimed to offer an improved technique for RDF reification. Query-time reasoning is also a requirement which has prompted a number of proposed extensions to SPARQL and which is only possible to a limited extent in the property graph paradigm. Another recent development, the hypergraph paradigm enables more powerful query-time reasoning. There is a need for more research into the user acceptance of these various more powerful approaches to modelling and querying. Such research should take account of complex modelling situations.
... Property graphs are most prominently used in popular graph databases, such as Neo4j [16,354]. In choosing between graph models, it is important to note that property graphs can be translated to/from directed edge-labelled graphs without loss of information [18,235] (per, e.g., Figure 4). In summary, directed-edge labelled graphs offer a more minimal model, while property graphs offer a more flexible one. ...
... While we could use the pattern of turning the edge into a node -as illustrated in Figure 3 -to directly represent such context, another option is to use reification, which allows for making statements about statements in a generic manner (or in the case of a graph, for defining edges about edges). In Figure 18 we present three forms of reification that can be used for modelling temporal context on the aforementioned edge within a directed edge-labelled graph [235]. We use to denote an arbitrary identifier representing the edge itself to which the contextual information can be associated. ...
... In general, a reified edge does not assert the edge it reifies; for example, we may reify an edge to state that it is no longer valid. We refer to the work of Hernández et al. [235] for further comparison of reification alternatives and their relative strengths and weaknesses. ...
... In [30], the authors confirm that the reification is the only standard way to represent annotation for RDF Triples and still the only one compatible with all the RDF Data Repositories for the fact that it does not require any extensions. There exist other abbreviated variants of standard reification, consisting of representing each triple <s,p,o> with two triples as <:i,rdf:subject,:s>, <:i,:p,:o> with the subject of the statement in one triple and the property-object pair in the other [31]. ...
Thesis
Full-text available
The Semantic Web has evolved to reach different applications. It was designed to enable machines to understand available resources on the World Wide Web and use the extracted information in the decision-making and reasoning processes. Hence, the Web is an open world where people can say whatever they want, and users -in this case, people and machines- can relate to it.One of the main challenges nowadays is to deal with information from multiple and mostly unreliable sources. The Linked Open Data is luckily presented in a machine-readable format. Whether automatically extracted from Web documents, directly introduced from data sources, or inferred by reasoning processes, the Linked Open Data can be outdated, incorrect, incomplete, vague, ambiguous, or more generally, uncertain. Dealing with Uncertain Linked Data faces multiple challenges like uncertainty qualification (or quantification), calculus, representation, and in the Semantic Web case: publishing and reusability. In this thesis, we analyze the existing results treating uncertainty in the Semantic Web. To introduce the proper terminology, we describe in Chapter 2 the preliminary notions related to uncertainty and the Semantic Web. We provide an overview of the technologies used in the Semantic Web stack and the limits of uncertain data. Afterward, we deliver in Chapter 3 a representation for uncertainty on the Semantic Web. We discuss our contribution of the “Uncertainty ontology” and the methods for annotating statements with uncertainty. Chapter 4 discusses uncertainty management and access in a contextualized view and the reading of uncertainty inside contexts. For sources without explicit uncertainty information, we present a framework in Chapter 5 enabling the evaluation of uncertainty based on syntactical and semantic similarities with entities from a reference source and within a specific use case. We conclude with a discussion about the dialogue between data sources and the consensual knowledge in the Semantic Web. We follow that with our perception of the reality and perspectives of this research
... P50, P577, P135) and objects, which can be either a literal like "1899" or another item like Q36180, by a series of statements, each providing one fact or information about the item. Table 2 gives several examples of sentences in natural language, their annotation with Wikidata IDs and finally their encoding as Wikidata statements, using reifying for RDF triples (Hernández, Hogan, and Krötzsch 2015) where the same subject is not repeated. Statements with the same subject are separated by a semicolon ";" and the last one is finished by a dot ".". ...
Chapter
We propose WDBench: a query benchmark for knowledge graphs based on Wikidata, featuring real-world queries extracted from the public query logs of the Wikidata SPARQL endpoint. While a number of benchmarks for graph databases (including SPARQL engines) have been proposed in recent years, few are based on real-world data, even fewer use real-world queries, and fewer still allow for comparing SPARQL engines with (non-SPARQL) graph databases. The raw Wikidata query log contains millions of diverse queries, where it would be prohibitively costly to run all such queries, and difficult to draw conclusions given the mix of features that these queries use. WDBench thus focuses on three main query features that are common to SPARQL and graph databases: (i) basic graph patterns, (ii) optional graph patterns, (iii) path patterns, and (iv) navigational graph patterns. We extract queries from the Wikidata logs specifically to test these patterns, clean them of non-standard features, remove duplicates, classify them into different structural subsets, and present them in two different syntaxes. Using this benchmark, we present and compare performance results for evaluating queries using Blazegraph, Jena/Fuseki, Virtuoso and Neo4j.
Article
Full-text available
The aim of this paper is to study the technological implementation of the emergent bibliographic models (IFLA LRM in particular) taking into account one of the most widespread platforms for the semantic web, namely Wikibase. Different initiatives of implementation of LRM have been taken into account, included: a) a prototype cataloging interface; b)the implementation of the new cataloging system for the Bibliothéque Nationale de France (Bnf); c) a test of use of one of the feature of the Wikibase data model - namely the "qualifier"- to find a sustainable solution for the LRM nomen entity. Wikibase and his data model cannot be considered the magical unicorn that solves all problems. More in-depth analysis and tests are needed, but – as an intermediate result – we can consider Wikibase a promising platform also in the bibliographic field with a low entry barrier.
Technical Report
Full-text available
Both the notion of Property Graphs (PG) and the Resource Description Framework (RDF) are commonly used models for representing graph-shaped data. While there exist some system-specific solutions to convert data from one model to the other, these solutions are not entirely compatible with one another and none of them appears to be based on a formal foundation. In fact, for the PG model, there does not even exist a commonly agreed-upon formal definition. The aim of this document is to reconcile both models formally. To this end, the document proposes a formalization of the PG model and introduces well-defined transformations between PGs and RDF. As a result, the document provides a basis for the following two innovations: On one hand, by implementing the RDF-to-PG transformations defined in this document, PG-based systems can enable their users to load RDF data and make it accessible in a compatible, system-independent manner using, e.g., the graph traversal language Gremlin or the declarative graph query language Cypher. On the other hand, the PG-to-RDF transformation in this document enables RDF data management systems to support compatible, system-independent queries over the content of Property Graphs by using the standard RDF query language SPARQL. Additionally, this document represents a foundation for systematic research on relationships between the two models and between their query languages.
Conference Paper
Full-text available
Graph Databases are gaining popularity owing to pervasiveness of graph data in social networks, physical sciences, networking, and web applications. A majority of these databases are based on the property graph model, which is characterized as key/value- based, directed, and multi-relational. In this paper, we consider the problem of supporting property graphs as RDF in Oracle Database. We introduce a property graph to RDF transformation scheme. The main challenge lies in representing the key/value properties of property graph edges in RDF. We propose three models: 1) named graph based, 2) subproperty based, and 3) (extended) reification based, all of which can be supported with RDF capabilities in Oracle Database. These models are evaluated with respect to ease of SPARQL query formulation, join complexities, skewness in generated RDF data, query performance, and storage overhead. An experimental study with a real-life Twitter social network dataset on Oracle Database 12c demonstrates the feasibility of representing property graphs as RDF and presents a quantitative performance comparison of the proposed models.
Technical Report
Full-text available
This document defines extensions of the RDF data model and of the SPARQL query language that capture an alternative approach to represent statement-level metadata. While this alternative approach is backwards compatible with RDF reification as defined by the RDF standard, the approach aims to address usability and data management shortcomings of RDF reification. One of the great advantages of the proposed approach is that it clarifies a means to (i) understand sparse matrices, the property graph model, hypergraphs, and other data structures with an emphasis on link attributes, (ii) map such data onto RDF, and (iii) query such data using SPARQL. Further, the proposal greatly expands both the freedom that database designers enjoy when creating physical indexing schemes and query plans for graph data annotated with link attributes and the interoperability of those database solutions.
Conference Paper
Full-text available
Statements about RDF statements, or meta triples, provide additional information about individual triples, such as the source, the occurring time or place, or the certainty. Integrating such meta triples into semantic knowledge bases would enable the querying and reasoning mechanisms to be aware of provenance, time, location, or certainty of triples. However, an efficient RDF representation for such meta knowledge of triples remains challenging. The existing standard reification approach allows such meta knowledge of RDF triples to be expressed using RDF by two steps. The first step is representing the triple by a Statement instance which has subject, predicate, and object indicated separately in three different triples. The second step is creating assertions about that instance as if it is a statement. While reification is simple and intuitive, this approach does not have formal semantics and is not commonly used in practice as described in the RDF Primer. In this paper, we propose a novel approach called Singleton Property for representing statements about statements and provide a formal semantics for it. We explain how this singleton property approach fits well with the existing syntax and formal semantics of RDF, and the syntax of SPARQL query language. We also demonstrate the use of singleton property in the representation and querying of meta knowledge in two examples of Semantic Web knowledge bases: YAGO2 and BKR. Our experiments on the BKR show that the singleton property approach gives a decent performance in terms of number of triples, query length and query execution time compared to existing approaches. This approach, which is also simple and intuitive, can be easily adopted for representing and querying statements about statements in other knowledge bases.
Article
Full-text available
The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available on the Web using Semantic Web and Linked Data technologies. The project extracts knowledge from 111 different language editions of Wikipedia. The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The DBpedia project maps Wikipedia infoboxes from 27 different language editions to a single shared ontology consisting of 320 classes and 1,650 properties. The mappings are created via a world-wide crowd-sourcing effort and enable knowledge from the different Wikipedia editions to be combined. The project publishes releases of all DBpedia knowledge bases for download and provides SPARQL query access to 14 out of the 111 language editions via a global network of local DBpedia chapters. In addition to the regular releases, the project maintains a live knowledge base which is updated whenever a page in Wikipedia changes. DBpedia sets 27 million RDF links pointing into over 30 external data sources and thus enables data from these sources to be used together with DBpedia data. Several hundred data sets on the Web publish RDF links pointing to DBpedia themselves and make DBpedia one of the central interlinking hubs in the Linked Open Data (LOD) cloud. In this system report, we give an overview of the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and applications.
Conference Paper
Wikidata is the central data management platform of Wikipedia. By the efforts of thousands of volunteers, the project has produced a large, open knowledge base with many interesting applications. The data is highly interlinked and connected to many other datasets, but it is also very rich, complex, and not available in RDF. To address this issue, we introduce new RDF exports that connect Wikidata to the Linked Data Web. We explain the data model of Wikidata and discuss its encoding in RDF. Moreover, we introduce several partial exports that provide more selective or simplified views on the data. This includes a class hierarchy and several other types of ontological axioms that we extract from the site. All datasets we discuss here are freely available online and updated regularly.
Article
Wikidata allows every user to extend and edit the stored information, even without creating an account. A form based interface makes editing easy. Wikidata's goal is to allow data to be used both in Wikipedia and in external applications. Data is exported through Web services in several formats, including JavaScript Object Notation, or JSON, and Resource Description Framework, or RDF. Data is published under legal terms that allow the widest possible reuse. The value of Wikipedia's data has long been obvious, with many efforts to use it. The Wikidata approach is to crowdsource data acquisition, allowing a global community to edit the data. This extends the traditional wiki approach of allowing users to edit a website. In March 2013, Wikimedia introduced Lua as a scripting language for automatically creating and enriching parts of articles. Lua scripts can access Wikidata, allowing Wikipedia editors to retrieve, process, and display data. Many other features were introduced in 2013, and development is planned to continue for the foreseeable future.