Conference PaperPDF Available

Querying Distributed RDF Data Sources with SPARQL

Authors:

Abstract and Figures

Integrated access to multiple distributed and autonomous RDF data sources is a key challenge for many semantic web applications. As a reaction to this challenge, SPARQL, the W3C Recommendation for an RDF query language, supports querying of multiple RDF graphs. However, the current standard does not provide transparent query federation, which makes query formulation hard and lengthy. Furthermore, current implementations of SPARQL load all RDF graphs mentioned in a query to the local machine. This usually incurs a large overhead in network traffic, and sometimes is simply impossible for technical or legal reasons. To overcome these problems we present DARQ, an engine for federated SPARQL queries. DARQ provides transparent query access to multiple SPARQL services, i.e., it gives the user the impression to query one single RDF graph despite the real data being distributed on the web. A service description language enables the query engine to decompose a query into sub-queries, each of which can be answered by an individual service. DARQ also uses query rewriting and cost-based query optimization to speed-up query execution. Experiments show that these optimizations significantly improve query performance even when only a very limited amount of statistical information is available. DARQ is available under GPL License at http://darq.sf.net/ .
Content may be subject to copyright.
Querying Distributed RDF Data Sources
with SPARQL
Bastian Quilitz and Ulf Leser
Humboldt-Universit¨at zu Berlin
{quilitz,leser}@informatik.hu-berlin.de
Abstract. Integrated access to multiple distributed and autonomous
RDF data sources is a key challenge for many semantic web applications.
As a reaction to this challenge, SPARQL, the W3C Recommendation
for an RDF query language, supports querying of multiple RDF graphs.
However, the current standard does not provide transparent query fed-
eration, which makes query formulation hard and lengthy. Furthermore,
current implementations of SPARQL load all RDF graphs mentioned in
a query to the local machine. This usually incurs a large overhead in
network traffic, and sometimes is simply impossible for technical or le-
gal reasons. To overcome these problems we present DARQ, an engine
for federated SPARQL queries. DARQ provides transparent query ac-
cess to multiple SPARQL services, i.e., it gives the user the impression
to query one single RDF graph despite the real data being distributed
on the web. A service description language enables the query engine to
decompose a query into sub-queries, each of which can be answered by
an individual service. DARQ also uses query rewriting and cost-based
query optimization to speed-up query execution. Experiments show that
these optimizations significantly improve query performance even when
only a very limited amount of statistical information is available. DARQ
is available under GPL License at http://darq.sf.net/.
1 Introduction
Many semantic web applications require the integration of data from distributed,
autonomous data sources. Until recently it was rather difficult to access and
query data in such a setting because there was no standard query language
or interface. With SPARQL [1], a W3C Recommendation for an RDF query
language and protocol, this situation has changed. It is now possible to make
RDF data available through a standard interface and query it using a standard
query language. The data does not need be stored in RDF but can be created on
the fly, e.g. from a relational databases or other non-RDF data sources (see D2R
Server1and SquirrelRDF2). We expect that more and more content provider
will make their data available via a SPARQL endpoint. Nevertheless, it is still
difficult to integrate data from multiple data sources. RDF data integration is
1D2R Server: http://www.wiwiss.fu-berlin.de/suhl/bizer /d2r-server/
2SquirrelRDF: http://jena.sf.net/SquirrelRDF/
often done by loading all data into a single repository and querying the merged
data locally. In many cases this will not be feasible for legal or technical reasons.
Often it will not be allowed to create copies of the whole data source due to
copyright issues. Possible technical reasons are that local copies are not up-to-
date if the data sources change frequently, that data sources are too big, or that
the RDF instances are created on-the-fly from non-RDF data, like relational
databases, web services, or even websites. This clearly shows the need for virtual
integration of RDF datasets.
In this paper, we present DARQ3, a query engine for federated SPARQL
queries. It provides transparent query access to multiple, distributed endpoints as
if querying a single RDF graph. We introduce service descriptions that describe
the capabilities of SPARQL endpoints and a query optimization algorithm that
builds a cost-effective query plan considering limitations on access patterns [2].
Sources with limited access patterns require some some variables in a query to
be bound or fail to answer the query.
Related work. Data integration has been a research topic in the field
of database systems for a long time. Systems providing a single interface to
many underlying data sources are generally called federated information sys-
tems [3]. Solutions range from multi-database query languages (MDBQL) such
as SchemaSQL [4] to federated databases [5] to mediator based information sys-
tems (MBIS) [6]. While multi-database query languages require that the user
explicitly specifies the used data sources in the query MBIS hide the federation
from the user by providing a single, unified schema. In this notation, SPARQL
currently can be considered as a MDBQL for RDF allowing the user to specify
the graphs to be used in the query. In contrast, DARQ offers source transparency
to the user, but unlike MBIS it does not assume an integrated schema.
In [7] Stuckenschmidt et. al theoretically describe how to extend the Sesame
RDF repository to support distributed SeRQL queries over multiple Sesame RDF
repositories. They use a special index structure to determine the relevant sources
for a query. To this end, they restricted themselves to path queries. In [8] the au-
thors describe a system for SPARQL queries over multiple relational databases.
To our best knowledge there exists no system that supports SPARQL query
federation for multiple regular SPARQL endpoints. Also, none of the described
systems uses service descriptions to declaratively describe the data sources nor
do they support limitations on access patterns. A special characteristic of DARQ
is that it strongly relies on standards and is compatible with any endpoint that
supports the SPARQL standards. There is no other need for cooperation except
of the support of the SPARQL protocol.
Research on query optimization for SPARQL includes query rewriting [9] or
basic reordering of triple patterns based on their selectivity [10]. Optimization
for queries on local repositories has also focused on the use of specialized in-
dices for RDF or efficient storage in relational databases, e.g. [11, 12]. However,
none of the approaches targets SPARQL queries across multiple sources. There
has been a lot of research on query optimization in the context of databases
3Distributed ARQ, as an extension to ARQ (http://jena.sourceforge.net/ARQ/ )
SELECT ?name ?mbox WHERE {
? x f o a f : name ?nam e .
? x f o a f : mbox ?mbox .
FILTER reg ex ( ? name , ” ˆ Tim” ) && regex ( ? mbox , ”w 3c ” )
}ORDER BY ?name LIMIT 5
Listing 1.1. Example SPARQL Query
and federated information systems. An excellent overview of distributed query
processing techniques can be found in [13]. In this paper, we show that exist-
ing techniques from relational systems, such as query rewriting and cost based
optimization for join ordering can be adopted to federated SPARQL. We also
propose a way to estimate the result sizes of SPARQL queries with only very
few statistical information.
Structure of this paper. The rest of the paper is structured as follows.
Section 2 gives a brief introduction to the SPARQL query language. In Section
3 we show the architecture of DARQ, give an introduce service descriptions and
describe the used query planning and optimization algorithms we use in our
current implementation. We show initial results of the evaluation of the system
in Section 4 and conclude and discuss future directions in Section 5.
2 Preliminaries
Before we describe our work on federated queries we give a short introduction
to the SPARQL query language and the operators of a SPARQL query that
are considered for this report. For a more detailed introduction to RDF and
SPARQL we refer the interested reader to [14, 1, 15]. In the following we use the
definitions from the SPARQL Recommendation in [1].
A SPARQL query Q is defined as tuple Q= (E, DS, R). Basis of SPARQL
query is an algebra expression Ethat is evaluated with respect to a RDF graph
in a dataset DS. The results of the matching process are processed according
to the definitions of the result form R(SELECT, CONSTRUCT, DESCRIBE,
ASK). The algebra expression Eis build from different graph patterns and can
also also include solution modifiers, such as PROJECTON, DISTINCT, LIMIT,
or ORDER BY.
The simplest graph pattern defined for SPARQL is the triple pattern. A triple
pattern tis similar to a RDF triple but allows the usage of variables for subject,
predicate, and object:
tT P = (RDF -TV)×(IV)×(RDF -TV)
with RDF -Tbeing the set of RDF Terms (RDF Literals and Blank Nodes), I
being a set of all IRIs, and Va set of all variables [1].
Abasic graph pattern BGP is defined as a set of triple patterns BGP =
{t1..tn}with t1..tnT P . It matches a subgraph if all contained triple patterns
match. Basic graph patterns can be mixed with value constraints (FILTER)
and other graph patterns. The evaluation of basic graph patterns and value
constraints is order independent. This means, a structure of two basic graph
patterns BGP1and BGP2separated by a constraint Ccan be transformed into
one equivalent basic graph pattern followed by the constraint. We refer to a basic
graph pattern followed by one or more constraints as filtered basic graph pattern
(F BGP ).
Example 1. Listing 1.1 shows a SPARQL query with one filtered basic graph
pattern that retrieves the names and email addresses of persons whose name
start with ”Tim” and email address contains ”w3c”. The results are ordered by
the name, the number of results is limited to five.
SPARQL furthermore defines other types of graph patterns such as GRAPH,
UNION, or OPTIONAL. We omit these patterns here, because DARQ works on
basic graph patterns as we will see in Section 3.2. Note however, that the engine
is able to process all other patterns by distributing the FBGPs contained in
these patterns and doing local post-processing. This means that the FBGPs in
every of these pattens are handled separately, i.e. the scope for distribution and
cost-based optimization is always limited to one FBGP. DARQ correctly handles
the order-dependent OPTIONAL pattern, but may waste resources transferring
unnecessary results when OPTIONAL is used to express negation as failure.
3 DARQ: Federated SPARQL Queries
To provide transparent query access to multiple data sources we adopt an ar-
chitecture of mediator based information systems [6] as shown in Figure 1. The
DARQ query engine has the role of the mediator component. Non-RDF data
sources can be wrapped with tools such as D2R and SquirrelRDF. A DARQ
query engine itself can work as SPARQL endpoint and may be integrated by
another instance of DARQ. Data sources are described by service descriptions
(see Section 3.1). The query engine uses this information for query planning and
optimization. In contrast to MBIS the schema is not fixed and does not need to
be specified, but is determined by the underlying data sources.
A query is processed in 4 stages:
1. Parsing. In the first stage the query string is parsed into a tree model of
SPARQL. The DARQ query engine reuses the parser shipped with ARQ.
2. Query Planning. In the second stage the query engine decomposes the
query and builds multiple sub-queries according to the information in the
service descriptions, each of which can be answered by one known data source
(see Section 3.2).
3. Optimization. In the third stage, the query optimizer takes the sub-queries
and builds an optimized query execution plan (see Section 3.3).
4. Query Execution. In the fourth stage, the query execution plan is ex-
ecuted. The sub-queries are sent to the data sources and the results are
integrated.
Fig. 1. DARQ - integration architecture
3.1 Service Descriptions
To find the relevant information sources for the different triples in a query and to
decompose the query into sub-queries the query engine needs information about
the data sources. To this end, we introduce service descriptions which provide
a declarative description of the data available from an endpoint and allow the
definition of limitations on access patterns. Furthermore, service descriptions can
include statistical information used for query optimization. Service Descriptions
are represented in RDF.
Data Description A service description describes the data available from
a data source in form of capabilities. Capabilities define what kind of triple
patterns can be answered by the data source. The definition of capabilities is
based on predicates. The capabilities of a data source Dare a set CDof tuples
c= (p, r)CD, where pis a predicate existing in D and ris a constraint on sub-
jects and objects. This constraint is a regular SPARQL filter expression that en-
ables a more precise source selection, e.g. we can express that a data source only
stores data about specific types of resources. We denote the constraint as func-
tion r(subject, obj ect) with r: (RDF -TV)×(RDF -TV)→ {true, f alse}.
For example, the constraints can be used for horizontal partitioning. It is possi-
ble to define a constraint that says that a Service A can only answer queries for
names starting with a letter from A to R, whereas another service can answer
queries for names starting with a letter from Q to Z.
Limitations on Access Patterns Some data sources have limitations on ac-
cess patterns [2]. For example, a wrapper that transforms results from a web
form into RDF may require some input values that can be entered into the form
to compute the results. Another example is a wrapper for an LDAP server may
require that the name of a person or their email address is always included in
the query because the server owner does not allow other queries.
DARQ supports the definition of limitations on access patterns in the service
descriptions in form of patterns that must be included in a query. Because pred-
icates must be bound we use them as basis for the pattern definition. Let LDbe
a set of limitations on access patterns for data source Dand (S, O)LDbe one
pattern with Sand Obeing sets of predicates that must have bound subject (S)
or bound objects (O).
Source Dcould contribute to the query answer of a query with graph pattern
Pif it satisfies at least one of the defined access patters for D. Let bound(x) be
a function that returns false if xis a variable and true otherwise. An access
pattern (S, O) is satisfied if
(psS\O:(s, ps, o)P:bound(s))
(poO\S:(s, po, o)P:bound(o))
(pbSO:(s, pb, o)P:bound(s)bound(o))
Example 2. To come back to the example of the LDAP server, the service
description in this example would contain two access patterns, (S1, O1) and
(S2, O2), with S1=S2=and O1={foaf :name}O2={f oaf :mbox}
Statistical Information Defining statistical information about the data avail-
able from a data source helps the query optimizer to find a cost-effective query
execution plan. Service descriptions include the total number of triples Nsin
data source Dand optionally information for each capability (p, r)CD: (1) The
number of triples nD(p) with the predicate p in D, (2) the selectivity sselD(p) of
a triple pattern with predicate p if the subject is bound (default= 1
nD(p)), and (3)
the selectivity oselD(p) of a triple pattern with predicate p if the object is bound
(default=1). We deliberately use only these simple statistics because we expect
every data source to be able to provide them, or at least rough estimations.
More precise statistics such as histograms would be preferable but will not be
available from many sources. Future work should explore what other statistics
are required for more complex cost-models and how they can be estimated. In
this context, aggregate functions, such as count, could be a valuable addition to
future SPARQL version.
RDF Representation Service Descriptions are represented in RDF. Listing
1.2 shows an example service description for a FOAF data source, e.g. an LDAP
Server. The data source defined in the example can answer queries for foaf:name,
foaf:mbox and foaf:weblog. Objects for a triple with predicate foaf:name must
always start with a letter from A to R. In total it stores 112 triples. The data
source has limitations on access patters, i.e. a query must at least contain a
triple pattern with predicate foaf:name or foaf:mbox with a bound object. More
detailed examples of service descriptions can be found at http://darq.sf.net/
[ ] a sd : S e r v i c e ;
sd : c a p a b i l i t y [ sd : p r e d i c a t e f o a f : name ;
sd : o b j e c t F i l t e r ”REGEX( ? o b j ec t , ” ˆ [ AR] ” ) ” ;
sd : t r i p l e s 5 1 ] ;
sd : c a p a b i l i t y [ sd : p r e d i c a t e f o a f : mbox ;
sd : t r i p l e s 5 1 ] ;
sd : c a p a b i l i t y [ sd : p r e d i c a t e f o a f : w e bl og ;
sd : t r i p l e s 1 0 ] ;
sd : t o t a l T r i p l e s ” 11 2” ;
sd : u r l ”E ndp oint URL ” ;
sd : r e q u i r e d B i n d i n g s [ sd : o b j e c tB i n d i n g f o a f : name ] ;
sd : r e q u i r e d B i n d i n g s [ sd : o b j e c tB i n d i n g f o a f : mbox ] .
Listing 1.2. Example Service Description
3.2 Query Planning
When querying multiple data sources it is necessary to decide which data source
can contribute to answer a query. The process of finding relevant sources and
feasible sub-queries is referred to as query planning. In this section we de-
scribe the query planning algorithm used by DARQ. Query planning is based
on the information provided in the service descriptions. In the following let
R={(d1, C1), .., (dn, Cn)}be a set of data sources d1..dnand their capabilities
C1..Cn, where Ci={(pi,1, ri,1)..(pi,m, ri,m )}.
Source Selection A SPARQL query contains one or more filtered basic graph
patterns each containing the actual triple patterns. Query Planning is performed
separately for each filtered basic graph pattern. The algorithm for finding the
relevant data sources for a query simply matches all triple patterns against the
capabilities of the data sources. The matching compares the predicate in a triple
pattern with the predicate defined for a capability and evaluated the constraint
for subject and object. Because matching is based on predicates, DARQ currently
only supports queries with bound predicates.
Let BGP be a set of triple patterns in a filtered basic graph pattern. The
result of the source selection is a set of data sources Djfor each triple pattern
tj= (sj, pj, oj)BGP with
Dj={d|(d, C)R∧ ∃(pj, r )C:r(sj, oj) = true}
Building Sub-Queries The results from source selection are used to build
sub-queries that can be answered by the data sources. Sub-queries consist of one
filtered basic graph pattern per data source. We represent a sub-query as triple
(T, C, d), where Tis a set of triple patterns, Cis a set of value constraints and
dis the data source that can answer the sub-query. Algorithm 1 shows how the
sub-queries are generated. If a triple pattern matches exactly one data source
(Di={d}) the triple will be added to the set of a sub-query for this data source.
All triples in this set can later be sent to the data source in one sub-query. If a
triple matches multiple data sources the triple must be sent individually to all
matching data sources in separate sub-queries.
Example 3. Let data source A and B be two data sources with the capabilities
(name,true) and (mbox,true ). A stores the triple (a, name,"Tim"), B stores the
triple (a ,mbox,"Tim@x.y"). The query shown in Listing 1.1 will return no
results if sent to A and B with both triple patterns or the correct result if triple
patterns are sent in separate sub-queries and the results are joined afterwards.
Algorithm 1 Sub-query generation
Require: T={t1, .., tn}, // set of triple patterns
D={D1, .., Dn}// sets of data sources matching to the triple patterns
1: queries =,separateQueries =
2: for each tiTdo
3: if Di={d}then
4: q=queries.getQuery(d)
5: if qnot null then
6: q.T =q.T +ti
7: else
8: queries =queries + ({ti},{} , d)
9: end if
10: else
11: for each djDido
12: separateQueries =separateQueries + ({ti},{} , dj)
13: end for
14: end if
15: end for
16: return queries seperateQueries // Return all queries
3.3 Optimization
After query planning the query plan consists of multiple sub-queries. The task of
the query optimizer is to build a feasible and cost-effective query execution plan
considering limitations on the access patterns. To build the plan we use logical
and physical query optimization.
Logical Optimization Logical query optimization uses equalities of query ex-
pressions to transform a logical query plan into an equivalent query plan that
is likely to be executed faster or with less costs. The current implementation of
DARQ uses logical query optimization in two ways. First, we use rules based on
the results in [15] to rewrite the original query before query planning so that
basic graph patterns are merged whenever possible and variable are replaced by
constants from filter expressions.
Example 4. Listing 1.3 shows the original query submitted by the user. There
are two separate Basic Graph Patterns, each with one triple pattern. In the
rewritten query that is shown in Listing 1.4 the two patterns are merged. Also,
variables that occur in filters with an equal operator are substituted. In our
example, ?name is substituted by "Tim".
SELECT ?mbox WHERE {
{? x f o a f : name ? name . }
FILTER (? nam e = ”Tim” )
&& regex ( ? mbox , ” w3 c” )
{? x f o a f : mbox ? mbox . }
}
Listing 1.3. Query before rewriting
SELECT ?mbox WHERE {
? x f o a f : name ”Tim ” .
? x f o a f : mbox ? mbox .
FILTER r ege x (? mbox , ” w3 c ” )
}
Listing 1.4. Query after rewriting
Second, we move possible value constraints into the sub-queries to reduce the
size of intermediate results as early as possible. Let Q= (T, C, d) be a sub-query
and F GP = (T0, C0) a filtered basic graph pattern. The value constraint C0can
be moved to the sub-query if all variables in the constraint are also used in the
triple patterns in the sub-query. Filters that contain variables from more than
one sub-query and that cannot be split using a limited set of rules are applied
locally inside the DARQ query engine.
Example 5. Listing 1.1 shows a query with a conjunctive filter on two attributes.
Let us assume that the two triple patterns are split into two sub-queries for
services A and B. In this case, the single filter cannot be moved into the sub-
queries because one of the variables would be unbound. However, to benefit
from filtering at the remote site the conjunction can be split into two filters
FILTER regex(?name, "^Tim") and FILTER regex(?mbox, "w3c") that can
then be moved into the sub-queries. If the optimizer is not able to split a filter
using its limited set of rewriting rules, it will apply the filter locally, inside
DARQ, as soon as all used variables are bound.
Physical Optimization Physical query optimization has the goal to find the
’best’ query execution plan among all possible plans and uses a cost model to
compare different plans. In case of federated queries with distributed sources
network latency and bandwidth have the highest influence on query execution
time. Thus, the main goal in our system is to reduce the amount of transferred
data and to reduce the number of transmissions, which will lead to less transfer
costs and faster query execution. We use the expected result size as the cost
factor of sub-queries.
We use iterative dynamic programming for optimization considering limita-
tions on access patterns. Currently, we support two join implementations:
nested-loop join () The nested-loop join is the simplest join implemen-
tation. For every binding in the outer relation, we scan the inner relation
and add the bindings that match the join condition to the result set.
bind join (B)The bind join was introduced in [16]. Basically it is a nested
loop join where intermediate results from the outer relation are passed to the
inner to be used as filter. This means that DARQ sends out the sub-query for
the inner relation multiple times with the join variables bound. We use the
bind join for data sources with limitations on access patterns. Furthermore,
it can help to drastically reduce the transfer costs if the unbound query
would return a large result set.
We calculate the result size of joins with
|R(q1 q2)|=|R(q1)| |R(q2)|sel12
where q1and q2are the joined query plan elements, i.e sub-query or join,
|R(q)|is the result size of q, and sel12 is a selectivity factor for the join attributes.
For DARQ, we currently set sel12 = 0.5 because the current statistics in the
service descriptions do not provide enough information for a better estimation.
The (transfer) costs of a nested loop join is estimated as
C(q1 q2) = |R(q1)|ct+|R(q2)|ct+ 2cr
while the costs of a bind join are estimated as
C(q1Bq2) = |R(q1)|ct+|R(q1)|cr+|R(q0
2)|ct
with ctand crbeing the transfer costs for one result tupel4and one query,
respectively, and q0
2being the query with variables bound with values of a result
tuple from q1.
Query result size estimation The result size estimation for a sub-query is based
on the statistics provided in the service descriptions. Currently, service descrip-
tions include for each capability (p, r)Cdof service d: (1) the number of
triples nd(p) with the predicate p in data source d, (2) the average selectivity
sseld(p) if the subject is bound, and (3) the average selectivity oseld(p) if the
object is bound. With this information we estimate the result size of a query
with a single triple pattern (s, p, o) that is sent to a service dusing the function
costd:T P ×VNwith
costsd((s, p, o), b) =
nd(p) if ¬bound(s, b)∧ ¬bound(o, b),
nd(p)oseld(p) if ¬bound(s, b)bound(o, b),
nd(p)sseld(p) if bound(s, b)∧ ¬bound(o, b),
0.5 if bound(s, b)bound(o, b).
where bis a set of previously bound variables and bound(x, b) is a function
that returns true if xis bound given the bound variables in band false otherwise.
Estimating the result size of a combination of two or more triple patterns
is more complex. Note that adding a triple pattern to a query can restrict the
result size or introduce new results because of a join. Adding more triple patterns
4For simplicity, we currently disregard the specific tuple size
with the same subject will not introduce new results, but rather reduce the
result size. In contrast, adding triple pattern with another subject potentially
increases the result size. Thus, we start with estimating the result size for all
triple patterns with the same subject or subject variable. Let T={t1, ..., tn}be
a set of triple patterns where t1, ...tnall have the same subject. Triple patterns
with a bound object restrict the possible solutions. We use the minimum function
over all triple patters with a bound object to estimate an upper bound for the
number of subjects. Note that this is different from the attribute independence
assumption that is widely used in SQL query optimization [17]. Triple patterns
with an unbound object can introduce new bindings for the used object variable.
The overall result size for the set of triple patterns is the product of number of
subjects and the result sizes of all triple patterns with unbound object. Using the
cost function for a single triple pattern we estimate the result size of as follows:
costsd(T, b) = min
vTbound
(costsd(v, b)) Y
uTunbound
costsd(u, b)
with
Tbound ={t|t= (s, p, o)Tbound(o)}and
Tunbound ={t|t= (s, p, o)T∧ ¬bound(o)}
Finally, we must combine the groups of triples with one subject to to compute
the estimated result size for the complete sub-query. The result sizes of the
single triple groups strongly depend of the already bound variables. Algorithm
2 builds groups of triple patterns with same subjects and then incrementally
selects the group with the minimal result size considering the variables bound
by the previously selected groups. We calculate the overall costs of the query as
the product of the result sizes of all groups.
4 Evaluation
In this section we evaluate the performance of the DARQ query engine. The
prototype was implemented in Java as an extension to ARQ5. We used a subset
of DBpedia6. DBpedia contains RDF information extracted from Wikipedia.
The dataset is offered in different parts. The names of the parts we used can be
found in the description column of Table 1(a).
The dataset has about 31.5 million triples in total. For our experiments we
split the dataset into multiple parts located at different endpoints as shown in
Table 1(a). To make sure that the endpoints are not a bottleneck in our setup we
split all data over two Sun-Fire-880 machines (8x sparcv9 CPU, 1050Mhz, 16GB
5http://jena.sf.net/ARQ/
6http://dbpedia.org (Version 2.0)
Algorithm 2 Result size estimation for a general basic graph pattern
Require: T={t1, .., tn}// basic graph pattern
1: result = 1 , bindings =,groups ={g1, .., gm}=buildGroups(T)
2: while groups 6=do
3: g=null ,costs =positiveI nf inity
4: for each gigroups do
5: c=costs(gi, bindings)
6: if c < costs then
7: g=gi,costs =c
8: end if
9: end for
10: groups =groups − {g},bindings =bindings var(g)
11: result =result costs
12: end while
13: return result
RAM) running SunOS 5.10. The SPARQL endpoints were provided using Vir-
tuoso Server 5.0.37with an allowed memory usage of 8GB . Note that, although
we use only two physical servers, there were five logical SPARQL endpoints.
DARQ was running on Sun Java 1.6.0 on a Linux system with Intel Core Duo
CPUs, 2.13 GHz and 4GB RAM. The machines were connected over a standard
100Mbit network connection.
(a) data sources
No. Description #triples
S1 Articles 7.6M
S2 Categories 6.4M
S3 Yago 2M
S4 Infoboxes 14.6M
S5 Persons 0.6M
Total 31.5M
(b) queries
No. Used sources #results
Q1 S4, S5 452
Q2 S4, S5 452
Q3 S2, S4, S5 6
Q4 S1, S3, S4, S5 1166
Table 1. Overview on data sources and queries
We run four example queries and evaluated the runtime with and without
optimization. For queries without optimization we used the ARQ 1.5 default
execution strategies without any changes, i.e. bind joins of all sub-queries in
order of appearance. The queries can be found in Listings 1.5-1.8. The queries
use different numbers of sources and have different result sizes. An overview is
given in Table 1(b). For all queries we had a timeout of 10 minutes. Q1 and
Q2 demonstrate the effect of pushing filters into the sub-queries. Q3 and Q4 are
rather complex queries, involving three to four sources, but have very different
7http://virtuoso.openlinksw.com/
result sizes. The results shown in the following are the average values over four
runs.
/Fi nd a l l m o vi e s o f a c t o r s
bo r n i n Pa r i s /
SELECT ?p ?m WHERE {
?p d b pe d ia 2 : b i r t h P l a c e : P a r is .
?p f o a f : name ? name .
?m db p ed ia 2 : s t a r r i n g ? p .
}
Listing 1.5. Q1
/Fi nd a l l m o vi e s o f a c t o r s
bo r n i n Pa r i s /
SELECT ?p ?m WHERE {
?p f o a f : name ? name .
?p d b pe d ia 2 : b i r t h P l a c e ? p a r i s .
?m db p ed ia 2 : s t a r r i n g ? p .
FILTER (? p a r i s=
<h tt p : / / d b pe d ia . o rg / r e s o u r c e /
Pa r i s >)
}}
Listing 1.6. Q2
/Fi n d name , bi r t h d a y an d i m ag e o f
ge rm an m u s i c i a n s b o rn i n B er l in /
SELECT ?n ? b ?p ? img WHERE {
?p f o a f : name ? n .
?p d bp e d ia 2 : b i r t h ? b .
?p d b pe d ia 2 : b i r t h P l a c e : B e r l i n .
?p s k o s : s u b j e c t
cat : German musicians .
OPTIONAL {?p f o a f : im g ? img }}
Listing 1.7. Q3
/Fi nd a l l M ov i es w it h a c t o r s
bo r n i n Lo ndo n w i t h a n i m ag e /
SELECT WHERE {?n r d f : t y pe
ya g o : Mo ti o n P i c t u r e Film103 78 9 4 0 0 .
?n d bp e di a2 : s t a r r i n g ?p .
?p d b pe d ia 2 : b i r t h P l a c e : L ond on .
?p f o a f : name ? name .
?n r d f s : l a b e l ? l a b e l .
?n f o a f : d e p i c t i o n ? i mg .
FILTER (LANG(? l a b e l ) = ’ en ) . }
Listing 1.8. Q4
Figure 2(a) shows the query execution times. The experiments show that
our optimizations significantly improve query evaluation performance. For query
Q1 the execution times of optimized and unoptimized execution are almost the
same. This is due to the fact that the query plans for both cases are the same and
bind joins of all sub-queries in order of appearance is exact the right strategy.
For queries Q2–Q4 the unoptimized queries took longer than 10 min to answer
and timed out, whereas the execution time of the optimized queries is quiet
reasonable. The optimized execution of Q1 and Q2 takes almost the same time
because Q2 is rewritten into Q1.
100
90
100
Opmized
70
80 Unopmized
60
70
ime (s)
40
50
ecuon
20
30
E
10
20
0
Q1 Q2 Q3 Q4
Query
meout
meout
meout
(a) Query execution times
2000
2500
3000
ime (ms)
Opmized
Unopmized
0
500
1000
1500
Q1 Q2 Q3 Q4
Transform T
Query
(b) Transformation time
Fig. 2. Benchmark Results
Figure 2(b) shows the time needed for query planning and optimization
(transformation time). We can see that transformation times for optimized queries
increase with query complexity from around 300 ms to 2800ms. Compared to
this transformation times for unoptimized queries (query planning only) are neg-
ligible, around 30–40 ms. However, in comparison to the performance gains for
query execution, the transformation times including optimization still remain
very small.
Our evaluations show that even with a very limited amount of statistical
information it is possible to generate query plans that perform relatively well.
All queries where answered within less than one and a half minutes. Of course
it would be possible for the user to write down triple patterns in exact the right
order for ARQ, but this is in conflict with the declarative nature of SPARQL.
Note that optimized queries in DARQ will be less performant if all the sub-
queries are very unselective, e.g. contain no values for subject and object or
a very unselective filter. In this case DARQ has few possibilities to improve
performance by optimization.
5 Conclusion and Future Work
DARQ offers a single interface for querying multiple, distributed SPARQL end-
points and makes query federation transparent to the client. One key feature of
DARQ is that it solely relies on the SPARQL standard and therefore is com-
patible to any SPARQL endpoint implementing this standard. Using service
descriptions provides a powerful way to dynamically add and remove endpoints
to the query engine in a manner that is completely transparent to the user.
To reduce execution costs we introduced basic query optimization for SPARQL
queries. Our experiments show that the optimization algorithm can drastically
improve query performance and allow distributed answering of SPARQL queries
over distributed sources in reasonable time. Because the algorithm only relies on
a very small amount of statistical information we expect that further improve-
ments are possible using techniques as described in [16, 13]
An important issue when dealing with data from multiple data sources are
differences in the used vocabularies and the representation of information. In
further work, we plan to work on mapping and translation rules between the
vocabularies used by different SPARQL endpoints. Also, we will investigate gen-
eralizing the query patterns that can be handled and blank nodes and identity
relationships across graphs.
Acknowledgments: Major parts of this work were done at HP Labs Bristol.
Bastian Quilitz is grateful to Andy Seaborne and the other members of the HP
Semantic Web Team for insightful discussions, their support, and their work on
Jena/ARQ. This research was supported by the German Research Foundation
(DFG) through the Graduiertenkolleg METRIK, grant no. GRK 1324.
References
1. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C
Recommendation (January 2008) http://www.w3.org/TR/rdf-sparql-query/.
2. Florescu, D., Levy, A., Manolescu, I., Suciu, D.: Query optimization in the presence
of limited access patterns. In: International conference on Management of data
(SIGMOD), New York, NY, USA, ACM (1999) 311–322
3. Busse, S., Kutsche, R.D., Leser, U., Weber, H.: Federated information systems:
Concepts, terminology and architectures. Technical Report Forschungsberichte des
Fachbereichs Informatik 99-9, Technische Universit¨at Berlin (1999)
4. Lakshmanan, L.V.S., Sadri, F., Subramanian, I.N.: SchemaSQL - a language for
interoperability in relational multi-database systems. In Vijayaraman, T.M., Buch-
mann, A.P., Mohan, C., Sarda, N.L., eds.: 22th International Conference on Very
Large Data Bases (VLDB), Mumbai (Bombay), India (September 1996) 239–250
5. Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed,
heterogeneous, and autonomous databases. ACM Comput. Surv. 22(3) (1990)
183–236
6. Wiederhold, G.: Mediators in the architecture of future information systems. Com-
puter 25(3) (1992) 38–49
7. Stuckenschmidt, H., Vdovjak, R., Houben, G.J., Broekstra, J.: Index structures
and algorithms for querying distributed rdf repositories. In: WWW’04. (2004)
8. Chen, H., Wang, Y., Wang, H., Mao, Y., Tang, J., Zhou, C., Yin, A., Wu, Z.:
Towards a semantic web of relational databases: a practical semantic toolkit and
an in-use case from traditional chinese medicine. In: 4th International Semantic
Web Conference (ISWC). LNCS, Athens, USA, Springer-Verlag (NOV 2006) 750–
763 Best Paper Award.
9. Hartig, O., Heese, R.: The sparql query graph model for query optimization. In:
4th European Semantic Web Conference (ESWC). (2007) 564–578
10. Abraham Bernstein, Christoph Kiefer, M.S.: OptARQ: A SPARQL Optimization
Approach based on Triple Pattern Selectivity Estimation. Technical Report ifi-
2007.03, Department of Informatics, University of Zurich (2007)
11. Harth, A., Decker, S.: Optimized index structures for querying rdf from the web.
In: Third Latin American Web Congress (LA-WEB), Washington, DC, USA, IEEE
Computer Society (2005) 71
12. Harris, S., Gibbins, N.: 3store: Efficient bulk rdf storage. In: PSSS - Practical and
Scalable Semantic Systems. (2003)
13. Kossmann, D.: The state of the art in distributed query processing. ACM Comput.
Surv. 32(4) (2000) 422–469
14. Manola, F., Miller, E.: RDF Primer, W3C Recommendation (2004)
http://www.w3.org/TR/rdf-primer/.
15. erez, J., Arenas, M., Gutierrez, C.: Semantics and Complexity of SPARQL. In:
4th International Semantic Web Conference (ISWC), Athens, GA, USA (November
2006) 30–43
16. Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing queries across
diverse data sources. In: 23rd Int. Conference on Very Large Data Bases (VLDB),
San Francisco, CA, USA, Morgan Kaufmann Publishers Inc. (1997) 276–285
17. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Ac-
cess path selection in a relational database management system. In: International
conference on Management of data (SIGMOD), New York, NY, USA, ACM (1979)
23–34
... DARQ [36] is an index-assisted federated query engine based on index-only source selection and nested loop join for query optimization. However, DARQ can generate a large number of endpoint requests due to the nested loop join implementation. ...
... The objective of the experiments was to show that we could achieve excellent join query performance with retrieval accuracy. We compared the learned semantic index structure, which we refer to as LearnS, with Quad [31], Darq [32], Dams [33], Midas [34], and Two-step [24]. Quad is an existing centralized approach, and Darq is a distributed approach; Dams is a recently proposed distributed approach, which combines adaptive hashing and the master-slave model; Midas is an index-based approach; and Two-step is a hybrid approach combining R*-tree and k-d trees. ...
Article
Full-text available
Recently, a pragmatic approach toward achieving semantic search has made significant progress with knowledge graph embedding (KGE). Although many standards, methods, and technologies are applicable to the linked open data (LOD) cloud, there are still several ongoing problems in this area. As LOD are modeled as resource description framework (RDF) graphs, we cannot directly adopt existing solutions from database management or information retrieval systems. This study addresses the issue of efficient LOD annotation organization, retrieval, and evaluation. We propose a hybrid strategy between the index and distributed approaches based on KGE to increase join query performance. Using a learned semantic index structure for semantic search, we can efficiently discover interlinked data distributed across multiple resources. Because this approach rapidly prunes numerous false hits, the performance of join query processing is remarkably improved. The performance of the proposed index structure is compared with some existing methods on real RDF datasets. As a result, the proposed indexing method outperforms existing methods due to its ability to prune a lot of unnecessary data scanned during semantic searching.
Article
Autonomous Systems (ASs) that work in the open, dynamic environment are required to share their data entities and semantics to implement the co-operations. Typically, AS’s data schemas and semantics are described via ontology. Since ASs need to maintain their autonomy and conceptual specificity, their ontologies might define one concept with different terms or in different contexts, which yields the heterogeneity issue and hampers their co-operations. An effective solution is to establish a set of data entity’s correspondences through the Ontology Alignment (OA). Sine the simple correspondence of one-to-one style lacks expressiveness and cannot completely cover different types of heterogeneity, ASs’ co-operations require using the complex correspondence of one-to-many or many-to-may style. Inspired by the success of applying the Brain Storm Optimization algorithm (BSO) to solve diverse complex optimization problems, this work proposes a Compact Co-Evolutionary BSO (CCBSO) to face the challenge of aligning AS ontologies. In particular, the AS ontology aligning problem is formally defined, a hybrid confidence measure for distinguishing the simple and complex correspondences is proposed, and a problem-specific CCBSO is presented. The experiment tests CCBSO’s performance on different AS ontology aligning tasks, which consist of two simple ontology aligning tasks and one complex ontology aligning track. The experimental results show that CCBSO outperforms the state-of-the-art ontology aligning techniques on all simple and complex ontology aligning tasks.
Article
Smart city seeks safer, resource-oriented, environment friendly, and cost-effective energy solutions, and due to the distributed and independent nature of energy management systems, Energy Smart Grid (ESG) quickly evolves from a widely accepted concept to an industrial reality. To monitor the energy consumption across various domains and assist efficient energy distribution and storage, it is necessary to integrate various ESG agents’ ontologies. To this end, this work models ESG’s Ontology Matching Problem (ESG-OMP) as a discrete Multi-Objective Optimization Problem (MOOP), which simultaneously optimizes the alignment’s completeness and correctness through determining an optimal entity corresponding set. After that, a Multi-Objective Particle Swarm Optimization Algorithm With Competitive Mechanism (MOPSO-CM) is proposed to address this issue by trading off the alignment’s completeness and correctness, which introduces the CM to update the particles with the pairwise competition that performed in the swarm of current generation. Moreover, an instance-based hybrid similarity measure is used to distinguish the heterogeneous ontology entities. The experiment uses OAEI’s testing datasets, 13 smart grid’s ontology matching tasks and 5 energy smart grid’s ontology matching tasks to test MOPSO-CM’s performance. The experimental results show that MOPSO-CM can effectively address various heterogeneous ontology matching problems and determine high-quality ESG’s ontology alignments.
Article
The data retrieval and unification of patient electronic health records from distinct clinical repositories is essential for an effective health decision making. The scattering of patient data in many complementary but also overlapping sources poses a challenge in locating the desired records. Maintaining the patient master index also faced legal and privacy issues. Therefore, it is essential to locate the relevant records of the patient in federated clinical data sources. The research carried out proposed an efficient approach for relevant selection of patient records using semantic technologies. The approach uses both triple pattern-wise and join-aware source selection approaches for optimal selection of relevant data sources. The state-of-art federated engines are evaluated on the basis of source selection time and overall query execution time for different federated queries. The experimental results shows that the proposed approach selects the relevant data sources with reduced number of remote requests and significantly reduces the query execution time.
Article
Background : Geospatial linked data brings into the scope of the Semantic Web and its technologies, a wealth of datasets that combine semantically-rich descriptions of resources with their geo-location. There are, however, various Semantic Web technologies where technical work is needed in order to achieve the full integration of geospatial data, and federated query processing is one of these technologies. Methods : In this paper, we explore the idea of annotating data sources with a bounding polygon that summarizes the spatial extent of the resources in each data source, and of using such a summary as an (additional) source selection criterion in order to reduce the set of sources that will be tested as potentially holding relevant data. We present our source selection method, and we discuss its correctness and implementation. Results : We evaluate the proposed source selection using three different types of summaries with different degrees of accuracy, against not using geospatial summaries. We use datasets and queries from a practical use case that combines crop-type data with water availability data for food security. The experimental results suggest that more complex summaries lead to slower source selection times, but also to more precise exclusion of unneeded sources. Moreover, we observe the source selection runtime is (partially or fully) recovered by shorter planning and execution runtimes. As a result, the federated sources are not burdened by pointless querying from the federation engine. Conclusions : The evaluation draws on data and queries from the agroenvironmental domain and shows that our source selection method substantially improves the effectiveness of federated GeoSPARQL query processing.
Article
Now that within the humanities more and more data sources have been created, a new opportunity is within reach: the searching of patterns spanning across data sources from archives, museums, and other cultural heritage institutes. These institutes adopt various digitization strategies based on differences in selection procedures. This results in heterogeneous data sources with a huge impact on the accessibility and interoperability of data within and between these distributed collections. We identify three interrelated challenges that researchers may encounter when querying such distributed data sources, namely query formulation , source selection , and alignment of data sources . We present a multiagent architecture to overcome these challenges and discuss a prototype implementation of the architecture by developing and integrating various technologies. In order to measure and validate the performance of integrated technologies that meet these three interrelated challenges, we propose a methodology for setting up and conducting experiments. We take an existing data source for which we can establish a baseline query result, against which we measure the precision and recall performance, and create various sets of data sources with realistic characteristics. We report on the results of a number of experiments that show the performance of the developed and integrated technologies.
Chapter
Looking for information in the web of data presents a big challenge since data is distributed over a large number of RDF datasets, moreover, those datasets are remotely accessible via sources called SPARQL Endpoints where each source is being managed independently. A sought information may not be found by querying a single source, it could require retrieving its parts from several sources. Hence, having good knowledge about the performance, the efficiency and the availability of Endpoints is of paramount importance to select relevant data sources. In this paper, we present an index-based solution to process a SPARQL query on distributed SPARQL Endpoints, we focus on setting up an index presented as a knowledge base storing information to describe sources characteristics, this knowledge base is updated automatically while executing queries, thereby, we can benefit from each query execution in order to have a good knowledge of sources, as well as a benchmark to select relevant sources according to specific needs.
Article
Linked Data Fragments (LDFs) refer to interfaces that allow for publishing and querying Knowledge Graphs on the Web. These interfaces primarily differ in their expressivity and allow for exploring different trade-offs when balancing the workload between clients and servers in decentralized SPARQL query processing. To devise efficient query plans, clients typically rely on heuristics that leverage the metadata provided by the LDF interface, since obtaining fine-grained statistics from remote sources is a challenging task. However, these heuristics are prone to potential estimation errors based on the metadata which can lead to inefficient query executions with a high number of requests, large amounts of data transferred, and, consequently, excessive execution times. In this work, we investigate robust query processing techniques for Linked Data Fragment clients to address these challenges. We first focus on robust plan selection by proposing CROP, a query plan optimizer that explores the cost and robustness of alternative query plans. Then, we address robust query execution by proposing a new class of adaptive operators: Polymorphic Join Operators. These operators adapt their join strategy in response to possible cardinality estimation errors. The results of our first experimental study show that CROP outperforms state-of-the-art clients by exploring alternative plans based on their cost and robustness. In our second experimental study, we investigate how different planning approaches can benefit from polymorphic join operators and find that they enable more efficient query execution in the majority of cases.
Technical Report
Full-text available
Query engines for ontological data based on graph models mostly execute user queries without considering any optimization. Especially for large ontologies, optimization techniques are required to ensure that query results are delivered within reasonable time. OptARQ is a first prototype for SPARQL query optimization based on the concept of triple pattern selectivity estimation. The evaluation we conduct demonstrates how triple pattern reordering according to their selectivity affects the query execution performance.
Conference Paper
Full-text available
Integrating relational databases is recently acknowledged as an im- portant vision of the Semantic Web research, however there are not many well- implemented tools and not many applications that are in large-scale real use either. This paper introduces the Dartgrid which is an application development framework together with a set of semantic tools to facilitate the integration of heterogenous relational databases using semantic web technologies. For exam- ples, DartMapping is a visualized mapping tool to help DBA in defining semantic mappings from heterogeneous relational schemas to ontologies. DartQuery is an ontology-based query interface helping user to construct semantic queries, and capable of rewriting SPARQL semantic queries to a set of SQL queries. Dart- Search is an ontology-based search engine enabling user to make full-text search over all databases and to navigate across the search results semantically. It is also enriched with a concept ranking mechanism to enable user to find more accurate and reliable results. This toolkit has been used to develop an currently in-use ap- plication for China Academy of Traditional Chinese Medicine (CATCM). In this application, over 70 legacy relational databases are semantically interconnected by an ontology with over 70 classes and 800 properties, providing integrated semantic-enriched query, search and navigation services to TCM communities.
Conference Paper
Full-text available
We consider the problem of query optimization in the presence of limitations on access patterns to the data (i.e., when one must provide values for one of the attributes of a relation in order to obtain tuples). We show that in the presence of limited access patterns we must search a space of annotated query plans, where the annotations describe the inputs that must be given to the plan. We describe a theoretical and experimental analysis of the resulting search space and a novel query optimization algorithm that is designed to perform well under the different conditions that may arise. The algorithm searches the set of annotated query plans, pruning invalid and non-viable plans as early as possible in the search space, and it also uses a best-first search strategy in order to produce a first complete plan early in the search. We describe experiments to illustrate the performance of our algorithm.
Conference Paper
Full-text available
A technical infrastructure for storing, querying and managing RDFdata is a key element in the current semantic web development. Systems like Jena, Sesame or the ICS-FORTH RDF Suite are widelyused for building semantic web applications. Currently, none ofthese systems supports the integrated querying of distributed RDF repositories. We consider this a major shortcoming since the semanticweb is distributed by nature. In this paper we present an architecture for querying distributed RDF repositories by extending the existing Sesame system. We discuss the implications of our architectureand propose an index structure as well as algorithms forquery processing and optimization in such a distributed context.
Conference Paper
Full-text available
The Semantic Web community has proposed several query languages for RDF before the World Wide Web Consortium started to standardize SPARQL. Due to the declarative nature of the query language, a query engine should be responsible to choose an efficient evaluation strategy. Although all RDF repositories provide query capabilities, some of them require manual interaction to reduce query execution time by several orders of magnitude. In this paper, we propose the SPARQL query graph model (SQGM) supporting all phases of query processing. On top of the SQGM we defined transformations rules to simplify and to rewrite a query. Based on these rules we developed heuristics to achieve an efficient query execution plan. Experiments illustrate the potential of our approach.
Conference Paper
The development and deployment of practical Semantic Web applications requires technologies for the storage and retrieval of RDF data that are robust and scalable. In this paper, we describe the 3store RDF storage and query engine developed within the Advanced Knowledge Technologies project, and discuss the design rationale and optimisations behind it which enable the efficient handling of large RDF knowledge bases.
Conference Paper
Storing and querying resource description framework (RDF) data is one of the basic tasks within any semantic Web application. A number of storage systems provide assistance for this task. However, current RDF database systems do not use optimized indexes, which results in a poor performance behavior for querying RDF. In this paper we describe optimized index structures for RDF, show how to process and evaluate queries based on the index structure, describe a lightweight adaptable implementation in Java, and provide a performance comparison with existing RDF databases.