Content uploaded by Muhammad Saleem
Author content
All content in this area was uploaded by Muhammad Saleem on Oct 28, 2021
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
A Survey of RDF Stores & SPARQL Engines
for Querying Knowledge Graphs
Waqas Ali ·Muhammad Saleem ·Bin Yao ·
Aidan Hogan ·Axel-Cyrille Ngonga Ngomo
Received: date / Accepted: date
Abstract RDF has seen increased adoption in recent years,
prompting the standardization of the SPARQL query lan-
guage for RDF, and the development of local and distributed
engines for processing SPARQL queries. This survey paper
provides a comprehensive review of techniques and systems
for querying RDF knowledge graphs. While other reviews
on this topic tend to focus on the distributed setting, the
main focus of the work is on providing a comprehensive sur-
vey of state-of-the-art storage, indexing and query process-
ing techniques for efficiently evaluating SPARQL queries in
a local setting (on one machine). To keep the survey self-
contained, we also provide a short discussion on graph par-
titioning techniques used in the distributed setting. We con-
clude by discussing contemporary research challenges for
further improving SPARQL query engines. This extended
version also provides a survey of over one hundred SPARQL
query engines and the techniques they use, along with twelve
benchmarks and their features.
Keywords Knowledge Graph ·Storage ·Indexing ·Query
Processing ·SPARQL
W. Ali
SEIEE, Shanghai Jiao Tong University, Shanghai, China.
E-mail: waqasali@sjtu.edu.cn
M. Saleem
AKSW, University of Leipzig, Leipzig, Germany.
E-mail: saleem@informatik.uni-leipzig.de
B. Yao (Hangzhou Qianjiang Distinguished Expert)
SEIEE, Shanghai Jiao Tong University, Shanghai, China.
Hangzhou Institute of Advanced Technology, Hangzhou, China.
E-mail: yaobin@cs.sjtu.edu.cn
A. Hogan
DCC, University of Chile & IMFD, Santiago, Chile.
E-mail: ahogan@dcc.uchile.cl
A.-C. Ngonga Ngomo
DICE, Paderborn University, Paderborn, Germany.
E-mail: axel.ngonga@upb.de
1 Introduction
The Resource Description Framework (RDF) is a graph-
based data model where triples of the form (s, p, o)denote
directed labeled edges sp
−→ oin a graph. RDF has gained
significant adoption in the past years, particularly on the
Web. As of 2019, over 5 million websites publish RDF data
embedded in their webpages [34]. RDF has also become
a popular format for publishing knowledge graphs on the
Web, the largest of which – including Bio2RDF, DBpedia,
PubChemRDF, UniProt, and Wikidata – contain billions of
triples. These developments have brought about the need for
optimized techniques and engines for querying large RDF
graphs. We refer to engines that allow for storing, indexing
and processing joins over RDF as RDF stores.
While various query languages have historically been
proposed for RDF, the SPARQL Protocol and RDF Query
Language (SPARQL) has become the standard [92]. The
first version of SPARQL was standardized in 2008, while
SPARQL 1.1 was released in 2013 [92]. SPARQL is an ex-
pressive language that supports not only joins, but also vari-
ants of the broader relational algebra (projection, selection,
union, difference, etc.). Various new features were added in
SPARQL 1.1, such as property paths for matching arbitrary-
length paths in the RDF graph. Hundreds of SPARQL query
services, called endpoints, have emerged on the Web [43],
with the most popular endpoints receiving millions of queries
per day [197,148]. We refer to engines that support storing,
indexing and processing SPARQL (1.1) queries over RDF
as SPARQL engines. Since SPARQL supports joins, we con-
sider any SPARQL engine to also be an RDF store.
Efficient data storage, indexing and join processing are
key to RDF stores (and thus, to SPARQL engines):
–Storage. Different engines store RDF data using differ-
ent structures (tables, graphs, etc.), encodings (integer
2 Ali et al.
IDs, string compression, etc.) and media (main memory,
disk, etc.). Which storage to use may depend on the scale
of the data, the types of query features supported, etc.
–Indexing. Indexes are used in RDF stores for fast lookups
and query execution. Different index types can support
different operations with varying time–space trade-offs.
–Join Processing. At the core of evaluating queries lie ef-
ficient methods for processing joins. Aside from tradi-
tional pairwise joins, recent years have seen the emer-
gence of novel techniques, such as multiway and worst-
case optimal joins, as well as GPU-based join process-
ing. Optimizing the order of evaluation of joins can also
be important to ensure efficient processing.
Beyond processing joins, SPARQL engines must offer
efficient support for more expressive query features:
–Query Processing. SPARQL is an expressive language
containing a variety of query features beyond joins that
need to be supported efficiently, such as filter expres-
sions, optionals, path queries, etc.
RDF stores can further be divided into two categories:
(1) local stores (also called single-node stores) that manage
RDF data on one machine and (2) distributed stores that par-
tition RDF data over multiple machines. While local stores
are more lightweight, the resources of one machine limit
scalability [249,175,104]. Various kinds of distributed RDF
stores have thus been proposed [88,104, 203, 204] that typi-
cally run on clusters of shared-nothing machines.
In this survey, we describe storage, indexing, join pro-
cessing and query processing techniques employed by local
RDF stores, as well as high-level strategies for partitioning
RDF graphs as needed for distributed storage. An appendix
in this extended version further compares 135 local and dis-
tributed RDF engines in terms of the techniques they use,
as well as 12 benchmarks in terms of the types of data and
queries they contain. The goal of this survey is to give a suc-
cinct introduction of the different techniques used by RDF
query engines, and also to help users to choose the appropri-
ate engine or benchmark for a given use case.
The rest of the paper is structured as follows. Section 2
discusses and contrasts this survey with related literature.
Section 3 provides preliminaries for RDF and SPARQL. Sec-
tions 4, 5, 6 and 7 review techniques for storage, index-
ing, join processing and query processing, respectively. Sec-
tion 8 explains different graph partitioning techniques for
distributing storage over multiple machines. Section 9 in-
troduces additional content available in the appendix of this
extended version, which surveys 135 local and distributed
RDF engines, along with 12 SPARQL benchmarks. Sec-
tion 10 concludes the paper with subsections for current
trends and research challenges regarding efficient RDF-based
data management and query processing.
2 Literature Review
We first discuss related studies. More specifically, we sum-
marize peer-reviewed tertiary literature (surveys in journals,
short surveys in proceedings, book chapters, surveys with
empirical comparisons, etc.) from the last 10 years collating
techniques, engines and/or benchmarks for querying RDF.
We summarize the topics covered by these works in Table 1.
We use 3,∼and blank cells to denote detailed, partial or lit-
tle/no discussion, respectively, when compared with the cur-
rent survey (the bottom row). We also present the number of
engines and benchmarks included in the extended version of
this survey. If the respective publication does not formally
list all systems/benchmarks (e.g., as a table), we may write
n+as an estimate for the number discussed in the text.
Sakr et al. [196] present three schemes for storing RDF
data in relational databases, surveying works that use the
different schemes. Svoboda et al. [221] provide a brief sur-
vey on indexing schemes for RDF divided into three cate-
gories: local, distributed and global. Faye et al. [70] focus
on both storage and indexing schemes for local RDF en-
gines, divided into native and non-native storage schemes.
Luo et al. [141] also focus on RDF storage and indexing
schemes under the relational-, entity-, and graph-based per-
spectives in local RDF engines. Compared to these works,
we present join processing, query processing and partition-
ing techniques; furthermore, these works predate the stan-
dardization of SPARQL 1.1, and thus our discussion includes
more recent storage and indexing techniques, as well as sup-
port for new features such as property paths.
Later surveys began to focus on distributed RDF stores.
Kaoudi et al. [116] present a survey of RDF stores explicitly
designed for a cloud-based environment. Ma et al. [144] pro-
vide an overview of RDF storage in relational and NoSQL
databases. Özsu [174] presents a survey that focuses on stor-
age techniques for RDF within local and distributed stores,
with a brief overview of query processing techniques in dis-
tributed and decentralized (Linked Data) settings. Abdelaziz
et al. [3] survey 22 distributed RDF stores, and compare 12
experimentally in terms of pre-processing cost, query per-
formance, scalability, and workload adaptability. Elzein et
al. [67] present a survey on the storage and query process-
ing techniques used by RDF stores on the cloud. Janke &
Staab [111] present lecture notes discussing RDF graph par-
titioning, indexing, and query processing techniques, with a
focus on distributed and cloud-based RDF engines. Pan et
al. [175] provide an overview of local and distributed stor-
age schemes for RDF. Yasin et al. [255] discussed SPARQL
(1.1) query processing in the context of distributed RDF
stores. Wylot et al. [249] present a comprehensive survey of
storage and indexing techniques for local (centralized) and
distributed RDF stores, along with a discussion of bench-
marks; most of their survey is dedicated to distributed and
A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 3
Table 1: Prior tertiary literature on RDF query engines; the abbreviations are: Sto./Storage, Ind./Indexing, J.Pr./Join Pro-
cessing, Q.Pr./Query Processing, Dis./Distribution, Eng./Engines, Ben./Benchmarks
Study Year Techniques Eng. Bench.
Sto. Ind. J.Pr. Q.Pr. Dis.
Sakr et al. [196] 2010 3∼10+
Svoboda et al. [221] 2011 ∼ ∼ ∼ 14 6
Faye et al. [70] 2012 3∼ ∼ 13
Luo et al. [141] 2012 3 3 20+
Kaoudi et al. [116] 2015 3∼ ∼ ∼ 317
Ma et al. [144] 2016 3∼ ∼ 17 6
Özsu [174] 2016 3∼ ∼ ∼ 35+
Abdelaziz et al. [3] 2017 3∼ ∼ ∼ 321 4
Elzein et al. [67] 2018 3∼ ∼ ∼ 15+
Janke & Staab [111] 2018 ∼ ∼ ∼ ∼ 350+ 9
Pan et al. [175] 2018 3∼ ∼ ∼ 25+ 4
Wylot et. al [249] 2018 3∼ ∼ ∼ 324 8
Yasin et al. [255] 2018 ∼3∼14
Alaoui [9] 2019 3∼30+
Chawla et al. [50] 2020 3∼ ∼ ∼ 46 9
Zambom et al. [257] 2020 ∼ ∼ 324
Ali et al. 3 3 3 3 3 135 12
federated stores. Alaoui [9] proposes a categorization scheme
for RDF engines, including memory-, cloud-, graph- and
binary-bases stores. The survey by Chawla et al. [50] re-
views distributed RDF engines in terms of storage, partition-
ing, indexing, and retrieval. The short survey by Zambom &
dos Santos [257] discusses mapping RDF data into NoSQL
databases. All of these works focus on techniques for stor-
ing RDF, particularly in distributed settings, where our sur-
vey is more detailed in terms of join and query processing
techniques, particularly in local settings.
Local RDF stores are those most commonly found in
practice [43]. To the best of our knowledge, our survey pro-
vides the most comprehensive discussion thus far on stor-
age, indexing, join processing and querying processing tech-
niques for SPARQL in a local setting, where, for example,
we discuss novel techniques for established features – such
as novel indexing techniques based on compact data struc-
tures, worst-case optimal and matrix-based join processing
techniques, multi-query optimization, etc. – as well as tech-
niques for novel features in SPARQL 1.1 – such as index-
ing and query processing techniques for evaluating property
paths – that are not well-represented in the existing litera-
ture. To keep our survey self-contained, we also present par-
titioning techniques for RDF graphs, and include distributed
stores and benchmarks in our survey. Per Table 1, the sur-
vey of engines and benchmarks found in the online version
is more comprehensive than seen in previous works [10].
Conversely, some of the aforementioned works are more de-
tailed in certain aspects, particularly distributed stores; we
refer to this literature for further details as appropriate.
3 Preliminaries
Before beginning the core of the survey, we first introduce
some preliminaries regarding RDF and SPARQL.
3.1 RDF
The RDF data model [208] uses RDF terms from three pair-
wise disjoint sets: the set Iof Internationalized Resource
Identifiers (IRIs) [66] used to identify resources; the set L
of literals used for (language-tagged or plain) strings and
datatype values; and the set Bof blank nodes, interpreted as
existential variables. An RDF triple (s, p, o)∈IB×I×IBL
contains a subject s, a predicate pand an object o.1A set
of RDF terms is called an RDF graph G, where each triple
(s, p, o)∈Grepresents a directed labeled edge sp
−→ o. The
sets s(G),p(G)and o(G)stand for the set of subjects, pred-
icates and objects in G, respectively. We further denote the
set of nodes in Gby so(G) := s(G)∪o(G).
An example RDF graph, representing information about
two university students, is shown in Figure 1. We include
both a graphical representation and a triple-based represen-
tation. RDF terms such as :DB,foaf:age, etc., denote pre-
fixed IRIs.2For example, foaf:age stands for the full IRI
http://xmlns.com/foaf/0.1/age if we define the pre-
fix foaf as http://xmlns.com/foaf/0.1/. Terms such
as "Motor RDF"@es denote strings with (optional) language
1In this paper, we abbreviate the union of sets M1∪...∪Mnwith
M1...Mn. Hence, IBL stands for I∪B∪L.
2We use the blank prefix (e.g., :DB) as an arbitrary example. Other
prefixes used can be retrieved at http://prefix.cc/.
4 Ali et al.
tags, and terms such as "21"^^xsd:int denote datatype val-
ues. Finally we denote blank nodes with the underscore pre-
fix, where _:p refers to the existence of a project shared by
Alice and Bob. Terms used in the predicate position (e.g.,
foaf:age,skos:broader) are known as properties. RDF
defines the special property rdf:type, which indicates the
class (e.g., foaf:Person,foaf:Project) of a resource.
The semantics of RDF can be defined using RDF Schema
(RDFS) [37], covering class and property hierarchies, prop-
erty domains and ranges, etc. Further semantics can be cap-
tured with the Web Ontology Language (OWL) [97], such as
class and property equivalence; inverse, transitive, symmet-
ric and reflexive properties; set- and restriction-based class
definitions; and more besides. Since our focus is on querying
RDF graphs, we do not discuss these standards in detail.
3.2 SPARQL
Various query languages for RDF have been proposed down
through the years, such as RQL [118], SeRQL [218], etc.
We focus our discussion on SPARQL [92], which is now the
standard language for querying RDF, and refer to the work
by Haase et al. [87] for information on its predecessors.
We define the core of SPARQL in terms of basic graph
patterns that express the core pattern matched against an
RDF graph; navigational graph patterns that match arbitrary-
length paths; complex graph patterns that introduce vari-
ous language features, such as OPTIONAL,UNION,MINUS,
etc. [16]; and query types that specify what result to return.
Basic Graph Patterns (BGPs) At the core of SPARQL lie
triple patterns, which are RDF triples that allow variables
from the set V(disjoint with IBL) in any position. A basic
graph pattern (BGP) is a set of triple patterns. Since blank
nodes in BGPs act as variables, we assume they have been
replaced with variables. We use vars(B)to denote the set of
variables in the BGP B. Given an RDF graph G, the evalu-
ation of a BGP B, denoted B(G), returns a set of solution
mappings. A solution mapping µis a partial mapping from
the set Vof variables to the set of RDF terms IBL. We
write dm(µ)to denote the set of variables for which µis
defined. Given a triple pattern t, we use µ(t)to refer to the
image of tunder µ, i.e., the result of replacing any variable
v∈dm(µ)appearing in twith µ(v).µ(B)stands for the
image of the BGP Bunder µ; i.e., µ(B) := {µ(t)|t∈B}.
The evaluation of a BGP Bon an RDF graph Gis then given
as B(G) := {µ|µ(B)⊆Gand dm(µ) = vars(B)}. In the
case of a singleton BGP {t}, we may write {t}(G)as t(G).
In Figure 2, we provide an example of a BGP along
with its evaluation. Each row of the results refers to a so-
lution mapping. Some solutions map different variables to
the same term; each such solution is thus a homomorphism
from the BGP to the RDF graph.
Table 2: Evaluation of path expressions
p(G):={(s, o)|(s, p, o)∈G}
^e(G):={(s, o)|(o, s)∈e(G)}
e+(G):={(y1, yn)|for 1≤i<n:∃(yi, yi+1)∈e(G)}
e?(G):=e(G)∪ {(x, x)|x∈so(G)}
e*(G):=(e+)?(G)
e1/e2(G):={(x, z)| ∃y: (x, y)∈e1(G)∧(y, z)∈e2(G)}
e1|e2(G):=e1(G)∪e2(G)
!P(G):={(s, o)|(s, p, o)∈G∧p /∈P}
!^P(G):={(s, o)|(o, p, s)∈G∧p /∈P}
Navigational Graph Patterns (NGPs) A key feature of graph
query languages is the ability to match paths of arbitrary
length [16]. In SPARQL (1.1), this ability is captured by
property paths [92], which are regular expressions Ethat
paths should match, defined recursively as follows:
–if pis an IRI, then pis a path expression (property);
–if eis a path expression, then ^e(inverse), e*(zero-or-
more, aka. Kleene star), e+(one-or-more), and e?(zero-
or-one) are path expressions.
–If e1, e2are path expressions, then e1/e2(concatena-
tion) and e1|e2(disjunction) are path expressions.
–if Pis a set of IRIs, then !Pand !^Pare path expressions
(negated property set);3
The evaluation of path expressions on an RDF graph G
returns pairs of nodes in Gconnected by paths that match the
expression, as defined in Table 2. These path expressions are
akin to 2-way regular path queries (2RPQs) extended with
negated property sets [128,16].
We call a triple pattern (s, e, o)that further allows a path
expression as the predicate (i.e., e∈EV) a path pattern. A
navigational graph pattern (NGP) is then a set of path pat-
terns. Given a navigational graph pattern N, let paths(N) :=
p(N)∩Edenote the set of path expressions used in N.
Given an RDF graph Gand a set of path expressions E⊆E,
we denote by GE:= G∪(Se∈E{(s, e, o)|(s, o)∈e(G)})
the result of materializing all paths matching Ein G. The
evaluation of the navigational graph pattern Non Gis then
N(G):={µ|µ(N)⊆Gpaths(N)and dm(µ) = vars(N)}.
We provide an example of a navigational graph pattern
and its evaluation in Figure 3.
Complex Graph Patterns (CGPs) Complex graph patterns
(CGPs) introduce additional language features that can com-
bine and transform the results of one or more graph patterns.
More specifically, evaluating BGPs and NGPs returns solu-
tion mappings that can be viewed as relations, (i.e., tables),
3SPARQL uses the syntax !(p1|...|pk|pk+1 |...|pn)which
can be written as !P|!^P0, where P={p1,...,pk}and P0=
{pk+1,...,pn}[92,128].
A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 5
foaf:Person
:Alice rdf:type
"26"^^xsd:int
foaf:age
:Bob
foaf:knows
foaf:knows
rdf:type
"21"^^xsd:int
foaf:age
_:p
foaf:pastProject
foaf:currentProject
foaf:Project
rdf:type
"RDF Engine"@en
rdfs:label
"Motor RDF"@es
rdfs:label
:SW
foaf:topic_interest
:DB
foaf:topic_interest
skos:related
foaf:topic_interest
:Web
skos:broader
:CS
skos:broader
skos:broader
Subject Predicate Object
:Alice rdf:type foaf:Person
:Alice foaf:age "26"^^xsd:int
:Alice foaf:topic_interest :DB
:Alice foaf:topic_interest :SW
:Alice foaf:knows :Bob
:Alice foaf:currentProject _:p
:Bob rdf:type foaf:Person
:Bob foaf:age "21"^^xsd:int
:Bob foaf:topic_interest :DB
:Bob foaf:knows :Alice
:Bob foaf:pastProject _:p
_:p rdf:type foaf:Project
_:p rdfs:label "RDF Engine"@en
_:p rdfs:label "Motor RDF"@es
:SW skos:broader :Web
:SW skos:related :DB
:Web skos:broader :CS
:DB skos:broader :CS
Fig. 1: Graphical (left) and triple-based representation (right) of an example RDF graph
SELECT *WHERE {
?a a foaf:Person ;foaf:knows ?b;foaf:topic_interest ?ia .
?b a foaf:Person ;foaf:knows ?a;foaf:topic_interest ?ib .
}
foaf:Person
?a rdf:type ?b
foaf:knows
foaf:knows
rdf:type
?ia
foaf:topic_interest
?ib
foaf:topic_interest
?a ?b ?ia ?ib
:Alice :Bob :DB :DB
:Alice :Bob :SW :DB
:Bob :Alice :DB :DB
:Bob :Alice :DB :SW
Fig. 2: A BGP in SPARQL syntax and as a graph (above),
with its evaluation over the graph of Figure 1 (below)
where variables are attributes (i.e., column names) and tu-
ples (i.e., rows) contain the RDF terms bound by each so-
lution mapping (see Figures 2–4). CGPs support combining
and transforming the results of BGPs/NGPs with language
features that include FILTER (selection: σ), SELECT (projec-
tion: π), UNION (union: ∪), EXISTS (semi-join: n), MINUS
(anti-join: B4) and OPTIONAL (left-join: ). These language
features correspond to the relational algebra defined in Ta-
ble 3. The default operator is a natural inner join (). Fig-
ure 4 provides an example of a CGP combining two BGPs
and an NGP using union, join and projection.
4The definition of MINUS is slightly different from anti-join in that
mappings with no overlapping variables on the right are ignored.
SELECT *WHERE {
?a a foaf:Person ;foaf:knows ?b;
foaf:topic_interest/skos:related*/foaf:topic_interest ?b.
?b a foaf:Person ;foaf:knows ?a.
}
foaf:Person
?a rdf:type ?b
foaf:knows
foaf:knows
foaf:topic_interest/skos:related*/^foaf:topic_interest
rdf:type
?a ?b
:Alice :Alice
:Alice :Bob
:Bob :Alice
:Bob :Bob
Fig. 3: Example NGP (above) and its evaluation over the
graph of Figure 1 (below)
Table 3: Core relational algebra of SPARQL
σR(M) := {µ∈M|R(µ)}
πV(M) := {µ0| ∃µ∈M:µ∼µ0∧dm(µ0)= V∩dm(µ)}
M1 M2:= {µ1∪µ2|µ1∈M1∧µ2∈M2∧µ1∼µ2}
M1∪M2:= {µ|µ∈M1∨µ∈M2}
M1nM2:= {µ1∈M1| ∃µ2∈M2:µ1∼µ2}
M1BM2:= {µ1∈M1|@µ2∈M2:µ1∼µ2}
M1 M2:= (M1 M2)∪(M1BM2)
Named graphs SPARQL allows for querying multiple RDF
graphs through the notion of a SPARQL dataset, defined as
D:= {G, (n1, G1),...,(nk, Gk))}where G, G1...,Gn
are RDF graphs; n1, . . . , nkare pairwise distinct IRIs; G
6 Ali et al.
SELECT ?x?zWHERE {
{ ?x foaf:currentProject ?y.?y rdfs:label ?z.}
UNION { ?x foaf:pastProject ?y.?y rdfs:label ?z.}
?x foaf:topic_interest/skos:broader*:SW .
}
π?x,?z ((?x ?y
foaf:currentProject ?z
rdfs:label ∪
?x ?y
foaf:pastProject ?z
rdfs:label )./
?x :SW
foaf:topic_interest/skos:broader* )
?x ?z
:Alice "Motor RDF"@es
:Alice "RDF Engine"@en
Fig. 4: Example CGP (above) and its evaluation over the
graph of Figure 1 (below)
is known as the default graph; and each pair (n1, G1)(for
1≤i≤n) is known as a named graph. Letting N0, N00 de-
note sets of IRIs, n0, n00 IRIs and va variable, SPARQL then
provides a number of features for querying different graphs:
–FROM N0FROM NAMED N00: activates a dataset with a de-
fault graph composed of the merge of all graphs G0such
that (n0, G0)∈Dand n0∈N0, and the set of all named
graphs (n00, G00 )∈Dsuch that n00 ∈N00;
–GRAPH n0: evaluates a graph pattern on the graph G0if
the named graph (n0, G0)is active;
–GRAPH v: takes the union of the evaluation of a graph
pattern over each G0such that (n0, G0)is active, binding
vto n0for each solution generated from G0;
Without FROM or FROM NAMED, the active dataset is the in-
dexed dataset D. Without GRAPH, graph patterns are evalu-
ated on the active default graph. Quad stores disallow empty
named graphs, such that D:= {G, (n1, G1),...,(nk, Gk))}
is viewed as D=G× {} ∪ (S(ni,Gi)∈DGi× {ni}), i.e., a
set of quads using 6∈ IBL as a special symbol for the de-
fault graph. In this case, a quad (s, p, o, n)denotes a triple
(s, p, o)in the default graph if n=, or a triple in the named
graph G0such that (n, G0)∈Dif n∈I. We can define
CGPs involving quad patterns analogously.
Other SPARQL features SPARQL supports features beyond
CGPs, which include aggregation (group-by with count, sum,
etc.), solution modifiers (ordering and slicing solutions), bag
semantics (preserving result multiplicity), federation (fetch-
ing solutions from remote services), entailment and more
besides. SPARQL also supports different query types, such
as SELECT, which returns a sequence of solution mappings;
CONSTRUCT, which returns an RDF graph based on the solu-
tion mappings; DESCRIBE, which returns an RDF graph de-
scribing indicated RDF terms; and ASK, which returns true if
some solution mapping is found, or false otherwise.
4 Storage
Data storage refers to how data are represented in mem-
ory. Different storage mechanisms store different elements
of data contiguously in memory, offering trade-offs in terms
of compression and efficient data access. This section re-
views various categories of RDF storage.
4.1 Triple table
Atriple table stores an RDF graph Gas a single ternary rela-
tion. Figure 1 shows an RDF graph with its triple table on the
right-hand side. One complication when storing triple tables
in relational databases is that such systems assume a column
to have a single type, which may not be true for RDF objects
in particular; a workaround is to store a string encoding of
the terms, though this may complicate their ordering.
Rather than storing full RDF terms in the triple table,
stores may apply dictionary encoding, where RDF terms are
mapped one-to-one with numeric object identifiers (OIDs),
with OIDs being stored in the table and decoded using the
dictionary as needed. Since OIDs consume less memory and
are faster to process than strings, such an approach works
better for queries that involve many intermediate results but
generate few final results; on the other hand, such an ap-
proach suffers when queries are simple and return many re-
sults, or when selective filters are specified that require de-
coding the term before filtering. To find a better trade-off,
some RDF engines (e.g., Jena 2 [241]) only use OIDs for
strings with lengths above a threshold.
The most obvious physical storage is to store triples con-
tiguously (row-wise). This allows for quickly retrieving the
full triples that match (e.g.) a given triple pattern. However,
some RDF engines based on relational storage (e.g., Virtu-
oso [69]) rather use (or provide an option for) column-wise
storage, where the values along a column are stored contigu-
ously, often following a particular order. Such column-wise
storage allows for better compression, and for quickly read-
ing many values from a single column.
Triple tables can be straightforwardly extended to quad
tables in order to support SPARQL datasets [69,91].
4.2 Vertical partitioning
The vertical partitioning approach [1] uses a binary relation
for each property p∈p(G)whose tuples encode subject–
object pairs for that property. In Figure 5 we exemplify two
such binary relations. Physical storage can again use OIDs,
row-based or column-based storage, etc.
When compared with triple tables, vertical partitioning
generates relations with fewer rows, and more specific do-
mains for columns (e.g., the object column for foaf:age
A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 7
rdf:type
Subject Object
:Alice foaf:Person
:Bob foaf:Person
_:p foaf:Project
foaf:age
Subject Object
:Alice 26
:Bob 21
Fig. 5: Vertical partitioning for two properties in Figure 1
can be defined as an integer type). However, triple patterns
with variable predicates may require applying a union on all
relations. Also, RDF graphs may have thousands of proper-
ties [233], which may lead to a schema with many relations.
Vertical partitioning can be used to store quads by adding
aGraph column to each table [69,91].
4.3 Extended vertical partitioning
S2RDF [204] uses extended vertical partitioning based on
semi-join reductions (we recall from Table 3 that a semi-join
M1nM2, aka. FILTER EXISTS, returns the tuples in M1that
are “joinable”with M2). Letting x,y,zdenote variables and
p, q denote RDF terms, then for each property pair (p, q)∈
p(G)×p(G)such that p6=q, extended vertical partitioning
stores three semi-join reductions:
1. (x, p, y)(G)n(y, q, z)(G)(O–S),
2. (x, p, y)(G)n(x, q, z)(G)(S–S),
3. (x, p, y)(G)n(z, q, x)(G)(S–O).
The semi-join (x, p, y)(G)n(z, q, y)(G)(O–O) is not stored
as most O–Ojoins have the same predicate, and thus would
occur in the same relation. In Figure 6 we give an example
of a semi-join reduction for two predicates from the running
example; empty semi-joins are omitted.
In comparison with vertical partitioning, observing that
(M1nM2)on (M2nM1)≡M1on M2, we can apply joins
over the corresponding semi-join reductions knowing that
each tuple read from each side will contribute to the join,
thus reducing I/O. The cost involves storing (and updating)
each tuple in up to 3(|p(G)| − 1) additional relations; omit-
ting empty semi-joins can help to mitigate this issue [204].
Extended vertical partitioning also presents complications
for variable predicates, graphs with many properties, etc.
4.4 Property table
Property tables aim to emulate the n-ary relations typical of
relational databases. A property table usually contains one
subject column, and nfurther columns to store objects for
the corresponding properties of the given subject. The sub-
ject column then forms a primary key for the table. The ta-
bles to define can be based on classes, clustering [184], col-
oring [36], etc., to group subjects with common properties.
skos:broader
nS–Sskos:related
Subject Object
:SW :Web
skos:broader
nS–Oskos:related
Subject Object
:DB :CS
skos:related
nO–Sskos:broader
Subject Object
:SW :DB
skos:related
nS–Sskos:broader
Subject Object
:SW :DB
Fig. 6: Example semi-join reduction for two properties
foaf:Person
Subject age topic knows cProj pProj
:Alice 26 :DB :Bob _:p NULL
:Bob 21 :DB :Alice N ULL _:p
Fig. 7: Example property table for people
We provide an example of a property table based on the class
foaf:Person in Figure 7 for the RDF graph of Figure 1.
Property tables can store and retrieve multiple triples
with a given subject as one tuple (e.g., to find people with
age <30 and interest =:SW) without needing joins. Prop-
erty tables often store terms of the same type in the same col-
umn, enabling better compression. Complications arise for
multi-valued (. . . -to-many) or optional (zero-to-. . . ) proper-
ties. In the example of Figure 1, Alice is also interested in
SW, which does not fit in the cell. Furthermore, Alice has no
past project, and Bob has no current project, leading to nulls.
Changes to the graph may also require re-normalization;
for example, even though each person currently has only
one value for knows, adding that Alice knows another per-
son would require renormalizing the tables. Complications
also arise when considering variable predicates, RDF graphs
with many properties or classes, quads, etc.
4.5 Graph-based storage
While the previous three storage mechanisms rely on rela-
tional storage, graph-based storage is adapted specifically
for the graph-based model of RDF. Key characteristics of
such models that can be exploited for storage include the
adjacency of nodes, the fixed arity of graphs, etc.
Graphs have bounded arity (3 for triples, 4 for quads),
which can be exploited for specialized storage. Engines like
4store [91] and YARS2 [94] build native triple/quad tables,
which differ from relational triple/quad tables in that they
have fixed arity, fixed attributes (S,P,O(,G)), and more gen-
eral domains (e.g., the Ocolumn can contain any RDF term).
Graphs often feature local repetitions that are compress-
ible with adjacency lists (e.g., Hexastore [238], gStore [263],
8 Ali et al.
s
:Alice
:Bob
...
(s)p
foaf:age
foaf:currentProject
foaf:knows
.foaf:topic_interest.
rdf:type
(sp)o
."26"^^xsd:int.
_:p
:Bob
:DB
:SW
foaf:Person
(s)p
foaf:age
foaf:knows
foaf:pastProject
.foaf:topic_interest.
rdf:type
(sp)o
."21"^^xsd:int.
:Alice
_:p
:DB
foaf:Person
Fig. 8: Example adjacency list for two subjects with dashed
links indicating index-free adjacency pointers
LE:
Edge Label
(:Alice,:Bob){foaf:knows}
(:Alice,:DB){foaf:topic_interest}
(:Bob,:Alice){foaf:knows}
... ...
LV:
Node Attributes
:Alice {(foaf:age,"26"^^xsd:int)}
:Bob {(foaf:age,"21"^^xsd:int)}
:DB {}
... ...
Fig. 9: Example of the multi-graph representation
SpiderStore [32], Trinity.RDF [258], GRaSS [142]). These
lists are akin to tries, where subject or subject–predicate pre-
fixes are followed by the rest of the triple. Such tries can be
stored row-wise in blocks of triples; or column-wise, where
blocks elements from one column point to blocks of ele-
ments from the next column. Index-free adjacency can en-
able efficient navigation, where terms in the suffix directly
point to the location on disk of their associated prefix. We
refer to Figure 8 for an example. Such structures can also in-
clude inverse edges (e.g., Trinity.RDF [258], GRaSS [142]).
An alternative is to decompose an RDF graph into its
constituent components for storage. AMBER [105] uses a
multigraph representation where an RDF graph Gis decom-
posed into a set of (non-literal) nodes V:= so(G)∩IB, a
set of edges E:= {(s, o)∈V×V| ∃p: (s, p, o)∈G},
an edge-labeling function of the form LE:V→2Isuch
that LE(s, o) := {p|(s, p, o)∈G}, and an attribute-
labeling function of the form LV:IB →2I×Lsuch that
LV(s) := {(p, o)|(s, p, o)∈G∧o∈L}, as seen in
Figure 9 (in practice, AMBER uses dictionary-encoding).
skos:broader
:CS :DB :SW :Web
:CS 0 0 0 0
:DB 1 0 0 0
:SW 0 0 0 1
:Web 1 0 0 0
Fig. 10: Example bit matrix for skos:broader
4.6 Tensor-based storage
Another type of native graph storage uses tensors, viewing a
dictionary-encoded RDF graph Gwith m=|so(G)|nodes
and n=|p(G)|predicates as an m×n×m3-order ten-
sor Tof bits such that Ti,j,k = 1 if the ith node links to the
kth node with the jth property, or Ti,j,k = 0 otherwise. A
popular variant uses an adjacency matrix per property (e.g.,
BitMat [21], BMatrix [38], QDags [163]), akin to vertical
partitioning, as seen in Figure 10. A third option (consid-
ered, e.g., by MAGiQ [109]) is to encode the full graph as
an adjacency matrix where each cell indicates the property
id connecting the two nodes; this matrix cannot directly rep-
resent pairs of nodes connected by more than one property.
While abstract tensor-based representations may lead to
highly-sparse matrices or tensors, compact data structures
offer compressed representations that support efficient op-
erations [21,109,38,163]. Often such matrices/tensors are
stored in memory, or loaded into memory when needed. Such
representations may also enable query processing techniques
that leverage hardware acceleration, e.g., for processing joins
on GPUs (as we will discuss in Section 6.4).
4.7 Miscellaneous storage
Aside from relational-based and graph-based storage, other
engines have proposed to leverage other forms of storage
as implemented by existing systems. A common example is
the use of NoSQL key-value, tabular or document stores for
distributed storage (see [111,257,249] for more details).
4.8 Discussion
Early works on storing RDF tended to rely on relational stor-
age, which had been subject to decades of developments
and optimizations before the advent of RDF (e.g., [241,1,
69]). Though such an approach still has broad adoption [69],
more recent storage techniques aim to exploit the graph-
based characteristics of RDF – and SPARQL – in order to
develop dedicated storage techniques (e.g., [21,238, 263]),
including those based on tensors/matrices [21,109,38,163].
A recent trend is to leverage NoSQL storage (e.g., [131,177,
25]) in order to distribute the management of RDF data.
A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 9
5 Indexing
Indexing enables efficient lookup operations on RDF graphs
(i.e., O(1) or O(log |G|)time to return the first result or an
empty result). The most common such operation is to find
triples that match a given triple pattern. However, indexes
can also be used to match non-singleton BGPs (with more
than one triple pattern), to match path expressions, etc. We
now discuss indexing techniques proposed for RDF graphs.
5.1 Triple indexes
The goal of triple indexes is to efficiently find triples match-
ing a triple pattern. Letting s, p, o denote RDF terms and
s,p,ovariables, there are 23= 8 abstract patterns: (s,p,o),
(s,p, o),(s, p, o),(s, p,o),(s, p, o),(s, p, o),(s, p, o)and
(s, p, o). Unlike relational databases, where often only the
primary key of a relation will be indexed by default and fur-
ther indexes must be manually specified, most RDF stores
aim to have a complete index by default, covering all eight
possible triple patterns. However, depending on the type of
storage chosen, this might not always be feasible.
When a storage scheme such as vertical partitioning is
used [1], only the five patterns where the predicate is con-
stant can be efficiently supported (by indexing the subject
and object columns). If the RDF graph is stored as a (bi-
nary) adjacency matrix for each property [21,163], again
only constant-predicate patterns can be efficiently supported.
Specialized indexes can be used to quickly evaluate such
patterns, where QDags [163] uses quadtrees: a hierarchi-
cal index structure that recursively divides the matrix into
four sub-matrices; we provide an example quadtree in Fig-
ure 11. A similar structure, namely a k2-tree, is used by
BMatrix [38].
Otherwise, in triple tables, or similar forms of graph-
based storage, all triple patterns can be efficiently supported
with triple permutations. Figure 8 illustrates a single SP O
permutation. A total of 3! = 6 permutations are possible and
suffice to cover all eight abstract triple patterns if the index
structure permits prefix lookups; for example, in an SPO per-
mutation we can efficiently support four abstract triple pat-
terns (s,p,o),(s, p,o),(s, p, o)and (s, p, o)as we require
the leftmost terms of the permutation to be filled. In fact,
with only 3
b3/2c= 3 permutations – e.g., S PO,POS and
OSP – we can cover all eight abstract triple patterns. Such
index permutations can be implemented using standard data
structures such as ISAM files [94], B(+)Trees [168], AVL
trees [243], as well as compact data structures, such as adja-
cency lists [238] (see Figure 8) and tries [185], etc.
Recent works use compact data structures to reduce re-
dundancy for index permutations, and thus the space re-
quired for triple indexing. Perego et al. [185] use tries to
0 0 0 0
1 0 0 0
0 0 0 1
1 0 0 0
Fig. 11: A quadtree index based on the bit matrix of Fig-
ure 10; the root represents the full matrix, while children de-
note four sub-matrices of the parent; a node is colored black
if it contains only 1’s, white if it contains only 0’s, and gray
if it contains both; only gray nodes require children
index multiple permutations, over which they apply cross-
compression, whereby the order of the triples given by one
permutation is used to compress another permutation. Other
approaches remove the need for multiple permutations. RD-
FCSA [40] and Ring [19] use a compact suffix-array (CSA)
such that one permutation suffices to efficiently support all
triple patterns. Intuitively speaking, triples can be indexed
cyclically in a CSA, such that in an SPO permutation, one
can continue from Oback to S, thus covering SPO,P OS and
OSP permutations in one CSA index [40]. The Ring indexing
scheme is also bidirectional, where in an SPO permutation,
one can move from Oforwards to Sor backwards to P.
5.2 Entity-based indexes
Entity-based indexes optimize graph patterns that “center
on” a particular entity. BGPs can be reduced to joins over
their triple patterns; for example, {(x, p, y),(y, q, z)}(G) =
{(x, p, y)}(G)on {(y, p, z)}(G).Star joins are frequently
found in BGPs, defined to be a join on a common subject,
e.g., {(w, p, x),(w, q, y),(w, r, z)}. Star joins may some-
times also include S–Ojoins on the common variable, e.g.,
{(w, p, x),(w, q, y),(z, r, w)}[142]. Star joins retrieve data
surrounding a particular entity (in this case w). Entity-based
indexes permit efficient evaluation of such joins.
Property tables can enable efficient star joins so long as
the relevant tables can be found efficiently and there are in-
dexes on the relevant columns (e.g., for p,qand/or r).
The EAGRE system [261] uses an index for property
tables where entities with nproperties are encoded in n-
dimensional space. A space-filling curve (e.g., a Z-order or
Hilbert curve) is then used for indexing. Figure 12 illus-
trates the idea, where four entities are indexed (abbreviating
:Alice,:Bob,:Carol,:Dave) with respect to two dimen-
sions (say foaf:age for xand integer-encoded values of
foaf:knows for y). We show the first-, second- and third-
order Hilbert curves from left to right. Letting ddenote the
number of dimensions, the nth-order Hilbert curve assigns
an ordinal to 2dn regions of the space based on the order in
which it visits the region; e.g., starting with region 1 on the
bottom left and following the curve, :A is in the region of or-
dinal 2, 7 and 26, respectively. The space-filling curve thus
10 Ali et al.
:A
:B
:C
:D
Fig. 12: Space-filling indexing with a Hilbert curve
“flattens” multidimensional data into one dimension (the or-
dinal), which can be indexed sequentially.
Property tables are complicated by multi-valued prop-
erties, missing values, etc. A more flexible approach is to
index signatures of entities, which are bit-vectors encoding
the property–value pairs of the entity. One such example is
the vertex signature tree of gStore [263], which encodes all
outgoing (p, o)pairs for a given entity sinto a bit vector
akin to a Bloom filter, and indexes these bit vectors hierar-
chically allowing for fast, approximate containment checks
that quickly find candidate entities for a subset of such pairs.
GRaSS [142] further optimizes for star subgraphs that in-
clude both outcoming and incoming edges on entities, where
a custom FDD-index allows for efficient retrieval of the sub-
graphs containing a triple that matches a triple pattern.
5.3 Property-based indexes
Returning to the star join {(w, p, x),(w, q, y),(w, r, z)},
another way to quickly return candidate bindings for the
variable wis to index nodes according to their adjacent prop-
erties; then we can find nodes that have at least the adjacent
properties p, q, r. Such an approach is used by RDFBro-
ker [212], which defines the signature of a node sas Σ(s) =
{p| ∃o: (s, p, o)∈G}; for example, the signature of :SW
in Figure 1 is Σ(:SW) = {skos:broader,skos:related}
(analogous to characteristic sets proposed later [165]). A
property table is then created for each signature. At query
time, property tables whose signatures subsume {p, q, r}are
found using a lattice of signatures. We provide an example in
Figure 13 with respect to the RDF graph of Figure 1, where
children subsume the signatures of their parent.
AxonDB [155] uses extended characteristic sets where
each triple (s, p, o)in the RDF graph is indexed with the sig-
natures (i.e., characteristic sets) of its subject and object; i.e.,
(Σ(s), Σ(o)). Thus the triple (:SW,skos:related,:DB)of
Figure 1 would be indexed with the extended characteristic
set ({skos:broader,skos:related},{skos:broader}).
The index then allows for efficiently identifying two star
joins that are connected by a given property p.
{}
{r:t,f:a,f:t,f:k,f:c} {r:t,f:a,f:t,f:k,f:p}{r:t,r:l}{s:b}
{s:b,s:r}
Fig. 13: Lattice of node signatures with abbreviated terms
(e.g., s:b denotes skos:broader)
5.4 Path indexes
A path join involves successive S–Ojoins between triple
patterns; e.g., {(w, p, x),(x, q, y),(y, r, z)}, where the start
and end nodes (w, z) may be variables or constants. While
path joins have fixed length, navigational graph patterns may
further match arbitrary length paths. A number of indexing
approaches have been proposed to speed up querying paths.
A path can be seen as a string of arbitrary length; e.g.,
a path {(w, p, x),(x, q, y ),(y, r, z)}can be seen as a string
wpxqyrz$, where $ indicates the end of the string; alterna-
tively, if intermediate nodes are not of importance, the path
could be represented as the string wpqrz$. The Yaanii sys-
tem [47] builds an index of paths of the form wpxqyrz$
that are clustered according to their template of the form
wpqrz$. Paths are then indexed in B+trees, which are par-
titioned by template. Fletcher et al. [72] also index paths in
B+trees, but rather than partition paths, they apply a max-
imum length of at most kfor the paths included. Text in-
dexing techniques can also be applied for paths (viewed as
strings). Maharjan et al. [147] and the HPRD system [139]
both leverage suffix arrays – a common indexing technique
for text – to index paths. The downside of path indexing ap-
proaches is that they may index an exponential number of
paths; in the case of HPRD, for example, users are thus ex-
pected to specify which paths to index [139].
Other path indexes are inspired by prior works for path
queries over trees (e.g., for XPath). Bartoˇ
n [26] proposes a
tree-based index based on preorder and postorder traversal.
A preorder traversal starts at the root and traverses children
in a depth-first manner from left to right. A postorder traver-
sal starts at the leftmost leaf and traverses all children, from
left to right, before moving to the parent. We provide an ex-
ample preorder and postorder traversal in Figure 14. Given
two nodes mand nin the tree, a key property is that mis
a descendant of nif and only if mis greater than nfor pre-
order and less than nfor postorder. Bartoˇ
n [26] uses this
property to generate an index on ascending preorder so as
to linearize the tree and quickly find descendants based on
postorder. To support graphs, Bartoˇ
n uses a decomposition
of the graph into a forest of trees that are then indexed [26].
Another type of path index, called PLSD, is used in Sys-
tem Π[245] for indexing the transitivity of a single prop-
erty, optimizing for path queries of the form (s, p∗,o), or
A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 11
(1,7) :CS
(5,4) :DB(2,3) :AI (6,6) :Web
(3,1) :ML (4,2) :KR (7,5) :SW
Fig. 14: Preorder and postorder on a skos:narrower tree;
e.g., :CS has preorder 1and postorder 7
(s, p∗,o), etc. For a given property p, each incident (subject
or object) node xis assigned a triple of numbers (i, j, k)∈
N3, where iis a unique prime number that identifies the node
x,jis the least common multiple of the i-values of x’s par-
ents (i.e., nodes ysuch that (y, p, x)∈G), and kis the least
common multiple of the k-values of x’s parents and the i-
value of x. We provide an example in Figure 15. PLSD can
further handle cycles by multiplying the k-value of all nodes
by the ivalue of all nodes in its strongly-connected compo-
nent. Given the i-value of a node, the i-values of its parents
and ancestors can be retrieved by factorizing jand k/i re-
spectively. However, multiplication may give rise to large
numbers, where no polynomial time algorithm is known for
the factorization of binary numbers.
Gubichev et al. [80] use a path index of directed graphs,
called FERRARI [210], for each property in an RDF graph.
First, a condensed graph is computed by merging nodes
of strongly connected components into one “supernode”;
adding an artificial root node (if one does not exist), the re-
sult is a directed acyclic graph (DAG) that preserves reach-
ability. A spanning tree – a subgraph that includes all nodes
and is a tree – of the DAG is computed and labeled with
its postorder. All subtrees thus have contiguous identifiers,
where the maximum identifies the root; e.g., in Figure 14,
the subtree at :AI has the interval [1,3], where 3identifies
the root. Then there exists a (directed) path from xto yif
and only if yis in the subtree interval for x. Nodes in a DAG
may, however, be reachable through paths not in the span-
ning tree. Hence each node is assigned a set of intervals for
nodes that can be reached from it, where overlapping and ad-
jacent intervals are merged; we must now check that yis in
one of the intervals of x. To improve time and space at the
cost of precision, approximate intervals are proposed that
merge non-overlapping intervals; e.g., [4,6],[8,9] is merged
to [4,9], which can reject reachability for nodes with id less
than 2 or greater than 9, but has a 1
6chance of a false positive
for nodes in [4,9], which must be verified separately.
5.5 Join indexes
The results of joins can also be indexed. Groppe et al. [78]
proposed to construct 6×24= 96 indexes for 6types of
non-symmetric joins between two triple patterns (S–S,S–
(2,1,2) :CS
(5,2,10) :DB(3,2,6) :AI (7,2,14) :Web
(11,3,66) :ML (13,15,390) :DM (17,35,1190) :SW
Fig. 15: PLSD index on an example skos:narrower hierar-
chy; terms (e.g., :CS) are indexed externally
(:Alice,2)
(_:p,2) (:CS,1)
... ... ... ...
Fig. 16: Distance-based indexing (GRIN)
P,S–O,P–P,P–O,O–O). Hash maps are used to cover the
24permutations of the remaining elements (not considering
the join variable). Given the high space cost, only frequently
encountered joins are sometimes indexed [55,152].
5.6 Structural indexes
Another family of indexes – known as structural indexes [141]
– rely on a high-level summary of the RDF graph.
Some structural indexes are based on distance measures.
GRIN [228] divides the graph hierarchically into regions
based on the distance of its nodes to selected centroids. These
regions form a tree, where the non-leaf elements indicate a
node xand a distance dreferring to all nodes at most dsteps
from x. The root element chooses a node and distance such
that all nodes of the graph are covered. Each non-leaf ele-
ment has two children that capture all nodes of their parent.
Each leaf node contains a set of nodes N, which induces a
subgraph of triples between the nodes of N; the leaves can
then be seen as partitioning the RDF graph. We provide an
example in Figure 16 for the RDF graph of Figure 1, where
all nodes are within distance two of :Alice, which are then
divided into two regions: one of distance at most two from
_:p, and another of distance at most one from :CS. The in-
dex can continue dividing the graph into regions, and can
then be used to find subgraphs within a particular distance
from a given node (e.g., a node given in a BGP).
Another type of structural index relies on some notion of
aquotient graph [48], where the nodes of a graph so(G)are
partitioned into {X1, . . . , Xn}pairwise-disjoint sets such
that Sn
i=1 Xi= so(G). Then edges of the form (Xi, p, Xj)
are added if and only if there exists (xi, p, xj)∈Gsuch that
xi∈Xiand xj∈Xj. Intuitively, a quotient graph merges
nodes from the input graph into “supernodes” while main-
taining the input (labeled) edges between the supernodes.
We provide an example of a quotient graph in Figure 17 fea-
turing six supernodes. Any partitioning of nodes can form
12 Ali et al.
foaf:Person
foaf:Project
:Alice
:Bob
foaf:knows
rdf:type
"21"^^xsd:int
"26"^^xsd:int foaf:age
_:p
foaf:currentProject
foaf:pastProject rdf:type
"Motor RDF"@es
"RDF Engine"@en rdfs:label
:CS
:DB
:SW
:Web
foaf:topic_interest
skos:broader
skos:related
Fig. 17: Quotient graph with six supernodes
a quotient graph, ranging from a single supernode with all
nodes so(G)and loops for all properties in p(G), to the
graph itself replacing each node x∈so(G)with the single-
ton {x}. If the input graph yields solutions for a BGP, then
the quotient graph will also yield solutions (with variables
now matching supernodes). For example, taking the BGP of
Figure 2, matching foaf:Person to the supernode contain-
ing foaf:Person in Figure 17, then the variables ?a and
?b will match the supernode containing :Alice and :Bob,
while ?ia and ?ib will match to the supernode containing
:CS,:DB,:SW and :Web; while we do not know the exact so-
lutions for the input graph, we know they must correspond
to elements of the supernodes matched in the quotient graph.
DOGMA [41] partitions an RDF graph into subgraphs,
from which a balanced binary tree is computed, where each
parent node contains a quotient-like graph of both its chil-
dren. The (O)SQP approach [225] creates an in-memory in-
dex graph, which is a quotient graph whose partition is de-
fined according to various notions of bisimulation.
SAINT-DB [186] adopts a similar approach, where su-
pernodes are defined directly as a partition of the triples of
the RDF graph, and edges between supernodes are labeled
with the type of join (S–S,P–O, etc.) between them.
5.7 Quad indexes
Most quad indexes follow the triple index scheme [243,94,
91,69], extending it to add another element. The number of
permutations then grows to 24= 16 abstract index pat-
terns, 4! = 24 potential permutations, and 4
b4/2c= 6
flat (ISAM/B+Tree/AVL tree/trie) permutations or 2circular
(CSA) permutations to efficiently support all abstract quad
patterns. A practical compromise is to maintain a selection
of permutations that cover the most common patterns [69];
for example, a pattern (s, p, o, g)may be uncommon in prac-
tice, and could be supported reasonably well by evaluating
(e.g.) (s, p, o, g)and filtering on g=g.
The RIQ system [120] proposes a custom index for quads
called a PV-index for finding (named) graphs that match a
BGP. Each graph is indexed by hashing all seven abstract
patterns on triples with some constant, generating seven pat-
tern vectors for each graph. For example, a triple (s, p, o)
in a graph named gwill be hashed as (s, p, o),(s, p, ?),
(s, ?, o),(?, p, o),(s, ?,?),(?, p, ?),(?,?, o), where ?is an
arbitrary fixed token, and each result will be added to one
of seven pattern vectors for gfor that abstract pattern. Ba-
sic graph patterns can be encoded likewise, where locality
sensitive hashing is then used to group and retrieve similar
pattern vectors for a given basic graph pattern.
5.8 Miscellaneous Indexing
RDF stores may use legacy systems, such as NoSQL stores,
for indexing. Since such approaches are not tailored to RDF,
and often correspond conceptually to one of the indexing
schemes already discussed, we refer to more dedicated sur-
veys of such topics for further details [111,257, 249]. Other
stores provide specialized indexes for particular types of val-
ues such as spatial or temporal data [232,130]; we do not
discuss such specialized indexes in detail.
5.9 Discussion
While indexing triples or quads is conceptually the most
straightforward approach, a number of systems have shown
positive results with entity- and property-based indexes that
optimize the evaluation of star joins, path indexes that opti-
mize the evaluation of path joins, or structural indexes that
allow for identifying query-relevant regions of the graph.
Different indexing schemes often have different time–space
trade-offs: more comprehensive indexes enable faster queries
at the cost of space and more costly updates.
6 Join Processing
RDF stores employ diverse query processing strategies, but
all require translating logical operators that represent the
query, into “physical operators” that implement algorithms
for efficient evaluation of the operation. The most important
such operators – as we now discuss – are natural joins.
6.1 Pairwise join algorithms
We recall that the evaluation of a BGP {t1,...tn}(G)can
be rewritten as t1(G) . . . tn(G), where the evaluation
A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 13
of each triple pattern ti(1≤i≤n) produces a relation of
arity |vars(ti)|. Thus the evaluation of a BGP Bproduces
a relation of arity |vars(B)|. The relational algebra – in-
cluding joins – can then be used to combine or transform
the results of one or more BGPs, giving rise to CGPs. The
core of evaluating graph patterns is thus analogous to pro-
cessing relational joins. The simplest and most well-known
such algorithms perform pairwise joins; for example, a pair-
wise strategy for computing {t1,...tn}(G)may evaluate
((t1(G) t2(G)) . . .) tn(G).
Without loss of generality, we assume a join of two graph
patterns P1(G) P2(G), where the join variables are de-
noted by V={v1, . . . , vn}= vars(P1)∩vars(P2). Well-
known algorithms for performing pairwise joins include (in-
dex) nested-loop joins, where P1(G) P2(G)is reduced to
evaluating Sµ∈P1(G){µ} µ(P2)(G);hash joins, where
each solution µ∈P1(G)is indexed by hashing on the key
(µ(v1), . . . , µ(vn)) and thereafter a key is computed like-
wise for each solution in P2(G)to probe the index with; and
(sort-)merge joins, where P1(G)and P2(G)are (sorted if
necessary and) read in the same order with respect to V, al-
lowing the join to be reduced to a merge sort. Index nested-
loop joins tend to perform well when |P1(G)||P2(G)|
(assuming that µ(P2)(G)can use indexes) since it does not
require reading all of P2(G). Otherwise hash or merge joins
can perform well [168]. Pairwise join algorithms are then
used in many RDF stores (e.g., [93,69,168]).
Techniques to optimize pairwise join algorithms include
sideways information passing [29], which passes data across
different parts of the query, often to filter intermediate re-
sults. Neumann and Weikum [167] propose ubiquitous side-
ways information passing (U-SIP) for computing joins over
RDF, which shares global ranges of values for a given query
variable. U-SIP is implemented differently for different join
types. For merge joins, where data are read in order, a max-
imum value for a variable can be shared across pairwise
joins, allowing individual operators to skip ahead to the cur-
rent maximum. For hash joins, a global domain filter is em-
ployed – consisting of a maximum value, a minimum value,
and Bloom filters – for filtering the results of each variable.
6.2 Multiway joins
Multiway join algorithms exploit the commutativity and as-
sociativity of joins to evaluate two or more operands at once.
For example, in order to compute {t1,...tn}(G), a multi-
way join algorithm may evaluate (t1(G) .. . tk(G))
(tk+1(G) . .. tn(G)) where k≥2, or it may even sim-
ply evaluate everything at once as (t1(G) . . . tn(G)).
Some of the previous storage and indexing schemes we
have seen lend themselves naturally to processing certain
types of multiway joins in an efficient manner. Entity-based
indexes allow for processing star joins efficiently, while path
indexes allow for processing path joins efficiently (see Sec-
tion 5). A BGP can be decomposed into sub-BGPs that can
be evaluated per the corresponding multiway join, with pair-
wise joins being applied across the sub-BGPs; for exam-
ple: {(w, p, x),(w, q, y),(w, r, z),(x, q , y),(x, r, z)}may
be divided into the sub-BGPs {(w, p, x),(w, q, y),(w, r, z)}
and {(x, q, y),(x, r, z)}, which are evaluated separately as
multiway joins before being themselves joined. Even in the
case of (sorted) triple/quad tables, multiway joins can be ap-
plied taking advantage of the locality of processing, where,
for example, in an SPO index permutation, triples with the
same subject will be grouped together. Similar locality can
be exploited in distributed settings (see, e.g., SMJoin [74]).
6.3 Worst case optimal joins
A new family of join algorithms have arisen due to the AGM
bound [22], which puts an upper bound on the number of
solutions that can be returned from relational join queries.
The result can be adapted straightforwardly to the case of
BGPs. Let B={t1, . . . , tn}denote a BGP with vars(B) =
V. Now define a fractional edge cover as a mapping λ:
B→R[0,1] that assigns a real value in the interval [0,1] to
each triple pattern of Bsuch that for all v∈V, it holds that
Pt∈Bvλ(t)≥1, where Bvdenotes the set of triple patterns
in Bthat mention v. The AGM bound tells us that if Bhas
the fractional edge cover λ, then for any RDF graph it holds
that |B(G)| ≤ Qn
i=1 |ti(G)|λ(ti); this bound is “tight”.
To illustrate the AGM bound, consider the BGP B=
{t1, t2, t3}from Figure 18. There exists a fractional edge
cover λof Bsuch that λ(t1) = λ(t2) = λ(t3) = 1
2; taking
?a, we have that B?a ={t1, t3},λ(t1)+ λ(t3)=1, and thus
?a is “covered”, and we can verify the same for ?b and ?c.
Then the AGM bound is given as the inequality |B(G)| ≤
Qn
i=1 |ti(G)|λ(ti). For Gthe graph in Figure 18, |t1(G)|=
|t2(G)|=|t3(G)|= 5, and hence |B(G)| ≤ 53
2. In reality,
for this graph, |B(G)|= 5, thus satisfying the inequality,
but there exists a graph where B=Qn
i=1 |ti(G)|λ(ti).
Recently, join algorithms have been proposed that can
enumerate the results for a BGP Bover a graph Gin time
O(agm(B, G)), where agm(B, G)denotes the AGM bound
of Bover G. Since such an algorithm must at least spend
O(agm(B, G)) time writing the results in the worst case,
such algorithms are deemed worst-case optimal (wco) [169].
Though such algorithms were initially proposed in a rela-
tional setting [169,230], they have recently been adapted for
processing joins over RDF graphs [115,99,163,19]. Note
that traditional pairwise join algorithms are not wco. If we
try to evaluate {t1, t2}(G)by pairwise join, for example, in
order to later join it with t3(G), the AGM bound becomes
quadratic as λ(t1) = λ(t2) = 1, and thus we have the bound
|t1(G)|·|t2(G)|, which exceeds the AGM bound for B. This
14 Ali et al.
:CS
:SW
s:b
s:n
:DB
s:b
s:n
s:r :Web
s:b
s:n
s:r
s:r
:IR
s:b
s:n
s:r
:AI s:b
s:n
s:r
G:
?a ?c
s:r
(t3)
?b
s:b s:n
(t2)
(t1)
B:B(G) :
?a ?b ?c
:DB :CS :AI
:DB :CS :SW
:IR :CS :Web
:SW :CS :Web
:Web :CS :SW
Fig. 18: Example RDF graph G, BGP Band its evaluation
B(G); the IRIs s:b,s:n and s:r abbreviate skos:broader,
skos:narrower and skos:related, resp.
holds for any pairwise join in B. Note that {t1, t2}(G)will
indeed produce (25) quadratic results, mapping ?a to :CS
and ?b and ?c to {:AI,:DB,:IR,:SW,:Web}2.
Wco join algorithms – including Leapfrog Triejoin (LTJ)
[230] – perform a multiway join that resolves a BGP B
variable-by-variable rather than pattern-by-pattern. First an
ordered sequence of variables is selected; say (?a,?b,?c).
Then the set of partial solutions M{?a}={µ|dm(µ) =
{?a}and µ(B?a)(G)6=∅} are computed for the first vari-
able ?a such that each image of B?a under µ∈M{?a}has
some solutions for G; e.g., M{?a}={{?a/:DB},{?a/:IR},
{?a/:SW},{?a/:Web}} in Figure 18, since replacing ?a in
B?a with :DB,:IR,:SW or :Web yields a BGP with solu-
tions over G. Next we compute M{?a,?b}={µ∪µ0|µ∈
M{?a},dm(µ0) = {?b}and µ0(µ(B?b))(G)6=∅}, “elimi-
nating” the next variable ?b. In the example of Figure 18,
M{?a,?b}={{?a/:DB,?b/:CS},...,{?a/:Web,?b/:CS}},
where each solution µ∈M{?a}is extended with {?b/:CS}.
Finally, M{?a,?b,?c}is computed analogously, eliminating the
last variable, and yielding the five results seen in Figure 18.
To be wco-compliant, the algorithm must always be able
to efficiently compute M{v}, i.e., solutions µwith dm(µ) =
{v}, such that µ(Bv)(G)6=∅. To compute M{?a}in the
running example, we need to efficiently intersect all nodes
with an outgoing s:b edge and an incoming s:r edge. This
is typically addressed by being able to read the results of a
triple pattern, in sorted order, for any variable, which enables
efficient intersection by allowing to seek ahead to the max-
imum current value of all triple patterns involving a given
variable. Jena-LTJ [99], which implements an LTJ-style join
algorithm for SPARQL, enables this by maintaining all six
index permutations over triples, while Ring [19] requires
only one permutation. Wco algorithms often outperform tra-
ditional join algorithms for complex BGPs [115,99].
6.4 Translations to linear algebra
Per Section 4.6, dictionary-encoded RDF graphs are some-
times represented as a bit tensor, or as a bit matrix for each
property (see Figure 10), etc. Viewed in this light, some
query algebra can then be reduced to linear algebra [156];
for example, joins become matrix/tensor multiplication. To
illustrate, we can multiply the bit (adjacency) matrix from
Figure 10 for skos:broader by itself:
0000
1000
0001
1000
0000
1000
0001
1000
=
0000
0000
1000
0000
The result indicates the analogous bit matrix for an O–Sjoin
on skos:broader, with :SW (on row 3) connected to :CS
(on column 1), which we would expect per Figure 1.
Translating joins into linear algebra enables hardware
acceleration, particularly involving GPUs and HPC archi-
tectures, which can process tensors with high levels of par-
allelism. Such an approach is followed by MAGiQ [109],
which represents an RDF graph as a single n×nmatrix M,
where nis the number of nodes (n=|so(G)|) and Mi,j
encodes the id of the property connecting the ith node to the
jth node (or 0 if no such property exists). One issue with this
representation is that it does not support two nodes being
connected by multiple edges with different labels, and thus a
coordinate list representation can rather be used. Basic graph
patterns with projection are translated into matrix multipli-
cation, scalar multiplication, transposition, etc., which can
be executed on a variety of hardware, including GPUs.
Other engines that translate SPARQL query features into
linear algebra (or other operations within GPUs) include
Wukong(+G) [211,235], TripleID-Q [49], and gSmart [52].
Wukong+G [235] proposes a number of caching, pipelin-
ing, swapping and prefetching techniques in order to reduce
the GPU memory required when processing large graphs
while maintaining efficiency, and also proposes a partition-
ing technique to distribute computation over multiple CPUs
and GPUs. TripleID-Q [49] represents an RDF graph as a
dictionary-encoded triple table that can be loaded into the
GPU in order to search for solutions to individual triple pat-
terns without indexing, but with high degrees of parallelism.
On top of this GPU-based search, join and union operators
are implemented using GPU libraries. gSmart [52] proposes
a variety of optimizations for evaluating basic graph patterns
in such settings, including a multi-way join optimization for
computing star-like joins more efficiently on GPUs, com-
pact representations for sparse matrices, data partitioning to
enable higher degrees of parallelism, and more besides.
A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 15
6.5 Join reordering
The order of join processing can have a dramatic effect on
computational costs. For Figure 18, if we apply pairwise
joins in the order (t1(G)on t2(G)) on t3(G), the first join
(t1(G)on t2(G)) yields 25 intermediate results, with 5 final
results produced with the second join. If we rather evaluate
(t2(G)on t3(G)) on t1(G), the first join (t2(G)on t3(G))
produces only 5 intermediate results, before the second join
produces the 5 final results. The second plan should thus be
more efficient than the first; if considering a graph at larger
scale, the differences may reach orders of magnitude.
A good plan depends not only on the query, but also the
graph. Selecting a good plan thus typically requires some
assumptions or statistics over the graph. As in relational
settings, the most important information relates to cardi-
nalities: how many (distinct) solutions a given pattern re-
turns; and/or selectivity: what percentage of solutions are
kept when restricting variables with constants or filters. Stat-
istics can be used not only to select an ordering for joins,
but also to decide which join algorithm to apply. For ex-
ample, given an arbitrary (sub-)BGP {t1, t2}, if we esti-
mate that |t2(G)| |t1(G)|, we may prefer to evaluate
t2(G)on t1(G)as an index nested-loop join, rather than
a hash or merge join, to avoid reading t1(G)in full.
While cardinality and selectivity estimates can be man-
aged in a similar way to relational database optimizers, a
number of approaches have proposed custom statistics for
RDF. Stocker et al. [215] collect statistics relating to the
number of triples, the number of unique subjects, and for
each predicate, the number of triples and a histogram of as-
sociated objects. RDF-3X [168] uses a set of aggregated in-
dexes, which store the cardinality of all triple patterns with
one or two constants. RDF-3X [168] further stores the exact
cardinality of frequently encountered joins, while character-
istic sets [165] and extended characteristic sets [155] (dis-
cussed in Section 5.3) capture the cardinality of star joins.
Computing and maintaining such statistics incur costs in
terms of space and updates. An alternative is to apply sam-
pling while evaluating the query. Vidal et al. [231] estimate
the cardinality of star joins by evaluating all solutions for
the first pattern of the join, thereafter computing the full so-
lutions of the star pattern for a sample of the initial solutions;
the full cardinality of the star pattern is then estimated from
the samples. Another alternative is to use syntactic heuris-
tics for reordering. Stocker et al. [215] propose heuristics
such as assuming that triple patterns with fewer variables
have lower cardinality, that subject constants are more selec-
tive than objects and predicates, etc. Tsialiamanis et al. [227]
further propose to prioritize rarer joins (such as P–Sand P–O
joins), and to consider literals as more selective than IRIs.
Taking into account such heuristics and statistics, the
simplest strategy to try to find a good join ordering is to ap-
ply a greedy metaheuristic [215,155], starting with the triple
pattern t1estimated to have the lowest cardinality, and join-
ing it with the triple pattern t2with the next lowest cardi-
nality; typically a constraint is added such that tn(n > 1)
should have a variable in common with some triple pattern
in {t1, . . . , tn−1}to avoid costly Cartesian products. Aside
from considering the cardinality of triple patterns, Meimaris
and Papastefanatos [154] propose a distance-based planning,
where pairs of triple patterns with more overlapping nodes
and more similar cardinality estimates have lesser distance
between them; the query planner then tries to group and join
triple patterns with the smallest distances first in a greedy
manner. Greedy strategies will not, however, always provide
the best ordering corresponding to an optimal plan.
More generally, reordering joins is an optimization prob-
lem, where classical methods from the relational literature
can be leveraged likewise for BGPs, including dynamic pro-
gramming [209] (used, e.g., by [94,168,82]) and simulated
annealing [106] (used, e.g., by [231]). Other metaheuristics
that have been applied for join reordering in BGPs include
genetic algorithms [102] and ant colony systems [101,114].
6.6 Caching
Another possible route for optimization – based on the ob-
servation that queries in practice may feature overlapping or
similar patterns – is to reuse work done previously for other
queries. Specifically, we can consider caching the results of
queries. In order to increase cache hit rates, we can further
try to reuse the results of subqueries, possibly generalizing
them to increase usability. Ideally the cache should store so-
lutions for subqueries that (a) have a high potential to reduce
the cost of future queries; (b) can reduce costs for many fu-
ture queries; (c) do not have a high space overhead; and (d)
will remain valid for a long time. Some of these aims can be
antagonistic; for example, caching solutions for triple pat-
terns satisfies (b) and (c) but not (a), while caching solutions
for complex BGPs satisfies (a) but not (b), (c) or (d).
Lampo et al. [132] propose caching of solutions for star
joins, which may strike a good balance in terms of reduc-
ing costs, being reusable, and not having a high space over-
head (as they share a common variable). Other caching tech-
niques try to increase cache hit rates by detecting similar
(sub)queries. Stuckenschmidt [217] uses a similarity mea-
sure for caching – based on the edit distance between BGPs
– that estimates the amount of computational effort needed
to compute the solutions for one query given the solutions to
the other. Lorey and Naumann [140] propose a technique for
grouping similar queries, which enables a pre-fetching strat-
egy based on predicting what a user might be interested in
based on their initial queries. Another direction is to normal-
ize (sub)queries to increase cache hit rates. Wu et al. [246]
16 Ali et al.
propose various algebraic normalizations in order to iden-
tify common subqueries [140], while Papailiou et al. [179]
generalize subqueries by replacing selective constants with
variables and thereafter canonically labeling variables (mod-
ulo isomorphism) to increase cache hit rates. Addressing dy-
namic data, Martin et al. [150] propose a cache where re-
sults for queries are stored in a relational database but are
invalidated when a triple matching a query pattern changes.
Williams and Weaver [242] add last-updated times to their
RDF index to help invalidate cached data.
Given that an arbitrary BGP can produce an exponential
number of results, Zhang et al. [260] propose to cache fre-
quently accessed “hot triples” from the RDF graph in mem-
ory, rather than caching (sub-)query results. This approach
limits the space overhead at the cost of recomputing joins.
6.7 Discussion
Techniques for processing BGPs are often based on tech-
niques for processing relational joins. Beyond standard pair-
wise joins, multiway joins can help to emulate some of the
benefits of property table storage by evaluating star joins
more efficiently. Another recent and promising approach is
to apply wco join algorithms whose runtime is bounded the-
oretically by the number of results that the BGP could gen-
erate. More and more attention has also been dedicated to
computing joins in GPUs by translating relational algebra
(e.g., joins) into linear algebra (e.g., matrix multiplication).
Aside from specific algorithms, the order in which joins are
processed can have a dramatic effect on runtimes. Statis-
tics about the RDF graph help to find a good ordering at
the cost of computing and maintaining those statistics; more
lightweight alternatives include runtime sampling, or syn-
tactic heuristics that consider only the query. To decide the
ordering, options range from simple greedy strategies to com-
plex metaheuristics; while simpler strategies have lower plan-
ning times, more complex strategies may find more efficient
plans. Another optimization is to cache results across BGPs,
for which a time–space trade-off must be considered.
7 Query Processing
While we have defined RDF stores as engines capable of
storing, indexing and processing joins over RDF graphs, SP-
ARQL engines support various features beyond joins. We
describe techniques for efficiently evaluating such features,
including the relational algebra (beyond joins) and property
paths. We further include some general extensions proposed
for SPARQL to support recursion and analytics.
7.1 Relational algebra (beyond joins)
Complex (navigational) graph patterns CGPs introduce ad-
ditional relational operators beyond joins.
Like in relational databases, algebraic rewriting rules
can be applied over CGPs in SPARQL to derive equivalent
but more efficient plans. Schmidt et al. [207] present a set of
such rules for SPARQL under set semantics, such as:
σR1∨R2(M)≡σR1(M)∪σR2(M)
σR1∧R2(M)≡σR1(σR2(M))
σR1(σR2(M)) ≡σR2(σR1(M))
σR(M1∪M2)≡σR(M1)∪σR(M2)
σR(M∗
1 M2)≡σR(M∗
1) M2
σR(M∗
1 M2)≡σR(M∗
1) M2
σR(M∗
1BM2)≡σR(M∗
1)BM2
where for each µ∈M∗
1, it holds that vars(R)⊆dm(µ).
The first two rules split filters, meaning that they can be
pushed further down in a query in order to reduce intermedi-
ary results. The third rule allows the order in which filters are
applied to be swapped. Finally the latter four rules describe
how filters can be pushed “down” inside various operators.
Another feature of importance for querying RDF graphs
are optionals ( ), as they facilitate returning partial solu-
tions over incomplete data. Given that an optional can be
used to emulate a form of negation (in Table 3 it is de-
fined using an anti-join), it can lead to jumps in computa-
tional complexity [183]. Works have thus studied a fragment
called well-designed patterns, which forbid using a variable
on the right of an optional that does not appear on the left
but does appear elsewhere in the query; taking an example,
the CGP ({(x, p, y)}OPTIONAL {(x, q, z)}).{(x, r, z)}
is not well designed as the variable zappears on the right of
an OPTIONAL and not on the left, but does appear elsewhere
in the query. Such variables may or may not be left unbound
after the left outer join is evaluated, which leads to com-
plications if they are used outside the optional clause. Most
SPARQL queries using optionals in practice are indeed well-
designed, where rewriting rules have been proposed specifi-
cally to optimize such queries [183,137].
7.2 Property paths
Navigational graph patterns (NGPs) extend BGPs with prop-
erty paths, which are extensions of (2)RPQs that allow for
matching paths of arbitrary length in the graph.
Some approaches evaluate property paths using graph
search algorithms. Though not part of SPARQL, Gubichev
and Neumann [81] implement single-source shortest paths
by applying Dijsktra’s search algorithm over B-Trees. Baier
et al. [23] propose to use the A* search algorithm, where
A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 17
search is guided by a heuristic that measures the minimum
distance from the current node to completing a path.
Extending RDF-3X, Gubichev et al. [80] build a FER-
RARI index [210] (see Section 5.4) for each property :p in
the graph that forms a directed path of length at least 2. The
indexes are used to evaluate paths :p* or :p+ . Paths of the
form (:p/:q)∗,(:p|:q)∗, etc., are not directly supported.
Koschmieder and Leser [127], and Nguyen and Kim [170]
optimize property paths by splitting them according to “rare
labels”: given a property path :p ∗/:q/:r∗, if :q has few
triples in the graph, the path can be split into :p∗/:q (evalu-
ated right-to-left) and :q/:r∗(evaluated left-to-right), sub-
sequently joining the results. Splitting paths can enable par-
allelism: Miura et al. [158] evaluate such splits on field pro-
grammable gate arrays (FPGAs), enabling hardware accel-
eration. Wadhwa et al. [234] rather use bidirectional random
walks from candidate endpoints on both sides of the path,
returning solutions when walks from each side coincide.
Another way to support property paths is to use recursive
queries. Stuckenschmidt et al. [218] evaluate property paths
such as :p+using recursive nested-loop and hash joins. Dey
et al. [63], Yakovets et al. [252] and Jachiet et al. [108] pro-
pose translations of more general property paths (or RPQs)
to extensions of the relational algebra with recursive or tran-
sitive operators. Paths can be evaluated by SQL engines us-
ing WITH RECURSIVE; however Yakovets et al. [252] note
that highly nested SQL queries may result, and that popu-
lar relational database engines cannot (efficiently) detect cy-
cles. Dey et al [63] alternatively explore the evaluation of
RPQs via translations to recursive Datalog.
In later work, Yakovets et al. [253] propose Waveguide,
which first converts the property path into a parse tree, from
which plans can be built based on finite automata (FA), or
relational algebra with transitive closure (α-RA, where α
denotes transitive closure). Figure 19 gives an example of a
parse tree and both types of plans. Although there is overlap,
FA can express physical plans that α-RA cannot, and vice
versa. For example, in FA we can express non-deterministic
transitions (see q0in Figure 19), while in α-RA we can ma-
terialize (cache) a particular relation in order to apply transi-
tive closure over it. Waveguide then uses hybrid waveplans,
where breadth-first search is guided in a similar manner to
FA, but where the results of an FA can be memoized (cached)
and reused multiple times like in α-RA.
Evaluating complex property paths can be costly, but
property paths in practice are often quite simple. Martens
and Trautner [149] propose a class of RPQs called simple
transitive expressions (STEs) that are found to cover 99.99%
of the queries found in Wikidata SPARQL logs, and have de-
sirable theoretical properties. Specifically, they define atomic
expressions of the form p1|. . . |pn, where p1, . . . , pnare
IRIs and n≥0; and also bounded expressions of the form
a1/. . . /akor a1?/ . . . /ak?where a1, . . . , akare atomic
expressions and k≥0. Then an expression of the form
b1/a∗/b2, is a simple transitive expression (STE), where b1
and b2are bounded expressions, and ais an atomic expres-
sion. They then show that simple paths for STEs can be enu-
merated more efficiently than arbitrary RPQs.
7.3 Recursion
Property paths offer a limited form of recursion. While ex-
tended forms of property paths have been proposed to in-
clude (for example) path intersection and difference [71],
more general extensions of SPARQL have also been pro-
posed to support graph-based and relation-based recursion.
Reutter et al. [194] propose to extend SPARQL with
graph-based recursion, where a temporary RDF graph is built
by recursively adding triples produced through CONSTRUCT
queries over the base graph and the temporary graph up
to a fixpoint; a SELECT query can then be evaluated over
both graphs. The authors discuss how key features (includ-
ing property paths) can then be supported through linear
recursion, meaning that each new triple only needs to be
joined with the base graph, not the temporary graph, to pro-
duce further triples, leading to better performance. Corby et
al. [59] propose LD-Script: a SPARQL-based scripting lan-
guage supporting various features, including for-loops that
can iterate over the triples returned by a CONSTRUCT query.
Hogan et al. [98] propose SPARQAL: a lightweight lan-
guage that supports relation-based (i.e., SELECT-based) re-
cursion over SPARQL. The results of a SELECT query can
be stored as a variable, and injected into a future query. Do–
until loops can be called until a particular condition is met,
thus enabling recursion over SELECT queries.
7.4 Analytics
SPARQL engines often focus on transactional (OLTP) work-
loads involving selective queries that are efficiently solved
through lookups on indexes. Recently, however, a number
of approaches have looked at addressing analytical (OLAP)
workloads for computing slices, aggregations, etc. [45].
One approach is to rewrite SPARQL queries to languages
executable in processing environments suitable for analyti-
cal workloads, including PigLatin (e.g., PigSPARQL [202],
RAPID+ [193]), Hadoop (e.g., Sempala [203]), Spark (e.g.,
S2RDF [204]), etc. Such frameworks are better able to han-
dle analytical (OLAP) workloads, but not all SPARQL fea-
tures are easily supported on existing distributed frameworks.
Conversely, one can also translate from analytical lan-
guages to SPARQL queries, allowing for in-database ana-
lytics, where analytical workloads are translated into queries
run by the SPARQL engine/database. Papadaki et al. [176]
propose the high-level functional query language HIFUN for
18 Ali et al.
(?x,s:n/(s:r|s:n)*,?z)s:r s:n
|
*
s:n
/
q0
q1
s:n
s:r s:n (?y,s:r,?z) (?y,s:n,?z)
∪
α
(?x,s:n,?y)
q0
q1
s:n s:r
q0
q1
s:n
WP0
Property Path PT FA α-RA WP0WP
Fig. 19: An example property path with its parse tree (PT) and three plans based on finite automata (F