Conference PaperPDF Available

Efficient Linked-List RDF Indexing in Parliament

Authors:

Abstract and Figures

As the number and scale of Semantic Web applications in use increases, so does the need to efficiently store and retrieve RDF data. Current published schemes for RDF data management either fail to embrace the schema flexibility inherent in RDF or make restrictive assumptions about application usage models. This paper describes a storage and indexing scheme based on linked lists and memory-mapped files, and presents theoretical and empirical analysis of its strengths and weaknesses versus other techniques. This scheme is currently used in Parliament (formerly DAML DB), a triple store with rule support that has recently been released as open source.
Content may be subject to copyright.
Efficient Linked-List RDF Indexing in
Parliament
Dave Kolas, Ian Emmons, and Mike Dean
BBN Technologies, Arlington, VA 22209, USA
{dkolas,iemmons,mdean}@bbn.com
Abstract. As the number and scale of Semantic Web applications in
use increases, so does the need to efficiently store and retrieve RDF
data. Current published schemes for RDF data management either fail
to embrace the schema flexibility inherent in RDF or make restrictive as-
sumptions about application usage models. This paper describes a stor-
age and indexing scheme based on linked lists and memory-mapped files,
and presents theoretical and empirical analysis of its strengths and weak-
nesses versus other techniques. This scheme is currently used in Parlia-
ment (formerly DAML DB), a triple store with rule support that has
recently been released as open source.
1 Introduction
As the number and scale of Semantic Web applications in use increases, so does
the need to efficiently store and retrieve RDF [1] data. A wide variety of RDF and
OWL [2] applications are currently being developed, and each application’s sce-
nario may demand prioritization of one performance metric or another. Current
published schemes for RDF data management either fail to embrace the schema
flexibility inherent in RDF or make restrictive assumptions about application
usage models.
Despite the fact that RDF’s graph-based data model is inherently different
than relational data models, many published schemes for RDF data storage
involve reductions to a traditional RDBMS [3–7]. This results in the deficiencies
of RDBMS’s (inflexible schemas, inability to efficiently query variable predicates)
being propagated to RDF storage; arguably, avoiding these deficiencies is one of
the major reasons for adopting an RDF data model. Other published approaches
eschew the mapping to an RDBMS, but suffer either inadequate load or query
performance for many applications. In this paper, we argue that the storage
approach in Parliament provides excellent load and query performance with low
space consumption and avoids the pitfalls of many other specialized RDF storage
systems.
Parliament [8] (formerly DAML-DB [9]) is a triple store developed by BBN
that has been in use since 2001. During that time, Parliament has been used for
a number of applications from basic research to production. We have found that
it offers an excellent tradeoff between load and query performance, and compares
favorably to commercial RDF data management systems [10].
17
2
Recently, BBN has decided to release Parliament as an open source project.
Parliament provides the underlying storage mechanism, while using Jena [11] or
Sesame [12] as an external API. This paper explains in detail the underlying
index structure of Parliament, and compares it to other published approaches.
Our hope is that open-sourced Parliament will provide a fast storage alternative
for RDF applications, create a platform upon which storage mechanism and
query optimizer research can be built, and generally advance the state of the art
in RDF data management.
The remainder of this paper is structured as follows. Section 2 addresses
related work. Section 3 describes the index structure within Parliament. Sec-
tion 4 explains how the operations on the structure are performed. Section 5
provides both worst case and average case analysis of the indexing mechanism,
and Section 6 provides a small empirical comparison to supplement [10].
2 Related Work
The related work on RDF data management systems falls into two major cate-
gories: solutions that involve a mapping to a relational database, and those that
do not.
2.1 RDBMS Based Approaches
A large proportion of the previously published approaches involve a mapping of
the RDF data model into some form of relational storage. These include triples-
table approaches, property tables, and vertical partitioning. There is a strong
temptation to use relational systems to store RDF data since such a great amount
of research has been done on making relational systems efficient. Moreover, ex-
isting RDBMS systems are extremely scalable and robust. Unfortunately, each
of the proposed ways of doing this mapping has deficiencies.
The triples-table approach has been employed in 3store [3], and is perhaps
the most straightforward mapping of RDF into a relational database system.
Each triple given by (s, p, o) is added to one large table of triples with a column
for the subject, predicate, and object respectively. Indexes are then added for
each of the columns. While this approach is straightforward to implement, it
is not particularly efficient, as noted in later work [4,5, 13, 14, 10]. The primary
problem is that queries with multiple triple patterns result in self-joins on this
one large table, and are inefficient.
Property tables were introduced later, and allowed multiple triple patterns
referencing the same subject to be retrieved without an expensive join. This
approach has been used in Jena 2 [4]. A similar approach is used in [6]. In this
approach, each database table includes a column for a subject and several fixed
properties. The intent is that these properties often appear together on the same
subject. While this approach does eliminate many of the expensive self-joins in
a triples table, it still has deficiencies leading to limited scalability. Queries with
triple patterns that span multiple property tables are still expensive. Depending
18
3
on the level of correlation between the properties chosen for a particular property
table, the table may be very sparse and thus be less space-efficient than other
approaches. Also, it may be complex to determine which sets of properties are
best joined within the same property table. Multi-valued properties are problem-
atic in this approach as well. Furthermore, queries with unbound variables in the
property position are very inefficient and may require dynamic table creation.
In a data model without a fixed schema, it is common to ask for all present
properties for a particular subject. In the property table approach, this type of
query requires scanning all tables. With property tables, adding new properties
also requires adding new tables, a consideration for applications dealing with
arbitrary RDF content. It is the flexibility in schema that differentiates RDF
from relational approaches, and thus this approach limits the benefit of using
RDF.
The vertical partitioning approach suggested in [5] may be viewed as a spe-
cialization of the property table approach, where each property table supports
exactly one property. This approach has several advantages over the general
property table approach. It better supports multi-valued properties, which are
common in Semantic Web data, and does not sacrifice the space taken by NULL’s
in a sparsely populated property table. It also does not require the property-
clustering algorithms for the general property tables. However, like the property
table approach, it fails to efficiently handle queries with variables in the property
position.
2.2 Other Indexing Approaches
The other primary approaches to RDF data storage eliminate the need for a
standard RDBMS and focus instead on indexing specific to the RDF data model.
This set of approaches tends to better address the query models of the semantic
web, but each suffers its own set of weaknesses.
The RDF store YARS [15] uses six B+ tree indices to store RDF quads
of a subject, predicate, object, and a “context”. In each B+ tree, the key is
a concatenation of the subject, predicate, object, and context, each dictionary
encoded. This allows fast lookup of all possible triple access patterns. Unlike the
RDBMS approaches discussed above, this method does not place any particu-
lar preference on the subject, predicate, or object, meaning that queries with
variable predicates are no different than those with variable subjects or objects.
This structure sacrifices space for query performance, repeating each dictionary
encoded triple six times. The design also favors query performance to insertion
speed, a tradeoff not necessarily appropriate for all Semantic Web applications.
Our approach is more efficient both in insertion time and space usage, as will
be demonstrated. Other commercial applications use this method as well [16].
Kowari [14] is designed similarly, but uses a hybrid of AVL and B trees instead
of B+ trees for indexing.
The commercial quad store Virtuoso [7] adds a graph gelement to a triple,
and conceptually stores the quads in a triples table expanded by one column.
While technically rooted in a RDBMS, it closely follows the model of YARS [15],
19
4
but with fewer indices. The quads are stored in two covering indices, g, s, p, o
and o, g, p, s, where the IRI’s are dictionary encoded. Several further optimiza-
tions are added, including bitmap indexing and inlining of short literal values.
Thus this approach, like YARS, avoids the pitfalls of other RDBMS based work,
including efficient variable-predicate queries. The pattern of fewer indices tips
the balance slightly towards insertion performance from query performance, but
still favors query performance.
Hexastore [13], one of the most recently published approaches, takes a similar
approach to YARS. While it also uses the dictionary encoding of resources, it
uses a series of sorted pointer lists instead of B+ trees of concatenated keys.
Again, this better supports the usage pattern of Semantic Web applications and
does not force them into a RDBMS query model. Hexastore not only provides
efficient single triple pattern lookups as in YARS, but also allows fast merge-
joins for any pair of two triple patterns. Again, however, it suffers a five-fold
increase in space for storing statements over a dictionary encoded triples table,
and favors query performance over insertion times. It is our experience that
applications often do require efficient statement insertion, and thus our approach
seeks to balance query performance and insertion time. Since this approach was
published most recently and compares favorably to previous approaches, we will
focus our empirical comparison evaluation on Hexastore.
Other commercial triple stores such as OWLIM [17] have been empirically
shown to perform well, but their indexing structure is proprietary and thus no
theoretical comparison can be made.
3 Index Structure
This section explains the three parts of the storage structure of Parliament: the
resource table, the statement table, and the resource dictionary. This description
is simplified for the sake of clarity; it does not discuss using quads instead of
triples, optimizations for blank nodes, the rule engine, or some small implemen-
tation details. Parliament can be compiled in either 32 or 64 bit modes, and the
width of the fields described varies accordingly.
3.1 Resource Table
The Resource Table is a single file of fixed-length records, each of which rep-
resents a single resource or literal. The records are sequentially numbered, and
this number serves as the ID of the corresponding resource. This allows direct
access to a record given its ID via simple array indexing. Each record has eight
components:
Three statement ID fields representing the first statements that contain this
resource as a subject, predicate, and object, respectively
Three count fields containing the number of statements using this resource
as a subject, predicate, and object, respectively
20
5
An offset into the string representations file described below, used to retrieve
the string representation of the resource
Bit-field flags encoding various attributes of the resource
The first subject, first predicate, and first object statement identifiers pro-
vide pointers into the statement table, which is described below. The subject,
predicate, and object counts benefit the find operations and query optimization,
discussed in Section 4. For the remainder of the paper, these counts will be
referred to as count(resource, pos) for the count of a resource in the given posi-
tion. The usage of the offset into the string representations file will be explained
below.
3.2 Statement Table
The Statement Table is the most important part of Parliament’s storage ap-
proach. It is similar to the resource table in that it is a single file of fixed-length
records, each of which represents a single statement. The records are sequentially
numbered, and this number serves as the ID of the corresponding statement.
Each record has seven components:
Three resource ID fields representing the subject, predicate, and object of
the statement, respectively
Three statement ID fields representing the next statements that use the same
resource as a subject, predicate, and object, respectively
Bit-field flags encoding various attributes of the statement
The three resource ID fields allow a statement ID to be translated into the
triple of resources that represent that statement. The three next statement point-
ers allow fast traversal of the statements that share either a subject, predicate,
or object, while still storing each statement only once.
Figure 1 shows an example knowledge base consisting of five triples. Each
triple row in the statement list table shows its resource identifier as a number and
the pointers to the next statements as arrows. Omitted arrows indicate pointers
to a special statement identifier, the null statement identifier, which indicates
the end of the linked list.
3.3 Resource Dictionary
Like many other triple stores [13,5, 6, 12], Parliament uses a dictionary encoding
for its resources. This dictionary provides a one-to-one, bidirectional mapping
between a resource and its resource ID. The first component of this dictionary
is the mapping from a resource to its associated identifier. This portion of the
dictionary uses Berkeley DB [18] to implement a B-tree whose keys are the
resources’ string representations and whose values are the corresponding resource
ID’s. This means that inserts and lookups require logarithmic time.
21
6
Fig. 1. Example Statement List and Resource Table
The second half of the dictionary is the reverse lookup from a resource ID to
a string representation. This is implemented in a memory-mapped file contain-
ing sequential, variable-length, and null-terminated string representations of re-
sources. A resource ID is translated into the corresponding string representation
by using the resource ID to index into the resource table, retrieving the string
representation offset, and using this to index into the string representations file
to retrieve the associated string. Thus, looking up a string representation from
a resource identifier is a constant time operation.
The current approach stores each string representation twice. Future imple-
mentations may eliminate this redundancy.
3.4 Memory-Mapped Files
Three of the four files that comprise a Parliament triple store (the resource
table, the statement table, and the string representations file) are stored and
accessed via a common operating system facility called “memory mapping”.
This is independent of the index structure of the store, but is worth mentioning
because it confers a significant performance advantage. Most modern operating
systems use memory mapping to implement their own demand-paged virtual
memory subsystem, and so this mechanism for accessing files tends to be highly
optimized and keeps frequently accessed pages in memory.
4 Triple Store Operations
The three fundamental operations that a triple store can perform are query,
insertion (assertion), and deletion (retraction). These are discussed below.
22
7
4.1 Query
Parliament performs a lookup of a single triple pattern according to the following
algorithm:
1. If any of the triple pattern elements are bound, Parliament uses the B-tree
to translate the string representations of these resources into resource ID’s.
2. If any bound elements are not found in the B-tree, then the query result is
the empty set, and the query algorithm terminates.
3. If none of the elements are bound, then the query result is the entire state-
ment list. Parliament enumerates this by iterating across all of the records in
the statement table and retrieving the string representations of the elements.
4. If exactly one element is bound, then Parliament looks in the resource table
for the resource table the ID of the first statement using that resource in the
position the resource appears in the given triple pattern.
5. If two or three elements are bound, then Parliament looks in the resource
table for those resource ID’s to retrieve count(resource, pos) for each. Par-
liament selects the resource whose count is smallest, and retrieves from the
resource table the ID of the first statement using that resource in the position
the resource appears in the given triple pattern.
6. Starting with that statement ID, Parliament traverses the linked list of state-
ment records corresponding to the position of the minimal count resource.
7. If the triple pattern contains exactly one bound resource, then this list of
statements is exactly the answer to the query, and again Parliament retrieves
the string representations of the elements as it enumerates the list to form
the query result.
8. If two or three elements are bound, then as Parliament enumerates the linked
list of statements, it checks whether the resources in the positions of the
non-minimal count resources are the same as the bindings in the given triple
pattern. Whenever a match is found, Parliament retrieves the string repre-
sentations of the elements and adds that triple to the query result.
Whenever Parliament is enumerating statements, it skips over statements
whose “deleted” flag has been set. See Section 4.3 below for details.
Parliament is designed as an embedded triple store and does not include a
SPARQL or other query language processor. Such queries are supported by ac-
cessing Parliament as a storage model from higher-level frameworks such as Jena
or Sesame. Single find operations (as discussed above) are combined together by
the higher-level framework, with Parliament-specific extensions for optimization.
In particular, when using Parliament with Jena’s query processors [11], we have
used several different algorithms for query planning and execution, which will
be detailed in subsequent publications. The basis of these optimizations is the
ability to quickly access the counts of the resources in the given positions.
4.2 Insertion
To insert a triple (s, p, o), Parliament executes the following algorithm:
23
8
1. Parliament uses the B-tree to translate the string representations of the three
resources into resource ID’s.
2. If all three elements are found in the B-tree, then Parliament performs a
query for the triple pattern (s, p, o). Note that this is necessarily a fully
bound query pattern. If the triple is found, then no insertion is required,
and the algorithm terminates.
3. If any elements are not found in the B-tree, then Parliament creates new
resources for each of them as follows:
(a) Parliament appends the string representation of the resource to the end
of the string representations file. If the file is not large enough to contain
the string, then the file is enlarged first. The offset of the beginning of
the string is noted for use in the next step.
(b) Parliament appends a new record to the end of the resource table. If
the file is not large enough to contain the new record, then the file is
enlarged first. The number of the record is saved as the new resource ID
for use in the steps below, and the offset from the string representations
file is written to the appropriate field in this record. The record’s counts
are initialized to zero, and the first statement ID’s are set to null.
(c) Parliament inserts a new entry into the B-tree. The entry contains the
resource’s string representation as its key and the new resource ID as its
value.
4. Parliament now has three valid resource ID’s representing the triple, and
knows that the triple is not present in the statement table.
5. Parliament appends a new record to the end of the statement table. If the
file is not large enough to contain the new record, then the file is enlarged
first. The number of the record is saved as the new statement ID for use in
the steps below, and the three resource ID’s obtained above are written to
the appropriate fields in this record. The record’s next statement ID’s are
all set to null.
6. For each of the three resources, Parliament inserts the new statement record
at the head of that resource’s linked list for the corresponding triple position
as follows:
(a) The resource record’s first statement ID for the resource’s position is
written into the corresponding next statement ID field in the new state-
ment record. Note that if this resource was newly inserted for this state-
ment, then this step will write a null into the next statement ID field.
(b) The ID of the new statement is written into the resource record’s first
statement ID for the resource’s position.
4.3 Deletion
The index structure of Parliament’s statement table is not conducive to the
efficient removal of a statement record from the three linked lists of which it is
a member. These linked lists are singly linked, and so there is no way to remove
a record except to traverse all three lists from the beginning.
24
9
Due to these difficulties, Parliament “deletes” statements by marking them
with a flag in the bit field portion of the statement record. Thus, the algorithm
consists of a find (as in the case of an insertion, this is a fully bound query
pattern) followed by setting the flag on the found record. In the future, we may
utilize doubly linked lists so that the space occupied by deleted statements can be
reclaimed efficiently. However, in our work to date deletion has been infrequent
enough that this has been deemed a lower priority enhancement.
5 Theoretical Analysis
As is readily apparent, the presented approach suffers some unfortunate worst
case performance, but the average case performance is quite good. This is con-
sistent with empirical results presented in [10] and this paper. We will address
both find operations on a single triple pattern and triple insertions.
5.1 Worst Case Analysis
The worst-case performance for a single triple pattern lookup is dependent on
how many of the elements in the pattern (s, p, o) are bound. If zero elements are
bound, the triple pattern results in a total scan of the statement list, resulting
in O(n). Since all triples are the expected result, this is the best possible worst
case performance. If one element is bound, the chain for that particular element
will be traversed with time O(count(bound, pos)). Again exactly the triples that
answer the pattern are traversed.
Things change slightly for the cases where two or three of the (s, p, o) ele-
ments are bound. If two elements are bound, the shorter of the two lists will be
traversed. This triple pattern can be returned in
O(min(count(bound1, pos1), count(bound2, pos2)))
However, this could be O(n) if all triples use the two bound elements. If all
three elements are bound, the shortest of the three lists will be traversed. This
shortest list will be longest when the set of statements is exactly the three-way
cross product of the set of resources. In this case, if the number of resources is
m, then the number of statements is m3and every list is of length m2. Thus the
list length is n2/3, and a find operation for three bound elements is O(n2/3).
Since an insertion first requires a find on the triple to be inserted, it incurs
the worst-case cost of a find with three bound elements, O(n2/3). It also incurs
the cost of inserting any nodes in the triple that were not previously known into
the dictionary, but this logarithmic time O(log m) is overshadowed by the worst
case find time. After that, adding the triple to the head of the lists is done in
O(1) constant time. Thus the worst-case of the insertion operation is O(n2/3).
Here we note that this worst-case performance is indeed worse than other
previously published approaches, which are logarithmic. However, the scenarios
that produce these worst-case results are quite rare in practice, as will be shown
in the following section.
25
10
5.2 Average Case Analysis
While the worst-case performance is worse than other approaches, analyzing the
relevant qualities of several example data sets leads us to believe that the average
case performance is actually quite good.
The most relevant feature of a data set with respect to its performance within
this indexing scheme is the length of the various statement lists for a particular
subject, predicate, or object. For instance, the worst-case time of the insert
operation and the find operation with three bound elements is O(n2/3), but this
is associated with the case that the set of triples is the cross-product of the
subjects, predicates, and objects, which is a highly unlikely real world situation.
Since these bounds are derived from the shortest statement list, analysis of the
average list lengths in a data set is a key measure to how this scheme will perform
in the real world.
Table 1. List Length Means (Standard Deviations)
Data Set Size Subject Predicate Non-Lit Object Lit Object
Webscope 83M 3.96 (9.77) 87,900 (722,575) 3.43 (2,170) 4.33 (659)
Falcon 33M 4.22 (13) 983 (31,773) 2.56 (328) 2.31 (217)
Swoogle 175M 5.65 (36) 4,464 (188,023) 3.27 (1,793) 3.38 (569)
Watson 60M 5.58 (56) 3,040 (98,288) 2.87 (918) 2.91 (407)
SWSE-1 30M 5.25 (15) 25,404 (289,000) 2.46 (1,138) 2.29 (187)
SWSE-2 61M 5.37 (15) 83,773 (739,736) 2.89 (1,741) 2.87 (300)
DBpedia 110M 15 (39) 300,855 (3,560,666) 3.84 (148) 1.17 (22)
Geonames 70M 10.4 (1.66) 4,096,150 (3,167,048) 2.81 (1,623) 1.67 (15)
SwetoDBLP 15M 5.63 (3.82) 103,009 (325,380) 2.93 (629) 2.36 (168)
Wordnet 2M 4.18 (2.04) 47,387 (100,907) 2.53 (295) 2.39 (271)
Freebase 63M 4.45 (15) 12,329 (316,363) 2.79 (1,286) 1.83 (116)
US Census 446M 5.39 (9.18) 265,005 (1,921,537) 5.29 (15,916) 227 (115,616)
Table 1 shows the mean and standard deviations of the subject, predicate,
and object list lengths for several large data sets [19]. There are a few nice prop-
erties of this data that are worth noting. First, the average number of statements
using a particular subject is quite small in all data sets. The average number
of statements using a particular object is generally even smaller, though with a
much higher standard deviation. Finally, only the predicate list length generally
seems to scale with the size of the data set.
These observations have important implications with respect to the average
time of find and insert operations. For find operations, we now know several
things to be generally true:
Either the object or subject list will likely be used for find operations when
all three elements are bound. Thus these operations will often touch fewer
than 10 triples.
26
11
Due to the previous, insert operations should generally be quite fast.
The predicate list is only likely to be used for find operations when only
the predicate is bound, and thus only when all statements with the given
predicate must be touched to be returned anyway.
Find operations with two bound elements, which have the most troubling
theoretical worst-case performance, necessarily include either a bound sub-
ject or bound object. As a result, these too should generally be quite fast.
These conclusions collectively suggest real world performance that is much
more impressive than the worst-case analysis would imply, and this is shown
empirically in the following section.
6 Empirical Analysis
Since Hexastore [13] is the most recently published work in this area, and its in-
dexing structure out performed several of the other approaches, we have focused
our empirical evaluation on Parliament as compared to Hexastore. At the time of
our evaluation, only the prototype Python version of Hexastore was available for
comparison. Future work will compare against the newly released version. This
limitation resulted in the relatively small size of this empirical evaluation; we
could not go beyond the size of main memory without the comparison becoming
unfair to Hexastore. Parliament was tested with 850 million triples in [10].
Evaluation was performed on a MacBook Pro laptop with a 2.6 GHz dual
core CPU, 4 GB of RAM, and a 7200 RPM SATA HD, running Mac OS X
10.5.7. This platform was most convenient for execution of both systems. While
Hexastore’s evaluation focused on only query performance, we feel it is important
to include insertion performance and memory utilization as well, as there are
many Semantic Web applications for which these factors are significant. We have
focused on the Lehigh University Benchmark [20], as it was used in the Hexastore
evaluation and contains insertion time metrics as well. We have evaluated LUBM
queries 1, 2, 3, 4, and 9. Since the version of Hexastore used does not perform
inference, we were forced to modify queries 4 and 9 such that none was required.
The insertion performance graph is shown in Figure 2. The throughput of
Parliament stays fairly stable at approximately 35k statements per second. This
throughput is 3 to 7 times larger than that of Hexastore, which starts at approx-
imately 9k statements per second, and declines to less than 5k statements per
second as the total number of triples increases. Parliament’s throughput results
include both persisting the data to disk (the Python version of Hexastore is en-
tirely memory-based) and forward chaining for its RDFS inference capabilities.
Figures 3, 4, 5, 6, and 7 show the relative query performance of Parliament
and Hexastore on LUBM queries 1, 2, 3, 4, and 9 respectively.
Queries 1, 3, and 4 produce results where both systems appear to be following
the same growth pattern, though Hexastore performs slightly better on queries 1
and 3 and Parliament performs better on query 4. Parliament also demonstrates
more variability in the query execution times, which is likely a result of the
dependency on the operating system’s memory mapping functionality.
27
12
0"
5"
10"
15"
20"
25"
30"
35"
40"
0.0" 0.5" 1.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" 4.5"
!"#$%&"'%()*("$%+,-.+)$/)+(,(010-(+)'0#)+02$-.3)
45665$-+)$/)7(,(010-(+)8$,.0.)
Parliament"
HexaStore"
Fig. 2. Insertion Performance
0"
0.5"
1"
1.5"
2"
2.5"
3"
3.5"
0.0" 0.5" 1.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" 4.5"
Response'Time'(milliseconds)'
Millions'of'Statements'Loaded'
Parliament"
HexaStore"
Fig. 3. LUBM Query 1 Response Time
28
13
0"
200"
400"
600"
800"
1000"
1200"
1400"
0.0" 0.5" 1.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" 4.5"
Response'Time'(milliseconds)'
Millions'of'Statements'Loaded'
Parliament"
HexaStore"
Fig. 4. LUBM Query 2 Response Time
0"
0.5"
1"
1.5"
2"
2.5"
0.0" 0.5" 1.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" 4.5"
Response'Time'(milliseconds)'
Millions'of'Statements'Loaded'
Parliament"
HexaStore"
Fig. 5. LUBM Query 3 Response Time
29
14
0"
2"
4"
6"
8"
10"
12"
14"
16"
18"
20"
0.0" 0.5" 1.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" 4.5"
Response'Time'(milliseconds)'
Millions'of'Statements'Loaded'
Parliament"
HexaStore"
Fig. 6. LUBM Query 4 (modified) Response Time
0"
20"
40"
60"
80"
100"
120"
140"
160"
180"
0.0" 0.5" 1.0" 1.5" 2.0" 2.5" 3.0" 3.5" 4.0" 4.5"
Response'Time'(seconds)'
Millions'of'Statements'Loaded'
Parliament"
HexaStore"
Fig. 7. LUBM Query 9 (modified) Response Time
30
15
Queries 2 and 9 show Parliament and Hexastore following different growth
curves, with Parliament performing better in query 9 and Hexastore performing
better in query 2. This is more likely the result of differing query plans within the
two systems than a strength or deficiency of the storage structure, but without
insight into the query planner of Hexastore we cannot verify this claim.
Finally, Table 2 shows an estimate of memory used by Hexastore and Parlia-
ment with all 4.3M statements loaded. These numbers are as reported by Mac
OS X, but as is often the case with virtual memory management, the mem-
ory metrics are only useful as course estimates. However, they show what was
expected; Parliament’s storage scheme requires significantly less storage space.
Table 2. Space Utilization for 4.3M Triples (in GB)
Hexastore Parliament
Real Memory 2.02 0.50
Virtual Memory 2.59 1.38
Disk Space N/A 0.36
Overall, we conclude that Parliament maintains very comparable query per-
formance to Hexastore, while significantly outperforming Hexastore with respect
to insertion throughput and required space.
7 Conclusions
In this paper, we have shown the storage and indexing scheme based on linked
lists and memory mapping used in Parliament. This scheme is designed to bal-
ance insertion performance, query performance, and space usage. We found that
while the worst-case performance does not compare favorably with other ap-
proaches, average case analysis indicates good performance. Experiments demon-
strate that Parliament maintains excellent query performance while significantly
increasing insertion throughput and decreasing space requirements compared to
Hexastore. Future work will include experiments focusing on different query op-
timization strategies for Parliament, explanations and analysis of Parliament’s
internal rule engine, and further optimizations to the storage structure.
References
1. Klyne, G., Carroll, J., eds.: Resource Description Framework (RDF):
Concepts and Abstract Syntax. W3C Recommendation (February 2004)
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/.
2. Dean, M., Schrieber, G., eds.: OWL Web Ontology Language Reference.
W3C Recommendation (February 2004) http://www.w3.org/TR/2004/REC-owl-
ref-20040210/.
31
16
3. Harris, S., Shadbolt, N.: Sparql query processing with conventional relational
database systems. In: Lecture Notes in Computer Science. Springer (2005) 235–
244
4. Wilkinson, K., Sayers, C., Kuno, H., Reynolds, D., Database, J.: Efficient rdf
storage and retrieval in jena2. In: EXPLOITING HYPERLINKS 349. (2003) 35–
43
5. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable semantic web
data management using vertical partitioning. In: VLDB ’07: Proceedings of the
33rd international conference on Very large data bases, VLDB Endowment (2007)
411–422
6. Chong, E.I., Das, S., Eadon, G., Srinivasan, J.: An efficient sql-based rdf querying
scheme. In: VLDB ’05: Proceedings of the 31st international conference on Very
large data bases, VLDB Endowment (2005) 1216–1227
7. Erling, O., Mikhailov, I.: Rdf support in the virtuoso dbms. In Auer, S., Bizer,
C., M¨uller, C., Zhdanova, A.V., eds.: The Social Semantic Web 2007, Proceedings
of the 1st Conference on Social Semantic Web (CSSW), September 26-28, 2007,
Leipzig, Germany. Volume 113 of LNI., GI (2007) 59–68
8. BBN Technologies: Parliament http://parliament.semwebcentral.org/.
9. Dean, M., Neves, P.: DAML DB http://www.daml.org/2001/09/damldb/.
10. Rohloff, K., Dean, M., Emmons, I., Ryder, D., Sumner, J.: An evaluation of triple-
store technologies for large data stores. In: On the Move to Meaningful Internet
Systems 2007: OTM 2007 Workshops, Vilamoura, Portugal, Springer (2007) 1105–
1114 LNCS 4806.
11. Carroll, J.J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.:
Jena: implementing the semantic web recommendations. In: WWW Alt. ’04: Pro-
ceedings of the 13th international World Wide Web conference on Alternate track
papers & posters, New York, NY, USA, ACM (2004) 74–83
12. Broekstra, J., Kampman, A., Harmelen, F.V.: Sesame: A generic architecture for
storing and querying rdf and rdf schema. In: Lecture notes in computer science.
Volume 2342., Springer (2002) 54–68
13. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic
web data management. Proc. VLDB Endow. 1(1) (2008) 1008–1019
14. Wood, D., Gearon, P., Adams, T.: Kowari: A platform for semantic web storage
and analysis. In: XTech2005: XML, the Web and beyond, Amsterdam (2005)
15. Harth, A., Decker, S.: Optimized index structures for querying rdf from the web.
Web Congress, Latin American 0(2005) 71–80
16. Franz, Inc.: AllegroGraph http://www.franz.com/products/allegrograph/.
17. Kiryakov, A., Ognyanov, D., Manov, D.: Owlim — a pragmatic semantic repository
for owl. In: Lecture Notes in Computer Science. Volume 3807/2005. Springer (2005)
182–192
18. Olson, M.A., Bostic, K., Seltzer, M.: Berkeley db. In: ATEC ’99: Proceedings of
the annual conference on USENIX Annual Technical Conference, Berkeley, CA,
USA, USENIX Association (1999) 43–43
19. Dean, M.: Toward a science of knowledge base performance analysis. In:
Invited Talk, 4th International Workshop on Scalable Semantic Web Knowl-
edge Base Systems (SSWS2008), Karlsruhe, Germany (October 2008) slide 20
http://asio.bbn.com/2008/10/iswc2008/mdean-ssws-2008-10-27.ppt.
20. Guo, Y., Qasem, A., Pan, Z., Heflin, J.: A requirements driven framework for
benchmarking semantic web knowledge base systems. IEEE Transactions on
Knowledge and Data Engineering 19(2) (2007) 297–309
32
... TripleBit [87] and BitMat [11] subdivide the data by predicate thus creating a bit-matrix for each predicate with rows representing subjects and columns representing objects. Parliament [45] uses a sorted file with offset pointers from the triple to the next one with the same subject to allow traversal. Effectively, this divides the triples into variable length blocks of same-value parts. ...
... For redundancy, while 1-hop s→o queries are the most common and involve a single triple, specialized indexes have been proposed to speed up reachability and multi-hop path expressions (e.g., Sparqling kleene [31]). Apart from Parliament [45], one can rarely find data representations supporting traversal patterns as required by k-hop and *-hop access patterns. Furthermore, solutions tend to rely on standard B+-tree and hash map implementations, avoiding the use of hybrid structures (e.g., hybrid indexes [65]). ...
Article
Full-text available
RDF triplestores’ ability to store and query knowledge bases augmented with semantic annotations has attracted the attention of both research and industry. A multitude of systems offer varying data representation and indexing schemes. However, as recently shown for designing data structures, many design choices are biased by outdated considerations and may not result in the most efficient data representation for a given query workload. To overcome this limitation, we identify a novel three-dimensional design space. Within this design space, we map the trade-offs between different RDF data representations employed as part of an RDF triplestore and identify unexplored solutions. We complement the review with an empirical evaluation of ten standard SPARQL benchmarks to examine the prevalence of these access patterns in synthetic and real query workloads. We find some access patterns, to be both prevalent in the workloads and under-supported by existing triplestores. This shows the capabilities of our model to be used by RDF store designers to reason about different design choices and allow a (possibly artificially intelligent) designer to evaluate the fit between a given system design and a query workload.
... This benchmark is based on the DBpedia3.5.1 dataset. 16 The dataset contains a total of 232M (English version) triples, 18,425k distinct subjects, 39,672 predicates, and 65,184k objects. The benchmark includes a total of 50 real queries selected from the DBpedia3.5.1 SPARQL endpoint log. ...
... 2. Jena TDB 20 Version 3.13.1 with Fuseki as HTTP interface with Java heap size set to 8g. Ontotext GraphDB 23 with Java heap=8g. 5. Parliament[16] with MIN MEM=1g and MAX MEM=8g of Java heap and jetty as HTTP interface. ...
Chapter
Full-text available
With significant growth in RDF datasets, application developers demand online availability of these datasets to meet the end users’ expectations. Various interfaces are available for querying RDF data using SPARQL query language. Studies show that SPARQL end-points may provide high query runtime performance at the cost of low availability. For example, it has been observed that only 32.2% of public endpoints have a monthly uptime of 99–100%. One possible reason for this low availability is the high workload experienced by these SPARQL endpoints. As complete query execution is performed at server side (i.e., SPARQL endpoint), this high query processing workload may result in performance degradation or even a service shutdown. We performed extensive experiments to show the query processing capabilities of well-known triple stores by using their SPARQL endpoints. In particular, we stressed these triple stores with multiple parallel requests from different querying agents. Our experiments revealed the maximum query processing capabilities of these triple stores after which point they lead to service shutdowns. We hope this analysis will help triple store developers to design workload-aware RDF engines to improve the availability of their public endpoints with high throughput.
... The persistent disk-based storage is a way to store RDF data permanently on file system by using the most influential indexing techniques, such as B + tree [48], AVL [54] and B − tree [31]. Among the existing solutions we can mention [1,10,12,32,36,51]. It is important to notice that, reading from and writing to disks slow the search process to an unacceptable level and induce an important performance bottleneck [15]. ...
... In order to improve the efficiency of queries, the indexes techniques are then added for each of the column for reducing the cost of self-join query [35]. However, the storage of RDF triples in a single table make the queries very slow to execute and may overtake the size of the main memory as indicated in [36]. Many early RDF stores use statement table approach, such as [7,20,27,29,40]. ...
Article
Full-text available
In this paper, we introduce three new implementations of non-native methods for storing RDF data. These methods named RDFSPO, RDFPC and RDFVP, are based respectively on the statement table, property table and vertical partitioning approaches. As important, we consider the issue of how to select the most relevant strategy for storing the RDF data depending on the dataset characteristics. For this, we investigate the balancing between two performance metrics, including load time and query response time. In this context, we provide an empirical comparative study between on one hand the three proposed methods, and on the other hand the proposed methods versus the existing ones by using various publicly available datasets. Finally, in order to further assess where the statistically significant differences appear between studied methods, we have performed a statistical analysis, based on the non-parametric Friedman test followed by a Nemenyi post-hoc test. The obtained results clearly show that the proposed RDFVP method achieves highly competitive computational performance against other state-of-the-art methods in terms of load time and query response time.
... AllegroGraph 9 (2003) is a general purpose store for semi-structured data that can be used to query documents (e.g., JSON) and graph data (e.g., RDF). RDF data is stored and indexed in six permutations as quads, Table, Q = Quad Table, RedLand [22] 2001 ✓ ✓ Jena [128] 2002 ✓ ✓ ✓ ✓ RDF4J [34] 2002 ✓ ✓ ✓ ✓ ✓ RSSDB [97,98] 2002 ✓ ✓ ✓ ✓ 3store [74] 2003 [203] 2003 ✓ ✓ ✓ ✓ ✓ ✓ CORESE [45,46] 2004 [121] 2004 ✓ ✓ ✓ ✓ BRAHMS [90] 2005 ✓ ✓ ✓ GraphDB [105,26] 2005 ✓ ✓ ✓ ✓ ✓ ✓ ✓ Kowari [205] 2005 [77] 2005 ✓ ✓ ✓ ✓ RDFBroker [179] 2006 ✓ ✓ ✓ ✓ ✓ Virtuoso [54] 2006 ✓ ✓ ✓ ✓ ✓ ✓ GRIN [194] 2007 ✓ ✓ ✓ SW-Store [1] 2007 ✓ Blazegraph [190] 2008 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Hexastore [201] 2008 ✓ ✓ ✓ ✓ RDF-3X [142] 2008 ✓ ✓ ✓ ✓ ✓ BitMat [15] 2009 ✓ ✓ ✓ ✓ DOGMA [33] 2009 ✓ ✓ ✓ ✓ LuposDate [61] 2009 ✓ ✓ ✓ Parliament [106] 2009 [212] 2011 ✓ ✓ ✓ ✓ ✓ gStore [223] 2011 ✓ ✓ ✓ ✓ SpiderStore [138] 2011 ✓ ✓ SAINT-DB [154] 2012 ✓ ✓ ✓ Strabon [109] 2012 ✓ ✓ ✓ ✓ ✓ ✓ DB2RDF [29] 2013 ✓ ✓ ✓ ✓ ✓ ✓ OntoQuad [155] 2013 ✓ ✓ ✓ ✓ ✓ OSQP [191] 2013 ✓ ✓ ✓ ✓ Triplebit [217] 2013 ✓ ✓ ✓ ✓ ✓ R3F [103,104] 2014 ✓ ✓ ✓ ✓ ✓ ✓ RQ-RDF-3X [113] 2014 ✓ ✓ ✓ ✓ SQBC [222] 2014 ✓ ✓ ✓ WaterFowl [48] 2014 ✓ ✓ ✓ ✓ GraSS [120] 2015 ✓ ✓ ✓ ✓ k 2 -triples [11] 2015 ✓ ✓ ✓ RDFCSA [32,31] 2015 ✓ ✓ ✓ RDFox [138] 2015 ✓ ✓ ✓ ✓ ✓ Turbo HOM++ [102] 2015 ✓ ✓ ✓ RIQ [100] 2016 ✓ ✓ ✓ ✓ ✓ axonDB [131] 2017 ✓ ✓ ✓ ✓ ✓ HTStore [116] 2017 ✓ ✓ AMBER [87] 2018 ✓ Jena-LTJ [81] 2019 ✓ ✓ ✓ BMatrix [30] 2020 ✓ ✓ Tentris [24] 2020 ✓ ✓ ✓ ✓ which are additionally associated with a triple identifier. SPARQL queries are supported, where the most recent version provides an option for two query engines: SBQE, which is optimized for SPARQL 1.0-style queries, and MJQE, which features merge joins and caching techniques optimized for property paths. ...
... RDFox [138] (2015) is an in-memory RDF engine that supports materialization-based Datalog reasoning. The RDF graph is stored as a triple table implemented as a linked list, which stores identifiers for subject, predicate and object, as well as three pointers in the list to the next triple with the same subject, predicate and object (similar to Parliament [106]). Four indexes are built: a hash table for three constants, and three for individual constants; the indexes for individual constants offer pointers to the first triple in the list with that constant, where patterns with two constants can be implemented by filtering over this list, or (optionally) by using various orderings of the triple list to avoid filtering (e.g., a triple list ordered by spo can be used to evaluate patterns with constant subject and predicate without filtering). ...
Preprint
Full-text available
Recent years have seen the growing adoption of non-relational data models for representing diverse, incomplete data. Among these, the RDF graph-based data model has seen ever-broadening adoption, particularly on the Web. This adoption has prompted the standardization of the SPARQL query language for RDF, as well as the development of a variety of local and distributed engines for processing queries over RDF graphs. These engines implement a diverse range of specialized techniques for storage, indexing, and query processing. A number of benchmarks, based on both synthetic and real-world data, have also emerged to allow for contrasting the performance of different query engines, often at large scale. This survey paper draws together these developments, providing a comprehensive review of the techniques, engines and benchmarks for querying RDF knowledge graphs.
... AllegroGraph 9 (2003) is a general purpose store for semi-structured data that can be used to query documents (e.g., JSON) and graph data (e.g., RDF). RDF data is stored and indexed in six permutations as quads, Table, Q = Quad Table, RedLand [22] 2001 ✓ ✓ Jena [128] 2002 ✓ ✓ ✓ ✓ RDF4J [34] 2002 ✓ ✓ ✓ ✓ ✓ RSSDB [97,98] 2002 ✓ ✓ ✓ ✓ 3store [74] 2003 [203] 2003 ✓ ✓ ✓ ✓ ✓ ✓ CORESE [45,46] 2004 [121] 2004 ✓ ✓ ✓ ✓ BRAHMS [90] 2005 ✓ ✓ ✓ GraphDB [105,26] 2005 ✓ ✓ ✓ ✓ ✓ ✓ ✓ Kowari [205] 2005 [77] 2005 ✓ ✓ ✓ ✓ RDFBroker [179] 2006 ✓ ✓ ✓ ✓ ✓ Virtuoso [54] 2006 ✓ ✓ ✓ ✓ ✓ ✓ GRIN [194] 2007 ✓ ✓ ✓ SW-Store [1] 2007 ✓ Blazegraph [190] 2008 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ Hexastore [201] 2008 ✓ ✓ ✓ ✓ RDF-3X [142] 2008 ✓ ✓ ✓ ✓ ✓ BitMat [15] 2009 ✓ ✓ ✓ ✓ DOGMA [33] 2009 ✓ ✓ ✓ ✓ LuposDate [61] 2009 ✓ ✓ ✓ Parliament [106] 2009 [212] 2011 ✓ ✓ ✓ ✓ ✓ gStore [223] 2011 ✓ ✓ ✓ ✓ SpiderStore [138] 2011 ✓ ✓ SAINT-DB [154] 2012 ✓ ✓ ✓ Strabon [109] 2012 ✓ ✓ ✓ ✓ ✓ ✓ DB2RDF [29] 2013 ✓ ✓ ✓ ✓ ✓ ✓ OntoQuad [155] 2013 ✓ ✓ ✓ ✓ ✓ OSQP [191] 2013 ✓ ✓ ✓ ✓ Triplebit [217] 2013 ✓ ✓ ✓ ✓ ✓ R3F [103,104] 2014 ✓ ✓ ✓ ✓ ✓ ✓ RQ-RDF-3X [113] 2014 ✓ ✓ ✓ ✓ SQBC [222] 2014 ✓ ✓ ✓ WaterFowl [48] 2014 ✓ ✓ ✓ ✓ GraSS [120] 2015 ✓ ✓ ✓ ✓ k 2 -triples [11] 2015 ✓ ✓ ✓ RDFCSA [32,31] 2015 ✓ ✓ ✓ RDFox [138] 2015 ✓ ✓ ✓ ✓ ✓ Turbo HOM++ [102] 2015 ✓ ✓ ✓ RIQ [100] 2016 ✓ ✓ ✓ ✓ ✓ axonDB [131] 2017 ✓ ✓ ✓ ✓ ✓ HTStore [116] 2017 ✓ ✓ AMBER [87] 2018 ✓ Jena-LTJ [81] 2019 ✓ ✓ ✓ BMatrix [30] 2020 ✓ ✓ Tentris [24] 2020 ✓ ✓ ✓ ✓ which are additionally associated with a triple identifier. SPARQL queries are supported, where the most recent version provides an option for two query engines: SBQE, which is optimized for SPARQL 1.0-style queries, and MJQE, which features merge joins and caching techniques optimized for property paths. ...
... RDFox [138] (2015) is an in-memory RDF engine that supports materialization-based Datalog reasoning. The RDF graph is stored as a triple table implemented as a linked list, which stores identifiers for subject, predicate and object, as well as three pointers in the list to the next triple with the same subject, predicate and object (similar to Parliament [106]). Four indexes are built: a hash table for three constants, and three for individual constants; the indexes for individual constants offer pointers to the first triple in the list with that constant, where patterns with two constants can be implemented by filtering over this list, or (optionally) by using various orderings of the triple list to avoid filtering (e.g., a triple list ordered by spo can be used to evaluate patterns with constant subject and predicate without filtering). ...
Preprint
Full-text available
Recent years have seen the growing adoption of non-relational data models for representing diverse, incomplete data. Among these, the RDF graph-based data model has seen ever-broadening adoption, particularly on the Web. This adoption has prompted the standardization of the SPARQL query language for RDF, as well as the development of a variety of local and distributed engines for processing queries over RDF graphs. These engines implement a diverse range of specialized techniques for storage, indexing, and query processing. A number of benchmarks, based on both synthetic and real-world data, have also emerged to allow for contrasting the performance of different query engines, often at large scale. This survey paper draws together these developments, providing a comprehensive review of the techniques, engines and benchmarks for querying RDF knowledge graphs.
... [44], Ontopia Knowledge Suite [45], Hexastore [36], Jena [37,38], 3store [46], 4store [47], Kowari [48], Oracle [49], RDF-3X [50], RDFSuite [42], Virtuoso [51], and Sesame [40,41]);  File systems, usually indexed by key B-trees and/or non-relational databases, such as Oracle Berkeley DB (Sesame [40,41], rdfDB [52], RDF Store [53], Redland [54], Jena [55], Parliament [56], and RDFCube [57]). ...
... The approach pays equal attention to all elements of RDF [1]. This approach has been used by tools such as the BitMat [35], Hexastore [36], Kowari system [48], RDF-3X [50], Virtuoso [51], Parliament [56], RDFCube [57], TripleT [67], BRAHMS [68], RDFJoin [69], RDFKB [70], and iStore [71]. ...
Article
Full-text available
This paper introduces a novel approach for storing Resource Description Framework (RDF) data based on the possibilities of Natural Language Addressing (NLA) and on a special NLA basic structure for storing Big Data, called “NLA-bit”, which is aimed to support middle-size or large distributed RDF triple or quadruple stores with time complexity O(1). The main idea of NLA is to use letter codes as coordinates (addresses) for data storing. This avoids indexing and provides high-speed direct access to the data with time complexity O(1). NLA-bit is a structured set of all RDF instances with the same “Subject”. An example based on a document system, where every document is stored as NLA-bit, which contains all data connected to it by metadata links, is discussed. The NLA-bits open up a wide field for research and practical implementations in the field of large databases with dynamic semi-structured data (Big Data). Important advantages of the approach are as follow: (1) The reduction of the amount of occupied memory due to the complete absence of additional indexes, absolute addresses, pointers, and additional files; (2) reduction of processing time due to the complete lack of demand—the data are stored/extracted to/from a direct address.
... RDF data storing methods are [17], [18], [19]: Structures in memory (TRIPLE [20], BitMat [21], Hexastore [22], Jena [23], [24], YARS [25] and Sesame [26], [27]); Popular relational databases [ICS-FORTH RDF Suite [28], [29], Semantics Platform [30], Ontopia Knowledge Suite [31], Hexastore [22], Jena [23], [24], 3store [32], 4store [33], Kowari [34], Oracle [35], RDF-3X [36], RDFSuite [28], Virtuoso [37] and Sesame [26], [27]); The file systems indexed by key B-trees and or non-relational databases -like: Oracle Berkeley DB (Sesame [26], [27], rdfDB [38], RDF Store [39], Redland [40], Jena [41], Parliament [42] and RDFCube [43]). The classification of RDF data storage is: ...
Conference Paper
An important feature of Big Data is the dynamics with which the data enters the registration and processing system. In the classical version of a relational database, the time for resetting the indexes of the database is usually longer than the time of receipt of the dynamic (streaming) data. This leads at least to the loss, sometimes - of essential, data. In this article is proposed a model for synchronous processing of large volumes of semi-structured dynamic data. A new data structure called NLA-Layer is proposed, which makes it possible to store a practically unlimited amount of dynamic (streaming) data coming in continuously, with the data ready for access and processing immediately after storage. Experiments show that NLA-Layer allows the construction of Big Databases with constant time complexity O (1).
... RDFox [164] (2015) is an in-memory RDF engine that supports Datalog reasoning. The RDF graph is stored as a triple table implemented as a linked list, which stores identifiers for subject, predicate and object, as well as three pointers in the list to the next triple with the same subject, predicate and object (similar to Parliament [126]). Four indexes are built: a hash table for three constants, and three for individual constants; the indexes for individual constants offer pointers to the first triple in the list with that constant, where patterns with two constants can be implemented by filtering over this list, or (optionally) by using various orderings of the triple list to avoid filtering (e.g., a triple list ordered by SPO can be used to evaluate patterns with constant subject and predicate without filtering). ...
Article
Full-text available
RDF has seen increased adoption in recent years, prompting the standardization of the SPARQL query language for RDF, and the development of local and distributed engines for processing SPARQL queries. This survey paper provides a comprehensive review of techniques and systems for querying RDF knowledge graphs. While other reviews on this topic tend to focus on the distributed setting, the main focus of the work is on providing a comprehensive survey of state-of-the-art storage, indexing and query processing techniques for efficiently evaluating SPARQL queries in a local setting (on one machine). To keep the survey self-contained, we also provide a short discussion on graph partitioning techniques used in the distributed setting. We conclude by discussing contemporary research challenges for further improving SPARQL query engines. This extended version also provides a survey of over one hundred SPARQL query engines and the techniques they use, along with twelve benchmarks and their features.
... RDFox [138] (2015) is an in-memory RDF engine that supports materialization-based Datalog reasoning. The RDF graph is stored as a triple table implemented as a linked list, which stores identifiers for subject, predicate and object, as well as three pointers in the list to the next triple with the same subject, predicate and object (similar to Parliament [106]). Four indexes are built: a hash table for three constants, and three for individual constants; the indexes for individual constants offer pointers to the first triple in the list with that constant, where patterns with two constants can be implemented by filtering over this list, or (optionally) by using various orderings of the triple list to avoid filtering (e.g., a triple list ordered by spo can be used to evaluate patterns with constant subject and predicate without filtering). ...
Preprint
Full-text available
Recent years have seen the growing adoption of non-relational data models for representing diverse , incomplete data. Among these, the RDF graph-based data model has seen ever-broadening adoption, particularly on the Web. This adoption has prompted the standardization of the SPARQL query language for RDF, as well as the development of a variety of local and distributed engines for processing queries over RDF graphs. These engines implement a diverse range of specialized techniques for storage, indexing, and query processing. A number of benchmarks, based on both synthetic and real-world data, have also emerged to allow for contrasting the performance of different query engines , often at large scale. This survey paper draws together these developments, providing a comprehensive review of the techniques, engines and benchmarks for querying RDF knowledge graphs.
Thesis
Le Big Data représente un défi non seulement pour le monde socio-économique mais aussi pour la recherchescientifique. En effet, comme il a été souligné dans plusieurs articles scientifiques et rapports stratégiques, lesapplications informatiques modernes sont confrontées à de nouveaux problèmes qui sont liés essentiellement austockage et à l’exploitation de données générées par les instruments d’observation et de simulation. La gestion de tellesdonnées représente un véritable goulot d’étranglement qui a pour effet de ralentir la valorisation des différentesdonnées collectées non seulement dans le cadre de programmes scientifiques internationaux mais aussi par desentreprises, ces dernières s'appuyant de plus en plus sur l’analyse de données massives. Une bonne partie de cesdonnées sont publié aujourd’hui sur le WEB. Nous assistons en effet à une évolution du Web classique permettant degérer les documents vers un Web de données qui permet d’offrir des mécanismes d’interrogation des informationssémantiques. Plusieurs modèles de données ont été proposés pour représenter ces informations sur le Web. Le plusimportant est le Resource Description Framework (RDF) qui fournit une représentation des connaissances simple etabstraite pour les ressources sur le Web. Chaque fait du Web sémantique peut être codé avec un triplet RDF. Afin depouvoir explorer et interroger les informations structurées exprimées en RDF, plusieurs langages de requête ont étéproposés au fil des années. En 2008, SPARQL est devenu le langage de recommandation officiel du W3C pourl'interrogation des données RDF. La nécessité de gérer et interroger efficacement les données RDF a conduit audéveloppement de nouveaux systèmes conçus spécialement pour traiter ce format de données. Ces approches peuventêtre catégorisées en étant centralisées qui s’appuient sur une seule machine pour gérer les données RDF et distribuéesqui peuvent combiner plusieurs machines connectées avec un réseau informatique. Certaines de ces approchess’appuient sur un système de gestion de données existant tels que Virtuoso et Jena, d’autres approches sont basées surune approche spécialement conçue pour la gestion des triplets RDF comme GRIN, RDF3X et gStore. Avec l’évolutiondes jeux de données RDF (e.g. DBPedia) et du langage Sparql, la plupart des systèmes sont devenus obsolètes et/ouinefficaces. A titre d’exemple, aucun système centralisé existant n’est en mesure de gérer 1 Milliard de triplets fourniesdans le cadre du benchmark WatDiv. Les systèmes distribués permettraient sous certaines conditions d’améliorer cepoint mais une perte de performances conséquente est induite.Dans cette thèse, nous proposons le système centralisé "RDF_QDAG" qui permet de trouver un bon compromisentre passage à l’échelle et performances. Nous proposons de combiner la fragmentation physique de données etl’exploration du graphe de données. "RDF_QDAG" permet de support plusieurs types de requêtes basées nonseulement sur les motifs basiques de graphes mais aussi qui intègrent des filtres à base d’expressions régulières et aussides fonctions d’agrégation et de tri. "RDF_QDAG" se base sur le modèle d’exécution Volcano, ce qui permet decontrôler la mémoire principale, en évitant tout débordement pour garantir les performances même si la configurationmatérielle est limitée. A notre connaissance, "RDF_QDAG" est le seul système centralisé capable de gérer plusieursmilliards de triplets tout en garantissant de bonnes performances. Nous avons comparé ce système avec d’autressystèmes qui représentent l’état de l’art en matière de gestion de données RDF : une approche relationnelle (Virtuoso),une approche à base de graphes (g-Store), une approche d'indexation intensive (RDF-3X) et une approche MPP(CliqueSquare). "RDF_QDAG" surpasse les systèmes existants lorsqu’il s’agit de garantir à la fois le passage à l’échelleet les performances.
Article
Full-text available
RDF and RDF Schema are two W3C standards aimed at enriching the Web with machine-processable semantic data. We have developed Sesame, an architecture for ecient storage and expressive querying of large quantities of metadata in RDF and RDF Schema. Sesame's design and implementation are independent from any specic
Article
Full-text available
Large-scale Semantic Web applications require large-scale storage of Resource Description Framework (RDF) information and a means to analyze that information via the Web Ontology Language (OWL) in near real time. The Kowari Metastore was designed as a purpose-built RDF database to fulfill this requirement. Kowari provides a scalable, transaction-safe storage infrastructure for RDF statements and an expressive query language for their analysis, with or without the use of a subset of the RDF Schema and/or OWL languages. OWL Lite plus the full cardinality constraints from OWL Full are currently supported via the interactive Tucana Query Language (iTQL) or the Simple Ontology Framework API (SOFA). Kowari's native quad-store indexing scheme has been shown to scale to hundreds of millions of RDF statements on a single machine. Kowari is an Open Source project sponsored by Tucana Technologies and is licensed under the Mozilla Public License, version 1.1.
Conference Paper
Full-text available
RDF and RDF Schema are two W3C standards aimed at enriching the Web with machine-processable semantic data. We have developed Sesame, an architecture for efficient storage and expressive querying of large quantities of metadata in RDF and RDF Schema. Sesame’s design and implementation are independent from any specific storage device. Thus, Sesame can be deployed on top of a variety of storage devices, such as relational databases, triple stores, or object-oriented databases, without having to change the query engine or other functional modules. Sesame offers support for concurrency control, independent export of RDF and RDFS information and a query engine for RQL, a query language for RDF that offers native support for RDF Schema semantics. We present an overview of Sesame as a generic architecture, as well as its implementation and our first experiences with this implementation.
Conference Paper
Full-text available
OWLIM is a high-performance Storage and Inference Layer (SAIL) for Sesame, which performs OWL DLP reasoning, based on forward-chaining of entilement rules. The reasoning and query evaluation are performed in- memory, while in the same time OWLIM provides a reliable persistence, based on N-Triples files. This paper presents OWLIM, together with an evaluation of its scalability over synthetic, but realistic, dataset encoded with respect to PROTON ontology. The experiment demonstrates that OWLIM can scale to millions of statements even on commodity desktop hardware. On an almost- entry-level server, OWLIM can manage a knowledge base of 10 million ex- plicit statements, which are extended to about 19 millions after forward chain- ing. The upload and storage speed is about 3,000 statement/sec. at the maximal size of the repository, but it starts at more than 18,000 (for a small repository) and slows down smoothly. As it can be expected for such an inference strategy, delete operations are expensive, taking as much as few minutes. In the same time, a variety of queries can be evaluated within milliseconds. The experiment shows that such reasoners can be efficient for very big knowledge bases, in scenarios when delete operations should not be handled in real-time.
Conference Paper
E cient management of RDF data is an important factor in real- izing the Semantic Web vision. Performance and scalability issues are becoming increasingly pressing as Semantic Web technology is applied to real-world applications. In this paper, we examine the reasons why current data management solutions for RDF data scale poorly, and explore the fundamental scalability limitations of these approaches. We review the state of the art for improving perfor- mance for RDF databases and consider a recent suggestion, "prop- erty tables." We then discuss practically and empirically why this solution has undesirable features. As an improvement, we propose an alternative solution: vertically partitioning the RDF data. We compare the performance of vertical partitioning with prior art on queries generated by a Web-based RDF browser over a large-scale (more than 50 million triples) catalog of library data. Our results show that a vertical partitioned schema achieves similar perfor- mance to the property table technique while being much simpler to design. Further, if a column-oriented DBMS (a database archi- tected specially for the vertically partitioned case) is used instead of a row-oriented DBMS, another order of magnitude performance improvement is observed, with query times dropping from minutes to several seconds.
Conference Paper
Devising a scheme for efficient and scalable querying of Resource Description Framework (RDF) data has been an active area of current research. However, most approaches define new languages for querying RDF data, which has the following shortcomings: 1) They are difficult to integrate with SQL queries used in database applications, and 2) They incur inefficiency as data has to be transformed from SQL to the corresponding language data format. This paper proposes a SQL based scheme that avoids these problems. Specifically, it introduces a SQL table function RDF_MATCH to query RDF data. The results of RDF_MATCH table function can be further processed by SQL's rich querying capabilities and seamlessly combined with queries on traditional relational data. Furthermore, the RDF_MATCH table function invocation is rewritten as a SQL query, thereby avoiding run-time table function procedural overheads. It also enables optimization of rewritten query in conjunction with the rest of the query. The resulting query is executed efficiently by making use of B-tree indexes as well as specialized subject-property materialized views. This paper describes the functionality of the RDF_MATCH table function for querying RDF data, which can optionally include user-defined rulebases, and discusses its implementation in Oracle RDBMS. It also presents an experimental study characterizing the overhead eliminated by avoiding procedural code at runtime, characterizing performance under various input conditions, and demonstrating scalability using 80 million RDF triples from UniProt protein and annotation data.
Conference Paper
This paper describes an evolution of the 3store RDF storage system, extended to provide a SPARQL query interface and informed by lessons learned in the area of scalable RDF storage.
Conference Paper
Abstract Berkeley DB is an Open Source embedded,database system with a number of key advantages over comparable,systems. It is simple to use, supports concurrent access by multiple users, and provides industrial-strength transaction support, including surviving system and disk crashes. This paper describes the design and technical features of Berkeley DB, the distribution, and its license.
Conference Paper
Storing and querying resource description framework (RDF) data is one of the basic tasks within any semantic Web application. A number of storage systems provide assistance for this task. However, current RDF database systems do not use optimized indexes, which results in a poor performance behavior for querying RDF. In this paper we describe optimized index structures for RDF, show how to process and evaluate queries based on the index structure, describe a lightweight adaptable implementation in Java, and provide a performance comparison with existing RDF databases.