ArticlePDF Available

MARVIN: A platform for large-scale analysis of Semantic Web data

Authors:

Abstract and Figures

Abstract—Web,Science,requires,efficient techniques,for analysing,large datasets. Many,Semantic,Web,problems,are difficult to solve through common divide-and-conquer strategies, since they are hard to partition. We present MARVIN, a parallel and,distributed platform,for processing,large amounts,of RDF data, on a network of loosely-coupled peers. We present our divide-conquer-swap,strategy and show,that this model,converges towards completeness. We evaluate performance, scalability, load balancing,and,efficiency of our system. I. ANALYSING WEB DATA
Content may be subject to copyright.
1
MARVIN: A platform for large-scale analysis of
Semantic Web data
Eyal Oren, Spyros Kotoulas, George Anadiotis, Ronny Siebes, Annette ten Teije, and Frank van Harmelen
Department of Computer Science, Vrije Universiteit Amsterdam, the Netherlands
Abstract—Web Science requires efficient techniques for
analysing large datasets. Many Semantic Web problems are
difficult to solve through common divide-and-conquer strategies,
since they are hard to partition. We present MARVIN, a parallel
and distributed platform for processing large amounts of RDF
data, on a network of loosely-coupled peers. We present our
divide-conquer-swap strategy and show that this model converges
towards completeness. We evaluate performance, scalability, load
balancing and efficiency of our system.
I. ANALYSING WEB DATA
Web Science involves, amongst others, the analysis and
interpretation of data and phenomena on the Web [9]. Since the
datasets involved are typically very large, efficient techniques
are needed for scalable execution of analysis jobs over these
datasets.
Traditionally, scaling computation through a divide-and-
conquer strategy has been successful in a wide range of data
analysis settings. Dedicated techniques have been developed
for analysis of Web-scale data through a divide-and-conquer
strategy, such as MapReduce [5].
Over the recent years, large volumes of Semantic Web
data have become available, to the extent that the data is
quickly outgrowing the capacity of storage systems and rea-
soning engines. Through the “linking open data” initiative, and
through crawling and indexing infrastructures [13], datasets
with millions or billions of triples are now readily available.
These datasets contain RDF triples and many RDFS and OWL
statements with implicit semantics [6].
From a Web Science viewpoint, these datasets are often
more interesting than the Web graph [9] of page hyperlinks.
First, since these datasets contain typed relations with partic-
ular meaning, they can be subjected to more detailed analysis.
Secondly, most of these datasets are not annotated Web pages
but rather interlinked exports of the “deep Web”, which has
traditionally been hard to obtain and analyse [14].
However, to process, analyse, and interpret such datasets
collected from the Web, infrastructure is needed that can scale
to these sizes, and can exploit the semantics in these datasets.
In contrast to other analysis tasks concerning Web data, it is
not clear how to solve many Semantic Web problems through
divide-and-conquer, since it is hard to split the problem into
independent partitions.
To illustrate this problem we will focus on a common and
typical problem: computing the deductive closure of these
datasets through logical reasoning. Recent benchmarks [2, 8]
show that current RDF stores can barely scale to the current
volumes of data, even without this kind of logical reasoning.
II. SCALABLE RDF REASONING
To deal with massive volumes of Semantic Web data, we
aim at building RDF engines that offer massively scalable
reasoning. In our opinion, such scalability can be achieved
by combining the following approaches:
using parallel hardware which runs distributed algo-
rithms that exploit such hardware regardless of the scale,
varying from tens of processors to many hundreds (as in
our experiments) or even many thousands.
designing anytime algorithms that produce sound results
where the degree of completeness increases over time.
Such algorithms can trade the speed with which the
inference process converges to completeness against the
size of the dataset, while still guaranteeing eventual
completeness.
our novel divide-conquer-swap strategy, which extends
the traditional approach of divide-and-conquer with an
iterative procedure whose result converge towards com-
pleteness over time.
We have implemented our approach in MARVIN
1
, a par-
allel and distributed platform for processing large amounts
of RDF data. MARVIN consists of a network of loosely-
coupled machines using a peer-to-peer model and does not
require splitting the problem in independent subparts. MAR-
VIN is based on the approach of divide-conquer-swap: peers
autonomously partition the problem in some manner, each
operate on some subproblem to find partial solutions, and
then re-partition their part and swap it with another peer; all
peers keep re-partitioning, solving, and swapping to find all
solutions. We show that this model is sound, converges and
reaches completeness eventually.
III. RELATED WORK
Several techniques for distributed reasoning are based on
distributed hashtables (DHTs) [11]. Cai and Frank [4] intro-
duce a basic schema for indexing RDF in DHTs. This layout
leads to uneven load distribution between nodes since term
popularity in RDF exhibits a power-law distribution [13]. Fang
et al. [7] have an iterative forward-chaining procedure similar
to ours but do not address load-balancing issues. Kaoudi et al.
[10] propose a backward-chaining algorithm which seems
promising, but no conclusions can be drawn given the small
1
Named after Marvin, the paranoid android from the Hitchhiker’s Guide to
the Galaxy. Marvin has “a brain the size of a planet”, which he can seldomly
use: the true horror of Marvin’s existence is that no task would occupy even
the tiniest fraction of his vast intellect.
2
dataset (10
4
triples) and atypical evaluation queries. Battr
´
e
et al. [1] perform limited reasoning over the locally stored
triples and introduce a policy to deal with load-balancing
issues, but only compute a fraction of the complete closure.
Serafini and Tamilin [16] perform distributed description
logics reasoning; the system relies on manually created on-
tology mappings, which is quite a limiting assumption, and
its performance is not evaluated. Schlicht and Stuckenschmidt
[15] distribute the reasoning rules instead of the data: each
node is only responsible for performing a specific part of the
reasoning process. Although efficient by preventing duplicate
work, the weakest node in this setup becomes an immediate
bottleneck and a single-point-of-failure, since all data has to
pass all nodes for the system to function properly.
IV. OUR APPROACH: DIVIDE-CONQUER-SWAP
MARVIN operates using the following “main loop”, run on
a grid of compute nodes; in this loop, steps 3–5 are repeated
infinitely, and operational statistics are continously gathered
from all compute nodes:
Algorithm 1 Divide-conquer-swap
1) The input data is divided into smaller chunks, which are
stored on a shared location.
2) A large number of “data processors” is started on the
nodes of the grid (the nature of these processors depends
on the task at hand, these could be for example be
reasoners or social graph analysers).
3) Each node reads some input chunks and computes the
corresponding output of this input data at its own speed.
4) On completion, each node selects some parts of the
computed data, and sends it to some other node(s) for
further processing. Asynchronous queues are used to
avoid blocking communication.
5) Each node copies (parts of) the computed data to some
external storage where the data can be queried on behalf
of end-users. These results grow gradually over time,
producing anytime behaviour.
When performing divide-conquer-swap, we have to address
two key trade-offs: on the one hand, we want to solve the
problem as efficiently as possible, while on the other hand we
want to minimise communication overhead and ensure that the
processing load is shared equally over all nodes. Secondly, to
maximise efficiency with minimal communication overhead,
we might let nodes process partially overlapping partitions;
however, we need an efficient method to detect when nodes
produce duplicate data, to prevent unneccesary computations.
We will discuss our approach to each of these trade-offs in
turn.
A. Load balancing and efficient computation
All communication in MARVIN is pull-based: peers explic-
itly request data from other peers, which prevents overloading
peers with too much data. Nodes also protect themselves
against high loads by ignoring incoming requests for data
when they have more than some threshold of their commu-
nication channels in use. Above some other threshold, nodes
will stop all reasoning and reject incoming messages until
they empty their communication queues. All communication
is sanity-checked using timeouts: messages are dropped if they
cannot be delivered within a certain timeframe.
Reasoning in a distributed system involves a trade-off
between overall efficiency and individual load-balancing: an
efficient distribution would route all triples involving some
given term to one node, where all inferences based on these
triples can be drawn. Since however terms are distributed very
unevenly (some terms are much more popular than others - see
[13]), the nodes responsible for these terms will have a much
higher load.
As discussed in section III, existing approaches that use
distributed hash-tables for RDF reasoning suffer from such
load-balancing problems. A uniform random distribution of
triples over nodes solves such load-balancing problems (since
all are distributed evenly) but is not very efficient for drawing
inferences.
We have implemented a hybrid approach called pull-DHT,
with two characteristic features:
in contrast to common DHT-based approaches, nodes pull
data instead of getting triples pushed to them, and
when asking peers for data, nodes still get a random dis-
tribution of triples, but instead of uniform, the distribution
is biased towards triples that fall into their address space
(based on the hash value of the triple and the node’s rank
in the network).
The pull-DHT strikes a balance between the perfectly
balanced but inefficient random routing and the efficient but
unbalanced DHT routing.
B. Duplicate detection and removal
Since our aim is to minimise the time spent for deduction of
the closure, we should spend most time computing new facts
instead of re-computing known facts. Duplicate triples can be
generated for several reasons like redundancy in the initial
dataset, sending identical triples to several peers or deriving
the same conclusions from different premises.
In reasonable quantities duplicate triples may be useful:
they may participate, in parallel, in different deductions. In
excess, however, they pose a major overhead: they cost time to
produce and process and they occupy memory and bandwidth.
Therefore, we typically want to limit the amount of duplicate
triples in the system.
To remove duplicates from the system, they need to be
detected. However, given the size of the data, peers cannot
keep a list of all previously seen triples in memory: even using
an optimal data structure such as a Bloom filter [3] with only
99% confidence, storing the existence of 8 billion triples would
occupy some 9.5GB of memory on each peer.
We tackle this issue by distributing the duplicate detection
effort, implementing a one-exit door policy: we assign the re-
sponsibility to detect each triple’s uniqueness to a single peer,
using a uniform hash function: exit door(t) = hash(t) mod
N, where t is a triple and N is the number of nodes. The
3
exit door uses a bloomfilter to detect previously encountered
triples: it marks the first copy of each triple as master copy,
and removes all other subsequent copies.
For large number of nodes however, the one-exit door policy
becomes less efficient since the probability of a triple to
randomly appear at its exit door is
1
N
for N number of nodes.
Therefore, we have an additional and configurable sub-exit
door policy, where some k peers are responsible for explicitly
routing some triples to an exit door, instead of waiting until
the triples arrive at the designated exit door randomly.
A final optimisation that we call the dynamic sub-exit door
policy makes k dependent on the number of triples in each
local output buffer - raising k when the system is loaded and
lowering it when the system is underutilized. This mechanism
effectively works as a pressure valve, relieving the system
when pressure gets too high. This policy is implemented with
two thresholds: if the number of triples in the output pool
exceeds t
upper
then we set k = N, if it is below t
lower
then
we set k = 0.
V. EVENTUAL COMPLETENESS
In this section we will provide a qualitative model to study
the completeness of MARVIN. Assuming a sound external
procedure in the “conquer” step, overall soundness is evident
through inspection of the basic loop, and we will not discuss
it further.
The interesting question is not only whether MARVIN is
complete: we want to know to which extent it is complete, and
how this completeness evolves over time. For such questions,
tools from logic do not suffice since they treat completeness as
a binary property, do not analyse the degree of completeness
and do not provide any progressive notion of the inference
process. Instead, an elementary statistical approach yields
more insight.
Let C
denote the deductive closure of the input data:
all triples that can be derived from the input data. Given
MARVINs soundness, we can consider each inference as
a “draw” from this closure C
. Since MARVIN derives its
conclusions gradually over time, we can regard MARVIN as
performing a series of repeated draws from C
over time.
The repeated draws from C
may yield triples that have been
drawn before: peers could re-derive duplicate conclusions that
had been previously derived by others. Still, by drawing at
each timepoint t a subset C(t) from C
, we gradually obtain
more and more elements from C
.
In this light, our completeness question can be rephrased
as follows: how does the union of all sets C(t) grow with t?
Will
t
C(t) = C
for some value of t?
At which rate will this convergence happen? Elementary
statistics tells us that if we draw t times a set of k elements
from a set of size N, the number of distinct drawn elements
is expected to be N × (1 (1 k/N)
t
). Of course, this
is the expected number of distinct drawn elements after t
iterations, since the number of drawn duplicates is governed
by chance, but the “most likely” (expected) number of distinct
elements after t iterations is N ×(1 (1 k/N )
t
), and in fact
the variance of this expectation is very low when k is small
compared to N .
Fig. 1. Predicted rate of unique triples produced
In our case, N = |C
|, the size of the full closure, and
k = |C(t)|, the number of triples jointly derived by all nodes at
time t, so that the expected completeness γ(t) after t iterations
is:
γ(t) = (1 (1
|C(t)|
|C
|
)
t
)
Notice that the boundary conditions on γ(t) are reasonable:
at t = 0, when no inference has been done, we have
maximal incompleteness (γ(0) = 0); for trivial problems
where the peers can compute the full closure in a single step
(i.e. |C(1)| = |C
|), we have immediate full completeness
(γ(1) = 1); and in general if the peers are more efficient (ie
they compute a larger slice of the closure at each iteration),
then |C(t)|/|C
| is closer to 1, and γ(t) converges faster to 1,
as expected. The graph of unique triple produced over time,
as predicted by this model, is shown in figure 1. The predicted
completeness rate fits the curves that we find in experimental
settings, shown in the next section.
This completeness result is quite robust. In many realistic
situations, at each timepoint the joint nodes will only compute
a small fraction of the full closure (C(t) |C
|), so that γ(t)
is a reliable expectation with only small variance. Furthermore,
completeness still holds when |C(t)| decreases over t, which
would correspond to the peers becoming less efficient over
time, through for example network congestion or increased
redundancy between repeated computations.
Our analytical evaluation shows that reasoning in MARVIN
converges and reaches completeness eventually. Still, conver-
gence time depends on system parameters such as the size of
internal buffers, the routing policy, and the exit policy. In the
next section, we report on empirical evaluations to understand
the influence of these paramaters.
VI. EVALUATION
We have implemented MARVIN in Java, on top of Ibis, a
high-performance communication middleware [12]. Ibis offers
an integrated solution that transparently deals with many
complexities in distributed programming such as network con-
nectivity, hardware heterogeneity, and application deployment.
We have experimented with many internal parameters such
as data distribution or routing policy; MARVIN is equipped
with tools for logging key performance indicators, to facilitate
experimentation until optimal settings for the task at hand are
found.
4
Experiments were run on the Distributed ASCI Supercom-
puter 3 (DAS-3), a ve-cluster grid system, consisting in total
of 271 machines with 791 cores at 2.4Ghz, with 4Gb of RAM
per machine. All experiments used the Sesame in-memory
store with a forward-chaining RDFS reasoner. All experiments
we limited to a max. runtime of one hour, and were run on
smaller parts of the DAS-3, as detailed in each experiment.
The datasets used were RDF Wordnet
2
and SwetoDBLP
3
.
Wordnet contains around 1.9M triples, with 41 distinct predi-
cates and 22 distinct classes; the DBLP dataset contains around
14.9M triples, with 145 distinct predicates and 11 distinct
classes. Although the schemas used are quite small, we did
not take exploit this fact in our algorithm (eg. by distributing
the schemas to all nodes a priori) because such optimisation
would not be possible for larger or initially unknown schemas.
A. Baseline: null reasoner
To validate the behavior of the baseline system components
such as buffers and routing algorithms, we created a “null
reasoner” which simple outputs all its input data. We thus
measure the throughput of the communication substrate and
the overhead of the platform.
In this setup, the system reached a sustained throughput
of 72.9 Ktps (thousand triples per second), with a sustained
transfer rate of 26.5 MB/s per node. Typically, just indexing
RDF data is slower (some 20–40 Ktps) [2], and reasoning
is even more computationally expensive. Therefore, we can
expect the inter-node communication (in the network used
during our experiments) not to be a performance bottleneck.
B. Scalability
We have designed the system in order to scale to a large
number of nodes. The Ibis middleware is based on solid
grid technology which allows MARVIN to scale to a large
number of nodes. Figure 2 shows the speedup gained by
additional computational resources (using random routing, on
the SwetoDBLP dataset), showing the number of unique triples
produced for a system of 1–64 nodes. As we can see, the
system scales gracefully.
The sharp bends in the growth curves (especially with a
small number of nodes) are attributed to the dynamic exit
doors opening: having reached the t
upper
threshold, the nodes
start sending their triples to the exit door, where they are
counted and copied to the storage bin.
nodes time (min) speedup scaled speedup
1 44
2 30 1.47 0.73
4 26 1.69 0.42
8 20 2.20 0.28
16 9.5 4.63 0.29
32 6.2 7.10 0.22
64 3.4 12.94 0.20
TABLE I
ABSOLUTE AND SCALED SPEEDUP FOR SWETODBLP DATASET
2
http://larkc.eu/marvin/experiments/wordnet.nt.gz
3
http://larkc.eu/marvin/experiments/swetodblp.nt.gz
0.0
5.0M
10.0M
15.0M
20.0M
25.0M
30.0M
0 10 20 30 40 50 60
total unique triples produced
time (min)
N=1
N=2
N=4
N=8
N=16
N=32
N=64
Fig. 2. Triples derived using an increasing number of nodes
0.0
500.0k
1.0M
1.5M
2.0M
2.5M
3.0M
3.5M
4.0M
0 10 20 30 40 50 60
total unique triples produced
time (min)
random routing
pull-DHT
input data
Fig. 3. Triples derived using different routing strategies
Table I shows the time needed to produce some fixed
number of triples in the SwetoDBLP dataset (namely, 20M
triples). It shows the amount of time needed over different
numbers of nodes, and compute the corresponding speedup
(total time spent compared to time spent on single node) and
the scaled speedup (speedup divided by number of nodes).
A perfect linear speedup would equal the number of nodes
and result in a scaled speedup (speedup divided by number of
nodes) of 1. To the best of our knowledge no relevant literature
is available in the field to compare these results, but a sublinear
speedup is to be expected in general.
C. Load balancing and efficiency
Figure 3 shows a comparison of the random triple routing
with the pull-DHT. The graphs show the time needed for
producing the number of unique triples shown. The pull-
DHT outperforms the random routing, converging faster and
producing more unique triples in total. In all experiments
presented in this section, the load was balanced evenly over
all nodes.
5
0.0
500.0k
1.0M
1.5M
2.0M
2.5M
3.0M
0 10 20 30 40 50 60
total unique triples produced
time (min)
low tolerance for copies
medium tolerance for copies
high tolerance for copies
input data
Fig. 4. Triples derived using the dynamic exit-door policy
D. Duplicate detection and removal
We have experimented with three different settings of the
dynamic sub-exit door: “low” where t
lower
= α, t
upper
= 2α;
“medium” where t
lower
= 2α, t
upper
= 4α; “high” where
t
lower
= 4α, t
upper
= 8α, where α is the number of input
triples / N .
These different settings were tested on the Wordnet dataset,
using 16 nodes with the random routing policy. The results
are shown in figure 4. As we can see, in the “low” setting,
the system benefits from having low tolerance to duplicates:
they are removed immediately, leaving bandwidth and com-
putational resources to produce useful unique new triples. On
the other hand, the duplicate detection comes at the cost of
additional communication needed to send triples to the exit
doors (not shown in the figure).
VII. CONCLUSION
We have presented a platform for analysing Web data, with
a focus on the Semantic Web. To process and interpret these
datasets, we need an infrastructure that can scale to Web size
and exploit the available semantics. In this paper, we have
focused on one particular problem: computing the deductive
closure of a dataset through logical reasoning.
We have introduced MARVIN, a platform for massive dis-
tributed RDF inference. MARVIN uses a peer-to-peer architec-
ture to achieve massive scalability by adding computational
resources through our novel divide-conquer-swap approach.
MARVIN guarantees eventual completeness of the inference
process and produces its results gradually (anytime behaviour).
Through its modular design and its built-in instrumentation,
MARVIN provides a versatile experimentation platform with
many configurations.
We have experimented with various reasoning strategies
using MARVIN. The experiments presented show that MARVIN
scales gracefully with the number of nodes, that the commu-
nication overhead is not the bottleneck during computation,
and that duplicate detection and removal is crucial for perfor-
mance. Furthermore, we have introduced an initial pull-DHT
routing policy, improving performance without disturbing load
balancing.
Acknowledgements: This work is supported by the Euro-
pean Commission under the LarKC project (FP7-215535).
REFERENCES
[1] D. Battr
´
e, A. H
¨
oing, F. Heine, and O. Kao. On triple dissemina-
tion, forward-chaining, and load balancing in DHT based RDF
stores. In Proceedings of the VLDB Workshop on Databases,
Information Systems and Peer-to-Peer Computing (DBISP2P).
2006.
[2] C. Bizer and A. Schultz. Benchmarking the performance of
storage systems that expose SPARQL endpoints. In Proceedings
of the ISWC Workshop on Scalable Semantic Web Knowledge-
base systems. 2008.
[3] B. H. Bloom. Space/time trade-offs in hash coding with
allowable errors. Communications of the ACM, 13(7):422–426,
1970.
[4] M. Cai and M. Frank. RDFPeers: A scalable distributed RDF
repository based on a structured peer-to-peer network. In
Proceedings of the International World-Wide Web Conference.
2004.
[5] J. Dean and S. Ghemawat. Mapreduce: Simplified data process-
ing on large clusters. In Proceedings of the USENIX Symposium
on Operating Systems Design & Implementation (OSDI), pp.
137–147. 2004.
[6] M. Dean. Towards a science of knowledge base performance
analysis. In Proceedings of the ISWC Workshop on Scalable
Semantic Web Knowledge-base systems. 2008.
[7] Q. Fang, Y. Zhao, G. Yang, and W. Zheng. Scalable distributed
ontology reasoning using DHT-based partitioning. In Proceed-
ings of the Asian Semantic Web Conference (ASWC). 2008.
[8] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL
knowledge base systems. Journal of Web Semantics, 3:158–182,
2005.
[9] J. Hendler, N. Shadbolt, W. Hall, T. Berners-Lee, et al. Web
science: An interdisciplinary approach to understanding the
web. Communications of the ACM, 51, 2008.
[10] Z. Kaoudi, I. Miliaraki, and M. Koubarakis. RDFS reasoning
and query answering on top of DHTs. In Proceedings of the
International Semantic Web Conference (ISWC). 2008.
[11] K. Lua, J. Crowcroft, M. Pias, R. Sharma, et al. A survey and
comparison of peer-to-peer overlay network schemes. IEEE
Communications Surveys & Tutorials, pp. 72–93, 2004.
[12] R. V. van Nieuwpoort, J. Maassen, G. Wrzesinska, R. Hofman,
et al. Ibis: a flexible and efficient Java based grid programming
environment. Concurrency and Computation: Practice and
Experience, 17(7-8):1079–1107, 2005.
[13] E. Oren, R. Delbru, M. Catasta, R. Cyganiak, et al. Sindice.com:
A document-oriented lookup index for open linked data. In-
ternational Journal of Metadata, Semantics and Ontologies,
3(1):37–52, 2008.
[14] S. Raghavan and H. Garcia-Molina. Crawling the hidden web.
In Proceedings of the International Conference on Very Large
Data Bases (VLDB), pp. 129–138. 2001.
[15] A. Schlicht and H. Stuckenschmidt. Distributed resolution
for ALC. In Proceedings of the International Workshop on
Description Logics. 2008.
[16] L. Serafini and A. Tamilin. DRAGO: Distributed Reasoning
Architecture for the Semantic Web. In Proceedings of the
European Semantic Web Conference (ESWC), pp. 361–376.
2005.
... To solve the scalability problem of semantic web reasoners, proposed solutions [5][6][7][8][9][10][11][12][13] suggest applying distributed computing techniques. Some of these studies guarantee reaching full closure [5][6][7][8][9][10], while some of them argue that they eventually reach full closure with an infinite loop [11][12][13]. ...
... To solve the scalability problem of semantic web reasoners, proposed solutions [5][6][7][8][9][10][11][12][13] suggest applying distributed computing techniques. Some of these studies guarantee reaching full closure [5][6][7][8][9][10], while some of them argue that they eventually reach full closure with an infinite loop [11][12][13]. By reaching full closure, we * Correspondence: hasan.bulut@ege.edu.tr ...
... In [12,13], data are first partitioned randomly and inferences are computed. Then the data are exchanged between peers based on a routing strategy that combines data clustering with random exchanges. ...
Article
Full-text available
Reasoning is a vital ability for semantic web applications since they aim to understand and interpret the data on the World Wide Web. However, reasoning of large data sets is one of the challenges facing semantic web applications. In this paper, we present new approaches for scalable Resource Description Framework Schema (RDFS) reasoning. Our RDFS specific term-based partitioning algorithm determines required schema elements for each data partition while eliminating the data partitions that will not produce any inferences. With the two-level partitioning approach, we are able to carry out reasoning with limited resources. In our hybrid approach, we integrate two previously mentioned methods to benefit from the advantages of both. In the experimental tests we achieve linear speedups for reasoning times with the proposed hybrid approach. These algorithms and methods presented in the paper enable RDFS-level reasoning of large data sets with limited resources, and they together build up a scalable distributed reasoning approach.
... In the first category we refer to distributed platforms that are able to process large amounts of data, usually Web data. The existing or under development systems that must be mentioned are: Marvin [15] and LarKC [16]. While the first one uses a divide-conquer-swap strategy that assures massive scalability is able to eventually reach completeness, LarKC is trading computational cost to incomplete reasoning, being intended for massive heterogeneous information. ...
Conference Paper
Full-text available
The increased interest in sensing the environment in which we live has led to the deployment of thousands of sensors which can measure and report its status. In order to raise the impact that sensor networks can have, improving the usability and accessibility of the measurements they provide is an important step. The problem addressed in this paper is that of enrichment of sensor descriptions and measurements in order to provide richer data, i.e., data containing more meaning. We propose a framework for automatizing the process of semantically enriching sensor descriptions and measurements with the purpose of improving the usability and accessibility of sensor data.
Chapter
Full-text available
Government open data is a booster for government performance, transparency, and innovation. The purpose of this study to investigate how researchers use Dubai open data on Dubai Pulse platform, Using Dubai open data for Knowledge and Human Development Authority in Dubai (KHDA) shows was not covered by academic researchers. To find the reasons behind that, a systematic review were conducted for all published papers and academic researchs in the period of 2015 until 2019. 68 articles were identified, but only 38 articels passed the including/excluding criteria, and considered within this study. The majority of researchers focus mainly on detailed students’ datasets. on the other hand the published KHDA open dataset provide many usefull datasets but it’s not including detailed students datasets; a gap between the required datasets by researchers and what KHDA open dataset provide is found. Therefore, to bridge the gap and improve researches in educational domain, KHDA needs to provide more students’ details that supports machine-readable format to become useful and usable for researchers.
Conference Paper
Full-text available
Dieser Beitrag untersucht die effiziente Auswertung von SPARQL- Anfragen auf großen RDF-Datensätzen. Zum Einsatz kommt hierfür das Apache Hadoop Framework, eine bekannte Open-Source Implementierung von Google's MapReduce, das massiv parallelisierte Berechnungen auf einem verteilten System ermöglicht. Zur Auswertung von SPARQL-Anfragen mit Hadoop wird in diesem Beitrag PigSPARQL, eine Übersetzung von SPARQL nach Pig Latin, vorgestellt. Pig Latin ist eine von Yahoo! Research entworfene Sprache zur verteilten Analyse von großen Datensätzen. Pig, die Implementierung von Pig Latin für Hadoop, übersetzt ein Pig Latin-Programm in eine Folge von MapReduce-Jobs, die anschließend auf einem Hadoop-Cluster ausgeführt werden. Die Evaluation von PigSPARQL anhand eines SPARQL spezifischen Benchmarks zeigt, dass der gewählte Ansatz eine effiziente Auswertung von SPARQL-Anfragen mit Hadoop ermöglicht.
Article
In recent years, there has been a growing interest in RDFS Inference to build a rich knowledge base. However, it is difficult to improve the inference performance with large data by using a single machine. Therefore, researchers are investigating the development of a RDFS inference engine for a distributed computing environment. However, the existing inference engines cannot process data in real-time, are difficult to implement, and are vulnerable to repetitive tasks. In order to overcome these problems, we propose a method to construct an in-memory distributed inference engine that uses a parallel graph structure. In general, the ontology based on a triple structure possesses a graph structure. Thus, it is intuitive to design a graph structure-based inference engine. Moreover, the RDFS inference rule can be implemented by utilizing the operator of the graph structure, and we can thus design the inference engine according to the graph structure, and not the structure of the data table. In this study, we evaluate the proposed inference engine by using the LUBM1000 and LUBM3000 data to test the speed of the inference. The results of our experiment indicate that the proposed in-memory distributed inference engine achieved a performance of about 10 times faster than an in-storage inference engine.
Article
Combining ontologies in expressive fragments of Description Logics in inherently distributed peer-to-peer settings with autonomous peers is still a challenge in the general case. Although several modular ontology representation frameworks have been proposed for combining Description Logics knowledge bases, each of them has its own strengths and limitations. In this paper, we consider networks of peers, where each peer holds its own ontology within the (Formula presented.) fragment of Description Logics, and subjective beliefs on how its knowledge can be coupled with the knowledge of others. To allow peers to reason jointly with their coupled knowledge, while preserving their autonomy on evolving their knowledge, data, and subjective beliefs, we propose the (Formula presented.)E-SHIQ representation framework. The article motivates the need for (Formula presented.) and compares it to existing representation frameworks for modular Description Logics. It discusses the implementation of the (Formula presented.) distributed reasoner and presents experimental results on the efficiency of this reasoner.
Conference Paper
The ability to reason over large scale data and return responsive query results is widely seen as a critical step to achieving the Semantic Web vision. We describe an approach for partitioning OWL Lite datasets and then propose a strategy for parallel reasoning about concept instances and role instances on each partition. The partitions are designed such that each can be reasoned on independently to find answers to each query sub goal, and when the results are unioned together, a complete set of results are found for that sub goal. Our partitioning approach has a polynomial worst case time complexity in the size of the knowledge base. In our current implementation, we partition semantic web datasets and execute reasoning tasks on partitioned data in parallel on independent machines. We implement a master-slave architecture that distributes a given query to the slave processes on different machines. All slaves run in parallel, each performing sound and complete reasoning to execute each sub goal of its query on its own set of partitions. As a final step, master joins the results computed by the slaves. We study the impact of our parallel reasoning approach on query performance and show some promising results on LUBM data.
Conference Paper
Accurate bus trajectory data is the basis of many public transportation applications. However, trajectory data sampled by GPS devices contains notable direction errors. We cannot determine the travelling direction of the bus through trajectory data. To address this problem, we utilize k-nearest neighbor algorithm (K-NN) to determine the direction of the bus trajectory. Meanwhile, the voluminous bus trajectory data accumulated daily need to be process efficiently for further data mining. To meet the scalability and performance requirements, in this paper, we use Map-Reduce programming model for trajectory data direction correcting and projecting the bus GPS point to the road link. Particularly, we compare execution time through setting different amount of reduce to express the extent of running time can be affected. Experimental results indicate that the K-NN algorithm improves the accuracy of the direction field in raw bus trajectory data significantly. By comparing the efficiency under different reduce quantities. The result shows that parallel processing framework improves the computational efficiency by a factor of 2 at least, obtaining.
Conference Paper
As growing of the Web, there are many efforts to use knowledge in formal language. The most popular work of the formal language is "LOD(linked open data)" initiative based on RDF(s). Most previous work for RDFs Reasoning is based on single reasoning machine. However, reasoning for large volume of RDFs is became a challenging issue as increasing of RDFs dataset such as LOD because previous work such as Jena can't handle the large scale of RDF dataset in the Internet. Transitivity reasoning which is most complicated one of RDFs reasoning rules generates monotone increasing dataset (transitive closure) continuously. This work proposes distributed reasoning method for large scale of RDFs dataset based on MapReduce algorithm. We analysis distributed RDFs reasoning and propose a distributed reasoning method. We implement the proposed method and reveals experimental results to show performance of our method.
Article
Full-text available
Despite the huge success of the World Wide Web as a technology, and the significant amount of computing infrastructure on which it sits, the Web, as an entity remains surprisingly unstudied. In this article, we look at some of the issues that need to be explored to model the Web as a whole, to keep it growing, and to understand its continuing social impact. We argue that a "systems" approach, in the sense of "systems biology" is needed if we are to be able to understand and engineer the future of the Web.
Article
Full-text available
The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems and are used within enterprise and open web settings. As SPARQL is taken up by the community there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. Such systems include native RDF stores, systems that map relational databases into RDF, and SPARQL wrappers around other kinds of data sources. This paper introduces the Berlin SPARQL Benchmark (BSBM) for comparing the performance of these systems across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. The benchmark query mix illustrates the search and navigation pattern of a consumer looking for a product. After giving an overview about the design of the benchmark, the paper presents the results of an experiment comparing the performance of D2R Server, a relational database to RDF wrapper, with the performance of Sesame, Virtuoso, and Jena SDB, three popular RDF stores.
Conference Paper
Full-text available
We study the problem of distributed RDFS reasoning and query answering on top of distributed hash tables. Scalable, distributed RDFS reasoning is an essential functionality for providing the scalabil- ity and performance that large-scale Semantic Web applications require. Our goal in this paper is to compare and evaluate two well-known ap- proaches to RDFS reasoning, namely backward and forward chaining, on top of distributed hash tables. We show how to implement both al- gorithms on top of the distributed hash table Bamboo and prove their correctness. We also study the time-space trade-o exhibited by the al- gorithms analytically, and experimentally by evaluating our algorithms on PlanetLab.
Conference Paper
Full-text available
The use of Description Logic as the basis for Semantic Web Languages has led to new requirements with respect to scalable and non- standard reasoning. In this paper, we address the problem of scalable reasoning by proposing a distributed, complete and terminating algo- rithm that decides satisfiability of terminologies in ALC. The algorithm is based on recent results on applying resolution to description logics. We show that the resolution procedure proposed by Tammet can be dis- tributed amongst multiple resolution solvers by assigning unique sets of literals to individual solvers. This results provides the basis for a highly scalable reasoning infrastructure for Description logics.
Article
In computational Grids, performance-hungry applications need to simultaneously tap the computational power of multiple, dynamically available sites. The crux of designing Grid programming environments stems exactly from the dynamic availability of compute cycles: Grid programming environments (a) need to be portable to run on as many sites as possible, (b) they need to be flexible to cope with different network protocols and dynamically changing groups of compute nodes, while (c) they need to provide efficient (local) communication that enables high-performance computing in the first place. Existing programming environments are either portable (Java), or flexible (Jini, Java Remote Method Invocation or (RMI)), or they are highly efficient (Message Passing Interface). No system combines all three properties that are necessary for Grid computing. In this paper, we present Ibis, a new programming environment that combines Java's ‘run everywhere’ portability both with flexible treatment of dynamically available networks and processor pools, and with highly efficient, object-based communication. Ibis can transfer Java objects very efficiently by combining streaming object serialization with a zero-copy protocol. Using RMI as a simple test case, we show that Ibis outperforms existing RMI implementations, achieving up to nine times higher throughputs with trees of objects. Copyright © 2005 John Wiley & Sons, Ltd.
Conference Paper
Ontology reasoning is an indispensable step to fully exploit the implicit semantics of Semantic Web data. The inherent distribution characteristic of the Semantic Web and huge amount of ontology instance data necessitates efficient and scalable distributed ontology reasoning. Current researches on distributed ontology reasoning mainly focus on dealing with the heterogeneity of different ontologies but pay little attention to the performance of distributed reasoning and have not presented practical approaches and systems. Our goal is to propose an efficient and scalable distributed ontology reasoning approach, making it practical in real semantic applications. We propose an approach in this paper, in which Description Logic reasoners for TBox reasoning are combined with rule engines for ABox reasoning to support both expressive ontologies and large amount of instance data. The published data from each node is distributed using a DHT-based partitioning and stored in well-designed relational databases to support convenient and efficient reasoning through cooperation of the distributed nodes. A practical distributed ontology reasoning and querying system called DORS is developed based on our proposed approach. Our experiments both in LANs and on PlanetLab using University Ontology Benchmark show high efficiency of DORS compared with the centralized OWL ontology reasoning system Minerva as well as good scalability with respect to the number of nodes and volume of data in the system.
Article
We describe our method for benchmarking Semantic Web knowledge base systems with respect to use in large OWL applications. We present the Lehigh University Benchmark (LUBM) as an example of how to design such benchmarks. The LUBM features an ontology for the university domain, synthetic OWL data scalable to an arbitrary size, 14 extensional queries representing a variety of properties, and several performance metrics. The LUBM can be used to evaluate systems with different reasoning capabilities and storage mechanisms. We demonstrate this with an evaluation of two memory-based systems and two systems with persistent storage.
Conference Paper
The paper addresses the problem of reasoning with multiple ontologies interrelated with semantic mappings. This problem is becom- ing more and more relevant due to the necessity of building a scalable ontological reasoning tools for the Semantic Web. In contrast to the so called global approach, in which reasoning with multiple semantically related ontologies is performed in a global knowledge base that encodes both ontologies and semantic mappings, we propose a distributed reason- ing approach in which reasoning is the result of combination via semantic mappings of local reasonings chunks performed in single ontologies. The paper presents a tableau-based distributed reasoning procedure which is sound and complete w.r.t. Distributed Description Logics, the formal framework used to represent multiple semantically connected ontologies. The paper also describes the design and implementation principles of a distributed reasoning system, called DRAGO (Distributed Reasoning Architecture for a Galaxy of Ontology), that implements such distributed decision procedure.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Conference Paper
The Resource Description Framework provides a powerful model for structured knowledge representation that allows the inference of new knowledge. Because of the anticipated scope of semantic information available in the future, centralized databases will become incapable of handling the load. Peer-to-Peer based distributed databases offer better scalability and integration of many different data sources. In this paper we present a detailed data management strategy for a DHT based RDF store that provides reasoning, robustness, and load-balancing.