Conference PaperPDF Available

SPARQL-to-SQL on internet of things databases and streams


Abstract and Figures

To realise a semantic Web of Things, the challenge of achieving efficient Resource Description Format (RDF) storage and SPARQL query performance on Internet of Things (IoT) devices with limited resources has to be addressed. State-of-the-art SPARQL-to-SQL engines have been shown to outperform RDF stores on some benchmarks. In this paper, we describe an optimisation to the SPARQL-to-SQL approach, based on a study of time-series IoT data structures, that employs metadata abstraction and efficient translation by reusing existing SPARQL engines to produce Linked Data ‘just-in-time’. We evaluate our approach against RDF stores, state-of-the-art SPARQL-to-SQL engines and streaming SPARQL engines, in the context of IoT data and scenarios. We show that storage efficiency, with succinct row storage, and query performance can be improved from 2 times to 3 orders of magnitude.
Content may be subject to copyright.
SPARQL-to-SQL on Internet of Things Databases
and Streams
Eugene Siow, Thanassis Tiropanis, and Wendy Hall
Electronics & Computer Science, University of Southampton
Abstract. To realise a semantic Web of Things, the challenge of achiev-
ing efficient Resource Description Format (RDF) storage and SPARQL
query performance on Internet of Things (IoT) devices with limited re-
sources has to be addressed. State-of-the-art SPARQL-to-SQL engines
have been shown to outperform RDF stores on some benchmarks. In
this paper, we describe an optimisation to the SPARQL-to-SQL ap-
proach, based on a study of time-series IoT data structures, that em-
ploys metadata abstraction and efficient translation by reusing existing
SPARQL engines to produce Linked Data ‘just-in-time’. We evaluate our
approach against RDF stores, state-of-the-art SPARQL-to-SQL engines
and streaming SPARQL engines, in the context of IoT data and scenar-
ios. We show that storage efficiency, with succinct row storage, and query
performance can be improved from 2 times to 3 orders of magnitude.
Keywords: SPARQL, SQL, Query Translation, Analytics, Internet of
Things, Web of Things
1 Introduction
The Internet of Things (IoT) envisions a world-wide, interconnected network
of smart physical entities with the aim of providing technological and societal
benefits [9]. However, as the W3C Web of Things (WoT) Interest Group charter1
states, the IoT is currently beset by product silos and to unlock its potential,
an open ecosystem based upon open standards for identification, discovery and
interoperation of services is required.
We see a semantic Web of Things as such an information space, with rich
descriptions, shared data models and constructs for interoperability that utilises
but is not limited to semantic & web technologies to provide an application layer
for IoT applications. As Barnaghi et al. [3] have proposed, semantic technologies
can serve to facilitate interoperability, data abstraction, access and integration
with other cyber, social or physical world data.
The semantic WoT does present a set of unique challenges: handling and
storing time-series data as RDF, querying with SPARQL on limited IoT devices
and distributed usage scenarios. Buil-Aranda et al. [5] have examined traditional
SPARQL endpoints on the web and shown that performance for generic queries
2 SPARQL-to-SQL on IoT Databases and Streams
can vary by up to 3-4 orders of magnitude. Endpoints generally limit or have
worsened reliability when issued with a series of non-trivial queries. IoT devices
have added resource constraints, however, we argue that time-series IoT data
and distribution also present the opportunity for specific optimisation.
The contribution of this paper is to present an optimisation of SPARQL-to-
SQL query translation for the particular case of time-series data, both historical
and streaming, with a novel approach that uses existing SPARQL engines to
resolve Basic Graph Patterns and mappings that allow intermediate nodes of
observations to be ‘collapsed’. This is advised by our study of IoT schemata
which exhibits a flat and wide structure. Our approach compares favourably
to native RDF storage, SPARQL-to-SQL engines and RDF stream processing
engines deployed on compact, resource-constrained devices, showing 2 times to 3
orders of magnitude performance and storage improvements on published sensor
benchmarks and IoT use cases like smart homes.
In Section 2, we first study the structure of time-series IoT data which leads
us, in Section 3, to study related work. We then describe the design and im-
plementation of our approach, that employs metadata abstraction through map-
pings and SPARQL-to-SQL translation for performance, reusing, at the core, any
existing SPARQL engine in Section 4. Finally, we evaluate our approach against
traditional RDF stores, SPARQL-to-SQL engines and streaming engines using
an established benchmark and a common IoT scenario in Section 5. Results are
presented and discussed in Section 6 with the conclusion in Section 7.
2 Structure of Internet of Things Data
To investigate the structure of data produced by sensors in the Internet of
Things, we collected the schemata of 19,914 unique IoT devices from public
data streams on Dweet.io2over a one month period in January 2016. is a cloud platform that supports the publishing of time-series data
from IoT devices in JavaScript Object Notation (JSON). The schema represented
in JSON can be flat (row-like with a single level of data) or complex (tree-
like/hierachical with multiple nested levels of data). It was observed from the
schemata, removing the 1542 (7.7%) that were empty, that 18,280 (99.5%) of
the non-empty schemata were flat while only 92 (0.5%) were complex.
We also analysed the schemata to investigate how wide the IoT data was.
Wideness is defined as the number of properties beside the timestamp and a
schema is considered wide if there are 2 or more such properties. We found
that 92.2% of the devices sampled had a schema that was wide. The majority
(53.2%) had 4 properties related to each timestamp. We also obtained a smaller
alternative sample of 614 unique devices (over the same period) from Sparkfun3,
that only supports flat schemata, which confirmed that most (76.3%) IoT devices
sampled have wide time-series schemata.
We concluded that our sample of over 20,000 unique IoT devices from
and Sparkfun contained 1) flat and 2) wide IoT time-series data. It follows that
SPARQL-to-SQL on IoT Databases and Streams 3
a possible succinct representation of such data is as rows in a relational database
with column headings corresponding to properties. SPARQL-to-SQL translation
is then a possibility for querying. Investigating column stores was out of the scope
of this study, however provides for interesting future work and comparison, as
tension between inserts/updates and optimising data structures for reads [15] are
reduced for time-series data which are already sorted by time in entry sequence
order. The IoT schemata we collected is available on Github4.
3 Related Work
The fact that we are dealing with time-series sensor data, represented as Linked
Data with ontologies like the Semantic Sensor Network (SSN) ontology and
Linked Sensor Data [12] for interoperability, prescribes the study of: i) RDF
stores, ii) R2RML and SPARQL-to-SQL translation with relational databases
to improve performance and storage efficiency for time series-data as rows and
iii) streaming engines for efficient processing on real-time streams.
RDF Stores Virtuoso [8] is based on an Object Relational DBMS optimised for
RDF storage while Jena Tuple Database (TDB) is a native Java RDF store using
a single table to store triples/quads. Indexes, like the 6 SPO (Subject-Predicate-
Object) permutations that Neumann et al. [11] propose often improve query
performance on tables by reducing scans. TDB creates 3 triple indexes (OSP,
POS, SPO) and 6 quad indexes while Virtuoso creates 5 quad indexes (PSOG,
POGS, SP, OP, GS; G is graph). Commercial stores like GraphDB, formerly
OWLIM, have also shown to perform well on benchmarks [4] with 6 indexes
(PSO, POS, entities, classes, predicates, literals). Indexing, however, increases
the storage size and memory required to load them.
Relational Databases (SPARL-to-SQL) Efficient SPARQL-to-SQL trans-
lation that improves performance and builds on previous literature has been
investigated by Rodriguez-Muro et al. [14] and Priyatna et al. [13] with state-of-
the-art engines ontop and morph respectively. Both engines support R2RML5,
a W3C recommendation based on the concept of mapping logical tables in re-
lational databases to RDF via Triples Maps (the subject, predicate and object
in a triple can be mapped to columns in a table). They also optimise query
translation to remove redundant self-joins. Ontop, which translates mappings
and queries to a set of Datalog rules, applies query containment and semantic
query optimisation to create efficient SQL queries. However, 1) R2RML is de-
signed for generality rather than abstracting and ‘collapsing’ (reducing self joins
on identifier columns in tables mapping to IRI templates) intermediate nodes
(Section 4) 2) Time-series data can be different from relational data (e.g. does
not have primary keys) 3) The round-trip to retrieve database metadata (ontop)
could be significant on devices with slower disk/memory access.
4 SPARQL-to-SQL on IoT Databases and Streams
Streaming Engines The C-SPARQL [1] engine supports continuous pull-based
SPARQL queries over RDF data streams by using Esper6, a complex event pro-
cessing engine, to form windows in which SPARQL queries can be executed on
an in-memory RDF model. CQELS [10] is a native RDF stream engine, sup-
porting push and pull queries, that takes a ‘white-box’ approach for full control
over query optimisation and execution. morph-streams, from SP ARQLstream
[6], supports query rewriting with R2RML mappings and execution with Esper.
4 Designing a SPARQL-to-SQL engine for the IoT
Based on the ontologies for integrating time-series sensor data, the SSN ontol-
ogy7, Semantic Sensor Web and Linked Sensor Data (LSD) [12] mentioned in
the previous section, we observe that semantic sensor data is modelled as 1) IoT
device metadata like the location and specifications of sensors, 2) IoT observation
metadata like the units of measure and types of observation 3) IoT observation
data like timestamps and actual readings. Listing 1.1 shows an example division
into the 3 categories from the Linked Sensor Data dataset in RDF Turtle.
Listing 1.1. LSD example, rainfall from Station 4UT01 (abbreviated)
@p r ef i x ss w : < h tt p : // k n o es i s . w ri g ht . e d u / ss w / on t / s en so r - o bs e rv a t io n . o wl # >
@p r ef i x w ea t he r : < h tt p : // k no e si s . w r ig h t . ed u / s sw / o nt / w e a th e r . ow l # >
@p r ef i x wg s : < h tt p : // w w w . w3 . o rg /2 0 0 3/ 0 1/ ge o / w g s8 4 _ po s # >
@p r e fi x t i me : < h t tp :/ / w w w . w3 . o r g / 20 0 6 / t im e # >
@p r ef i x se n : < h tt p : // k n o es i s . w ri g ht . e d u / ss w / >
se n : S y st e m_ 4 U T0 1 s sw : p r o ce s s Lo c a ti o n // Device Metadata
[ wg s : la t " 4 0. 8 2 94 4 "; w g s : lo n g " - 11 1 . 88 2 22 " ].
_: o bs a w ea th e r : Ra in f al l Ob s er v at i on ; // Observation Metadata
ss w : o b se r v ed P r op e r ty w e a th e r : _ Ra i nf a l l ;
ss w : p ro c e du r e se n : S y st e m _4 U T0 1 ;
ss w : r es u lt _ : da t a ; ssw : s a m pl i ng T i me _ : t im e .
_: d a ta a ss w : M e as u r eD a ta ;
ssw:uom weather: degrees.
_: t i me a ti m e : In s ta n t ;
ti me : i nX SD D at eT i me " 20 03 - 03 - 31 T 12 : 35 :0 0 ". // Observation Data
_: d a ta s s w : fl o a tV a lu e " 0 .1 " .
Table 1. LSD example, abbreviated row from the Table 4TU01
Time Rainfall RelativeHumidity ...
2003-03-31T12:35:00 0.1 37.0 ...
Although Linked Data as implemented in RDF is flexible and expressive
enough to represent both data and metadata as triples as seen in Listing 1.1,
however, given the resource constraints of IoT devices, we make these hypotheses:
1. Storing flat and wide IoT observation data as rows is more efficient than
storage as RDF as each field value in a row, under a column header, does
not require additional subject and predicate terms (Table 1).
SPARQL-to-SQL on IoT Databases and Streams 5
2. Queries that retrieve more fields from a row (e.g. Rainfall & RelativeHumid-
ity) will require less joins as compared to RDF stores’ and perform better.
3. Most device and observation metadata can be abstracted and stored in-
memory, with a mapping language that can express this. Metadata triples
can be produced ‘just-in-time’ and intermediate nodes (e.g. ssw:MeasureData
in Listing 1.1), if not projected in queries, can be ‘collapsed’ (reduces joins
in RDF stores and self joins on identifier columns in tables that map to
intermediate nodes, e.g. _:obs, _:data and _:time for SPARQL-to-SQL).
4. Efficient queries can be produced without relying on primary keys within
time-series data and retrieving database schema from IoT devices.
4.1 sparql2sql and sparql2stream
We present, based on our hypotheses, sparql2sql (translates SPARQL-to-SQL)
and sparql2stream (translates SPARQL to Event Processing Language (EPL)
for streams) engines. They utilise the same core to provide a holistic approach
to SPARQL translation for both historical and streaming IoT datasets.
Firstly, to support SPARQL-to-SQL translation, a mapping for IoT data
stored as rows is required. R2RML (as in Section 3) is designed for generality
rather than for specific IoT time-series data. As such, we propose S2SML in
Section 4.2, an R2RML-compatible mapping language designed for metadata
abstraction, collapsing intermediate nodes and in-memory storage.
Next, in Section 4.3, we explain how S2SML mappings can be used to trans-
late SPARQL to SQL, reusing any existing SPARQL engine. Finally, in Section
4.4, we show how this applies for SPARQL on streams.
4.2 S2SML Mapping
Sparql2Sql Mapping Language (S2SML) mappings serve the dual purpose of pro-
viding bindings from rows and abstracting sensor & observation metadata from
observation data stored as rows. Mappings are pure RDF and compatible with
R2RML (can be translated to and from). Furthermore, S2SML is also designed
to support ‘collapsing’ intermediate nodes of observation metadata through the
use of blank nodes or faux nodes, nodes containing identifiers only created on
projection. Listings 1.2 & 1.3 show a comparison of S2SML and R2RML from
Listing 1.1. R2RML is more verbose and uses the {time} column for IRI tem-
plates, which might not be unique and cannot be ‘collapsed’ (Section 6.2).
Listing 1.2. S2SML
_:b a weather : RainfallObservation ;
ss w : r e s u l t _: c .
_: c a s sw : Me asu reDa ta ;
ss w : fl o a t V a l u e
"4 UT01 . R a i n f a l l " ^^<:LiteralMap >.
Listing 1.3. R2RML
: t 1 a rr : T r ip le s Ma p ; r r : lo g i c a l T a b l e : 4 UT01 ;
r r : s ub j e c t Ma p [ r r : t e m p l a t e " ht tp : / / . . . o /{ ti me } " ;
r r : c l a s s w e at h e r : Ra i n f a l l O b s e r v a t i o n ] ;
r r : p r e d i c a t e O b j e c t M a p [ r r : pr e d i c a t e s sw : re s u l t ;
r r : o bj e c t Ma p [ r r : p a r e n t T r i p l e s M a p : t 2 ] ] .
: t 2 a rr : T r ip le s Ma p ; r r : lo g i c a l T a b l e : 4 UT01 ;
r r : s ub j e c t Ma p [ r r : t e m p l a t e " ht tp : / / . . m/{ ti me } " ;
r r : c l a s s ssw : Me as ur eD ata ] ;
r r : p r e d i c a t e O b j e c t M a p [ r r : pr e d i c a t e s sw : f l o a t V a l u e ;
r r : ob j ec tM a p [ r r : co lu mn " R a i n f a l l " ] ] .
To define S2SML, we adopt the notation introduced by Chebotko et al. [7]
where I, B, L denote pairwise disjoint infinite sets of IRIs, blank nodes and lit-
erals while Imap, Lmap , F are IRI Map, Literal Map and Faux Node respectively.
6 SPARQL-to-SQL on IoT Databases and Streams
Table 2. Examples of elements in (s, p, o)sets
Symbol Name Example
Imap IRI Map <{sensors.sensorName}>
BBlank Node _:bNodeId
LLiteral "-111.88222"ˆˆ<xsd:float>
Lmap Literal Map "readings.temperature"ˆˆ<s2s:literalMap>
FFaux Node <{readings.uuid}>
Examples can be found in Table 2. Combinations of these terms (e.g. Imap I BF )
denote the union of their component sets (e.g. Imap IBF).
Definition 1 (S2SML Mapping, m). Given a set of all possible S2SML map-
pings, M, an S2SML mapping, mM, is a set of triple tuples, (s, p, o)
(ImapI BF )×I×(Imap IB LmapLF )where s, p and o are subject, predicate and
object respectively.
As shown in Table 2, Imap are IRI templates that consist of the union of
IRI string parts (e.g. and reference bindings to
table columns (e.g. {tableName.colName}). Lmap are RDF literals whose value
contains reference bindings to table columns (e.g. "tableName.colName") with
a datatype of <s2s:literalMap>.
Definition 2 (Faux Node, F). Fis defined as an IRI template that consists
of the union of a set of IRI string parts, Ipand a set of placeholders, Uid ,
referencing a table, so that F=IpUid and |Uid |>= 1,|Ip|>= 1.
The example Fin Table 2 shows how a placeholder is defined in the format
of {tableName.uuid} with keyword ‘.uuid’ identifying this as a Faux node.
Listing 1.2 shows an S2SML mapping of an LSD weather station 4UT01 in
Salt Lake City. Observation data is referenced from table columns with Literal
Maps, Lmap (e.g. "4UT01.Rainfall"). Observation metadata which serves to con-
nect nodes (e.g. _:c) is ‘collapsed’ through the use of blank nodes, B, which in
R2RML (Listing 1.3) is mapped to {time} columns. The R2RML specification
does support blank nodes but none of the other engines support their use yet.
Faux nodes in S2SML are used if there is a possibility that the identifier/interme-
diate node will be projected in queries (described in Section 4.3). Finally, device
metadata also contains constant Literals, L(e.g. the latitude of the sensor).
Mapping Closures IoT devices might also have multiple sensors, each produc-
ing a time-series with a corresponding S2SML mapping. In Fig. 1, there might
be multiple observations mappings each in different readings tables and a single
sensors mapping and sensors table all forming a mapping closure.
Definition 3 (Mapping Closure, Mc). Given the set of all mappings on a
device, Md={md|mdM}, where M is a set of all possible S2SML mappings,
a mapping closure is the union of all elements in Md, so Mc=SmMdm.
SPARQL-to-SQL on IoT Databases and Streams 7
Implicit Join Conditions Observation data that is represented across mul-
tiple tables within a mapping closure might need to be joined if matched by a
SPARQL query. In R2RML, one or more join conditions (rr:joinCondition) may
be specified between triple maps of different logical tables.
In S2SML, these join conditions are automatically discovered as they are
implicit within mapping closures from IRI template matching involving two or
more tables. We define IRI template matching as follows.
Definition 4 (IRI Template Matching). Let Ipbe the set of IRI string parts
in an element of Imap.Imap1and Imap2are matching if Si1Ip1i1=Si2Ip2i2
and i1Ip1,i2Ip2:pos(i1) = pos(i2)where pos(x)is a function that
returns the position of xwithin its Imap .
Fig. 1. Graph representation of an Implicit Join within a Mapping Closure
Given matching Imap, join conditions can be inferred. Fig. 1 shows a mapping
closure consisting of a sensor and observation mapping. An IRI map in each of the
mappings, sen:system{} in Imap1and sen:system{readings.sensor}
in Imap2, fulfil a template matching. A join condition is inferred between the
columns and readings.sensor as a result.
Compatibility with R2RML S2SML is compatible with R2RML as they can
be mutually translated without losing expressiveness. Triple Maps are translated
to triples based on the elements in Table 2. Table 3 defines additional R2RML
predicates and the corresponding S2SML construct. rr:inverseExpression, for
example, is encoded within a literal, Liv , with a datatype of <s2s:inverse>
and the rr:column denoted with double braces {{COL2}}. rr:sqlQuery is en-
coded by generating a context/named graph to group triples produced from that
TripleMap and the query is stored in a literal object with context as the subject
8 SPARQL-to-SQL on IoT Databases and Streams
Table 3. Other R2RML predicates and the corresponding S2SML construct
R2RML predicate S2SML example
rr:language "literal"@en
rr:datatype "literal"ˆˆ<xsd:float>
rr:inverseExpression "{COL1} = SUBSTRING({{COL2}}, 3)"ˆˆ<s2s:inverse>
rr:class ?s a <ont:class>.
rr:sqlQuery <context1> {<sen:sys_{table.col}> ?p ?o.}
<context1> s2s:sqlQuery "query".
and <s2s:sqlQuery> as predicate. Faux nodes are translated as IRI templates.
A specification of S2SML is available on the sparql2sql wiki on Github.
4.3 Translation
Building a Mapping Closure Following from Definition 3 of a Mapping
Closure, Mc, a translation engine needs to perform, SmMdm, a union of all
mappings on a device, Md. To support template matching with any in-memory
RDF store and SPARQL engine, as described in Definition 4, we replace all Imap
within each mapping mwith Ip, the union of IRI string parts, and extract C, the
set of table column binding strings. C is then stored within map, mjoin , with Ip
as key and Cas value. For example, in Fig. 1, <sen:system_> will replace both
<sen:system_{}> and <sen:system_{readings.sensor}> while mjoin
will store (<sen:system_>, {, readings.sensor}).
SPARQL Algebra and BGP Resolution A SPARQL query, sparql, can
be translated by the function trans(Mc, sparq l). The first step within trans is
algebra(sparql)→∝, where is a SPARQL algebra expression. For example,
SRBench [17] query 68, which looks for weather stations that have observed
low visibility within an hour of time by projecting stations that have either (by
union) low visibility, high rainfall or snowfall observations, has as follows.
Basic graph patterns (BGPs) are sets of triple patterns within the query.
trans walks through from the leaf nodes executing function σ(Mc, B GP )on
each BGP. As the Mcis pure RDF and represents the graph as it is, it can
SPARQL-to-SQL on IoT Databases and Streams 9
be loaded into an RDF store, ideally, in-memory. A SPARQL select * query
containing the BGP within its where clause can then be executed on the Mc
within the store. Literal datatypes are removed from the query and stored in
a map. In the above example, BGPsnow and BGPv isibility return no results for
4UT01 (Listing 1.1) but BGPrain returns a result from σ. Each result from σis a
map of (vk, v v)V×(ImapIBLmap LF )where Vis a variable in a triple pattern.
The (vk, v v)maps are passed to the operator, op above in . Eventually, an
SQL union is performed at the project operator πfor all |σ|>1. We have
implemented a pluggable BGP resolution interface to show various in-memory
RDF stores can be supported, with Jena and Sesame as reference examples.
Table 4. Operators op and corresponding SQL Clauses
op SQL Clause Remarks
Project πSelect, From Restricts relation to subset using (vk, v v)
Extend ρSelect Renames an attribute in (vk, v v)
Filter ςWhere, Having, From Restriction translated using (vk, vv )
Union From Add unrestricted select of SQLiin FROM
Group γGroup By, From Aggregation translated using (vk , vv)
Slice ςSLimit Add a LIMIT clause
Distinct ςDSelect Add a DISTINCT to SELECT clause
Left Join ./ Left Join..On, Select If Iadd to (v k, vv), else LEFT JOIN
Syntax Translation trans continues its walk from BGP leaf nodes through
to the root. At each node, op, a syntax translation syn(S QLi,(vk, vv)i,op )
(SQLo,(v k, vv)o)is performed, producing an updated SQL query, SQLo. In the
example, at the F ilter>30,time op,S QLiwhich consists of a blank SQL where
clause is updated using (vk , vv)ito translate restrictions on ?time and ?value
to those with bindings 4UT01.time<...T17:00:00 and 4UT01.Rainfall>30. The
SQL from clause is also updated with the table 4UT01. An unchanged (vk, vv)o
and the updated SQLoare output from syn and passed upwards.
Table 4 shows a list of common operators op and their corresponding SQL
clauses and syn descriptions. If an operator uses (v k, vv)for mapping a Vand
retrieves a Imap,Lmap or F, it adds the table binding to the FROM clause. If
there are tables in the FROM without join conditions, a cartesian product (cross
join) of two tables is taken. Finally, if faux nodes, F, are encountered in π, an
SQL update (UPDATE table SET col=RANDOM_UUID()) is run to generate
identifiers and vv in (vk, vv)is updated from {table.uuid} to {table.col}.
4.4 Streaming
The mapping and translation design can be used to translate SPARQL to Event
Processing Language (EPL) for streams. Listing 1.4 shows the additional syntax
in the SPARQL from clause specified in Extended Backus Naur Form.
10 SPARQL-to-SQL on IoT Databases and Streams
Listing 1.4. SPARQL FROM Clause Definition for sparql2stream
Fr o mC l aus e = FR OM N AME D ST R EA M < St r ea mIR I > [ RA N GE T i me T ime Uni t W in d ow T yp e ]
Ti m eU n it = m s | s | m | h | d
Wi n do w Typ e = T UM B LI N G | ST EP
ATUMBLING window is a pull-based buffer that reevaluates at the specified
time interval while the STEP window is a push-based sliding window extending
for the specified time interval into the past. The syn function is modified to
support EPL as an SQL dialect. Streaming for the IoT is useful for 1) scenarios
with high sampling (e.g. accelerometers) or insertion rate (e.g. many sensors
to a device/hub) and 2) applications that perform real-time analytics requiring
push-based results from queries rather than results at pull intervals.
5 Experiment
To evaluate our approach against RDF stores, SPARQL-to-SQL engines and
streaming engines in an WoT context, we selected two unique IoT scenarios
using published datasets. Code and experiments can be found on Github9.
Distributed Meteorological System The first scenario uses Linked Sensor
Data with sensor metadata and observation data from about 20,000 weather sta-
tions across the United States. In particular, we used the period of the Nevada
Blizzard (100k triples) for storage and performance tests and the largest Hurri-
cane Ike period (300k triples) for storage tests. SRBench [17] is an accompanying
analytics benchmark for streaming SPARQL queries but can be applied, with
similar effect, to SPARQL queries constrained by time. Queries10 1 to 10 were
used as they involve time-series sensor data while the remaining queries involved
integration or federation with DBpedia or Geonames which was not within the
scope of the experiment. Queries are available on Github11 . The experiment sim-
ulates a distributed setup as each station’s data is stored on an IoT device as
RDF or rows with S2SML or R2RML mappings. Queries are broadcast to all
stations, total query time was the maximum time as the slowest station was the
limiting factor. Due to resource constraints, we assumed broadcast and individ-
ual connection times to be similar over a gigabit switch, hence, distributed tests
for the 4700+ stations were run in series, recording individual times, averaging
over 3 runs and taking the maximum amongst stations for each query.
Smart Home Analytics Benchmark This scenario uses smart home IoT
data collected by Barker et al. [2] over 3 months in 2012. 4 queries12 requiring
space-time aggregations with a variety of data for descriptive and diagnostic an-
alytics were devised. 1) hourly aggregation of temperature, 2) daily aggregation
SPARQL-to-SQL on IoT Databases and Streams 11
of temperature, 3) hourly and room-based aggregation of energy usage and 4) di-
agnosis of unattended devices through energy usage and motion, aggregating by
hour and room. Time taken for queries were averaged over 3 runs.
Environment and Stores The IoT devices used were Raspberry Pi 2 Model
B+s’ with 1GB RAM, 900MHz quad-core ARM Cortex-A7 CPU and Class 10 SD
Cards, as they are widely available and relatively powerful. 512mb was assigned
to the Java Virtual Machine on Raspbian 4.1. Ethernet connections were used
between the querying client (i5 3.2GHz, 8GB RAM, hybrid drive) and the Pis’.
RDF stores compared were TDB (Open Source) and GraphDB (Commer-
cial). Virtuoso 7 was not supported on the 32-Bit Raspbian and Virtuoso 6
did not support SPARQL 1.1 time functions like hours. H213 (disk mode) was
used as the relational store for all SPARQL-to-SQL tests. ontop and morph
were tested within the limits of query compatibility and a quantitative evalua-
tion of SQL queries and translation time was done. Native SPARQL streaming
engine CQELS was compared for push-based performance. As CQELS already
benchmarked against C-SPARQL and push results for real-time analytics helped
differentiate streams, we did not compare against C-SPARQL.
6 Results & Discussion
6.1 Storage Efficiency
Table 5 shows the store sizes of different datasets for the H2, TDB and GraphDB
setups. As time-series sensor data benefits from succinct storage as rows, H2
outperformed the RDF stores, which also suffered from greater overheads for
multiple stores and indexing [16], from about one to three orders of magnitude.
Table 5. Store Size By Dataset (in MB)
Dataset #Store(s) H2 TDB GraphDB Ratio
Nevada Blizzard 4701 90 6162 121694 1:68:1352
Hurricane Ike 12381 761 85274 345004 1:112:453
Smart Home 1 135 2103 1221 1:15:9
6.2 Query Performance
Fig. 2 shows the performance of SRBench queries on the various stores with
the Nevada Blizzard dataset. We see that our sparql2sql approach performs bet-
ter consistently on all queries with stable average execution times. We argue
that this was the result of SQL queries produced not having joins as each sta-
tion was a single time-series (wide) and intermediate nodes not being projected
12 SPARQL-to-SQL on IoT Databases and Streams
(could be ‘collapsed’). GraphDB generally performed better than the TDB store
especially on query 9 due to TDB doing a time consuming join operation in
the low-resource environment between two subtrees, WindSpeedObservation and
WindDirectionObservation. If queries were executed to retrieve subgraphs indi-
vidually with TDB, each query cost a 100 times less. Query 4 was similar but
with TemperatureObservation and WindSpeedObservation subgraphs instead.
1 2 3 4 5 6 7 8 9 10
Max Time Taken (s)
Query sparql2sql TDB GraphDB Ontop Morph
Fig. 2. Max Time Taken for Distributed SRBench Queries
Both ontop (v1.6.1) and morph (v3.5.16), at the time of writing, will only
support the aggregation operators required for queries 3 to 9 sans query 6 in
future versions. morph was also unable to translate queries 6 and 10 as yet
while ontop’s SQL query 10 did not return from the H2 store on some stations
(e.g. BLSC2). ontop performs better than the RDF stores on queries 2 and 6.
Although queries 1 and 2 are similar in purpose, query 2 has an OPTIONAL on
the unit of measure term, hence as shown in Table 6, ontop generates different
structured queries, explaining the discrepancy in time taken.
We did an additional comparison between SPARQL-to-SQL engines in terms
of the structure of queries generated and translation time. Table 6 shows the
average translation time, ttrans of the 3 engines on the client. The plugin BGP
resolution engine for sparql2sql (s2s) used was Jena. Both ontop and morph have
additional inference/reasoning features and ontop makes an extra round trip to
the Pi to obtain database metadata explaining the longer translation times.
In R2RML, as shown in Listing 1.3, in the absence of row identifiers in time-
series data, time has to be used in IRI templates for intermediate observation
metadata nodes. As timestamps are not unique in LSD (observed from data),
they are not suited as a primary key, hence cannot be used to chase equality
generating dependencies in the semantic query optimisation ontop does [14]. The
resulting queries from ontop and morph both have redundant inner joins on the
time column (used to model intermediate IRIs in R2RML).
SPARQL-to-SQL on IoT Databases and Streams 13
Table 6. SPARQL-to-SQL Translation Time and Query Structure
Qttrans (ms) Joins Join Type & Structure
s2s ontop morph s2s ontop morph ontop (qview) morph
1 16 702 146 0 6 4 implicit 4 inner
2 17 703 144 0 6 4 5 nested, 1 left outer 4 inner
6 19 703 - 0 5 - 5 implicit -
10 32 846 - 0 6 - UNION(2x3 implicit) -
In the smarthome scenario, sparql2sql query performance on aggregation
queries as shown in Fig. 3 is still ahead of the RDF stores. GraphDB also has
all-round better performance than TDB. All the queries performed SPARQL 1.1
space-time aggregations, excluding the other SPARQL-to-SQL engines.
Through the experiments, we observe that although other SPARQL-to-SQL
engines have reported significant performance improvements over RDF stores
on various benchmarks and deployments, there is still room for optimisation for
IoT devices and scenarios and perform below RDF stores on Pis’ or do not yet
support queries relevant to IoT scenarios such as aggregations. sparql2sql with
S2SML, utilises the strengths of SPARQL-to-SQL on IoT scenarios and time-
series data and performed better than both RDF stores and SPARQL-to-SQL
engines. Table 8 summarises the average query times for all the tests.
6.3 Push-based Streaming Query Performance
Table 8 shows the average time taken to evaluate a query from the insertion
of an event to the return of a push-based result from sparql2stream, ts2rand
CQELS, tCQE LS with 1s delays in between. This was averaged over 100 results.
For sparql2stream, the one-off translation time at the start (ranging from 16ms
to 32ms) was added to the sum during the average calculation. Query 6 of
SRBench was omitted due to EPL and CQELS not supporting the UNION
operator. The sparql2stream engine (using Esper to execute EPL) showed over
two orders of magnitude performance improvements over CQELS. Queries 4,
5 and 9 that involved joining subgraphs (e.g. WindSpeed and WindDirection
in 9) and aggregations showed larger differences. It was noted, that although
CQELS returned valid results for these queries, they contained an increasing
number of duplicates (perhaps from issues in the adaptive implementation) which
caused a significant slowdown over time and when averaged over 100 pushes. The
experiments are available on Github14,15 .
This ability to answer queries in sub-millisecond average times in a push-
based fashion makes sparql2stream a viable option for real-time analytics on
IoT devices like medical devices that require reacting instantaneously.
To verify that sparql2stream was able to answer SRBench queries close to the
rate they are sent, even at high velocity, we reduced the delay between insertions
14 SPARQL-to-SQL on IoT Databases and Streams
1 2 3 4
Time Taken (s)
Fig. 3. Average Time Taken for Smarthome Analytical Queries
Table 7. Average Query Run Times (in ms)
SRB ench ts2stT DB tGDB tot tmorph Ratio ts2rtC QELS Ratio
1 365 1679 1223 4589 1747702 1:5:3:13:4k 0.47 138 1:294
2 415 1651 1627 945 2097159 1:4:4:2:5k 0.46 119 1:261
3 375 1258 2251 - - 1:3:6 0.66 202 1:306
4 533 47084 3004 - - 1:88:6 0.67 186k 1:277k
5 415 1119 1404 - - 1:3:3 0.63 1476k 1:3243k
6 457 2751 2181 987 - 1:6:5:2 - - -
7 455 6563 1082 - - 1:14:2 0.66 2885 1:5245
8 320 1785 1162 - - 1:6:4 0.67 282 1:426
9 436 1328197 1175 - - 1:3k:3 0.67 188k 1:280k
10 354 2514 685 - - 1:7:2 0.73 72 1:98
Smarthome ts2stT DB tGDB Ratio ts2rtCQELS Ratio
1 466 13709 3132 1:29:7 0.64 125 1:196
2 2457 21898 6914 1:9:3 0.77 129 1:167
3 4685 322357 59803 1:69:13 0.81 - -
4 147649 527184 147275 1:4:1 3.78 - -
from 1000ms to 1ms and 0.1ms. Table 8 shows a summary of the average latency
(the time from insertion to when query results to be returned) of each query (in
ms). We observe that the average latency is slightly higher than the inverse of the
rate. The underlying stream engine, Esper, maintains context partition states
consisting of aggregation values, partial pattern matches and data windows. At
high rates, the engine introduces blocking to lock and protect context partition
states. However, Figure 4 shows the effect of this blocking is minimal as the
percentage of high latency events is less than 0.3% (note that x-axis is 99% to
100%) across various rates. This comparison which groups messages by latency
ranges is also used in the Esper benchmark and by Calbimonte et al. [6].
We also tested the size of data that can fit in-memory for sparql2stream with
SRBench Query 8, that uses a long TUMBLING window. The engine ran out of
memory after 33.5 million insertions. Given a ratio of 1 row to 75 triples within
SPARQL-to-SQL on IoT Databases and Streams 15
Table 8. Average Latency (in ms) at Different Rates
Q#1234 578910
1 1.300 1.374 1.279 1.303 1.2561 1.268 1.267 1.295 1.255
10 0.155 0.159 0.143 0.161 0.1291 0.137 0.141 0.155 0.129
R= Rate(rows/ms), Q#= Query number
99% 100%
Percentage events at latency band
Rate (rows/ms)
Fig. 4. Percentage Latency at Various Rates for Q1
the SSN mapping (each observation type with 10+ triples), by projection, an
RDF dataset size of 2.5 billion triples was ‘fit’ in a IoT devices’ memory.
Queries 1 & 2 of the smart home scenario also corroborated the 2 orders
of magnitude performance advantage of sparql2stream over CQELS. Queries 3
and 4 were not run on CQELS due to issues with the FILTER operator in the
version tested. Query 4 which involved joins on motion and meter streams and
an aggregation saw the average latency of sparql2stream increase, though still
stay under 4ms. The latency for this query was measured from the insertion time
of the last event involved (that trips the push) to that of the push result.
7 Conclusion
A Web of Things based on open standards and the innovations introduced in
the Semantic Web and Linked Data can encourage greater interoperability and
bridge product silos. This paper shows how time-series Internet of Things data
that is flat and wide, can be stored efficiently as rows on devices with limited
resources. By optimising SPARQL-to-SQL translation and ‘collapsing’ interme-
diate nodes, performance on smart home monitoring and a distributed meteo-
rological system show storage and query performance improvements that range
from 2 times to 3 orders of magnitude. The independence from primary keys and
database metadata also resulted in less joins in resultant SQL queries and faster
query translation times respectively. Future work will expand experimentation
to consider additional datasets, data sizes, queries and include a greater variety
of stores and stream processing use cases for time-series data e.g. column stores,
stream analytics and compression/approximation.
The limitations of this approach lie in the assumption that the bulk of IoT
time-series data is flat and read-only which might change in the future. Exploit-
16 SPARQL-to-SQL on IoT Databases and Streams
ing the wideness of time-series data for row access performance is also query
dependant. Current state-of-the-art Ontology-Based Data Access (OBDA) sys-
tems, which do query translation, support general use cases (web/enterprise
relational database mapping) and support reasoning which our approach does
not seek to address at the moment.
1. Barbieri, D.F., Braga, D., Ceri, S., Valle, E.D., Grossniklaus, M.: Querying RDF
streams with C-SPARQL. ACM SIGMOD Record 39(1), 20 (2010)
2. Barker, S., Mishra, A., Irwin, D., Cecchet, E.: Smart*: An open data set and tools
for enabling research in sustainable homes. In: Proceedings of the Workshop on
Data Mining Applications in Sustainability (2012)
3. Barnaghi, P., Wang, W.: Semantics for the Internet of Things: early progress and
back to the future. International Journal on Semantic Web and Information Sys-
tems 8(1), 1–21 (2012)
4. Bishop, B., Kiryakov, A., Ognyanoff, D.: OWLIM: A family of scalable semantic
repositories. Semantic Web 2(1), 33–42 (2011)
5. Buil-Aranda, C., Hogan, A.: SPARQL Web-Querying Infrastructure: Ready for
Action? In: Proceedings of the International Semantic Web Conference (2013)
6. Calbimonte, J.P., Jeung, H., Corcho, O., Aberer, K.: Enabling Query Technolo-
gies for the Semantic Sensor Web. International Journal on Semantic Web and
Information Systems 8(1), 43–63 (2012)
7. Chebotko, A., Lu, S., Fotouhi, F.: Semantics preserving SPARQL-to-SQL transla-
tion. Data and Knowledge Engineering 68(10), 973–1000 (2009)
8. Erling, O.: Implementing a sparql compliant rdf triple store using a sql-ordbms.
Tech. rep., OpenLink Software (2001)
9. International Telecommunication Union: Overview of the Internet of things. Tech.
rep. (2012)
10. Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, M.: A native and
adaptive approach for unified processing of linked streams and linked data. Pro-
ceedings of the International Semantic Web Conference (2011)
11. Neumann, T., Weikum, G.: x-RDF-3X. In: Proceedings of the VLDB Endowment.
vol. 3, pp. 256–263 (2010)
12. Patni, H., Henson, C., Sheth, A.: Linked Sensor Data. In: Proceedings of the In-
ternational Symposium on Collaborative Technologies and Systems (2010)
13. Priyatna, F., Corcho, O., Sequeda, J.: Formalisation and Experiences of R2RML-
based SPARQL to SQL Query Translation using Morph. In: Proceedings of the
23rd International Conference on World Wide Web. pp. 479–489 (2014)
14. Rodriguez-Muro, M., Rezk, M.: Efficient SPARQL-to-SQL with R2RML mappings.
Web Semantics: Science, Services and Agents on the WWW 33, 141–169 (2014)
15. Stonebraker, M., Abadi, D., Batkin, A.: C-store: a column-oriented DBMS. Pro-
ceedings of VLDB pp. 553 – 564 (2005)
16. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web
data management. In: Proceedings of the VLDB Endowment (2008)
17. Zhang, Y., Duc, P.M., Corcho, O., Calbimonte, J.P.: SRBench: A streaming RD-
F/SPARQL benchmark. In: Proceedings of the International Semantic Web Con-
ference. Lecture Notes in Computer Science (2012)
... For the LSD Blizzard dataset with 108,830,518 triples and the LSD Hurricane Ike dataset with 534,469,024 triples, only 12.5% is observation data, 0.17% is device metadata, while 87.3% is observa- tion metadata. In the Smart Home Analytics dataset [53] based on a diierent ontology, a similarly large 81.7% of 11,157,281 triples are observation metadata. ...
... As such, publishers of RDF observation metadata ooen generate long 128-bit universally unique identiiers (UUIDs) to serve as observation, time and data identiiers. In the 17 queries proposed for the streaming RDF/SPARQL benchmark, SRBench [61], and the 4 queries in the Smart Home Analytics Benchmark [53], none of the queries project any these identiiers from observation metadata. ...
... Hence, although both models are rich and promote interoperability, they also repetitively encode sensor and observation metadata which deviates from the eecient time-series storage structures we benchmarked in Section 4. erefore, we present a novel abstraction of a query translation algorithm titled map-match-operate that allows us to query rich data models while preserving the eecient underlying time-series storage that exploits the characteristics of IoT data. We use examples of RDF graphs (as a rich data model) and corresponding SPARQL [22] queries building on previous SPARQL-to-SQL work [53]. e abstraction can also be applied on other graph models or tree-based models like JSON documents with JSON Schema, which are restricted forms of a graph, but is not the focus of the paper. ...
Full-text available
The efficient management of data is an important prerequisite for realising the potential of the Internet of Things (IoT). Two issues given the large volume of structured time-series IoT data are, addressing the difficulties of data integration between heterogeneous Things and improving ingestion and query performance across databases on both resource-constrained Things and in the cloud. In this paper, we examine the structure of public IoT data and discover that the majority exhibit unique flat, wide and numerical characteristics with a mix of evenly and unevenly-spaced time-series. We investigate the advances in time-series databases for telemetry data and combine these findings with microbenchmarks to determine the best compression techniques and storage data structures to inform the design of a novel solution optimised for IoT data. A query translation method with low overhead even on resource-constrained Things allows us to utilise rich data models like the Resource Description Framework (RDF) for interoperability and data integration on top of the optimised storage. Our solution, TritanDB, shows an order of magnitude performance improvement across both Things and cloud hardware on many state-of-the-art databases within IoT scenarios. Finally, we describe how TritanDB supports various analyses of IoT time-series data like forecasting.
... [43] proposes a streaming Ontology-Based Data Access (OBDA) layer to access data in a mixed approach, where stream semantic queries are processed through R2RML mappings to be transformed into queries over federated sensor networks, and to transform the response in order to answer the original SPARQL stream query. [48] proposes a similar approach, with transformation from SPARQL to SQL via RML, with and without the streaming extension to SPARQL. In other papers, where querying is not the core contribution, direct querying in SPARQL is also used after storing RDF content in a knowledge base, as it is proposed in [12] for instance. ...
... [47] [43], [47], [48] M U [11], [6], [52], [63], [51] Upstream Unspecified [50], [38], [49] [38], [54] , [33] U M M L Downstream Unspecified U [31], [20] , [33] [31] ...
... The authors extend the role of the OBDA layer in their system, by adding content aggregation functions to the content enrichment. [48] details why IoT content is adapted to the OBDA approach: the observations produced by the sensors are strongly structured, and storing only one field header for all the observations is more efficient than storing the metadata for each observation. The mapping from the metadata to the schema can be easily stored in memory, and queries on traditional relational databases are more efficient than queries on RDF stores. ...
Full-text available
The Internet of Things (IoT) is a technological topic with a very important societal impact. IoT application domains are various and include: smart cities, precision farming, smart factories, and smart buildings. The diversity of these application domains is the source of the very high technological heterogeneity in the IoT, leading to interoperability issues. The semantic web principles and technologies are more and more adopted as a solution to these interoperability issues, leading to the emergence of a new domain, the Semantic Web Of Things (SWoT). Scientific contributions to the SWoT are many, and the diversity of architectures in which they are expressed complicates comparison. To unify the presented architectures, we propose an architectural pattern, LMU-N. LMU-N provides a reading grid used to classify processes to which the SWoT community contributes, and to describe how the semantic web impacts the IoT. Then, the evolutions of the semantic web to adapt to the IoT constraints are described as well, in order to give a twofold view of the convergence between the IoT and the semantic web toward the SWoT.
... We have previously surveyed about 20,000 unique IoT schema from pub- lic IoT data streams and discovered that a large majority of the sampled de- vices had flat schemata and wide schemata (99.5% and 76.3% respectively) [17]. Flat schemata have no nested layers (table-like rather than tree-like) and wide schemata have more than one property besides the timestamp. ...
... There are a few benchmarks on streaming Linked Data including: SRBench [19], which in our previous work we have evaluated on [17], CityBench [1], which we compare against and LSBench [10] comprising social network stream data. ...
Conference Paper
Fog computing is an emerging technology for the Internet of Things (IoT) that aims to support processing on resource-constrained distributed nodes in between the sensors and actuators on the ground and compute clusters in the cloud. Fog Computing benefits from low latency, location awareness, mobility, wide-spread deployment and geographical distribution at the edge of the network. However, there is a need to investigate, optimise for and measure the performance, scalability and interoperability of resource-constrained Fog nodes running real-time applications and queries on streaming IoT data before we can realise these benefits. With Eywa, a novel Fog Computing infrastructure, we (1) formally define and implement a means of distribution and control of query workload with an inverse publish-subscribe and push mechanism, (2) show how data can be integrated and made interoperable through organising data as Linked Data in the Resource Description Format (RDF), (3) test if we can improve RDF Stream Processing query performance and scalability over state-of-the-art engines with our approach to query translation and distribution for a published IoT benchmark on resource-constrained nodes and (4) position Fog Computing within the Internet of the Future.<br/
... Besides, it allows for the enforcement of access control policies implemented in the RDBMS [4]. Existing methods for ondemand mapping are relational data to RDF Mapping Language (RML) [5] and SPARQL-SQL Mapping Language (S2SML) [6]. These existing methods have some limitations, i.e., a SPARQL query is transformed to Xquery; then, it is converted to SQL. ...
Full-text available
Modern web wants the data to be in Resource Description Framework (RDF) format, a machine-readable form that is easy to share and reuse data without human intervention. However, most of the information is still available in relational form. The existing conventional methods transform the data from RDB to RDF using instance-level mapping, which has not yielded the expected results because of poor mapping. Hence, in this paper, a novel schema-based RDB-RDF mapping method (relational database to Resource Description Framework) is proposed, which is an improvised version for transforming the relational database into the Resource Description Framework. It provides both data materialization and on-demand mapping. RDB-RDF reduces the data retrieval time for nonprimary key search by using schema-level mapping. The resultant mapped RDF graph presents the relational database in a conceptual schema and maintains the instance triples as data graph. This mechanism is known as data materialization, which suits well for the static dataset. To get the data in a dynamic environment, query translation (on-demand mapping) is best instead of whole data conversion. The proposed approach directly converts the SPARQL query into SQL query using the mapping descriptions available in the proposed system. The mapping description is the key component of this proposed system which is responsible for quick data retrieval and query translation. Join expression introduced in the proposed RDB-RDF mapping method efficiently handles all complex operations with primary and foreign keys. Experimental evaluation is done on the graphics designer database. It is observed from the result that the proposed schema-based RDB-RDF mapping method accomplishes more comprehensible mapping than conventional methods by dissolving structural and operational differences.
... Semantic Technologies, an in particular semantic models also known as ontologies [11], [12]offer an excellent modeling solution that proved to be effective in a number of contexts including the Internet of Things [38], [2], [3], [39], industrial diagnostics [25], data integration and access [21], [19], [26]. An ontology is a semantically rich conceptual model of the problem domain that captures the domain in terms of classes and binary properties that relate entities that populate classes and assign data values to these entities. ...
Conference Paper
Digital twins (DTs) are a powerful mechanism for representing complex industrial assets such as oil platforms as digital models. These models can facilitate temporal analyses and computer simulations of assets. In order to enable this, DTs should be able to capture characteristics of an asset as specified by the manufacturer, its state during the run time, as well as how the asset interacts with other assets in a complex system. We argue that semantic technologies and in particular semantic models or ontologies is promising modelling paradigm for DTs. Semantic models allow to capture complex systems in an intuitive fashion, can be written in standardised ontology languages, and come with a wide range of off-the-shelf systems to design, maintain, query, and navigate semantic models. In this work we report our preliminary results on developing a system that would support semantic-based DTs. In particular, we plan to augment the PI System developed by OSIsoft with ontologies and show how the resulting solution can help in simplifying analytical and machine learning routines for DTs.
... where I have quoted from the work of others, the source is always given. parts of this work have been published as: Siow et al. (2016a), Siow et al. (2016b), Siow et al. (2016c) and Siow et al. (2017). Lights in the hallway brighten as a man enters the house. ...
Full-text available
This thesis is concerned with the development of efficient methods for managing contextualised time-series data and event streams produced by the Internet of Things (IoT) so that both historical and real-time information can be utilised to generate value within analytical applications. From a database systems perspective, two conflicting challenges motivate this research, interoperability and performance. IoT applications integrating streams of time-series data from heterogeneous IoT agents require a level of semantic interoperability. This semantic interoperability can be achieved with a common flexible data model that represents both metadata and data. However, applications might also have time constraints or require processing to be performed on large volumes of historical and streaming time-series data, possibly on resource-constrained platforms, without significant delay. Obtaining good performance is complicated by the complexity of the data model. In the first part of the thesis, a graph data model is shown to support the representation of metadata and data that various research and standard bodies are working towards, while the ‘volume’ of IoT data is shown to exhibit flat, wide and numerical characteristics. A three step abstraction is defined to reconcile queries on the graph model with efficient underlying storage by query translation. This storage is iteratively improved to exploit the character of time-series IoT data, achieving orders of magnitude performance improvement over state-of-the-art commercial, open-source and research databases. The second part of the thesis extends this abstraction to efficiently process real-time IoT streams continuously and proposes an infrastructure for fog computing that shows how resource-constrained platforms close to source IoT agents can co-operatively orchestrate stream processing. The main contributions of this thesis are therefore, i) a novel interoperable and performant abstraction for querying IoT graph representations, ii) high performance historical, streaming and fog computing time-series database implementations and iii) analytical applications and platforms built on this abstraction that act as practical models for the socio-technical development of the IoT.
Knowledge graphs are crucial assets for tasks like query answering or data integration. These tasks can be viewed as reasoning problems, which in turn require efficient reasoning systems to be implemented. To this end, we present VLog, a rule-based reasoner designed to satisfy the requirements of modern use cases, with a focus on performance and adaptability to different scenarios. We address the former with a novel vertical storage layout, and the latter by abstracting the access to data sources and providing a platform-independent Java API. Features of VLog include fast Datalog materialisation, support for reasoning with existential rules, stratified negation, and data integration from a variety of sources, such as high-performance RDF stores, relational databases, CSV files, OWL ontologies, and remote SPARQL endpoints.
An Industrial Internet of Things (IoT) is a network of intelligent industrial equipment such as trains and power generating turbines that collect and share large amounts of data. These data are either generated by various sensors deployed in the equipment or captures equipment specific information such as configurations, history of use, and manufacturer. Diagnostics of the industrial IoT is critical to minimise the maintenance cost and downtime of its equipment. It is common that industry today employs rule-based diagnostic systems for this purpose. Rules are typically used to process signals from sensors installed in equipment by filtering, aggregating, and combining sequences of time-stamped measurements recorded by the sensors. Such rules are often data-dependent in the sense that they rely on specific characteristics of individual sensors and equipment. This dependence poses significant challenges in rule authoring, reuse, and maintenance by engineers especially when the rules are applied in industrial IoT scenarios. In this work we propose an approach to address these problems by relying on the well-known Ontology-Based Data Access approach: we propose to use ontologies to mediate the sensor signals and the rules. To this end, we propose a semantic rule language, SDRL, where signals are first class citizens. Our language offers a balance of expressive power, usability, and efficiency: it captures most of Siemens data-driven diagnostic rules, significantly simplifies authoring of diagnostic tasks, and allows to efficiently rewrite semantic rules from ontologies to data and execute over data. We implemented our approach in a semantic diagnostic system and evaluated it. For evaluation, we developed a use case of rail systems as well as power generating turbines at Siemens and conducted experiments to demonstrate both usability and efficiency of our solution.
Conference Paper
Full-text available
Fog computing is an emerging technology for the Internet of Things (IoT) that aims to support processing on resource-constrained distributed nodes in between the sensors and actuators on the ground and compute clusters in the cloud. Fog Computing benefits from low latency, location awareness, mobility, wide-spread deployment and geographical distribution at the edge of the network. However, there is a need to investigate, optimise for and measure the performance, scalability and interoperability of resource-constrained Fog nodes running real-time applications and queries on streaming IoT data before we can realise these benefits. With Eywa, a novel Fog Computing infrastructure, we (1) formally define and implement a means of distribution and control of query workload with an inverse publish-subscribe and push mechanism, (2) show how data can be integrated and made interoperable through organising data as Linked Data in the Resource Description Format (RDF), (3) test if we can improve RDF Stream Processing query performance and scalability over state-of-the-art engines with our approach to query translation and distribution for a published IoT benchmark on resource-constrained nodes and (4) position Fog Computing within the Internet of the Future.
Conference Paper
Real-time processing of data streams emanating from sensors is becoming a common task in Internet of Things scenarios. The key implementation goal consists in efficiently handling massive incoming data streams and supporting advanced data analytics services like anomaly detection. In an on-going, industrial project, a 24 / 7 available stream processing engine usually faces dynamically changing data and workload characteristics. These changes impact the engine’s performance and reliability. We propose Strider, a hybrid adaptive distributed RDF Stream Processing engine that optimizes logical query plan according to the state of data streams. Strider has been designed to guarantee important industrial properties such as scalability, high availability, fault tolerance, high throughput and acceptable latency. These guarantees are obtained by designing the engine’s architecture with state-of-the-art Apache components such as Spark and Kafka. We highlight the efficiency (e.g., on a single machine machine, up to 60x gain on throughput compared to state-of-the-art systems, a throughput of 3.1 million triples/second on a 9 machines cluster, a major breakthrough in this system’s category) of Strider on real-world and synthetic data sets.
Full-text available
The goal of the Smart* project is to optimize home energy con-sumption. As part of the project, we have designed and deployed a "live" system that continuously gathers a wide variety of envi-ronmental and operational data in three real homes. In contrast to prior work, our focus has been on sensing depth, i.e., collecting as much data as possible from each home, rather than breadth, i.e., collecting data from as many homes as possible. Our data captures many important aspects of the home environment, including aver-age household electricity usage every second, as well as usage at every circuit and nearly every plug load, electricity generation data from on-site solar panels and wind turbines, outdoor weather data, temperature and humidity data in indoor rooms, and, finally, data for a range of important binary events, e.g., at wall switches, the HVAC system, doors, and from motion sensors. We also have elec-tricity usage data every minute from 400 anonymous homes. This data corpus has served as the foundation for much of our recent research. In this paper, we describe our data sets as well as basic software tools we have developed to facilitate their collection. We are releasing both the data and tools publicly to the research com-munity to foster future research on designing sustainable homes.
Conference Paper
Full-text available
R2RML is used to specify transformations of data available in relational databases into materialised or virtual RDF datasets. SPARQL queries evaluated against virtual datasets are translated into SQL queries according to the R2RML mappings, so that they can be evaluated over the underlying relational database engines. In this paper we describe an extension of a well-known algorithm for SPARQL to SQL translation, originally formalised for RDBMS-backed triple stores, that takes into account R2RML mappings. We present the result of our implementation using queries from a synthetic benchmark and from three real use cases, and show that SPARQL queries can be in general evaluated as fast as the SQL queries that would have been generated by SQL experts if no R2RML mappings had been used.
Full-text available
The Internet of Things IoT has recently received considerable interest from both academia and industry that are working on technologies to develop the future Internet. It is a joint and complex discipline that requires synergetic efforts from several communities such as telecommunication industry, device manufacturers, semantic Web, and informatics and engineering. Much of the IoT initiative is supported by the capabilities of manufacturing low-cost and energy-efficient hardware for devices with communication capacities, the maturity of wireless sensor network technologies, and the interests in integrating the physical and cyber worlds. However, the heterogeneity of the "Things" makes interoperability among them a challenging problem, which prevents generic solutions from being adopted on a global scale. Furthermore, the volume, velocity and volatility of the IoT data impose significant challenges to existing information systems. Semantic technologies based on machine-interpretable representation formalism have shown promise for describing objects, sharing and integrating information, and inferring new knowledge together with other intelligent processing techniques. However, the dynamic and resource-constrained nature of the IoT requires special design considerations to be taken into account to effectively apply the semantic technologies on the real world data. In this article the authors review some of the recent developments on applying the semantic technologies to IoT.
Conference Paper
Full-text available
We introduce SRBench, a general-purpose benchmark primarily designed for streaming RDF/SPARQL engines, completely based on real-world data sets from the Linked Open Data cloud. With the increasing problem of too much streaming data but not enough tools to gain knowledge from them, researchers have set out for solutions in which Semantic Web technologies are adapted and extended for publishing, sharing, analysing and understanding streaming data. To help researchers and users comparing streaming RDF/SPARQL (strRS) engines in a standardised application scenario, we have designed SRBench, with which one can assess the abilities of a strRS engine to cope with a broad range of use cases typically encountered in real-world scenarios. The data sets used in the benchmark have been carefully chosen, such that they represent a realistic and relevant usage of streaming data. The benchmark defines a concise, yet comprehensive set of queries that cover the major aspects of strRS processing. Finally, our work is complemented with a functional evaluation on three representative strRS engines: SPARQLStream, C-SPARQL and CQELS. The presented results are meant to give a first baseline and illustrate the state-of-the-art.
Full-text available
Sensor networks are increasingly being deployed in the environment for many different purposes. The observations that they produce are made available with heterogeneous schemas, vocabularies and data formats, making it difficult to share and reuse this data, for other purposes than those for which they were originally set up. The authors propose an ontology-based approach for providing data access and query capabilities to streaming data sources, allowing users to express their needs at a conceptual level, independent of implementation and language-specific details. In this article, the authors describe the theoretical foundations and technologies that enable exposing semantically enriched sensor metadata, and querying sensor observations through SPARQL extensions, using query rewriting and data translation techniques according to mapping languages, and managing both pull and push delivery modes.
Despite the intense interest towards realizing the Semantic Web vision, most existing RDF data management schemes are constrained in terms of efficiency and scalability. Still, the growing popularity of the RDF format arguably calls for an effort to offset these drawbacks. Viewed from a relational-database perspective, these constraints are derived from the very nature of the RDF data model, which is based on a triple format. Recent research has attempted to address these constraints using a vertical-partitioning approach, in which separate two-column tables are constructed for each property. However, as we show, this approach suffers from similar scalability drawbacks on queries that are not bound by RDF property value. In this paper, we propose an RDF storage scheme that uses the triple nature of RDF as an asset. This scheme enhances the vertical partitioning idea and takes it to its logical conclusion. RDF data is indexed in six possible ways, one for each possible ordering of the three RDF elements. Each instance of an RDF element is associated with two vectors; each such vector gathers elements of one of the other types, along with lists of the third-type resources attached to each vector element. Hence, a sextuple-indexing scheme emerges. This format allows for quick and scalable general-purpose query processing; it confers significant advantages (up to five orders of magnitude) compared to previous approaches for RDF data management, at the price of a worst-case five-fold increase in index space. We experimentally document the advantages of our approach on real-world and synthetic data sets with practical queries.
Existing SPARQL-to-SQL translation techniques have limitations that reduce their robustness, efficiency and dependability. These limitations include the generation of inefficient or even incorrect SQL queries, lack of formal background, and poor implementations. Moreover, some of these techniques cannot be used over arbitrary DB schemas due to the lack of support for RDB to RDF mapping languages, such as R2RML. In this paper we present a technique (implemented in the -ontop- system) that tackles all these issues. We propose a formal approach for SPARQL-to-SQL translation that (i) generates efficient SQL by combining optimization techniques from the logic programming and SQL optimization fields; (ii) provides a well-defined specification of the SPARQL semantics used in the translation; and (iii) supports R2RML mappings over general relational schemas. We provide extensive benchmarks using the -ontop- system for Ontology Based Data Access (OBDA) and show that by using these techniques -ontop- is able to outperform well known SPARQL-to-SQL systems, as well as commercial triple stores, by several orders of magnitude.
Conference Paper
Hundreds of public SPARQL endpoints have been deployed on the Web, forming a novel decentralised infrastructure for querying billions of structured facts from a variety of sources on a plethora of topics. But is this infrastructure mature enough to support applications? For 427 public SPARQL endpoints registered on the DataHub, we conduct various experiments to test their maturity. Regarding discoverability, we find that only one-third of endpoints make descriptive meta-data available, making it difficult to locate or learn about their content and capabilities. Regarding interoperability, we find patchy support for established SPARQL features like ORDER BY as well as (understandably) for new SPARQL 1.1 features. Regarding efficiency, we show that the performance of endpoints for generic queries can vary by up to 3–4 orders of magnitude. Regarding availability, based on a 27-month long monitoring experiment, we show that only 32.2% of public endpoints can be expected to have (monthly) “two-nines” uptimes of 99–100%.