Content uploaded by Mohamed Nadjib Mami
Author content
All content in this area was uploaded by Mohamed Nadjib Mami on Jul 03, 2019
Content may be subject to copyright.
Squerall: Virtual Ontology-Based Access to
Heterogeneous and Large Data Sources
Mohamed Nadjib Mami1,2, Damien Graux2,3, Simon Scerri1,2, Hajira
Jabeen1, Sören Auer4, and Jens Lehmann1,2
1Smart Data Analytics (SDA) Group, Bonn University, Germany
2Enterprise Information Systems, Fraunhofer IAIS, Germany
3ADAPT Centre, Trinity College of Dublin, Ireland
4TIB & Hannover University, Germany
{mami,scerri,jabeen,jens.lehmann}@cs.uni-bonn.de
damien.graux@iais.fraunhofer.de,auer@l3s.de
Abstract. The last two decades witnessed a remarkable evolution in
terms of data formats, modalities, and storage capabilities. Instead of
having to adapt one’s application needs to the, earlier limited, available
storage options, today there is a wide array of options to choose from to
best meet an application’s needs. This has resulted in vast amounts of
data available in a variety of forms and formats which, if interlinked and
jointly queried, can generate valuable knowledge and insights. In this
article, we describe Squerall: a framework that builds on the principles
of Ontology-Based Data Access (OBDA) to enable the querying of dis-
parate heterogeneous sources using a unique query language, SPARQL.
In Squerall, original data is queried on-the-y without prior data materi-
alization or transformation. In particular, Squerall allows the aggregation
and joining of large data in a distributed manner. Squerall supports out-
of-the-box ve data sources and moreover, it can be programmatically
extended to cover more sources and incorporate new query engines. The
framework provides user interfaces for the creation of necessary inputs, as
well as guiding non-SPARQL experts to write SPARQL queries. Squerall
is integrated into the popular SANSA stack and available as open-source
software via GitHub and as a Docker image.
Software Framework.https://eis-bonn.github.io/Squerall
1 Introduction
For over four decades, relational data management remained a dominant paradigm
for storing and managing structured data. However, the advent of extremely
large-scale applications revealed the weakness of relational data management
at dynamically and horizontally scaling the storage and querying of massive
amounts of data. This prompted a paradigm shift, calling for a new breed of
databases capable of managing large data volumes without jeopardising query
performance by reducing query expressivity and consistency requirements. Since
2008 to date, a wide array of so-called non-relational or NoSQL (Not only SQL)
2 M. N. Mami et al.
databases emerged (e.g., Cassandra, MongoDB, Couchbase, Neo4j). This het-
erogeneity contributed to one of the main Big Data challenges: variety. The
integration of heterogeneous data is the key rational for the development of se-
mantic technologies over the past two decades. Local data schemata are mapped
to global ontology terms, using mapping languages that have been standard-
ized for a number of popular data representations, e.g., relational data, JSON,
CSV or XML. Heterogeneous data can then be accessed in a uniform manner
by means of queries in a standardized query language, SPARQL [15], employing
terms from the ontology. Such data access is commonly referred to as Ontology-
Based Data Access (OBDA) [23]. The term Data Lake [11] refers to the schema-
less pool of heterogeneous and large data residing in its original formats on a
horizontally-scalable cluster infrastructure. It comprises databases (e.g., NoSQL
stores) or scale-out le/block storage infrastructure (e.g., Hadoop Distributed
File System), and requires dealing with the original data without prior physical
transformation or pre-processing. After emerging in industry, the concept has
increasingly been discussed in the literature [24,21,32]. The integration of se-
mantic technologies into Data Lakes led to the Semantic Data Lake concept,
briey introduced in our earlier work [2]. By adopting the OBDA paradigm to
the NoSQL and Data Lake technology space, we realize the Semantic Data Lake
concept and present in this article a comprehensive implementation.
Implementing an OBDA architecture atop Big Data raises three challenges:
1. Query translation. SPARQL queries must be translated into the query lan-
guage of each of the respective data sources. A generic and dynamic transla-
tion between data models is challenging (even impossible in some cases e.g.,
join operations are unsupported in Cassandra and MongoDB [20]).
2. Federated Query Execution. In Big Data scenarios it is common to have non-
selective queries with large intermediate results, so joining or aggregation
cannot be performed on a single node, but only distributed across a cluster.
3. Data silos. Data coming from various sources can be connected to generate
new insights, but it may not be readily ‘joinable’ (cf. denition below).
To target the aforementioned challenges we build Squerall [19], an extensible
framework for querying Data Lakes.
–It allows ad hoc querying of large and heterogeneous data sources virtually
without any data transformation or materialization.
–It allows the distributed query execution, in particular the joining of disparate
heterogeneous sources.
–It enables users to declare query-time transformations for altering join keys
and thus making data joinable.
–Squerall integrates the state-of-the-art Big Data engines Apache Spark and
Presto with the semantic technologies RML and FnO.
The article is structured as follows. Squerall architecture is presented in Sec-
tion 2 and its implementation in Section 3. The performance is evaluated in Sec-
tion 4 and its sustainability, availability and extensibility aspects are discussed
in Section 5. Related Work is presented in Section 6 and Section 7 concludes
with an outlook on possible future work.
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 3
PS2
PS1 PSn
Join
ParSets
Transformed
ParSets
Relevant
Data Sources
Distributed
Query
Processor
Transformation
PS2t
Query
Decomposer
Mappings PSr
Final Results
Data Wrapper
Relevant Entity
Extractor
Data Lake
Parallel
Operational
Area (POA)
Union
PS1t
Config
Query
4
1
23
Fig. 1: Squerall Architecture (Mappings, Query and Cong are user inputs).
2 Architecture
Squerall (Semantically query all) is built following the OBDA principles [23]. The
latter were originally devised for accessing relational data but do not impose a
restriction on the type or size of data it deals with. We project them to large
and heterogeneous data sources contained in a Data Lake.
2.1 Preliminaries
In order to guide the subsequent discussions, we rst dene the following terms:
Data Attribute represents all concepts used by data sources to characterize a
particular stored datum, e.g., a column in a tabular database like Cassandra, or
aeld in a document database like MongoDB.
Data Entity and Relevant Entity: an entity represents all concepts that are
used by data sources to group together similar data, e.g., a table in a tabular
database or a collection in a document database. An entity has one or multi-
ple data attributes. An entity is relevant to a query if it contains information
matching a part of the query (similarly found in federated systems, e.g., [25]).
ParSet and Joinable ParSets: from Parallel dataSet, ParSet refers to a
data structure that is partitioned and distributed, and that is queried in par-
allel. ParSet is populated on-the-y, and not materialized. Joinable ParSets are
ParSets that store inter-matching values. For example, if the ParSet has a tabular
representation, it has the same meaning as joinable tables in relational algebra,
i.e., tables sharing common attribute values.
Parallel Operational Area (POA) is the parallel distributed environment
where ParSets are loaded, joined and transformed, in response to a query. It has
its internal data structure, which ParSets comply with.
Data Source refers to any storage medium, e.g., plain le storage or a database.
Data Lake is a repository of multiple data sources where data is stored and
accessed directly in its original form and format, without prior transformation.
4 M. N. Mami et al.
?k bibo:isbn ?i .
?k dc:title ?t .
?k schema:author ?a .
?k bibo:editor ?e .
?a foaf:firstName ?fn .
?a foaf:lastName ?ln .
?a rdf:type nlon:Author .
?a drm:worksFor ?in .
?in rdfs:label ?n1 .
?in rdf:type vivo:Institute
?e rdfs:label ?n2 .
?e rdf:type saws:Editor .
?c rdf:type swc:Chair .
?c dul:introduces ?r .
?c foaf:firstName ?cfn .
?r schema:reviews ?k .
PS(a)
PS(c)
PS(k)
PS(in)
PS(e)
PS(a)
PS(a)
PS(k)
PS(k) PS(e)
PS(r) PS(k)
PS(r)
PS(c)
ParSet
join pairs
1st join pair
Results ParSet
III
Incremental
Join
PS(a) PS(in)
PS(k)
PS(a) PS(in)
PS(in)
PS(r)
PS(e)
PS(k)
PS(a) PS(in)
PS(e)
PS(k)
PS(a) PS(r)
PS(in)
Fig. 2: ParSets extraction and Join (for clarity ParSet(x) is shortened to PS(x)).
2.2 OBDA Building Blocks
A typical OBDA system is composed of ve main components:
Data. A Data Lake is a collection of multiple heterogeneous data sources, be it
raw les (e.g., CSV), structured le formats (e.g., Parquet) or databases (e.g.,
MongoDB). Currently, Squerall does not support unstructured data but it can
be part of the Data Lake.
Schema. A Data Lake is by denition a schema-less repository of data. Schemata
exist at the level of the individual data sources.
Ontology. Ontologies are used to dene a common domain conceptualization
across Data Lake entities. At least class and properties denition is required.
Mappings. Mappings are association links between elements of the data schema
and ontology terms (i.e., classes and properties). Three mapping elements need
to be provided as input for an entity to be queried:
1. Class mapping: associates an entity to an ontology class.
2. Property mappings: associate entity attributes to ontology properties.
3. Entity ID: species an attribute to be used as identier of the entity.
For example, Author(AID,first_name,last_name) is an entity in a table of
a Cassandra database. In order to enable nding this entity, the user must
provide the three mapping elements. As example, (1) Class mapping: (Author,
nlon:Author) (2) Property mappings: (rst_name, foaf:rstName), (last_name,
foaf:lastName), and (3) Entity ID: AID. firstName and lastName are properties
from the foaf ontology (http://xmlns.com/foaf/spec) and Author is a class
from the nlon ontology (http://lod.nl.go.kr/page/ontology).
Query. The purpose of using a top query language, SPARQL in our case, is
mainly to join data coming from multiple sources. Therefore, we assume that
certain query forms and constructs are of less concern to Data Lake users, e.g.,
multi-level nested queries, queries with variable properties, CONSTRUCT queries.
2.3 Architecture Components
Squerall consists of four main components (cf. Figure 1). Because of Squerall
extensible design, also for clarity, we hereafter use the generic ParSets and POA
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 5
-In pu t: ParSetJoinsArray // An Array of all join pairs [ParSet,ParSet]
-Output: ResultsParSet // A ParSet joining all ParSets
Re su lt sP ar Se t = Pa rSe tJ oi ns Ar ra y . he ad // First join pair
it er at e Pa rS et Jo in sAr ra y : c ur ren t -p air
if cu rr ent - pa ir joinable_with ResultsParSet
ResultsParSet = ResultsParSet join cu rre nt - pa ir
el se a dd cu rr ent - pa ir to PendingJoinsQueue
// Next, iterate through PendingJoinsQueue like ParSetJoinsArray
Listing 1.1: ParSet Join.
concepts instead of Squerall’s underlying equivalent concrete terms, which dier
from engine to engine. The latter are presented in Section 3.
(1) Query Decomposor. This component is commonly found in OBDA and
query federation systems (e.g., [12]). It here decomposes the query’s Basic Graph
Pattern (BGP, conjunctive set of triple patterns in the where clause) into a set
of star-shaped sub-BGPs, where each sub-BGP contains all the triple patterns
sharing the same subject variable. We refer to these sub-BGPs as stars for brevity;
(see Figure 2 left, stars are shown in distinct colored boxes). Query decomposi-
tion is subject-based (variable subjects), because the focus of query execution is
on bringing and joining entities from dierent sources, not to retrieve a specic
known entity. Retrieving a specic entity, i.e., subject is constant, requires full-
data parsing and creating an index in a pre-processing phase. This dees the
Data Lake denition to access original data without a pre-processing phase. A
specic entity can be obtained, nonetheless, by ltering on its attributes.
(2) Relevant Entity Extractor. For every extracted star, this component
looks in the Mappings for entities that have attributes mappings to each of the
properties of the star. Such entities are relevant to the star.
(3) Data Wrapper. In the classical OBDA, SPARQL query has to be translated
to the query language of the relevant data sources. This is in practice hard to
achieve in the highly heterogeneous Data Lake settings. Therefore, numerous
recent publications (e.g., [4,29]) advocated for the use of an intermediate query
language. In our case, the intermediate query language is POA’s query language,
dictated by its internal data structure. The Data Wrapper generates data in
POA’s data structure at query-time, which allows for the parallel execution
of expensive operations, e.g., join. There must exist wrappers to convert data
entities from the source to POA’s data structure, either fully, or partially if parts
of the data can be pushed down to the original source. Each identied star from
step (1) will generate exactly one ParSet. If more than an entity are relevant,
the ParSet is formed as a union. An auxiliary user input Cong is used to guide
the conversion process, e.g., authentication, or deployment specications.
(4) Distributed Query Processor. Finally, ParSets are joined together form-
ing the nal results. ParSets in the POA can undergo any query operation, e.g.,
selection, aggregation, ordering, etc. However, since our focus is on querying
multiple data sources, the emphasis is on the join operation. Joins between stars
translate into joins between ParSets (Figure 2 phase I). Next, ParSet pairs are all
iteratively joined to form the Results ParSet (Figure 2 phase II) using Listing 1.1
6 M. N. Mami et al.
1<# Aut ho rMa p >
2rml:logicalSource [
3rml : so ur ce : " .. / au th or s. p ar qu et " ; n osql :sto re n os ql : pa rq ue t ] ;
4rr :s ub jectM ap [
5rr :t em pl at e " ht tp : // ex am . pl /. ./ { AID} " ; rr : cl as s nl on : Au th or ] ;
6rr :p red ic at eO bj ec tMa p [
7rr :p re di cate fo af : fi rs tN am e ; rr :o bj ec tM ap [r ml : re fe re nc e "F na me " ] ] ;
8rr :p red ic at eO bj ec tMa p [ rr : pr ed ic at e dr m:w or ks Fo r ; rr : ob je ct Ma p < #F un ct io nM ap >] .
9<# Fun ct ion Map >
10 fnml :fu nc ti on Va lu e [ r ml : lo gi ca lS ou rc e " .. / au th or s. p ar qu et " ; # S ame as a bo ve
11 rr :p red ic at eO bj ec tMa p [ rr : pr ed ic at e fn o:e xe cu te s ;
12 rr :o bj ec tMap [r r: c on st an t gr el : st rin g_ to Up pe rc ase ] ];
13 rr :p red ic at eO bj ec tMa p [
14 rr :p re di cate gr el : in pu tS tr in g ; rr :ob je ct Ma p [r r: r ef er en ce " In ID " ]
15 ] ] . # Tr an sf or m " In ID " a tt ri bu te u si ng g re l:s tr in g _t oU pp er ca se
Listing 1.2: Mapping an entity using RML and FnO.
algorithm. In short, extracted join pairs are initially stored in an array. After
the rst pair is joined, it iterates through each remaining pair to attempt further
joins or, else, add to a queue1. Next, the queue is similarly iterated, when a pair
is joined, it is unqueued. The algorithm completes when the queue is empty. As
the Results ParSet is a ParSet, it can also undergo query operations. The join
capability of ParSets in the POA replaces the lack of the join common in many
NoSQL databases, e.g., Cassandra, MongoDB [20]. Sometimes ParSets cannot
be readily joined due to a syntactic mismatch between attribute values. Squerall
allows users to declare Transformations, which are atomic operations applied to
textual or numeral values, details are given in subsection 3.2.
3 Implementation
Squerall2is written in Scala. It uses RML and FnO to declare data mappings
and transformations, and Spark [35] and Presto3as query engines.
3.1 Data Mapping
Squerall accepts entity and attribute mappings declared in RML [10], a map-
ping language extending the W3C R2RML [7] to allow mapping heterogeneous
sources. The following fragment is expected (e.g., #AuthorMap in Listing 1.2):
–rml:logicalsource used to specify the entity source and type.
–rr:subjectMap used (only) to extract the entity ID (in brackets).
–rr:predicateObjectMap, used for all entity attributes; maps an attribute
using rml:reference to an ontology term using rr:predicate.
1We used queue data structure simply to be able to dynamically pull (unqueue)
elements from it iteratively till it has no more elements.
2Available at https://github.com/EIS-Bonn/Squerall (Apache-2.0 license).
3http://prestodb.io
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 7
We complement RML with the property nosql:store (line 5) from our NoSQL
ontology4, to enable specifying the entity type, e.g., Cassandra, MongoDB, etc.
3.2 Data Transformation
To enable data joinability, Squerall allows users to declare transformations. Two
requirements should be met: (1) transformation specication should be decou-
pled from the technical implementation, (2) transformations should be performed
on-the-y on query-time, complying with the Data Lake denition.
We incorporate the Function Ontology (FnO, [8]), which allows to declare
machine-processable high-level functions, abstracting from the concrete tech-
nology used. We use FnO in conjunction with RML similarly to the approach
in [9] applied to the DBpedia Extraction Framework. However, we do not phys-
ically generate RDF triples but only apply FnO transformations on-the-y at
query-time. Instead of directly referencing an entity attribute rml:reference
(e.g., line 7 Listing 1.2), we reference an FnO function that alters the values
of the attribute (line 9 Listing 1.2). For example in Listing 1.2, the attribute
InID (line 18) is indirectly mapped to the ontology term drm:worksFor via the
#FunctionMap. This implies that the attribute values are to be transformed us-
ing the function represented by the #FunctionMap,grel:string_toUppercase
(line 16). The latter sets the InID attribute values to uppercase.
Squerall visits the mappings at query-time and triggers specic Spark and
Presto operations over the query intermediate results whenever a transformation
declaration is met. In Spark, a map() transformation is used, in Presto corre-
sponding string or numeral SQL operations are used. For the uppercase example,
in Spark upper(DataFrame column) function inside a map() is used, in Presto
the SQL upper() string function is used.
3.3 Data Wrapping and Querying
We implement Squerall engine using two popular frameworks: Apache Spark
and Presto. Spark is a general-purpose processing engine and Presto a dis-
tributed SQL query engine for interactive querying, both base their computa-
tions primarily in memory. We leverage Spark’s and Presto’s connector concept,
which is a wrapper able to load data from an external source into their internal
data structure (ParSet), performing attening of any non-tabular representa-
tions. Spark’s internal data structure is called DataFrame, which is a tabular
structure programmatically queried in SQL. Their schema corresponds to the
schema of the ParSet, a column per star predicate. As explained with ParSets,
DataFrames are created from the relevant entities, and incrementally joined.
Other non-join operations found in SPARQL query (e.g., selection, aggrega-
tion, ordering) are translated to equivalent SQL operations; they are applied
either at the level of individual DataFrames, or the level of the nal results
DataFrame, whichever is more optimal. As an optimization, in order to reduce
4URL: http://purl.org/db/nosql#, details are out of the scope of this article.
8 M. N. Mami et al.
(a) SPARQL UI. (b) Mapping UI.
Fig. 3: Screenshots from Squerall GUIs.
the intermediate results and, thus, data to join with, we push the selection
and transformation to the level of individual DataFrames. We leave aggrega-
tion and ordering to the nal results DataFrame, as those have results-wide
eect. Presto also loads data into its internal native data structure. However,
unlike Spark, it does it transparently; it is not possible to manipulate those
data structures. Rather, Presto accepts one self-contained SQL query with refer-
ences to all the relevant data sources, e.g., SELECT cassandra.cdb.product C
JOIN mongo.mdb.producer M ON C.producerID = M.ID. ParSets in this case
are views (SELECT sub-queries), which we create, join and optimize similarly
to DataFrames.
Spark and Presto make using connectors very convenient, users only provide
values to a pre-dened list of options. Spark DataFrames are created as fol-
lows: spark.read.format(sourceType).options(options).load. In Presto,
options are added to a simple le. Leveraging on this simplicity, Squeral sup-
ports out-of-the box ve data sources: Cassandra, MongoDB, Parquet, CSV and
JDBC (MySQL tested). We chose Spark and Presto as they have a good balance
between the number of connectors, ease of use and performance [33,17].
3.4 User Interfaces
Squerall is provided with three user interfaces allowing to generate its three
needed input les (cong, mappings and query), respectively described as follows:
Connect UI shows and receives from users the required options that enable the
connection to a data source, e.g., host, port, password, cluster settings, etc.
Mapping UI uses connection options to extract data schema (entity and at-
tributes). It then allows the users to ll or search existing ontology catalogues
for equivalent ontology terms (cf. Figure 3a).
SPARQL UI guides non-SPARQL experts to build correct SPARQL queries
by means of widget oered for dierent SPARQL constructs (cf. Figure 3b).
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 9
Product Oer Review Person Producer
Generated Data (BSBM) Cassandra MongoDB Parquet CSV MySQL
# of tuples Scale 0.5M 0.5M 10M 5M 26K 10K
# of tuples Scale 1.5M 1.5M 30M 15M 77K 30K
# of tuples Scale 5M 5M 100M 50M 2.6M 100K
Table 1: Data sources and corresponding number of tuples loaded.
4 Performance Analysis
4.1 Setup
Datasets: As there is no Data Lake dedicated benchmark with SPARQL sup-
port, we opt for BSBM [3], a benchmark conceived for comparing the perfor-
mance of RDF triple stores with SPARQL-to-SQL rewriters. We use its data
generator to generate SQL dumps. We pick ve tables: Product, Producer, Of-
fer, Review, and Person tables, pre-process them to extract tuples and load them
into ve dierent data sources (cf. Table 1). Those were chosen to enable up to
4-chain joins of ve dierent data sources. We generate three scales: 500k, 1.5M
and 5M (number of products)5.
Queries: Since we only populate a subset of the generated BSBM tables, we
have to alter the initial queries accordingly. We discard joining with tables we
do not consider, e.g., Vendors, and replace them with others populated. All
queries6result in a cross-source join, from 1 (between 2 sources e.g., Q3) to 4 (be-
tween 5 sources e.g., Q4). Queries with yet unsupported syntax, e.g., DESCRIBE,
CONSTRUCT, are omitted.
Metrics: We evaluate results accuracy and query performance. For accuracy, we
compare query results against a centralized relational database (MySQL), as it
represents data at its highest level of consistency. For performance, we measure
query execution time wrt. the three generated scales. A particular emphasis is
put on the impact of the number of joins on query time. We run each query three
times and calculate their mean value. The timeout threshold is set to 3600s.
Environment: We ran our experiments in a cluster of three machines each
having DELL PowerEdge R815, 2x AMD Opteron 6376 (16 cores) CPU and
256GB RAM. In order to exclude any eect of caching, all queries are run on a
cold cache. Also, no optimization techniques from the query engines were used.
4.2 Results and Discussion
We compare the performance of Squerall’s two underlying query engines: Spark
and Presto, in absence of a similar work allowing to query all ve data sources
and SPARQL fragment that Squerall supports.
Accuracy. The number of results returned by Squerall was in all queries of scale
0.5m identical to MySQL, i.e., 100% accuracy. MySQL timed out with data of
5The 1.5M scale factor generates 500M RDF triples, and the 5M factor 1,75B triples.
6See https://github.com/EIS-Bonn/Squerall/tree/master/evaluation
10 M. N. Mami et al.
178
656 379
517 452 441
856 635
337
seconds
0
100
200
300
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Presto Spark
(a) Scale 0.5m.
74
1203 434
504 552 527
1061 523
219
seconds
0
200
400
600
800
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Presto Spark
(b) Scale 1.5m.
48 1375 305
626 528 548
1418 418
345
seconds
0
500
1000
1500
2000
2500
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Presto Spark Presto (transformed data)
(c) Scale 5m.
Fig. 4: Query execution Time (seconds). The labels on top of Presto’s columns
show the percentage between Presto’s and Spark’s execution times, e.g., in (a)
in Q2, 178 means that Presto-based Squerall is 178% faster than Spark-based.
scale 1.5M, starting from which we compared the performance between the two
engines, and results were also identical.
Performance. The results (cf. Figure 4) suggest that Squerall overall exhibits
reasonable performance throughout the various queries, i.e., dierent number
of joins, with and without ltering and ordering. Presto-based Squerall exhibits
signicantly better performance than Spark-based, up to an order of magni-
tude. With the 0.5M data scale, query performance is superior across all the
queries with an increase of up to 800%. With scales 1.5M and 5M, Presto-based
is superior in all queries other than Q1, with an increase of up to 1300%. This
superiority is due to a number of factors. Presto is built following the MPP
(Massively Parallel Processing) principles specically to optimize SQL querying.
Another factor is that Presto has far less preparation overhead (e.g., mid-query
fault-tolerance, resource negotiation) than Spark. Spark, in the other hand, is
a general-purpose system, basing its SQL library on its native in-memory tab-
ular data structure not originally conceived for ad hoc querying. Also, it incurs
overhead to guarantee query resiliency and manage resources.
Query performance was homogeneous across all the scales and between the
two engines, except for Q2, which was among the fastest in Presto contrarily
to Spark. This query is special in that it projects out most Product attributes
joining the largest entity, Oer, without any ltering, which may indicate that
Presto handles intermediate results of unselective queries better. Q3 was the
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 11
32 32 35 33 32
33
32 41 35 31
36
34
0
500
1000
1500
2000
2500
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Scale 0.5M Scale 1.5M Scale 5M
(a) Spark-based Squerall query times.
40 59 40 43 37 44
22
50
37 53
39 33 35
43
52
0
200
400
600
800
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Scale 0.5M Scale 1.5M Scale 5M
(b) Presto-based Squerall query times.
Fig. 5: Numbers above the bars denote time percentage dierences between the
scales, e.g., in (a) Q2 execution time in scale 0.5M is 32% of that in scale 1.5M,
which is 32% of that in 5M. On average, percentages are ≈30% in all cases (both
engines), which is proportional to the data scale (5m=30%1.5m=30%0.5m).
fastest, as it has the lowest number of joins. Followed by Q1 and Q8, which
contain more joins, but with signicantly ltered number of products. The rest
of the queries are notably slower because they join with the large entity Oer.
The presence of the LIMIT clause did not have a direct eect on query time, it is
present in Q1-Q5, Q8 and Q10, across which the performance varies signicantly.
Although the current data distribution does not represent the best-case sce-
nario (e.g., query performance would be better if Review data was loaded into
Cassandra instead), we intentionally stored the large data into the better per-
forming data sources. Our purpose in this experiment series was to observe
Squerall behavior using Spark and Presto across the various scales and queries.
For example, we observed that, although the largest data entity was loaded into
a very ecient database, MongoDB, the queries implicating this entity were the
slowest anyway. This gives an indication of the performance of those queries if
the same entity was loaded into a less capable source, e.g., Parquet or CSV.
Increasing data size did not diminish query performance; query times were
approximately proportional to the data size (cf. Figure 5), and remained under
the threshold. In order to evaluate the eect of the query-time data transforma-
tions, we intentionally introduce variations to the data so it becomes unjoinable.
In table Product, we decrease the column pr values by 71, in table Producer,
we append the string “-A” to all values of column pr, and in table Review we
prex the values of column person with the character “P”. We declare the nec-
essary transformations accordingly. The results show that there is a negligible
cost in the majority of the cases. This is attributed to the fact that both Spark
and Presto base computations in memory. In Spark, those transformations in-
volve only map function, which is executed locally very eciently, not requiring
any data movement. Only few queries in 5M in Presto-based Squerall exhibited
noticeable but not signicant costs, e.g., Q1 and Q10. Due to the insignicant
dierences and to improve readability, we only add the results of the transfor-
mation cost in the scale 5M Figure 4c. Our results could be considered as a
performance comparison between Spark and Presto, of which few exist [17,33].
12 M. N. Mami et al.
5 Availability, Sustainability, Usability and Extensibility
Squerall is integrated7into SANSA [18] since version 0.5, a large framework for
distributed querying, inference and analytics over knowledge graphs. SANSA has
been used across a range of funded projects, such as BigDataEurope, SLIPO,
and BETTER. Via the integration into SANSA, Squerall becomes available to
a large user base. It benets from SANSA’s various deployment options, e.g.,
Maven Central integration and runnable notebooks. SANSA has an active devel-
oper community (≈20 developers), an active mailing list, issue tracking system,
website and social media channels. Prior to its integration, Squerall features were
recurrently requested in SANSA, to allow it to also access large non-RDF data.
Squerall sustainability is ensured until 2022 thanks to a number of contributing
innovation projects including Simple-ML, BETTER, SLIPO, and MLwin. Fur-
ther, Squerall is being adopted and extended in an internal use-case of a large
industrial company; we plan to report this in a separate adequate submission.
The development of Squerall was driven by technical challenges identied by
the H2020 BigDataEurope project8, whose main technical result, the Big Data
Integrator platform, retains a signicant amount of interest by the open-source
community. In the absence of appropriate technical solutions supporting Data
Lake scenarios, the platform development was constrained to transform and
centralize most of the data in the use-cases. The need to invest in architectures,
tools and methodologies to allow for decentralized Big Data management was
highly emphasized by the project. Further, demonstrating the feasibility and
eectiveness of OBDA on top of the ever increasing movement of NoSQL has
a positive impact on the adoption of Semantic Web principles. This indicates
some clear evidence of Squerall’s value and role in the community.
Squerall is openly available under Apache-2.0 terms; it is hosted on GitHub9
and registered in Zenodo10 (DOI: 10.5281/zenodo.2636436). Squerall makes use
of open standards and ontologies, including SPARQL, RML, and FnO. It can eas-
ily be built and used thanks to its detailed documentation. Its usage is facilitated
by the accompanied user interfaces, which are demonstrated in a walkthrough
screencast11. Further, we provide a Docker image allowing to easily setup Squer-
all and reproduce the presented evaluation. Squerall was built with extensibility
in mind; it can be programmatically extended12 by (1) adding a new query en-
gine, e.g., Drill, due to its modular code design (cf. Figure 6), and (2) supporting
more data sources with minimal eort by leveraging Spark/Presto connectors.
A mailing list and a Gitter community are made available for the users.
7https://github.com/SANSA-Stack/SANSA-DataLake
8www.big-data-europe.eu &https://github.com/big-data- europe
9https://github.com/EIS-Bonn/Squerall
10 https://zenodo.org/record/2636436
11 https://github.com/EIS-Bonn/Squerall/tree/master/evaluation/screencasts
12 https://github.com/EIS-Bonn/Squerall/wiki/Extending-Squerall
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 13
Run Query Analyser
Planner
Mapper
MainQuery Executor
Spark
Executor
Presto
Executor
...
Executor
class call
subclass
Fig. 6: Squerall class call hierarchy. Engine classes (colored) are decoupled from
the rest (grey). A new engine can be added by extending only Query Executor
class (implementing it methods).
6 Related Work
There are several solutions for mapping relational databases to RDF [28], and
OBDA over relational databases [34], e.g., Ontop, Morph, Ultrawrap, Mastro,
Stardog. Although we share the OBDA concepts, our focus goes to the heteroge-
neous non-relational and distributed scalable databases. On the non-relational
side, there has been a number of eorts, which we can classify into ontology-
based and non-ontology-based.
For non-ontology-based access, [6] denes a mapping language to express ac-
cess links to NoSQL databases. It proposes an intermediate query language to
transform SQL to Java methods accessing NoSQL databases. However, query
processing is neither elaborated nor evaluated, e.g., cross-database join is not
mentioned. [13] suggests that computations performance can be improved if
data is shifted on query-time between multiple databases; the suitable database
is decided on a case-to-case basis. Although it demonstrates that the overall
performance, including the planning and data movement, is higher when using
one database, this is not proven to be true with large data. In real large-scale
settings, data movement and I/O can dominate the query time. [26] allows to
run CRUD operations over NoSQL databases; beyond, the same authors in [27]
enable joins as follows. If the join involves entities in the same database, it is
performed locally, if not or if the database lacks join capability, data is moved
to another capable database. This implies that no intra-source distributed join
is possible, and, similarly to [13], moving data can become a bottleneck in large
scales. [1] proposes a unifying programming model to interface with dierent
NoSQL databases. It allows direct access to individual databases using get,put
and delete primitives. Join between databases is not addressed. [16] proposes a
SQL-like language containing invocations to the native query interface of rela-
tional and NoSQL databases. The learning curve of this query language is higher
than other eorts suggesting to query solely using plain (or minimally adapted)
SQL, JSONPath or SPARQL. Although its architecture is distributed, it is not
explicitly stated whether intra-source join is also distributed. Besides, the code-
source is unfortunately not available. A number of eorts, e.g., [29,30,4], aim at
bridging the gap between relational and NoSQL databases, but only one database
is evaluated. Given the high semantic and structural heterogeneity found across
NoSQL databases, a single database cannot be representative of all the family.
Among those, [30] adopts JSON as both conceptual and physical data model.
This requires physically transforming query’s intermediate results, costing the
engine transformation price (a limitation also observed in other eorts). More-
14 M. N. Mami et al.
over, the prototype is evaluated with only small data on a single machine. [22]
presents SQL++, an ambitious general query language that is based on SQL
and JSON. It covers a vast portion of the capabilities of query languages found
across NoSQL databases. However, the focus is on the query language, and the
prototype is only minimally validated using a single database: MongoDB. [31]
considers data duplicated in multiple heterogeneous sources, and identies the
best source to send a query to. Thus, joins between sources are not explored. For
ontology-based access, Optique [14] is a reference platform with consideration
also for dynamic streaming data. Although based on the open-source Ontop,
sources of the Big Data instance are not publicly available. Ontario13 is a very
similar (unpublished) work; however, we were not able to run it due to the lack
of documentation, and it appears that wrappers are manually created. [5] con-
siders simple query examples, where joins are minimally addressed. Distributed
implementation is future work.
In all the surveyed solutions, support for data source variety is limited or faces
bottlenecks. Only support for a few data sources (1-3) is observed, and wrappers
are manually created or hard-coded. In contrast, Squerall does not reinvent the
wheel and makes use of the many wrappers of existing engines. This makes it the
solution with the broadest support of the Big Data Variety dimension in terms
of data sources. Additionally, Squerall has among the richest query capabilities
(see full fragment in14), from joining and aggregation to various query modiers.
7 Conclusion and Future Work
In this article, we presented Squerall —a framework realizing the Semantic Data
Lake, i.e., querying heterogeneous and large data sources using Semantic Web
techniques. It performs distributed cross-source join operation and allows users to
declare transformations that enable joinability on-the-y at query-time. Squerall
is built using state-of-the-art Big Data technologies, Spark and Presto. Relying
on the latter’s connectors to wrap the data, Squerall relieves users from hand-
crafting wrappers —a major bottleneck in supporting data variety throughout
the literature. It also makes Squerall easily extensible, e.g., in addition to the
ve sources evaluated here, Couchbase and Elasticsearch were also tested. There
are dozens of connectors already available15. Furthermore, due to its modu-
lar code design, Squerall can also be programmatically extended to use other
query engines. In the future, we plan to support more SPARQL operations, e.g.,
OPTIONAL and UNION, and also to exploit the query engines’ own optimizations
to accelerate query performance. Finally, in such a heterogeneous environment,
there is a natural need for retaining provenance at data and query results levels.
13 https://github.com/SDM-TIB/Ontario
14 https://github.com/EIS-Bonn/Squerall/tree/master/evaluation
15 https://spark-packages.org https://prestodb.io/docs/current/connector
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 15
Acknowledgment
This work is partly supported by the EU H2020 projects BETTER (GA 776280)
and QualiChain (GA 822404); and by the ADAPT Centre for Digital Con-
tent Technology funded under the SFI Research Centres Programme (Grant
13/RC/2106) and co-funded under the European Regional Development Fund.
References
1. Atzeni, P., Bugiotti, F., Rossi, L.: Uniform access to non-relational database sys-
tems: The SOS platform. In: Ralyté, J., Franch, X., Brinkkemper, S., Wrycza, S.
(eds.) In CAiSE. vol. 7328, pp. 160–174. Springer (2012)
2. Auer, S., Scerri, S., Versteden, A., Pauwels, E., Charalambidis, A., Konstantopou-
los, S., Lehmann, J., Jabeen, H., Ermilov, I., Sejdiu, G., et al.: The BigDataEurope
platform–supporting the variety dimension of big data. In: International Confer-
ence on Web Engineering. pp. 41–59. Springer (2017)
3. Bizer, C., Schultz, A.: The berlin SPARQL benchmark. International Journal on
Semantic Web and Information Systems (IJSWIS) 5(2), 1–24 (2009)
4. Botoeva, E., Calvanese, D., Cogrel, B., Corman, J., Xiao, G.: A generalized frame-
work for ontology-based data access. In: International Conference of the Italian
Association for Articial Intelligence. pp. 166–180. Springer (2018)
5. Curé, O., Kerdjoudj, F., Faye, D., Le Duc, C., Lamolle, M.: On the potential inte-
gration of an ontology-based data access approach in NoSQL stores. International
Journal of Distributed Systems and Technologies (IJDST) 4(3), 17–30 (2013)
6. Curé, O., Hecht, R., Le Duc, C., Lamolle, M.: Data integration over NoSQL stores
using access path based mappings. In: International Conference on Database and
Expert Systems Applications. pp. 481–495. Springer (2011)
7. Das, S., Sundara, S., Cyganiak, R.: R2rml: Rdb to rdf mapping language. working
group recommendation, w3c, sept. 2012
8. De Meester, B., Dimou, A., Verborgh, R., Mannens, E.: An ontology to semanti-
cally declare and describe functions. In: ISWC. pp. 46–49. Springer (2016)
9. De Meester, B., Maroy, W., Dimou, A., Verborgh, R., Mannens, E.: Declarative
data transformations for linked data generation: the case of DBpedia. In: European
Semantic Web Conference. pp. 33–48. Springer (2017)
10. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de
Walle, R.: RML: A generic language for integrated RDF mappings of heterogeneous
data. In: LDOW (2014)
11. Dixon, J.: Pentaho, Hadoop, and Data Lakes (2010), https://jamesdixon.
wordpress.com/2010/10/14/pentaho-hadoop-and-data- lakes, online; accessed
27-January-2019
12. Endris, K.M., Galkin, M., Lytra, I., Mami, M.N., Vidal, M.E., Auer, S.: MULDER:
querying the linked data web by bridging RDF molecule templates. In: Int. Conf.
on Database and Expert Systems Applications. pp. 3–18. Springer (2017)
13. Gadepally, V., Chen, P., Duggan, J., Elmore, A., Haynes, B., Kepner, J., Madden,
S., Mattson, T., Stonebraker, M.: The bigdawg polystore system and architecture.
In: High Performance Extreme Computing Conference. pp. 1–6. IEEE (2016)
14. Giese, M., Soylu, A., Vega-Gorgojo, G., Waaler, A., Haase, P., Jiménez-Ruiz, E.,
Lanti, D., Rezk, M., Xiao, G., Özçep, Ö., et al.: Optique: Zooming in on big data.
Computer 48(3), 60–67 (2015)
16 M. N. Mami et al.
15. Harris, S., Seaborne, A., Prud’hommeaux, E.: SPARQL 1.1 query language. W3C
recommendation 21(10) (2013)
16. Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.:
CloudMdsQL: querying heterogeneous cloud data stores with a common language.
Distributed and Parallel Databases 34(4), 463–503 (2016)
17. Kolychev, A., Zaytsev, K.: Research of the eectiveness of SQL engines working in
HDFS. Journal of Theoretical & Applied Information Technology 95(20) (2017)
18. Lehmann, J., Sejdiu, G., Bühmann, L., Westphal, P., Stadler, C., Ermilov, I., Bin,
S., Chakraborty, N., Saleem, M., Ngomo, A.C.N.: Distributed semantic analytics
using the SANSA stack. In: ISWC. pp. 147–155. Springer (2017)
19. Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S.: Querying data lakes using
spark and presto. To appear in The WebConf - Demonstrations (2019)
20. Michel, F., Faron-Zucker, C., Montagnat, J.: A mapping-based method to query
MongoDB documents with SPARQL. In: International Conference on Database
and Expert Systems Applications. pp. 52–67. Springer (2016)
21. Miloslavskaya, N., Tolstoy, A.: Application of big data, fast data, and data lake
concepts to information security issues. In: International Conference on Future
Internet of Things and Cloud Workshops. pp. 148–153. IEEE (2016)
22. Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ unifying semi-
structured query language, and an expressiveness benchmark of SQL-on-Hadoop,
NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)
23. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.:
Linking data to ontologies. In: Journal on Data Semantics X. Springer (2008)
24. Quix, C., Hai, R., Vatov, I.: GEMMS: A generic and extensible metadata manage-
ment system for data lakes. In: CAiSE Forum. pp. 129–136 (2016)
25. Saleem, M., Ngomo, A.C.N.: Hibiscus: Hypergraph-based source selection for
SPARQL endpoint federation. In: Ext. Semantic Web Conf. Springer (2014)
26. Sellami, R., Bhiri, S., Defude, B.: Supporting multi data stores applications in
cloud environments. IEEE Trans. Services Computing 9(1), 59–71 (2016)
27. Sellami, R., Defude, B.: Complex queries optimization and evaluation over rela-
tional and NoSQL data stores in cloud environments. IEEE Trans. Big Data 4(2),
217–230 (2018)
28. Spanos, D., Stavrou, P., Mitrou, N.: Bringing relational databases into the semantic
web: A survey. Semantic Web pp. 1–41 (2010)
29. Unbehauen, J., Martin, M.: Executing SPARQL queries over mapped document
stores with SparqlMap-M. In: 12th Int. Conf. on Semantic Systems (2016)
30. Vathy-Fogarassy, Á., Hugyák, T.: Uniform data access platform for SQL and
NoSQL database systems. Information Systems 69, 93–105 (2017)
31. Vogt, M., Stiemer, A., Schuldt, H.: Icarus: Towards a multistore database system.
2017 IEEE International Conference on Big Data (Big Data) pp. 2490–2499 (2017)
32. Walker, C., Alrehamy, H.: Personal data lake with data gravity pull. In: 5th Inter-
national Conf. on Big Data and Cloud Computing. pp. 160–167. IEEE (2015)
33. Wiewiórka, M.S., Wysakowicz, D.P., Okoniewski, M.J., Gambin, T.: Benchmark-
ing distributed data warehouse solutions for storing genomic variant information.
Database 2017 (2017)
34. Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., Za-
kharyaschev, M.: Ontology-based data access: A survey. IJCAI (2018)
35. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster
computing with working sets. HotCloud 10(10-10), 95 (2010)