Conference PaperPDF Available

Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources

Authors:

Abstract and Figures

The last two decades witnessed a remarkable evolution in terms of data formats, modalities, and storage capabilities. Instead of having to adapt one's application needs to the, earlier limited, available storage options, today there is a wide array of options to choose from to best meet an application's needs. This has resulted in vast amounts of data available in a variety of forms and formats which, if interlinked and jointly queried, can generate valuable knowledge and insights. In this article, we describe Squerall: a framework that builds on the principles of Ontology-Based Data Access (OBDA) to enable the querying of dis-parate heterogeneous sources using a unique query language, SPARQL. In Squerall, original data is queried on-the-fly without prior data materi-alization or transformation. In particular, Squerall allows the aggregation and joining of large data in a distributed manner. Squerall supports out-of-the-box five data sources and moreover, it can be programmatically extended to cover more sources and incorporate new query engines. The framework provides user interfaces for the creation of necessary inputs, as well as guiding non-SPARQL experts to write SPARQL queries. Squerall is integrated into the popular SANSA stack and available as open-source software via GitHub and as a Docker image.
Content may be subject to copyright.
Squerall: Virtual Ontology-Based Access to
Heterogeneous and Large Data Sources
Mohamed Nadjib Mami1,2, Damien Graux2,3, Simon Scerri1,2, Hajira
Jabeen1, Sören Auer4, and Jens Lehmann1,2
1Smart Data Analytics (SDA) Group, Bonn University, Germany
2Enterprise Information Systems, Fraunhofer IAIS, Germany
3ADAPT Centre, Trinity College of Dublin, Ireland
4TIB & Hannover University, Germany
{mami,scerri,jabeen,jens.lehmann}@cs.uni-bonn.de
damien.graux@iais.fraunhofer.de,auer@l3s.de
Abstract. The last two decades witnessed a remarkable evolution in
terms of data formats, modalities, and storage capabilities. Instead of
having to adapt one’s application needs to the, earlier limited, available
storage options, today there is a wide array of options to choose from to
best meet an application’s needs. This has resulted in vast amounts of
data available in a variety of forms and formats which, if interlinked and
jointly queried, can generate valuable knowledge and insights. In this
article, we describe Squerall: a framework that builds on the principles
of Ontology-Based Data Access (OBDA) to enable the querying of dis-
parate heterogeneous sources using a unique query language, SPARQL.
In Squerall, original data is queried on-the-y without prior data materi-
alization or transformation. In particular, Squerall allows the aggregation
and joining of large data in a distributed manner. Squerall supports out-
of-the-box ve data sources and moreover, it can be programmatically
extended to cover more sources and incorporate new query engines. The
framework provides user interfaces for the creation of necessary inputs, as
well as guiding non-SPARQL experts to write SPARQL queries. Squerall
is integrated into the popular SANSA stack and available as open-source
software via GitHub and as a Docker image.
Software Framework.https://eis-bonn.github.io/Squerall
1 Introduction
For over four decades, relational data management remained a dominant paradigm
for storing and managing structured data. However, the advent of extremely
large-scale applications revealed the weakness of relational data management
at dynamically and horizontally scaling the storage and querying of massive
amounts of data. This prompted a paradigm shift, calling for a new breed of
databases capable of managing large data volumes without jeopardising query
performance by reducing query expressivity and consistency requirements. Since
2008 to date, a wide array of so-called non-relational or NoSQL (Not only SQL)
2 M. N. Mami et al.
databases emerged (e.g., Cassandra, MongoDB, Couchbase, Neo4j). This het-
erogeneity contributed to one of the main Big Data challenges: variety. The
integration of heterogeneous data is the key rational for the development of se-
mantic technologies over the past two decades. Local data schemata are mapped
to global ontology terms, using mapping languages that have been standard-
ized for a number of popular data representations, e.g., relational data, JSON,
CSV or XML. Heterogeneous data can then be accessed in a uniform manner
by means of queries in a standardized query language, SPARQL [15], employing
terms from the ontology. Such data access is commonly referred to as Ontology-
Based Data Access (OBDA) [23]. The term Data Lake [11] refers to the schema-
less pool of heterogeneous and large data residing in its original formats on a
horizontally-scalable cluster infrastructure. It comprises databases (e.g., NoSQL
stores) or scale-out le/block storage infrastructure (e.g., Hadoop Distributed
File System), and requires dealing with the original data without prior physical
transformation or pre-processing. After emerging in industry, the concept has
increasingly been discussed in the literature [24,21,32]. The integration of se-
mantic technologies into Data Lakes led to the Semantic Data Lake concept,
briey introduced in our earlier work [2]. By adopting the OBDA paradigm to
the NoSQL and Data Lake technology space, we realize the Semantic Data Lake
concept and present in this article a comprehensive implementation.
Implementing an OBDA architecture atop Big Data raises three challenges:
1. Query translation. SPARQL queries must be translated into the query lan-
guage of each of the respective data sources. A generic and dynamic transla-
tion between data models is challenging (even impossible in some cases e.g.,
join operations are unsupported in Cassandra and MongoDB [20]).
2. Federated Query Execution. In Big Data scenarios it is common to have non-
selective queries with large intermediate results, so joining or aggregation
cannot be performed on a single node, but only distributed across a cluster.
3. Data silos. Data coming from various sources can be connected to generate
new insights, but it may not be readily ‘joinable’ (cf. denition below).
To target the aforementioned challenges we build Squerall [19], an extensible
framework for querying Data Lakes.
It allows ad hoc querying of large and heterogeneous data sources virtually
without any data transformation or materialization.
It allows the distributed query execution, in particular the joining of disparate
heterogeneous sources.
It enables users to declare query-time transformations for altering join keys
and thus making data joinable.
Squerall integrates the state-of-the-art Big Data engines Apache Spark and
Presto with the semantic technologies RML and FnO.
The article is structured as follows. Squerall architecture is presented in Sec-
tion 2 and its implementation in Section 3. The performance is evaluated in Sec-
tion 4 and its sustainability, availability and extensibility aspects are discussed
in Section 5. Related Work is presented in Section 6 and Section 7 concludes
with an outlook on possible future work.
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 3
PS2
PS1 PSn
Join
ParSets
Transformed
ParSets
Relevant
Data Sources
Distributed
Query
Processor
Transformation
PS2t
Query
Decomposer
Mappings PSr
Final Results
Data Wrapper
Relevant Entity
Extractor
Data Lake
Parallel
Operational
Area (POA)
Union
PS1t
Config
Query
4
1
23
Fig. 1: Squerall Architecture (Mappings, Query and Cong are user inputs).
2 Architecture
Squerall (Semantically query all) is built following the OBDA principles [23]. The
latter were originally devised for accessing relational data but do not impose a
restriction on the type or size of data it deals with. We project them to large
and heterogeneous data sources contained in a Data Lake.
2.1 Preliminaries
In order to guide the subsequent discussions, we rst dene the following terms:
Data Attribute represents all concepts used by data sources to characterize a
particular stored datum, e.g., a column in a tabular database like Cassandra, or
aeld in a document database like MongoDB.
Data Entity and Relevant Entity: an entity represents all concepts that are
used by data sources to group together similar data, e.g., a table in a tabular
database or a collection in a document database. An entity has one or multi-
ple data attributes. An entity is relevant to a query if it contains information
matching a part of the query (similarly found in federated systems, e.g., [25]).
ParSet and Joinable ParSets: from Parallel dataSet, ParSet refers to a
data structure that is partitioned and distributed, and that is queried in par-
allel. ParSet is populated on-the-y, and not materialized. Joinable ParSets are
ParSets that store inter-matching values. For example, if the ParSet has a tabular
representation, it has the same meaning as joinable tables in relational algebra,
i.e., tables sharing common attribute values.
Parallel Operational Area (POA) is the parallel distributed environment
where ParSets are loaded, joined and transformed, in response to a query. It has
its internal data structure, which ParSets comply with.
Data Source refers to any storage medium, e.g., plain le storage or a database.
Data Lake is a repository of multiple data sources where data is stored and
accessed directly in its original form and format, without prior transformation.
4 M. N. Mami et al.
?k bibo:isbn ?i .
?k dc:title ?t .
?k schema:author ?a .
?k bibo:editor ?e .
?a foaf:firstName ?fn .
?a foaf:lastName ?ln .
?a rdf:type nlon:Author .
?a drm:worksFor ?in .
?in rdfs:label ?n1 .
?in rdf:type vivo:Institute
?e rdfs:label ?n2 .
?e rdf:type saws:Editor .
?c rdf:type swc:Chair .
?c dul:introduces ?r .
?c foaf:firstName ?cfn .
?r schema:reviews ?k .
PS(a)
PS(c)
PS(k)
PS(in)
PS(e)
PS(a)
PS(a)
PS(k)
PS(k) PS(e)
PS(r) PS(k)
PS(r)
PS(c)
ParSet
join pairs
1st join pair
Results ParSet
III
Incremental
Join
PS(a) PS(in)
PS(k)
PS(a) PS(in)
PS(in)
PS(r)
PS(e)
PS(k)
PS(a) PS(in)
PS(e)
PS(k)
PS(a) PS(r)
PS(in)
Fig. 2: ParSets extraction and Join (for clarity ParSet(x) is shortened to PS(x)).
2.2 OBDA Building Blocks
A typical OBDA system is composed of ve main components:
Data. A Data Lake is a collection of multiple heterogeneous data sources, be it
raw les (e.g., CSV), structured le formats (e.g., Parquet) or databases (e.g.,
MongoDB). Currently, Squerall does not support unstructured data but it can
be part of the Data Lake.
Schema. A Data Lake is by denition a schema-less repository of data. Schemata
exist at the level of the individual data sources.
Ontology. Ontologies are used to dene a common domain conceptualization
across Data Lake entities. At least class and properties denition is required.
Mappings. Mappings are association links between elements of the data schema
and ontology terms (i.e., classes and properties). Three mapping elements need
to be provided as input for an entity to be queried:
1. Class mapping: associates an entity to an ontology class.
2. Property mappings: associate entity attributes to ontology properties.
3. Entity ID: species an attribute to be used as identier of the entity.
For example, Author(AID,first_name,last_name) is an entity in a table of
a Cassandra database. In order to enable nding this entity, the user must
provide the three mapping elements. As example, (1) Class mapping: (Author,
nlon:Author) (2) Property mappings: (rst_name, foaf:rstName), (last_name,
foaf:lastName), and (3) Entity ID: AID. firstName and lastName are properties
from the foaf ontology (http://xmlns.com/foaf/spec) and Author is a class
from the nlon ontology (http://lod.nl.go.kr/page/ontology).
Query. The purpose of using a top query language, SPARQL in our case, is
mainly to join data coming from multiple sources. Therefore, we assume that
certain query forms and constructs are of less concern to Data Lake users, e.g.,
multi-level nested queries, queries with variable properties, CONSTRUCT queries.
2.3 Architecture Components
Squerall consists of four main components (cf. Figure 1). Because of Squerall
extensible design, also for clarity, we hereafter use the generic ParSets and POA
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 5
-In pu t: ParSetJoinsArray // An Array of all join pairs [ParSet,ParSet]
-Output: ResultsParSet // A ParSet joining all ParSets
Re su lt sP ar Se t = Pa rSe tJ oi ns Ar ra y . he ad // First join pair
it er at e Pa rS et Jo in sAr ra y : c ur ren t -p air
if cu rr ent - pa ir joinable_with ResultsParSet
ResultsParSet = ResultsParSet join cu rre nt - pa ir
el se a dd cu rr ent - pa ir to PendingJoinsQueue
// Next, iterate through PendingJoinsQueue like ParSetJoinsArray
Listing 1.1: ParSet Join.
concepts instead of Squerall’s underlying equivalent concrete terms, which dier
from engine to engine. The latter are presented in Section 3.
(1) Query Decomposor. This component is commonly found in OBDA and
query federation systems (e.g., [12]). It here decomposes the query’s Basic Graph
Pattern (BGP, conjunctive set of triple patterns in the where clause) into a set
of star-shaped sub-BGPs, where each sub-BGP contains all the triple patterns
sharing the same subject variable. We refer to these sub-BGPs as stars for brevity;
(see Figure 2 left, stars are shown in distinct colored boxes). Query decomposi-
tion is subject-based (variable subjects), because the focus of query execution is
on bringing and joining entities from dierent sources, not to retrieve a specic
known entity. Retrieving a specic entity, i.e., subject is constant, requires full-
data parsing and creating an index in a pre-processing phase. This dees the
Data Lake denition to access original data without a pre-processing phase. A
specic entity can be obtained, nonetheless, by ltering on its attributes.
(2) Relevant Entity Extractor. For every extracted star, this component
looks in the Mappings for entities that have attributes mappings to each of the
properties of the star. Such entities are relevant to the star.
(3) Data Wrapper. In the classical OBDA, SPARQL query has to be translated
to the query language of the relevant data sources. This is in practice hard to
achieve in the highly heterogeneous Data Lake settings. Therefore, numerous
recent publications (e.g., [4,29]) advocated for the use of an intermediate query
language. In our case, the intermediate query language is POA’s query language,
dictated by its internal data structure. The Data Wrapper generates data in
POA’s data structure at query-time, which allows for the parallel execution
of expensive operations, e.g., join. There must exist wrappers to convert data
entities from the source to POA’s data structure, either fully, or partially if parts
of the data can be pushed down to the original source. Each identied star from
step (1) will generate exactly one ParSet. If more than an entity are relevant,
the ParSet is formed as a union. An auxiliary user input Cong is used to guide
the conversion process, e.g., authentication, or deployment specications.
(4) Distributed Query Processor. Finally, ParSets are joined together form-
ing the nal results. ParSets in the POA can undergo any query operation, e.g.,
selection, aggregation, ordering, etc. However, since our focus is on querying
multiple data sources, the emphasis is on the join operation. Joins between stars
translate into joins between ParSets (Figure 2 phase I). Next, ParSet pairs are all
iteratively joined to form the Results ParSet (Figure 2 phase II) using Listing 1.1
6 M. N. Mami et al.
1<# Aut ho rMa p >
2rml:logicalSource [
3rml : so ur ce : " .. / au th or s. p ar qu et " ; n osql :sto re n os ql : pa rq ue t ] ;
4rr :s ub jectM ap [
5rr :t em pl at e " ht tp : // ex am . pl /. ./ { AID} " ; rr : cl as s nl on : Au th or ] ;
6rr :p red ic at eO bj ec tMa p [
7rr :p re di cate fo af : fi rs tN am e ; rr :o bj ec tM ap [r ml : re fe re nc e "F na me " ] ] ;
8rr :p red ic at eO bj ec tMa p [ rr : pr ed ic at e dr m:w or ks Fo r ; rr : ob je ct Ma p < #F un ct io nM ap >] .
9<# Fun ct ion Map >
10 fnml :fu nc ti on Va lu e [ r ml : lo gi ca lS ou rc e " .. / au th or s. p ar qu et " ; # S ame as a bo ve
11 rr :p red ic at eO bj ec tMa p [ rr : pr ed ic at e fn o:e xe cu te s ;
12 rr :o bj ec tMap [r r: c on st an t gr el : st rin g_ to Up pe rc ase ] ];
13 rr :p red ic at eO bj ec tMa p [
14 rr :p re di cate gr el : in pu tS tr in g ; rr :ob je ct Ma p [r r: r ef er en ce " In ID " ]
15 ] ] . # Tr an sf or m " In ID " a tt ri bu te u si ng g re l:s tr in g _t oU pp er ca se
Listing 1.2: Mapping an entity using RML and FnO.
algorithm. In short, extracted join pairs are initially stored in an array. After
the rst pair is joined, it iterates through each remaining pair to attempt further
joins or, else, add to a queue1. Next, the queue is similarly iterated, when a pair
is joined, it is unqueued. The algorithm completes when the queue is empty. As
the Results ParSet is a ParSet, it can also undergo query operations. The join
capability of ParSets in the POA replaces the lack of the join common in many
NoSQL databases, e.g., Cassandra, MongoDB [20]. Sometimes ParSets cannot
be readily joined due to a syntactic mismatch between attribute values. Squerall
allows users to declare Transformations, which are atomic operations applied to
textual or numeral values, details are given in subsection 3.2.
3 Implementation
Squerall2is written in Scala. It uses RML and FnO to declare data mappings
and transformations, and Spark [35] and Presto3as query engines.
3.1 Data Mapping
Squerall accepts entity and attribute mappings declared in RML [10], a map-
ping language extending the W3C R2RML [7] to allow mapping heterogeneous
sources. The following fragment is expected (e.g., #AuthorMap in Listing 1.2):
rml:logicalsource used to specify the entity source and type.
rr:subjectMap used (only) to extract the entity ID (in brackets).
rr:predicateObjectMap, used for all entity attributes; maps an attribute
using rml:reference to an ontology term using rr:predicate.
1We used queue data structure simply to be able to dynamically pull (unqueue)
elements from it iteratively till it has no more elements.
2Available at https://github.com/EIS-Bonn/Squerall (Apache-2.0 license).
3http://prestodb.io
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 7
We complement RML with the property nosql:store (line 5) from our NoSQL
ontology4, to enable specifying the entity type, e.g., Cassandra, MongoDB, etc.
3.2 Data Transformation
To enable data joinability, Squerall allows users to declare transformations. Two
requirements should be met: (1) transformation specication should be decou-
pled from the technical implementation, (2) transformations should be performed
on-the-y on query-time, complying with the Data Lake denition.
We incorporate the Function Ontology (FnO, [8]), which allows to declare
machine-processable high-level functions, abstracting from the concrete tech-
nology used. We use FnO in conjunction with RML similarly to the approach
in [9] applied to the DBpedia Extraction Framework. However, we do not phys-
ically generate RDF triples but only apply FnO transformations on-the-y at
query-time. Instead of directly referencing an entity attribute rml:reference
(e.g., line 7 Listing 1.2), we reference an FnO function that alters the values
of the attribute (line 9 Listing 1.2). For example in Listing 1.2, the attribute
InID (line 18) is indirectly mapped to the ontology term drm:worksFor via the
#FunctionMap. This implies that the attribute values are to be transformed us-
ing the function represented by the #FunctionMap,grel:string_toUppercase
(line 16). The latter sets the InID attribute values to uppercase.
Squerall visits the mappings at query-time and triggers specic Spark and
Presto operations over the query intermediate results whenever a transformation
declaration is met. In Spark, a map() transformation is used, in Presto corre-
sponding string or numeral SQL operations are used. For the uppercase example,
in Spark upper(DataFrame column) function inside a map() is used, in Presto
the SQL upper() string function is used.
3.3 Data Wrapping and Querying
We implement Squerall engine using two popular frameworks: Apache Spark
and Presto. Spark is a general-purpose processing engine and Presto a dis-
tributed SQL query engine for interactive querying, both base their computa-
tions primarily in memory. We leverage Spark’s and Presto’s connector concept,
which is a wrapper able to load data from an external source into their internal
data structure (ParSet), performing attening of any non-tabular representa-
tions. Spark’s internal data structure is called DataFrame, which is a tabular
structure programmatically queried in SQL. Their schema corresponds to the
schema of the ParSet, a column per star predicate. As explained with ParSets,
DataFrames are created from the relevant entities, and incrementally joined.
Other non-join operations found in SPARQL query (e.g., selection, aggrega-
tion, ordering) are translated to equivalent SQL operations; they are applied
either at the level of individual DataFrames, or the level of the nal results
DataFrame, whichever is more optimal. As an optimization, in order to reduce
4URL: http://purl.org/db/nosql#, details are out of the scope of this article.
8 M. N. Mami et al.
(a) SPARQL UI. (b) Mapping UI.
Fig. 3: Screenshots from Squerall GUIs.
the intermediate results and, thus, data to join with, we push the selection
and transformation to the level of individual DataFrames. We leave aggrega-
tion and ordering to the nal results DataFrame, as those have results-wide
eect. Presto also loads data into its internal native data structure. However,
unlike Spark, it does it transparently; it is not possible to manipulate those
data structures. Rather, Presto accepts one self-contained SQL query with refer-
ences to all the relevant data sources, e.g., SELECT cassandra.cdb.product C
JOIN mongo.mdb.producer M ON C.producerID = M.ID. ParSets in this case
are views (SELECT sub-queries), which we create, join and optimize similarly
to DataFrames.
Spark and Presto make using connectors very convenient, users only provide
values to a pre-dened list of options. Spark DataFrames are created as fol-
lows: spark.read.format(sourceType).options(options).load. In Presto,
options are added to a simple le. Leveraging on this simplicity, Squeral sup-
ports out-of-the box ve data sources: Cassandra, MongoDB, Parquet, CSV and
JDBC (MySQL tested). We chose Spark and Presto as they have a good balance
between the number of connectors, ease of use and performance [33,17].
3.4 User Interfaces
Squerall is provided with three user interfaces allowing to generate its three
needed input les (cong, mappings and query), respectively described as follows:
Connect UI shows and receives from users the required options that enable the
connection to a data source, e.g., host, port, password, cluster settings, etc.
Mapping UI uses connection options to extract data schema (entity and at-
tributes). It then allows the users to ll or search existing ontology catalogues
for equivalent ontology terms (cf. Figure 3a).
SPARQL UI guides non-SPARQL experts to build correct SPARQL queries
by means of widget oered for dierent SPARQL constructs (cf. Figure 3b).
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 9
Product Oer Review Person Producer
Generated Data (BSBM) Cassandra MongoDB Parquet CSV MySQL
# of tuples Scale 0.5M 0.5M 10M 5M 26K 10K
# of tuples Scale 1.5M 1.5M 30M 15M 77K 30K
# of tuples Scale 5M 5M 100M 50M 2.6M 100K
Table 1: Data sources and corresponding number of tuples loaded.
4 Performance Analysis
4.1 Setup
Datasets: As there is no Data Lake dedicated benchmark with SPARQL sup-
port, we opt for BSBM [3], a benchmark conceived for comparing the perfor-
mance of RDF triple stores with SPARQL-to-SQL rewriters. We use its data
generator to generate SQL dumps. We pick ve tables: Product, Producer, Of-
fer, Review, and Person tables, pre-process them to extract tuples and load them
into ve dierent data sources (cf. Table 1). Those were chosen to enable up to
4-chain joins of ve dierent data sources. We generate three scales: 500k, 1.5M
and 5M (number of products)5.
Queries: Since we only populate a subset of the generated BSBM tables, we
have to alter the initial queries accordingly. We discard joining with tables we
do not consider, e.g., Vendors, and replace them with others populated. All
queries6result in a cross-source join, from 1 (between 2 sources e.g., Q3) to 4 (be-
tween 5 sources e.g., Q4). Queries with yet unsupported syntax, e.g., DESCRIBE,
CONSTRUCT, are omitted.
Metrics: We evaluate results accuracy and query performance. For accuracy, we
compare query results against a centralized relational database (MySQL), as it
represents data at its highest level of consistency. For performance, we measure
query execution time wrt. the three generated scales. A particular emphasis is
put on the impact of the number of joins on query time. We run each query three
times and calculate their mean value. The timeout threshold is set to 3600s.
Environment: We ran our experiments in a cluster of three machines each
having DELL PowerEdge R815, 2x AMD Opteron 6376 (16 cores) CPU and
256GB RAM. In order to exclude any eect of caching, all queries are run on a
cold cache. Also, no optimization techniques from the query engines were used.
4.2 Results and Discussion
We compare the performance of Squerall’s two underlying query engines: Spark
and Presto, in absence of a similar work allowing to query all ve data sources
and SPARQL fragment that Squerall supports.
Accuracy. The number of results returned by Squerall was in all queries of scale
0.5m identical to MySQL, i.e., 100% accuracy. MySQL timed out with data of
5The 1.5M scale factor generates 500M RDF triples, and the 5M factor 1,75B triples.
6See https://github.com/EIS-Bonn/Squerall/tree/master/evaluation
10 M. N. Mami et al.
178
656 379
517 452 441
856 635
337
seconds
0
100
200
300
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Presto Spark
(a) Scale 0.5m.
74
1203 434
504 552 527
1061 523
219
seconds
0
200
400
600
800
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Presto Spark
(b) Scale 1.5m.
48 1375 305
626 528 548
1418 418
345
seconds
0
500
1000
1500
2000
2500
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Presto Spark Presto (transformed data)
(c) Scale 5m.
Fig. 4: Query execution Time (seconds). The labels on top of Presto’s columns
show the percentage between Presto’s and Spark’s execution times, e.g., in (a)
in Q2, 178 means that Presto-based Squerall is 178% faster than Spark-based.
scale 1.5M, starting from which we compared the performance between the two
engines, and results were also identical.
Performance. The results (cf. Figure 4) suggest that Squerall overall exhibits
reasonable performance throughout the various queries, i.e., dierent number
of joins, with and without ltering and ordering. Presto-based Squerall exhibits
signicantly better performance than Spark-based, up to an order of magni-
tude. With the 0.5M data scale, query performance is superior across all the
queries with an increase of up to 800%. With scales 1.5M and 5M, Presto-based
is superior in all queries other than Q1, with an increase of up to 1300%. This
superiority is due to a number of factors. Presto is built following the MPP
(Massively Parallel Processing) principles specically to optimize SQL querying.
Another factor is that Presto has far less preparation overhead (e.g., mid-query
fault-tolerance, resource negotiation) than Spark. Spark, in the other hand, is
a general-purpose system, basing its SQL library on its native in-memory tab-
ular data structure not originally conceived for ad hoc querying. Also, it incurs
overhead to guarantee query resiliency and manage resources.
Query performance was homogeneous across all the scales and between the
two engines, except for Q2, which was among the fastest in Presto contrarily
to Spark. This query is special in that it projects out most Product attributes
joining the largest entity, Oer, without any ltering, which may indicate that
Presto handles intermediate results of unselective queries better. Q3 was the
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 11
32 32 35 33 32
33
32 41 35 31
36
34
0
500
1000
1500
2000
2500
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Scale 0.5M Scale 1.5M Scale 5M
(a) Spark-based Squerall query times.
40 59 40 43 37 44
22
50
37 53
39 33 35
43
52
0
200
400
600
800
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q10
Scale 0.5M Scale 1.5M Scale 5M
(b) Presto-based Squerall query times.
Fig. 5: Numbers above the bars denote time percentage dierences between the
scales, e.g., in (a) Q2 execution time in scale 0.5M is 32% of that in scale 1.5M,
which is 32% of that in 5M. On average, percentages are 30% in all cases (both
engines), which is proportional to the data scale (5m=30%1.5m=30%0.5m).
fastest, as it has the lowest number of joins. Followed by Q1 and Q8, which
contain more joins, but with signicantly ltered number of products. The rest
of the queries are notably slower because they join with the large entity Oer.
The presence of the LIMIT clause did not have a direct eect on query time, it is
present in Q1-Q5, Q8 and Q10, across which the performance varies signicantly.
Although the current data distribution does not represent the best-case sce-
nario (e.g., query performance would be better if Review data was loaded into
Cassandra instead), we intentionally stored the large data into the better per-
forming data sources. Our purpose in this experiment series was to observe
Squerall behavior using Spark and Presto across the various scales and queries.
For example, we observed that, although the largest data entity was loaded into
a very ecient database, MongoDB, the queries implicating this entity were the
slowest anyway. This gives an indication of the performance of those queries if
the same entity was loaded into a less capable source, e.g., Parquet or CSV.
Increasing data size did not diminish query performance; query times were
approximately proportional to the data size (cf. Figure 5), and remained under
the threshold. In order to evaluate the eect of the query-time data transforma-
tions, we intentionally introduce variations to the data so it becomes unjoinable.
In table Product, we decrease the column pr values by 71, in table Producer,
we append the string “-A” to all values of column pr, and in table Review we
prex the values of column person with the character “P”. We declare the nec-
essary transformations accordingly. The results show that there is a negligible
cost in the majority of the cases. This is attributed to the fact that both Spark
and Presto base computations in memory. In Spark, those transformations in-
volve only map function, which is executed locally very eciently, not requiring
any data movement. Only few queries in 5M in Presto-based Squerall exhibited
noticeable but not signicant costs, e.g., Q1 and Q10. Due to the insignicant
dierences and to improve readability, we only add the results of the transfor-
mation cost in the scale 5M Figure 4c. Our results could be considered as a
performance comparison between Spark and Presto, of which few exist [17,33].
12 M. N. Mami et al.
5 Availability, Sustainability, Usability and Extensibility
Squerall is integrated7into SANSA [18] since version 0.5, a large framework for
distributed querying, inference and analytics over knowledge graphs. SANSA has
been used across a range of funded projects, such as BigDataEurope, SLIPO,
and BETTER. Via the integration into SANSA, Squerall becomes available to
a large user base. It benets from SANSA’s various deployment options, e.g.,
Maven Central integration and runnable notebooks. SANSA has an active devel-
oper community (20 developers), an active mailing list, issue tracking system,
website and social media channels. Prior to its integration, Squerall features were
recurrently requested in SANSA, to allow it to also access large non-RDF data.
Squerall sustainability is ensured until 2022 thanks to a number of contributing
innovation projects including Simple-ML, BETTER, SLIPO, and MLwin. Fur-
ther, Squerall is being adopted and extended in an internal use-case of a large
industrial company; we plan to report this in a separate adequate submission.
The development of Squerall was driven by technical challenges identied by
the H2020 BigDataEurope project8, whose main technical result, the Big Data
Integrator platform, retains a signicant amount of interest by the open-source
community. In the absence of appropriate technical solutions supporting Data
Lake scenarios, the platform development was constrained to transform and
centralize most of the data in the use-cases. The need to invest in architectures,
tools and methodologies to allow for decentralized Big Data management was
highly emphasized by the project. Further, demonstrating the feasibility and
eectiveness of OBDA on top of the ever increasing movement of NoSQL has
a positive impact on the adoption of Semantic Web principles. This indicates
some clear evidence of Squerall’s value and role in the community.
Squerall is openly available under Apache-2.0 terms; it is hosted on GitHub9
and registered in Zenodo10 (DOI: 10.5281/zenodo.2636436). Squerall makes use
of open standards and ontologies, including SPARQL, RML, and FnO. It can eas-
ily be built and used thanks to its detailed documentation. Its usage is facilitated
by the accompanied user interfaces, which are demonstrated in a walkthrough
screencast11. Further, we provide a Docker image allowing to easily setup Squer-
all and reproduce the presented evaluation. Squerall was built with extensibility
in mind; it can be programmatically extended12 by (1) adding a new query en-
gine, e.g., Drill, due to its modular code design (cf. Figure 6), and (2) supporting
more data sources with minimal eort by leveraging Spark/Presto connectors.
A mailing list and a Gitter community are made available for the users.
7https://github.com/SANSA-Stack/SANSA-DataLake
8www.big-data-europe.eu &https://github.com/big-data- europe
9https://github.com/EIS-Bonn/Squerall
10 https://zenodo.org/record/2636436
11 https://github.com/EIS-Bonn/Squerall/tree/master/evaluation/screencasts
12 https://github.com/EIS-Bonn/Squerall/wiki/Extending-Squerall
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 13
Run Query Analyser
Planner
Mapper
MainQuery Executor
Spark
Executor
Presto
Executor
...
Executor
class call
subclass
Fig. 6: Squerall class call hierarchy. Engine classes (colored) are decoupled from
the rest (grey). A new engine can be added by extending only Query Executor
class (implementing it methods).
6 Related Work
There are several solutions for mapping relational databases to RDF [28], and
OBDA over relational databases [34], e.g., Ontop, Morph, Ultrawrap, Mastro,
Stardog. Although we share the OBDA concepts, our focus goes to the heteroge-
neous non-relational and distributed scalable databases. On the non-relational
side, there has been a number of eorts, which we can classify into ontology-
based and non-ontology-based.
For non-ontology-based access, [6] denes a mapping language to express ac-
cess links to NoSQL databases. It proposes an intermediate query language to
transform SQL to Java methods accessing NoSQL databases. However, query
processing is neither elaborated nor evaluated, e.g., cross-database join is not
mentioned. [13] suggests that computations performance can be improved if
data is shifted on query-time between multiple databases; the suitable database
is decided on a case-to-case basis. Although it demonstrates that the overall
performance, including the planning and data movement, is higher when using
one database, this is not proven to be true with large data. In real large-scale
settings, data movement and I/O can dominate the query time. [26] allows to
run CRUD operations over NoSQL databases; beyond, the same authors in [27]
enable joins as follows. If the join involves entities in the same database, it is
performed locally, if not or if the database lacks join capability, data is moved
to another capable database. This implies that no intra-source distributed join
is possible, and, similarly to [13], moving data can become a bottleneck in large
scales. [1] proposes a unifying programming model to interface with dierent
NoSQL databases. It allows direct access to individual databases using get,put
and delete primitives. Join between databases is not addressed. [16] proposes a
SQL-like language containing invocations to the native query interface of rela-
tional and NoSQL databases. The learning curve of this query language is higher
than other eorts suggesting to query solely using plain (or minimally adapted)
SQL, JSONPath or SPARQL. Although its architecture is distributed, it is not
explicitly stated whether intra-source join is also distributed. Besides, the code-
source is unfortunately not available. A number of eorts, e.g., [29,30,4], aim at
bridging the gap between relational and NoSQL databases, but only one database
is evaluated. Given the high semantic and structural heterogeneity found across
NoSQL databases, a single database cannot be representative of all the family.
Among those, [30] adopts JSON as both conceptual and physical data model.
This requires physically transforming query’s intermediate results, costing the
engine transformation price (a limitation also observed in other eorts). More-
14 M. N. Mami et al.
over, the prototype is evaluated with only small data on a single machine. [22]
presents SQL++, an ambitious general query language that is based on SQL
and JSON. It covers a vast portion of the capabilities of query languages found
across NoSQL databases. However, the focus is on the query language, and the
prototype is only minimally validated using a single database: MongoDB. [31]
considers data duplicated in multiple heterogeneous sources, and identies the
best source to send a query to. Thus, joins between sources are not explored. For
ontology-based access, Optique [14] is a reference platform with consideration
also for dynamic streaming data. Although based on the open-source Ontop,
sources of the Big Data instance are not publicly available. Ontario13 is a very
similar (unpublished) work; however, we were not able to run it due to the lack
of documentation, and it appears that wrappers are manually created. [5] con-
siders simple query examples, where joins are minimally addressed. Distributed
implementation is future work.
In all the surveyed solutions, support for data source variety is limited or faces
bottlenecks. Only support for a few data sources (1-3) is observed, and wrappers
are manually created or hard-coded. In contrast, Squerall does not reinvent the
wheel and makes use of the many wrappers of existing engines. This makes it the
solution with the broadest support of the Big Data Variety dimension in terms
of data sources. Additionally, Squerall has among the richest query capabilities
(see full fragment in14), from joining and aggregation to various query modiers.
7 Conclusion and Future Work
In this article, we presented Squerall —a framework realizing the Semantic Data
Lake, i.e., querying heterogeneous and large data sources using Semantic Web
techniques. It performs distributed cross-source join operation and allows users to
declare transformations that enable joinability on-the-y at query-time. Squerall
is built using state-of-the-art Big Data technologies, Spark and Presto. Relying
on the latter’s connectors to wrap the data, Squerall relieves users from hand-
crafting wrappers —a major bottleneck in supporting data variety throughout
the literature. It also makes Squerall easily extensible, e.g., in addition to the
ve sources evaluated here, Couchbase and Elasticsearch were also tested. There
are dozens of connectors already available15. Furthermore, due to its modu-
lar code design, Squerall can also be programmatically extended to use other
query engines. In the future, we plan to support more SPARQL operations, e.g.,
OPTIONAL and UNION, and also to exploit the query engines’ own optimizations
to accelerate query performance. Finally, in such a heterogeneous environment,
there is a natural need for retaining provenance at data and query results levels.
13 https://github.com/SDM-TIB/Ontario
14 https://github.com/EIS-Bonn/Squerall/tree/master/evaluation
15 https://spark-packages.org https://prestodb.io/docs/current/connector
Squerall: Ontology-Based Access to Heterogeneous and Large Data Sources 15
Acknowledgment
This work is partly supported by the EU H2020 projects BETTER (GA 776280)
and QualiChain (GA 822404); and by the ADAPT Centre for Digital Con-
tent Technology funded under the SFI Research Centres Programme (Grant
13/RC/2106) and co-funded under the European Regional Development Fund.
References
1. Atzeni, P., Bugiotti, F., Rossi, L.: Uniform access to non-relational database sys-
tems: The SOS platform. In: Ralyté, J., Franch, X., Brinkkemper, S., Wrycza, S.
(eds.) In CAiSE. vol. 7328, pp. 160–174. Springer (2012)
2. Auer, S., Scerri, S., Versteden, A., Pauwels, E., Charalambidis, A., Konstantopou-
los, S., Lehmann, J., Jabeen, H., Ermilov, I., Sejdiu, G., et al.: The BigDataEurope
platform–supporting the variety dimension of big data. In: International Confer-
ence on Web Engineering. pp. 41–59. Springer (2017)
3. Bizer, C., Schultz, A.: The berlin SPARQL benchmark. International Journal on
Semantic Web and Information Systems (IJSWIS) 5(2), 1–24 (2009)
4. Botoeva, E., Calvanese, D., Cogrel, B., Corman, J., Xiao, G.: A generalized frame-
work for ontology-based data access. In: International Conference of the Italian
Association for Articial Intelligence. pp. 166–180. Springer (2018)
5. Curé, O., Kerdjoudj, F., Faye, D., Le Duc, C., Lamolle, M.: On the potential inte-
gration of an ontology-based data access approach in NoSQL stores. International
Journal of Distributed Systems and Technologies (IJDST) 4(3), 17–30 (2013)
6. Curé, O., Hecht, R., Le Duc, C., Lamolle, M.: Data integration over NoSQL stores
using access path based mappings. In: International Conference on Database and
Expert Systems Applications. pp. 481–495. Springer (2011)
7. Das, S., Sundara, S., Cyganiak, R.: R2rml: Rdb to rdf mapping language. working
group recommendation, w3c, sept. 2012
8. De Meester, B., Dimou, A., Verborgh, R., Mannens, E.: An ontology to semanti-
cally declare and describe functions. In: ISWC. pp. 46–49. Springer (2016)
9. De Meester, B., Maroy, W., Dimou, A., Verborgh, R., Mannens, E.: Declarative
data transformations for linked data generation: the case of DBpedia. In: European
Semantic Web Conference. pp. 33–48. Springer (2017)
10. Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., Van de
Walle, R.: RML: A generic language for integrated RDF mappings of heterogeneous
data. In: LDOW (2014)
11. Dixon, J.: Pentaho, Hadoop, and Data Lakes (2010), https://jamesdixon.
wordpress.com/2010/10/14/pentaho-hadoop-and-data- lakes, online; accessed
27-January-2019
12. Endris, K.M., Galkin, M., Lytra, I., Mami, M.N., Vidal, M.E., Auer, S.: MULDER:
querying the linked data web by bridging RDF molecule templates. In: Int. Conf.
on Database and Expert Systems Applications. pp. 3–18. Springer (2017)
13. Gadepally, V., Chen, P., Duggan, J., Elmore, A., Haynes, B., Kepner, J., Madden,
S., Mattson, T., Stonebraker, M.: The bigdawg polystore system and architecture.
In: High Performance Extreme Computing Conference. pp. 1–6. IEEE (2016)
14. Giese, M., Soylu, A., Vega-Gorgojo, G., Waaler, A., Haase, P., Jiménez-Ruiz, E.,
Lanti, D., Rezk, M., Xiao, G., Özçep, Ö., et al.: Optique: Zooming in on big data.
Computer 48(3), 60–67 (2015)
16 M. N. Mami et al.
15. Harris, S., Seaborne, A., Prud’hommeaux, E.: SPARQL 1.1 query language. W3C
recommendation 21(10) (2013)
16. Kolev, B., Valduriez, P., Bondiombouy, C., Jiménez-Peris, R., Pau, R., Pereira, J.:
CloudMdsQL: querying heterogeneous cloud data stores with a common language.
Distributed and Parallel Databases 34(4), 463–503 (2016)
17. Kolychev, A., Zaytsev, K.: Research of the eectiveness of SQL engines working in
HDFS. Journal of Theoretical & Applied Information Technology 95(20) (2017)
18. Lehmann, J., Sejdiu, G., Bühmann, L., Westphal, P., Stadler, C., Ermilov, I., Bin,
S., Chakraborty, N., Saleem, M., Ngomo, A.C.N.: Distributed semantic analytics
using the SANSA stack. In: ISWC. pp. 147–155. Springer (2017)
19. Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S.: Querying data lakes using
spark and presto. To appear in The WebConf - Demonstrations (2019)
20. Michel, F., Faron-Zucker, C., Montagnat, J.: A mapping-based method to query
MongoDB documents with SPARQL. In: International Conference on Database
and Expert Systems Applications. pp. 52–67. Springer (2016)
21. Miloslavskaya, N., Tolstoy, A.: Application of big data, fast data, and data lake
concepts to information security issues. In: International Conference on Future
Internet of Things and Cloud Workshops. pp. 148–153. IEEE (2016)
22. Ong, K.W., Papakonstantinou, Y., Vernoux, R.: The SQL++ unifying semi-
structured query language, and an expressiveness benchmark of SQL-on-Hadoop,
NoSQL and NewSQL databases. CoRR, abs/1405.3631 (2014)
23. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.:
Linking data to ontologies. In: Journal on Data Semantics X. Springer (2008)
24. Quix, C., Hai, R., Vatov, I.: GEMMS: A generic and extensible metadata manage-
ment system for data lakes. In: CAiSE Forum. pp. 129–136 (2016)
25. Saleem, M., Ngomo, A.C.N.: Hibiscus: Hypergraph-based source selection for
SPARQL endpoint federation. In: Ext. Semantic Web Conf. Springer (2014)
26. Sellami, R., Bhiri, S., Defude, B.: Supporting multi data stores applications in
cloud environments. IEEE Trans. Services Computing 9(1), 59–71 (2016)
27. Sellami, R., Defude, B.: Complex queries optimization and evaluation over rela-
tional and NoSQL data stores in cloud environments. IEEE Trans. Big Data 4(2),
217–230 (2018)
28. Spanos, D., Stavrou, P., Mitrou, N.: Bringing relational databases into the semantic
web: A survey. Semantic Web pp. 1–41 (2010)
29. Unbehauen, J., Martin, M.: Executing SPARQL queries over mapped document
stores with SparqlMap-M. In: 12th Int. Conf. on Semantic Systems (2016)
30. Vathy-Fogarassy, Á., Hugyák, T.: Uniform data access platform for SQL and
NoSQL database systems. Information Systems 69, 93–105 (2017)
31. Vogt, M., Stiemer, A., Schuldt, H.: Icarus: Towards a multistore database system.
2017 IEEE International Conference on Big Data (Big Data) pp. 2490–2499 (2017)
32. Walker, C., Alrehamy, H.: Personal data lake with data gravity pull. In: 5th Inter-
national Conf. on Big Data and Cloud Computing. pp. 160–167. IEEE (2015)
33. Wiewiórka, M.S., Wysakowicz, D.P., Okoniewski, M.J., Gambin, T.: Benchmark-
ing distributed data warehouse solutions for storing genomic variant information.
Database 2017 (2017)
34. Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., Za-
kharyaschev, M.: Ontology-based data access: A survey. IJCAI (2018)
35. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: Cluster
computing with working sets. HotCloud 10(10-10), 95 (2010)
... Nevertheless, semantic annotation procedure is described only at a high abstraction level. Authors in [19] model data sources in a Data Lake according to attributes and entities, the latter meant for grouping similar attributes and used as a starting point for easing querying data sources. The work presented in [20] models each data source in the Data Lake as a network-based structure, and data sources attributes are semantically annotated using DBpedia to build thematic views on the Data Lake sources. ...
... Then, semantic annotation through domain ontologies ensures a formal representation of the meaning associated with data source attributes. Instead of relying on BabelNet as done in [20,21], which is more suitable for Natural Language Processing applications and, generally, for multilingual word-sense disambiguation, we resort to a set of publicly available domain ontologies as suggested by [19], but introducing lexical enrichment of attributes before annotation. Regarding the support offered to domain experts throughout the annotation process, the work in [22] does not provide any tool and the focus of the approach is not on creating a semantic metadata catalog for data sources attributes, but on collecting general-purpose metadata to support the treatment and management of multiple data sources, belonging to different types. ...
Article
Full-text available
The increasing availability of Big Data is changing the way data exploration for Business Intelligence is performed, due to the volume, velocity and uncontrolled variety of data on which exploration relies. In particular, data exploration is required in Data Lakes that have been proposed to host heterogeneous data sources, given their flexibility to cope with cumbersome properties of Big Data. However, as data grows, new methods and techniques are required for extracting value and knowledge from data stored within Data Lakes, aggregating data into indicators according to multiple analysis dimensions, to enable a large number of users with different roles and competencies to capitalise on available information. In this paper, we propose PERSEUS (PERSonalised Exploration by User Support), a computer-aided approach for data exploration on top of a Data Lake, structured over three phases: (1) the construction of a semantic metadata catalog on top of the Data Lake, leveraging tools and metrics to ease the annotation of the Data Lake metadata; (2) modelling of indicators and analysis dimensions, guided by an openly available Multi-Dimensional Ontology to enable conformance checking of indicators and let users explore Data Lake contents; (3) enrichment of the definition of indicators with personalisation aspects, based on users’ profiles and preferences, to make easier and more usable the exploration of data for a large number of users. Results of an experimental evaluation in the Smart City domain are presented with the aim of demonstrating the feasibility of the approach.
... Squerall [58], similar to our approach, is source independent and also relies on intermediate data structures that are populated on the fly, namely ParSets. However, there are significant differences between both solutions. ...
Article
Full-text available
Virtual knowledge graphs (VKGs) have been widely applied to access relational data with a semantic layer by using an ontology in use cases that are dynamic in nature. However, current VKG techniques focus mainly on accessing a single relational database and remain largely unstudied for data integration with several heterogeneous data sources. To overcome this limitation, we propose intermediate triple table (ITT), a general VKG architecture to access multiple and diverse data sources. Our proposal is based on data shipping and addresses heterogeneity by adopting a schema-oblivious graph representation that intervenes between the sources and the queries. We minimize data computation by just materializing a relevant subgraph for a specific query. We employ star-shaped query processing and extend this technique to mapping candidate selection. For rapid materialization of the ITT, we apply a mapping partitioning technique to parallelize mapping execution, which also guarantees duplicate-free subgraphs and reduces memory consumption. We use SPARQL-to-SQL query translation to homogeneously evaluate queries over the ITT and execute them with an in-process analytical store. We implemented ITT on top of a knowledge graph materialization engine and evaluated it with two VKG benchmarks. The experimental results show that our proposal outperforms state-of-the-art techniques for complex graph queries in terms of execution time. It also decreases the number of timeouts although it uses more memory as a trade-off. The experiments also demonstrate the source independence of the architecture on a mixed distribution of data with SQL and document stores together with various file formats.
... The query is translated into Spark-SQL and then Spark reads the relevant data and performs necessary transformations to return the query results in a Dataframe. To facilitate this type of semantics-based on-demand data access, we have incorporated Squerall [22], a scalable engine that enables querying multiple databases simultaneously. Squerall employs wrappers to directly query heterogeneous data sources in their original formats. ...
Conference Paper
Full-text available
Data lakes have emerged as a solution for managing vast and di-verse datasets for modern data analytics. To prevent them frombecoming ungoverned, semantic data management techniques arecrucial, which involve connecting metadata with knowledge graphs,following the principles of Linked Data. This semantic layer en-ables more expressive data management, integration from varioussources and enhances data access utilizing the concepts and re-lations to semantically enrich the data. Some frameworks havebeen proposed, but requirements like data versioning, linking ofdatasets, managing machine learning projects, automated seman-tic modeling and ontology-based data access are not supportedin one uniform system. We demonstrate SEDAR, a comprehensivesemantic data lake that includes support for data ingestion, storage,processing, and governance with a special focus on semantic datamanagement. The demo will showcase how the system allows forvarious ingestion scenarios, metadata enrichment, data source link-ing, profiling, semantic modeling, data integration and processinginside a machine learning life cycle.
... 16 However, data processing usually shows performance deficiencies when the dataset exceeds the memory size of a single machine, and distributed computing frameworks can be employed to address this limitation. 17,18 Developing an efficient system capable of processing large amounts of data and performing interoperable GBAs could significantly simplify the design process. In addition, utilizing this system as an external service could reduce costs and efforts, enabling the use of GBAs as a Service (GBAaaS). ...
Article
Full-text available
During the last few years, there has been increasing attention paid to serious games (SGs), which are games used for non‐entertainment purposes. SGs offer the potential for more valid and reliable assessments compared to traditional methods such as paper‐and‐pencil tests. However, the incorporation of assessment features into SGs is still in its early stages, requiring specific design efforts for each game and adding significant time to the design of Game‐based Assessments (GBAs). In this research, we present a completely novel framework that aims to perform interoperable GBAs by: (a) integrating a common GBA ontology model to process RDF data; (b) developing in‐game metrics to infer useful information and assess learners; (c) integrating a service API to provide an easy way to interact with the framework. We then validate our approach through performance evaluation and two use cases, demonstrating its effectiveness in real‐world scenarios with large‐scale datasets. Our results show that the developed framework achieves excellent performance, replicating metrics from previous literature. We anticipate that our work will help alleviate current limitations in the field and facilitate the deployment of GBAs as a Service.
Preprint
Ever since the vision was formulated, the Semantic Web has inspired many generations of innovations. Semantic technologies have been used to share vast amounts of information on the Web, enhance them with semantics to give them meaning, and enable inference and reasoning on them. Throughout the years, semantic technologies, and in particular knowledge graphs, have been used in search engines, data integration, enterprise settings, and machine learning. In this paper, we recap the classical concepts and foundations of the Semantic Web as well as modern and recent concepts and applications, building upon these foundations. The classical topics we cover include knowledge representation, creating and validating knowledge on the Web, reasoning and linking, and distributed querying. We enhance this classical view of the so-called ``Semantic Web Layer Cake'' with an update of recent concepts that include provenance, security and trust, as well as a discussion of practical impacts from industry-led contributions. We conclude with an outlook on the future directions of the Semantic Web.
Article
Full-text available
With Knowledge Graphs (KGs) at the center of numerous applications such as recommender systems and question-answering, the need for generalized pipelines to construct and continuously update such KGs is increasing. While the individual steps that are necessary to create KGs from unstructured sources (e.g., text) and structured data sources (e.g., databases) are mostly well researched for their one-shot execution, their adoption for incremental KG updates and the interplay of the individual steps have hardly been investigated in a systematic manner so far. In this work, we first discuss the main graph models for KGs and introduce the major requirements for future KG construction pipelines. Next, we provide an overview of the necessary steps to build high-quality KGs, including cross-cutting topics such as metadata management, ontology development, and quality assurance. We then evaluate the state of the art of KG construction with respect to the introduced requirements for specific popular KGs, as well as some recent tools and strategies for KG construction. Finally, we identify areas in need of further research and improvement.
Article
Full-text available
The significant increase in data volume in recent years has prompted the adoption of knowledge graphs as valuable data structures for integrating diverse data and metadata. However, this surge in data availability has brought to light challenges related to standardization, interoperability, and data quality. Knowledge graph creation faces complexities from large data volumes, data heterogeneity, and high duplicate rates. This work addresses these challenges and proposes data management techniques to scale up the creation of knowledge graphs specified using the RDF Mapping Language (RML). These techniques are integrated into SDM-RDFizer, transforming it into a two-fold solution designed to address the complexities of generating knowledge graphs. Firstly, we introduce a reordering approach for RML triples maps, prioritizing the evaluation of the most selective maps first to reduce memory usage. Secondly, we employ an RDF compression strategy, along with optimized data structures and novel operators, to prevent the generation of duplicate RDF triples and optimize the execution of RML operators. We assess the performance of SDM-RDFizer through established benchmarks. The evaluation showcases the effectiveness of SDM-RDFizer compared to state-of-the-art RML engines, emphasizing the benefits of our techniques. Furthermore, the paper presents real-world projects where SDM-RDFizer has been utilized, providing insights into the advantages of declaratively defining knowledge graphs and efficiently executing these specifications using this engine.
Article
Full-text available
Modern organizations are currently wrestling with strenuous challenges relating to the management of heterogeneous big data, which combines data from various sources and varies in type, format, and content. The heterogeneity of the data makes it difficult to analyze and integrate. This paper presents big data warehousing and federation as viable approaches for handling big data complexity. It discusses their respective advantages and disadvantages as strategies for integrating, managing, and analyzing heterogeneous big data. Data integration is crucial for organizations to manipulate organizational data. Organizations have to weigh the benefits and drawbacks of both data integration approaches to identify the one that responds to their organizational needs and objectives. This paper aw well presents an adequate analysis of these two data integration approaches and identifies challenges associated with the selection of either approach. Thorough understanding and awareness of the merits and demits of these two approaches are crucial for practitioners, researchers, and decision-makers to select the approach that enables them to handle complex data, boost their decision-making process, and best align with their needs and expectations.
Article
Full-text available
In a GraphQL Web API, a so-called GraphQL schema defines the types of data objects that can be queried, and so-called resolver functions are responsible for fetching the relevant data from underlying data sources. Thus, we can expect to use GraphQL not only for data access but also for data integration, if the GraphQL schema reflects the semantics of data from multiple data sources, and the resolver functions can obtain data from these data sources and structure the data according to the schema. However, there does not exist a semantics-aware approach to employ GraphQL for data integration. Furthermore, there are no formal methods for defining a GraphQL API based on an ontology. In this work, we introduce a framework for using GraphQL in which a global domain ontology informs the generation of a GraphQL server that answers requests by querying heterogeneous data sources. The core of this framework consists of an algorithm to generate a GraphQL schema based on an ontology and a generic resolver function based on semantic mappings. We provide a prototype, OBG-gen, of this framework, and we evaluate our approach over a real-world data integration scenario in the materials design domain and two synthetic benchmark scenarios (Linköping GraphQL Benchmark and GTFS-Madrid-Bench). The experimental results of our evaluation indicate that: (i) our approach is feasible to generate GraphQL servers for data access and integration over heterogeneous data sources, thus avoiding a manual construction of GraphQL servers, and (ii) our data access and integration approach is general and applicable to different domains where data is shared or queried via different ways.
Conference Paper
Full-text available
Squerall is a tool that allows the querying of heterogeneous, large-scale data sources by leveraging state-of-the-art Big Data processing engines: Spark and Presto. Queries are posed on-demand against a Data Lake, i.e., directly on the original data sources without requiring prior data transformation. We showcase Squerall's ability to query five different data sources, including inter alia the popular Cassandra and MongoDB. In particular, we demonstrate how it can jointly query heterogeneous data sources, and how interested developers can easily extend it to support additional data sources. Graphical user interfaces (GUIs) are offered to support users in (1) building intra-source queries, and (2) creating required input files.
Conference Paper
Full-text available
A major research challenge is to perform scalable analysis of large-scale knowledge graphs to facilitate applications like link prediction, knowledge base completion and reasoning. Analytics methods which exploit expressive structures usually do not scale well to very large knowledge bases, and most analytics approaches which do scale horizontally (i.e., can be executed in a distributed environment) work on simple feature-vector-based input. This software framework paper describes the ongoing Semantic Analytics Stack (SANSA) project, which supports expressive and scalable semantic analytics by providing functionality for distributed computing on RDF data.
Chapter
The database (DB) landscape has been significantly diversified during the last decade, resulting in the emergence of a variety of non-relational (also called NoSQL) DBs, e.g., xml and json-document DBs, key-value stores, and graph DBs. To enable access to such data, we generalize the well-known ontology-based data access (OBDA) framework so as to allow for querying arbitrary data sources using sparql. We propose an architecture for a generalized OBDA system implementing the virtual approach. Then, to investigate feasibility of OBDA over non-relational DBs, we compare an implementation of an OBDA system over MongoDB, a popular json-document DB, with a triple store.
Chapter
We present the framework of ontology-based data access, a semantic paradigm for providing a con- venient and user-friendly access to data reposito- ries, which has been actively developed and stud- ied in the past decade. Focusing on relational data sources, we discuss the main ingredients of ontology-based data access, key theoretical results, techniques, applications and future challenges.
Conference Paper
We present the framework of ontology-based data access, a semantic paradigm for providing a convenient and user-friendly access to data repositories, which has been actively developed and studied in the past decade. Focusing on relational data sources, we discuss the main ingredients of ontology-based data access, key theoretical results, techniques, applications and future challenges.
Conference Paper
The last years have seen a vast diversification on the database market. In contrast to the "one-size-fits-all" paradigm according to which systems have been designed in the past, today's database management systems (DBMSs) are tuned for particular workloads. This has led to DBMSs optimized for high performance, high throughput read/write workload in online transaction processing (OLTP) and systems optimized for complex analytical queries (OLAP). However, this approach reaches a limit when systems have to deal with mixed workloads that are neither pure OLAP nor pure OLTP workloads. In such cases, polystores are increasingly gaining popularity. Rather than supporting one single database paradigm and addressing one particular workload, polystores encompass several DBMSs that store data in different schemas and allow to route requests at a per-query-level to the most appropriate system. In this paper, we introduce the polystore Icarus. In our evaluation based on a workload that combines OLTP and OLAP elements, We show that Icarus is able to speed-up queries up to a factor of 3 by properly routing queries to the best underlying DBMS.
Article
The rapid data growth at the beginning of the XXI century gave impetus to the development of big data technologies. A distributed platform Hadoop became a key element of big data technologies. Initially, it was difficult using Hadoop for tabular data processing, on which many modern industrial information systems are built. Therefore, a variety of SQL tools for Hadoop began to appear, which gave rise to the problem of choosing a specific solution. The aim of this work is to identify the most efficient SQL tools for tabular data processing in a distributed Hadoop system. For this purpose a comparative analysis of six most popular tools: Apache Hive, Cloudera Impala, Spark SQL, Presto, Apache Drill, Apache HAWQ has been done. The result of the study was the choice of the most effective means from the standpoint of completeness of the list of functions performed, tool performance and the level of SQL standards support. After summarizing of the results of a study, which has been done on all selected space coordinates comparison, Presto was the most effective tool.