PresentationPDF Available

How to feed the Squerall with RDF and other data nuts?

Authors:

Abstract

Advances in Data Management methods have resulted in a wide array of storage solutions having varying query capabilities and supporting different data formats. Traditionally, heterogeneous data was transformed off-line into a unique format and migrated to a unique data management system, before being uniformly queried. However, with the increasing amount of heterogeneous data sources, many of which are dynamic , modern applications prefer accessing directly the original fresh data. Addressing this requirement, we designed and developed Squerall, a software framework that enables the querying of original large and heterogeneous data on-the-fly without prior data transformation. Squer-all is built from the ground up with extensibility in consideration, e.g., supporting more data sources. Here, we explain Squerall's extensibility aspect and demonstrate step-by-step how to add support for RDF data, a new extension to the previously supported range of data sources.
How to feed the Squerall with RDF
and other data nuts?
Mohamed Nadjib Mami1,2, Damien Graux2,3, Simon Scerri1,2, Hajira Jabeen1,
oren Auer4, and Jens Lehmann1,2
1Smart Data Analytics (SDA) Group, Bonn University, Germany
2Enterprise Information Systems, Fraunhofer IAIS, Germany
3ADAPT Centre, Trinity College of Dublin, Ireland
4TIB & L3S Research Center, Hannover University, Germany
{mami,scerri,jabeen,jens.lehmann}@cs.uni-bonn.de
damien.graux@iais.fraunhofer.de,auer@l3s.de
Abstract. Advances in Data Management methods have resulted in a
wide array of storage solutions having varying query capabilities and
supporting different data formats. Traditionally, heterogeneous data was
transformed off-line into a unique format and migrated to a unique data
management system, before being uniformly queried. However, with the
increasing amount of heterogeneous data sources, many of which are dy-
namic, modern applications prefer accessing directly the original fresh
data. Addressing this requirement, we designed and developed Squerall,
a software framework that enables the querying of original large and
heterogeneous data on-the-fly without prior data transformation. Squer-
all is built from the ground up with extensibility in consideration, e.g.,
supporting more data sources. Here, we explain Squerall’s extensibility
aspect and demonstrate step-by-step how to add support for RDF data,
a new extension to the previously supported range of data sources.
1 Introduction
Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
The term Data Lake [1] denotes a repository of schema-less data stored in its
original form and format without prior transformations. We have built Squer-
all [2], a software framework implementing the so-called Semantic Data Lake
concept, which enables querying Data Lakes in a uniform manner using Semantic
Web techniques. In essence, Semantic Data Lake incorporates a ’virtual’ schema
over the schema-less data repository by mapping data schemata into high-level
ontologies, which then can be queried in a uniform manner using SPARQL.
The value of a Data Lake-accessing system lays in its ability to query as much
data as possible. For this sake, Squerall was built from the ground up with exten-
sibility in consideration, so to allow and facilitate supporting more data sources.
As we recognize the burden of creating a wrapper for every needed data source,
we resort to leveraging the wrappers that data source providers themselves offer
for many state-of-the-art processing engines. For example, Squerall uses Apache
Spark and Presto as underlying query engines, both of which benefit from a wide
range of connectors accessing the most popular data sources.
2 M. N. Mami et al.
In this demonstration, we complement the published5work about Squerall [3]
by (1) providing more details on the data source extensibility aspect, and (2)
demonstrating extensibility by supporting a new data source, RDF.
2 Squerall and its Extensibility
2.1 Squerall: a Semantic Data Lake
Squerall is an implementation of the Semantic Data Lake concept, i.e., query-
ing original large and heterogeneous data using established Semantic Web tech-
niques and technologies. It is built following the Ontology-Based Data Access
principles [5], where elements from the data schema (entities/attributes) are as-
sociated to elements from an ontology (classes/properties), by means of mapping
language, forming a virtual schema against which SPARQL queries can be posed.
2.2 Squerall Extensibility
As we recognize the burden of creating wrappers for the variety of data sources,
we chose not to reinvent the wheel and rely on the wrappers often offered by the
developers of the data sources themselves or by specialized experts. The way a
connector is used is dependent on the query engine:
Spark: the connector’s role is to load a specific data entity into a DataFrame
using Spark SQL API. Its usage is simple, it only requires providing access
values to a predefined list of options inside a simple connection template:
s pa r k . re ad . f o rm a t ( s ou r ce T y pe ) . o pt i o ns ( o p ti o ns ). l oa d
Where sourceType designates the data source type to access, and options is
a simple key-value list storing e.g., username, password, host, cluster settings,
etc. The template is similar in most data source types. There are dozens
connectors6already available for a multitude of data sources.
Presto: access options are stored in a plain text file in a key-value fashion.
Presto uses directly SQL interface to query heterogeneous data, e.g., SELECT
cassandra.cdb.product C JOIN mongo.mdb.producer M ON C.producerID
= M.ID, there is no direct interaction with the connectors. Presto internally
and transparently uses the access options to load necessary data on query-
time. Similarly, there are already several ready-to-use connectors for Presto7.
Hence, while Squerall supports by default MongoDB, Cassandra, Parquet,
CSV and various JDBC sources, interested users can easily provide access to
other data sources leveraging Spark and Presto connectors8.
5At ISWC-Resources track.
6https://spark-packages.org/
7https://prestosql.io/docs/current/connector.html
8Tutorial: https://github.com/EIS-Bonn/Squerall/wiki/Extending-Squerall
How to feed the Squerall with RDF and other data nuts? 3
3 Supporting a New Data Source: Case of RDF Data
In case no connector is found for a given data source type, we show in this section
the principles of supporting a new data source. The procedure concerns Spark
as query engine, where the connector’s role is to generate a DataFrame from an
underlying data entity. Squerall did not previously have a wrapper for RDF data.
With the wealth of RDF data available today as part of the Linked Data and
Knowledge Graph movements, supporting RDF data is paramount. Contrary to
the previously supported data sources, RDF does not require a schema, neither
fixed nor flexible. As a result, lots of RDF data is generated without schema.
In this case, it is required to exhaustively extract the schema from the data
on-the-fly during query execution. Also, as per the Data Lake requirements, it
is necessary not to apply any pre-processing, and to directly access the original
data. If an entity inside an RDF data is detected as relevant to (part of) a query, a
set of transformations are applied to flatten the (subject,property,object) triples
and extract the schema elements needed to generate the DataFrame(s). Full
procedure is shown in Figure 1 and is described as follows:
1. First, triples are loaded into Spark distributed dataset9of the schema (sub-
ject : String, property : String, object : String).
2. Using Spark transformations, we generate a new dataset. We map (s,p,o)
triples to pairs: (s,(p,o)), then group pairs by subject: (s,(p,o)+), then find
class from p ( p=rdf:type) and map the pairs to new pairs: (class,(s,(p,o)+])),
then group them by class (class, (s, (p, o)+)+). Each class has one or more
instances identified by ‘s’ and contains one or more (p, o) pairs.
3. The new dataset is partitioned into a set of class-based DataFrames, columns
of which are the properties and tuples are the objects. This corresponds to
the so-called property table partitioning [6].
4. The XSD data types, if present as part of the object, are detected and used
to type the DataFrame attributes, otherwise string is used.
5. Only the relevant entity/ies (matching their attributes against query prop-
erties) detected using the mappings is/are retained, the rest are discarded.
This procedure generates a (typed) DataFrame that can join DataFrames
generated using other data connectors from other data sources. The procedure
is part of our previously published effort: SeBiDa [4]. We made the usage of the
new RDF connector as simple as the other Spark connectors:
val rdf = n e w NTtoDF ()
df = r df . o pt i on s ( op t io n s ). r ea d ( f il eP at h , s pa r kU R I ). t oD F
Where NTtoDF is the connector’s instance, options are the access information
including RDF file path and the specific RDF class to load into the DataFrame.
9Called RDD: Resilient Distributed Dataset, a distributed tabular data structure.
4 M. N. Mami et al.
DS1
RDF Connector
(s1 , a, A)
(s1, p1, o1_t1)
(s1, p2, o2_t2)
(s2 , a, B)
(s2, pn,
on_t3)
(s2, pn+1,
on+1_t4)
...
...
Type A DataFrame (Relevant)
s2pnon_t3
...
ID: Str p1: t1 p2: t2
s1p1o1_t1
s1o1o2
... pm: str
... om
S: str P: str 0: str
Type B DataFrame (Irrelevant)
... ...
ID: Str pn: t3 pn+1: t4
s2onon+1
... pn+r: str
... on+r
s1p2o2_t2
s2pn+1 on+1_t4
DS2
DS1
Connector
DS2
Connector
DataFrame
Joins
Triples Dataset
RDF Triples
Fig. 1. RDF Connector. A and B are RDF classes, tndenote data types.
4 Conclusion
In this demonstration article10, we have described with more depth the extensi-
bility aspect of Squerall in supporting more data sources. We have demonstrated
extensibility principles by adding a support for RDF data. In the common ab-
sence of schema, RDF triples have to be exhaustively parsed and reformatted
into a tabular representation on query-time, which only then can be queried.
In the future, in order to alleviate the reformatting cost and, thus, accelerate
query processing time, we intend to implement a light-weight caching technique,
which can save the results of the flattening phase across different queries. Be-
yond Squerall context, we will investigate making the newly created connector
(currently supporting NTriples RDF) available in Spark Packages (connectors)
hub for the public to be able to process large RDF data using Apache Spark.
References
1. Dixon, J.: Pentaho, Hadoop, and Data Lakes (2010), https://jamesdixon.
wordpress.com/2010/10/14/pentaho-hadoop-and-data- lakes, online; accessed 27-
January-2019
2. Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S.: Querying data lakes using
spark and presto. In: The World Wide Web Conference. pp. 3574–3578. WWW ’19,
ACM, New York, NY, USA (2019)
3. Mami, M.N., Graux, D., Scerri, S., Jabeen, H., Auer, S., Lehman, J.: Squerall:
Virtual ontology-based access to heterogeneous and large data sources. Proceedings
of 18th International Semantic Web Conference (2019)
4. Mami, M.N., Scerri, S., Auer, S., Vidal, M.E.: Towards semantification of big data
technology. In: International Conference on Big Data Analytics and Knowledge
Discovery. pp. 376–390. Springer (2016)
5. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.:
Linking data to ontologies. In: Journal on Data Semantics X. Springer (2008)
6. Wilkinson, K., Sayers, C., Kuno, H., Reynolds, D.: Efficient rdf storage and retrieval
in jena2. In: Proceedings of the First International Conference on Semantic Web and
Databases. pp. 120–139. Citeseer (2003)
10 Screencasts are publicly available from: https://git.io/fjyOO
Article
Full-text available
Data federation addresses the problem of uniformly accessing multiple, possibly heterogeneous data sources, by mapping them into a unified schema, such as an RDF(S)/OWL ontology or a relational schema, and by supporting the execution of queries, like SPARQL or SQL queries, over that unified schema. Data explosion in volume and variety has made data federation increasingly popular in many application domains. Hence, many data federation systems have been developed in industry and academia, and it has become challenging for users to select suitable systems to achieve their objectives. In order to systematically analyze and compare these systems, we propose an evaluation framework comprising four dimensions: (i) federation capabilities, i.e., query language, data source, and federation techniques; (ii) data security, i.e., authentication, authorization, auditing, encryption, and data masking; (iii) interface, i.e., graphical interface, command line interface, and application programming interface; and (iv) development, i.e., main development language, deployment, commercial support, open source, and release. Using this framework, we thoroughly studied 51 data federation systems from the Semantic Web and Database communities. This paper shares the results of our investigation and aims to provide reference material and insights for users, developers and researchers selecting or further developing data federation systems.
Conference Paper
Full-text available
The last two decades witnessed a remarkable evolution in terms of data formats, modalities, and storage capabilities. Instead of having to adapt one's application needs to the, earlier limited, available storage options, today there is a wide array of options to choose from to best meet an application's needs. This has resulted in vast amounts of data available in a variety of forms and formats which, if interlinked and jointly queried, can generate valuable knowledge and insights. In this article, we describe Squerall: a framework that builds on the principles of Ontology-Based Data Access (OBDA) to enable the querying of dis-parate heterogeneous sources using a unique query language, SPARQL. In Squerall, original data is queried on-the-fly without prior data materi-alization or transformation. In particular, Squerall allows the aggregation and joining of large data in a distributed manner. Squerall supports out-of-the-box five data sources and moreover, it can be programmatically extended to cover more sources and incorporate new query engines. The framework provides user interfaces for the creation of necessary inputs, as well as guiding non-SPARQL experts to write SPARQL queries. Squerall is integrated into the popular SANSA stack and available as open-source software via GitHub and as a Docker image.
Conference Paper
Full-text available
Squerall is a tool that allows the querying of heterogeneous, large-scale data sources by leveraging state-of-the-art Big Data processing engines: Spark and Presto. Queries are posed on-demand against a Data Lake, i.e., directly on the original data sources without requiring prior data transformation. We showcase Squerall's ability to query five different data sources, including inter alia the popular Cassandra and MongoDB. In particular, we demonstrate how it can jointly query heterogeneous data sources, and how interested developers can easily extend it to support additional data sources. Graphical user interfaces (GUIs) are offered to support users in (1) building intra-source queries, and (2) creating required input files.
Conference Paper
Full-text available
Much attention has been devoted to support the volume and velocity dimensions of Big Data. As a result, a plethora of technology components supporting various data structures (e.g., key-value, graph, relational), modalities (e.g., stream, log, real-time) and computing paradigms (e.g., in-memory, cluster/cloud) are meanwhile available. However, systematic support for managing the variety of data, the third dimension in the classical Big Data definition, is still missing. In this article, we present SeBiDA, an approach for managing hybrid Big Data. SeBiDA supports the Semantification of Big Data using the RDF data model, i.e., non-semantic Big Data is semantically enriched by using RDF vocabularies. We empirically evaluate the performance of SeBiDA for two dimensions of Big Data, i.e., volume and variety; the Berlin Benchmark is used in the study. The results suggest that even in large datasets, query processing time is not affected by data variety.
Article
Full-text available
Many organizations nowadays face the problem of accessing existing data sources by means of exible mechanisms that are both pow- erful and ecient. Ontologies are widely considered as a suitable formal tool for sophisticated data access. The ontology expresses the domain of interest of the information system at a high level of abstraction, and the relationship between data at the sources and instances of concepts and roles in the ontology is expressed by means of mappings. In this paper we present a solution to the problem of designing eective systems for ontology-based data access. Our solution is based on three main ingre- dients. First, we present a new ontology language, based on Description Logics, that is particularly suited to reason with large amounts of in- stances. The second ingredient is a novel mapping language that is able to deal with the so-called impedance mismatch problem, i.e., the problem arising from the dierence
Efficient rdf storage and retrieval in jena2
  • K Wilkinson
  • C Sayers
  • H Kuno
  • D Reynolds
Wilkinson, K., Sayers, C., Kuno, H., Reynolds, D.: Efficient rdf storage and retrieval in jena2. In: Proceedings of the First International Conference on Semantic Web and Databases. pp. 120-139. Citeseer (2003)