Content uploaded by Mohamed Nadjib Mami
Author content
All content in this area was uploaded by Mohamed Nadjib Mami on Jan 23, 2018
Content may be subject to copyright.
Content uploaded by Mohamed Nadjib Mami
Author content
All content in this area was uploaded by Mohamed Nadjib Mami on Oct 11, 2017
Content may be subject to copyright.
Towards Semantification of Big Data Technology
Mohamed Nadjib Mami, Simon Scerri, S¨oren Auer, Maria-Esther Vidal
University of Bonn & Fraunhofer IAIS, Bonn & Sankt Augustin, Germany
{lastname}@cs.uni-bonn.de
Abstract. Much attention has been devoted to support the volume
and velocity dimensions of Big Data. As a result, a plethora of tech-
nology components supporting various data structures (e.g., key-value,
graph, relational), modalities (e.g., stream, log, real-time) and comput-
ing paradigms (e.g., in-memory, cluster/cloud) are meanwhile available.
However, systematic support for managing the variety of data, the third
dimension in the classical Big Data definition, is still missing. In this
article, we present SeBiDA, an approach for managing hybrid Big Data.
SeBiDA supports the Semantification of Big Data using the RDF data
model, i.e., non-semantic Big Data is semantically enriched by using RDF
vocabularies. We empirically evaluate the performance of SeBiDA for two
dimensions of Big Data, i.e., volume and variety; the Berlin Benchmark
is used in the study. The results suggest that even in large datasets,
query processing time is not affected by data variety.
1 Introduction
Before ’Big Data’ became a phenomenon and a tremendous marketplace backed
by ever-increasing research efforts, Gartner suggested a novel data management
model termed 3-D Data Management [3]. At the basis of this model are three
major challenges to the potential of data management systems: increasing vol-
ume, accelerated data flows, and diversified data types. These three dimensions
have since become known as Big Data’s ‘three V’s’: volume,velocity, and variety.
Other dimensions have been added to the V-family to cover a broader range of
emerging challenges such as veracity and value.
In the last few years, a number of efforts have sought to design generic re-
usable architectures that tackle the aforementioned Big Data challenges. How-
ever, they all miss the explicit support for semantic data integration, querying,
and exposure. The Big Data Semantification introduced in this paper is an um-
brella term that covers the previous operations in a large scale. It also includes
the ability to semantically enrich non-semantic input data using Resource De-
scription Framework (RDF) ontologies. In this paper, we propose a realization
of this vision by presenting a blueprint of a generic semantified Big Data ar-
chitecture. All while preserving semantic information, so semantic data can be
exported in its natural form, i.e., RDF. Although both an effective management
of volume and velocity has gained a considerable attention both from academia
and industry, the variety dimension has not been adequately tackled; even though
it has been reported to be the top big challenge by many industrial players and
stakeholders1,
In this paper we target the lack of variety by suggesting unique data model
and storage, and querying interface for both semantic and non-semantic data.
SeBiDA provides a particular support for Big Data semantification by enabling
the semantic lifting of non-semantic datasets. Experimental results show that
(1) SeBiDA is not impacted by the variety dimension, even in presence of an in-
creasing large volume of data, and (2) outperforms a state-of-the-art centralized
triple in several aspects.
Our contributions can be summarized as follows:
–The definition of a blueprint for a semantified Big Data architecture that en-
ables the ingestion, querying and exposure of heterogeneous data with varying
levels of semantics (hybrid data), while ensuring the preservation of semantics
(section 3).
–SeBiDA: A proof-of-concept implementation of the architecture using Big
Data components such as, Apache Spark & Parquet and MongoDB (section 4).
–Evaluation of the benefits of using the Big Data technology for the storage
and processing of hybrid data (section 5).
The rest of this paper is structured as follows: Section 2 presents a moti-
vation example and the requirements of a Semantified Big Data Architecture.
Section 3 presents a blueprint for a generic Semantic Big Data Architecture.
SeBiDa implementation is described in Section 4, while the experimental study
is reported in Section 5. Section 6 summarizes related approaches. In Section 7,
we conclude and present an outlook to our future work.
2 Motivating Example and Requirements
Suppose there are three datasets (Figure 1) that are large in size (i.e., volume)
and different in type (i.e., variety): (1) Mobility: an RDF graph containing
transport information about buses, (2) Regions: a JSON encoded data about
one country’s regions semantically described using ontologies terms in JSON-LD
format, and (3) Stop: a structured (GTFS-compliant 2) data describing Stops
in CSV format. The problem to be solved is to provide unified data model to
store and query these datasets, independently of their dissimilar types.
A Semantified Big Data Architecture (SBDA) allows for efficiently ingesting
and processing the previous heterogeneous data in large scale. Previously, there
has been a focus on achieving an efficient big data loading and querying for RDF
data or other structured data separately. The support of variety that we claim
in this paper is achieved through proving (1) a unified data model and storage
that is adapted and optimized for RDF data and structured and semi-structured
non-RDF data, and (2) a unified query interface over the whole stored data.
1http://newvantage.com/wp-content/uploads/2014/12/
Big-Data-Survey-2014- Summary-Report-110314.pdf,https://www.
capgemini-consulting.com/resource-file-access/resource/pdf/cracking_
the_data_conundrum-big_data_pov_13-1-15_v2.pdf
2https://developers.google.com/transit/gtfs/
Fig. 1: Motivating Example.Mobility: semantic RDF graph for buses; Re-
gions: semantic JSON-LD data about country’s regions; and (3) Stop: non-
semantic data about stops, presented using the CSV format
SBDA meets the next requirements to provide the above-mentioned features:
R1: Ingest semantic and non-semantic data. SBDAs must be able to
process arbitrary types of data. However, we should distinguish between seman-
tic and non-semantic data. In this paper, semantic data is all data, which is
either originally represented according to the RDF data model or has an associ-
ated mapping, which allows to convert the data to RDF. Non-semantic data is
then all data that is represented in other formalisms, e.g., CSV, JSON, XML,
without associated mappings. The semantic lifting of non-semantic data can be
achieved through the integration of mapping techniques e.g., R2RM3, CSVW4
annotation models or JSON-LD contexts 5. This integration can lead to either a
representation of the non-semantic data in RDF, or its annotation with semantic
mappings so as to enable full conversion at a later stage. In our example Mo-
bility and Regions are semantic data. The former is originally in RDF model,
while the latter is not in RDF model but has mappings associated. Stops in the
other hand is a non-semantic dataset on which semantic lifting can be applied.
R2: Preserve semantics, and metadata in Big Data processing chains.
Once data is preprocessed, semantically enriched and ingested, it is paramount
to preserve the semantic enrichment as much as possible. RDF-based data repre-
sentations and mappings have the advantage (e.g., compared to XML) of using
fine-grained formalisms (e.g., RDF triples or R2RML triple maps) that persist
even when the data itself is significantly altered or aggregated. Semantics preser-
vation can be reduced as follows: 1) Preserve IRIs and literals. The most atomic
3http://www.w3.org/TR/r2rml/
4http://www.w3.org/2013/csvw/wiki/Main Page
5http://www.w3.org/TR/json-ld/
components of RDF-based data representation are IRIs and literals 6. Best prac-
tices and techniques to enable storage and indexing of IRIs (e.g., by separately
storing and indexing namespaces and local names) as well as literals (along with
their XSD or custom datatypes and language tags) in an SBDA, need to be de-
fined. In the dataset Mobility, the Literals ”Alex Alion” and ”12,005”xsd:long,
and the IRI ‘http://xmlns.com/foaf/0.1/name‘ (shortened foaf:name in the fig-
ure), should be stored in an optimal way in the big data storage. 2) Preserve
triple structure. Atomic IRI and literal components are organized in triples.
Various existing techniques can be applied to preserve RDF triple structures in
SBDA components (e.g., HBase [14,6,2]). In the dataset Mobility, the triple
(prs:Alex mb:drives mb:Bus1) should be preserved by adopting a storage scheme
that keeps the connection between the subject prs:Alex, the property mb:drives
and the object mb:Bus1. Preserve mappings. Although conversion of the original
data into RDF is ideal, it must not be a requirement as it is not always feasible
(due to limitations in the storage, or to time critical use-cases). However, it is
beneficial to at least annotate the original data with mappings, so that a trans-
formation of the (full or partial) data can be performed on demand. R2RML,
JSON-LD contexts, and CSV annotation models are examples of such mappings,
which are usually composed of fine-grained rules that define how a certain col-
umn, property or cell can be transformed to RDF. The (partial) preservation
of such data structures throughout processing pipelines means that the result-
ing views can also be directly transformed to RDF. In Regions dataset, the
semantic annotations defined by the JSON object @context should be persisted
associated to the actual data it describes: RegionA.
R3: Scalable and Efficient Query Processing. Data management tech-
niques like data caching, query optimization, and query processing have to be
exploited to ensure scalable and efficient performance during query processing.
3 A Blueprint for a Semantified Big Data Architecture
In this section, we provide a formalisation for an SBDA blueprint.
Definition 1 (Heterogeneous Input Superset). We define a heterogeneous
input superset HIS, as the union of the following three types of datasets:
–Dn={dn1, . . . , dnm}is a set of non-semantic, structured or semi-structured,
datasets in any format (e.g., relational database, CSV files, Excel sheets, JSON
files).
–Da=da1...,daqis a set of semantically annotated datasets, consisting of
pairs of non-semantic datasets with corresponding semantic mappings (e.g.,
JSON-LD context, metadata accompanying CSV).
–Ds=ds1, . . . , dspis a set of semantic datasets consisting of RDF triples.
In our running example, stops,regions, and mobility correspond to Dn,
Da, and Ds, respectively.
6We disregard blank nodes, which can be avoided or replaced by IRIs [4].
Definition 2 (Dataset Schemata). Given HIS=Dn∪Da∪Ds, the dataset
schemata of Dn,Da, and Dsare defined as follows:
–Sn={sn1, . . . , snm}is a set of non-semantic schemata structuring Dn,
where each sniis defined as follows:
sni={(T, AT)|T is an entity type and ATis the set of all the attributes of T}
–Ss={ss1, . . . , ssq}is a set of the semantic schemata behind Dswhere
each ssiis defined as follows:
ssi={(C, PC)|T is an RDF class and PCis the set of all the properties of C7}
–Sa={ss1, . . . , ssp}is a set of the semantic schemata annotating Da
where each ssiis defined the same way as elements of Ss.
In the running example, the semantic schema of the dataset8mobility is:
ss1={(mb:Bus,{mb:matric,mb:stopsBy}),(mb:Driver,{foaf:name,mb:drives})}
Definition 3 (Semantic Mapping). A semantic mapping is a relation linking
two semantically-equivalent schema elements. There are two types of semantic
mappings:
–mc=(e, c)is a relation mapping an entity type efrom Snonto a class c.
–mp=(a, p)is a relation mapping an attribute afrom Snonto a property p.
SBDA facilitates the lifting of non-semantic data to semantically annotated
data by mapping non-semantic schemata to RDF vocabularies. The following are
possible mappings: (stop name, rdfs:label),(stop lat, geo:lat),(stop long, geo:long).
Definition 4 (Semantic Lifting Function). Given a set of mappings Mand
a non-semantic dataset dn, a semantic lifting function SL returns a semantically-
annotated dataset dawith semantic annotations of entities and attributes in dn.
In the motivating example, dataset Stops can be semantically lifted using the
following set of mappings: {(stop name, rdfs:label),(stop lat, geo:lat), (stop long,
geo:long)}, thus a semantically annotated dataset is generated.
Definition 5 (Ingestion Function). Given an element d∈HIS, an ingestion
function In(d)returns a set of triples of the form (RT, AT, f), where:
–Tan entity type or class for data on d,
–ATis a set of attributes A1, . . . , Anof T,
–RT⊆type(A1)×type(A2)× · · ·×type(An)⊆d, where type(Ai) = Tiindicates
that Tiis the data type of the attribute Aiin d, and
–f:RT×AT→SAi∈ATtype(Ai)such that f(t, Ai) = tiindicates that ti∈
tuple t in RTis the value of the attribute Ai.
The result of applying the ingestion function In over all d∈HIS is the final
dataset that we refer to as the Transformed Dataset T D.
T D =Sdi∈HIS I n(di)
Fig. 2: A Semantified Big Data Architecture Blueprint
The above definitions are illustrated in Figure 2. The SBDA blueprint han-
dles a representation (T D) of the relations resulting from the ingestion of mul-
tiple heterogeneous datasets in HIS (ds, da, dn). The ingestion (In) generates
relations or tables (denoted RT) corresponding to the data, supported by a
schema for interpretation (denoted Tand AT). The ingestion of semantic (ds)
and semantically-annotated (da) data is direct (denoted resp. by the solid and
dashed lines), a non-semantic dataset (dn) can be optionally semantically lifted
(SL) given an input set of mappings (M). This explains the two dotted lines
outgoing from dn, where one denotes the option to directly ingest the data with-
out semantic lifting, and the other denotes the option to apply semantic lifting.
Query-driven processing can then generate a number (|Q|) of results over (T D ).
Next, we validate our blueprint through the description of a proof of concept
implementation.
4 SeBiDA: A Proof-of-concept Implementation
The SeBiDA architecture (Figure 3) comprises three main components:
–Schema Extractor: performs schema knowledge extraction from input data
sources, and supports semantic lifting based on a provided set of mappings.
7A set of properties PCof an RDF class Cwhere: ∀p∈PC(p rdfs:domain C).
8mb and foaf are prefixes for mobility and friend of friend vocabularies, respectively.
Fig. 3: The SeBiDA Architecture
–Data Loader : creates tables based on extracted schemata and loads input data.
–Data Server: receives queries; generates results as tuples or RDF triples.
The first two components jointly realise the ingestion function I n from sec-
tion 3. The resulting tables TD can the be queried using the Data Server to
generate the required views. Next, these components are described in more de-
tail.
4.1 Schema Extractor
We extract the structure of both semantic and non-semantic data to transform it
into a tabular format that can be easily handled (stored and queried) by existing
Big Data technologies.
(A) From each semantic or semantically-annotated input dataset (dsor da), we
extract classes and properties describing the data (cf. section 3). This is
achieved by first reformatting RDF data into the following representation:
(class, (subject, (property, object)+)+)
which reads: ”each class has one or more instances where each instance can
be described using one or more (property, object) pairs”, and then we retain
the classes and properties. The XSD datatypes, if present, are leveraged to
type the properties, otherwise 9string is used.
The reformatting operation is performed using Apache Spark10 , a popular
Big Data processing engine.
(B) From each non-semantic input dataset (dn), we extract entities and at-
tributes (cf. section 3). As examples, in a relational database, table and
column names can be returned using particular SQL queries; the entity
9When the object is occasionally not typed or is a URL
10 https://spark.apache.org/
and its attributes can be extracted from a CSV file’s name and header, re-
spectively. Similarly whenever possible, attribute datatypes are extracted,
otherwise casted to string. Schemata that do not natively have a tabular
format e.g., the case of XML, JSON, are also flattened into entity-attributes
pairs.
As depicted in Figure 3, the results are stored in an instance of MongoDB11.
MongoDB is an efficient document-based database that can be distributed among
a cluster. As schema can automatically be extracted (case of Dn), it is essential
to store the schema information separately and expose it. This enables a sort of
discovery, as one can navigate through the schema, visualize it, and formulate
queries accordingly.
4.2 Semantic Lifter
In this step, SeBiDA targets the lifting of non-semantic elements: entities/at-
tributes to existing semantic representations: classes/properties), by leveraging
the LOV catalog API 12. The lifting operation is supervised by the user and is op-
tional i.e. the user can choose to ingest non-semantic data in its original format.
The equivalent classes/properties are first fetched automatically from the LOV
catalogue based on syntactical similarities with the original entities/attributes.
The user next validates the suggested mappings or adjust them, either manually
or by keyword-based searching the LOV catalogue. If semantic counterparts are
undefined, internal IRIs are created by attaching a base IRI. Example 1 shows a
result of this process, where four attributes for the GTFS entity ’Stop’ 13 have
been mapped to existing vocabularies, and the fifth converted to an internal IRI.
Semantic mappings across the cluster are stored in the same MongoDB instance,
together with the extracted schemata.
Example 1 (Property mapping and typing).
Source Target Datatype
stop name http://xmlns.com/foaf/0.1/name string
stop lat http://www.w3.org/2003/01/geo/wgs84 pos#lat double
stop lon http://www.w3.org/2003/01/geo/wgs84 pos#long double
parent station http://example.com/sebida/20151215T1708/parent station string
4.3 Data Loader
This component loads data from the source HIS into the final dataset TD by
generating and populating tables as described below. These procedures are also
realised by employing Apache Spark. Historically, storing RDF triples in tabular
layouts (e.g., Jena Property Table14) has been avoided due to the resulting large
amount of null values in wide tables. This concern has largely been reduced
following the emergence of NOSQL databases e.g., HBase, and columnar storage
11 https://www.mongodb.org
12 http://lov.okfn.org/dataset/lov/terms
13 developers.google.com/transit/gtfs/reference
14 http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
formats on top of HDFS (Hadoop Distributed File System) e.g. Apache Parquet
and ORC files. HDFS is the de facto storage for Big Data applications, thus, is
supported by the vast majority of Big Data processing engines.
We use Apache Parquet 15, a column-oriented tabular format, as the stor-
age technology. One advantage of this kind of storage is the schema projection,
whereby only projected columns are read and returned following a query. Fur-
ther, columns are stored consecutively on disk; thus, Parquet tables are very
compression-friendly (e.g., via Snappy and LZO) as values in each column are
guaranteed to be of the same type. Parquet also supports composed and nested
columns, i.e., saving multiple values in one column and storing hierarchical data
in one table cell. This is very important since RDF properties are frequently
used to refer to multiple objects for the same instance. State-of-the-art encod-
ing algorithms are also supported, e.g., bit-packing, run-length, and dictionary
encoding. In particular, the latter can be useful to store long string IRIs.
Table Generation. A corresponding table template is created for each derived
class or entity as follows: (A) Following the representation described in sub-
section 4.1, a table with the same label is created for each class (e.g. a table
’Bus’ from RDF class :bus). A default column ’ID’ (of type string) is created to
store the triple’s subject. For each property describing the class, an additional
column is created typed according to the property extracted datatype. (B) For
each entity, a table is similarly created as above, taking the entity label as table
name and creating a typed column for each attribute.
Table Population In this step, each table is populated as follows:
(1) For each RDF extracted class (cf. subsection 4.1) iterate throug its instance,
a new row is inserted for each instance into the corresponding table: The in-
stance IRI is stored in the ’ID’ column, whereas the corresponding objects are
saved under the column representing the property. For example, the following
semantic descriptions (formatted for clarity):
(dbo:bus, [
(dbn:bus1, [(foaf:name, "OP354"), (dc:type, "mini")]),
(dbn:bus2, [(foaf:name, "OP355"), (dc:type, "larg")])
])
is flatted to a table ”dbo:bus” in this manner:
ID foaf:name dc:type
dbn:bus1 ”OP354” ”mini”
dbn:bus2 ”OP355” ”larg”
(2) The population of the tables in case of non-semantic data varies depending
on its type. For example, we iterate through each CSV line and save its values
into the corresponding column of the corresponding table. XPath can be used
to iteratively select the needed nodes in an XML file.
15 https://parquet.apache.org
4.4 Data Server
Data loaded into tables is persistent, so one can access the data and perform
analytical operations by way of ad-hoc queries.
Quering Interface The current implementation utilizes SQL queries, as the
internal TD representation corresponds to a tabular structure, and because the
query technology used, i.e., Spark, provides only SQL-like query interface.
Multi-Format Results As shown in Figure 3 query results can be returned as
a set of tables, or as an RDF graph. RDFisation is achieved as follows: Create
a triple from each projected column, casting the column name and value to the
triple predicate and object, respectively. If the result includes the ID column,
cast value as the triple subject. Otherwise, set the subject to base IRI/i, where
the base IRI is defined by the user, and iis an incremental integer.
5 Experimental Study
The goal of the study is to evaluate if SeBiDA 16 meets requirements R1,R2,
and R3 (cf., section 2). We evaluate if data management techniques, e.g., caching
data, allow SeBiDA to speed up both data loading and query execution time,
whenever semantic and non-semantic data is combined.
Datasets: Datasets have been created using the Berlin benchmark generator17.
Table 1 describes them in terms of number of triples and file size. We choose
XML as non-semantic data input to demonstrate the ability of SeBiDA to ingest
and query semi-structured data (requirement R1). Parquet and Spark are used
to store XML and query nested data in tables.
Metrics: We measure the loading time of the datasets, the size of the datasets
after the loading, as well as the query execution time over the loaded datasets.
Implementation: We ran our experiments on a small-size cluster of three ma-
chines each having DELL PowerEdge R815, 2x AMD Opteron 6376 (16 Core)
CPU, 256 GB RAM, and 3 TB SATA RAID-5 disk. We cleared the cache before
running each query. To run on warm cache, we executed the same query five
times by dropping the cache just before running the first iteration of the query;
thus, data temporally stored in cache during the execution of iteration icould
be used in iteration i+ 1.
Discussion As can be observed in Table 2, loading time takes almost 3 hours for
the largest dataset. This is because one step of the algorithm involves sending
part of the data back from the workers to the master, which incurs important
network transfer (collect function in Spark). This can, however, be overcome
by saving data to a distributed database e.g., Cassandra, which we intend to
implement in the next version of the system. However, we achieved a huge gain
16 https://github.com/EIS-Bonn/SeBiDA
17 Using a command line: ./generate -fc -pc [scaling factor] -s [file
format] -fn [file name], where file format is nt for RDF data and xml for XML
data. More details in: http://wifo5-03.informatik.uni-mannheim.de/bizer/
berlinsparqlbenchmark/spec/BenchmarkRules/#datagenerator+
Table 1: Description of the Berlin Benchmark RDF Datasets.
RDF Dataset Size (n triples) Type Scaling Factor
Dataset148.9GB (200M) RDF 569,600
Dataset298.0GB (400M) RDF 1,139,200
Dataset38.0GB (100M) XML 284,800
Table 2: Benchmark of RDF Data Loading For each dataset the loading
time as well as the obtained data size together with the compression ratio.
RDF Dataset Loading Time New size Ratio
Dataset11.1h 389MB 1:0.015
Dataset22.9h 524MB 1:0.018
Dataset30.5h 188MB 1:0.023
in terms of disk space, which is expected from the adopted data model that
avoids the repetition of data (in case of RDF data) and the adopted file format,
i.e. Parquet, which performs high compression rates (cf. subsection 4.3).
Tables 3 and 4 report on the results of executing the Berlin Benchmark 12
queries against Dataset1, Dataset2, in two ways: first, the dataset alone and, sec-
ond the dataset combined with Dataset3(using UNION in the query). Queries
are run in cold cache and warm cache. We notice that caching can improve
query performance significantly in case of hybrid large data (entries highlighted
in bold). Among the 12 queries, the most expensive queries are Q7 and Q9. Q7
scans a large number of tables: five tables, while Q9 produces a large number
of intermediate results. These results suggest that SiBiDA is able to scale to
large hybrid data, without deteriorating query performance. Further, these re-
sults provide evidences of the benefits of loading query intermediate results in
cache.
Centralized vs. Distributed Triple Stores. We can look at SeBiDA as a
distributed triple store as it can load and query RDF triples separately. We thus
try to compare against the performance of one of the fastest centralized triple
stores: RDF-3X 18. Comparative results can be found in Table 5 and Table 6.
Discussion. Table 5 shows that RDF-3X loaded Dataset1within 93min, com-
pared to 66min in SeBiDA, while it timedout loading Dataset2. We set a timeout
of 12 hours, RDF-3X took more than 24 hours; before we terminate it manually.
Table 6 shows no definitive dominant, but suggests that SeBiDA in all queries
does not exceed a threshold of 20s, while RDF-3X does in four queries and even
passes to the order of minutes. We do not report on query time of Dataset2using
RDF-3X because of the prohibitive time of loading it.
18 https://github.com/gh-rdf3x/gh-rdf3x
Table 3: Benchmark Query Execution Times (secs.) in Cold and Warm
Caches. Significant differences are highlighted in bold.
Dataset1- Only Semantic Data (RDF)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Geom. Mean
Cold
Cache 3.00 2.20 1.00 4.00 3.00 0.78 11.3 6.00 16.00 7.00 11.07 11.00 4.45
Warm
Cache 1.00 1.10 1.00 2.00 3.00 0.58 6.10 5.00 14.00 6.00 10.04 9.30 3.14
Dataset1∪Dataset3–RDF & XML Data
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Geom. Mean
Cold
Cache 3.00 2.94 2.00 5.00 3.00 0.90 11.10 7.00 25.20 8.00 11.00 11.5 5.28
Warm
Cache 2.00 1.10 1.00 5.00 3.00 1.78 8.10 6.00 20.94 7.00 11.00 9.10 4.03
Table 4: Benchmark Query Execution Times (secs.) in Cold and Warm
Caches-Significant differences are highlighted in bold
Dataset2- Only Semantic Data (RDF)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Geom. Mean
Cold
Cache 5.00 3.20 3.00 8.00 3.00 1.10 20.00 7.00 18.00 7.00 13.00 11.40 6.21
Warm
Cache 4.00 3.10 2.00 7.00 3.00 1.10 18.10 6.00 17.00 6.00 12.04 11.2 5.55
Dataset2∪Dataset3–RDF & XML Data
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Geom. Mean
Cold
Cache 11.00 3.20 7.20 17.00 3.00 1.10 23.10 16.00 20.72 10.00 14.10 13.20 8.75
Warm
Cache 4.00 3.20 2.00 8.00 3.00 1.10 21.20 8.00 18.59 7.00 12.10 11.10 5.96
6 Related Work
There have been many works related to the combination of Big Data and Se-
mantic technologies. They can be classified into two categories: MapReduce-
only-based and non-MapReduce-based, where in the first category only Hadoop
framework is used, for storage (HDFS) and for processing (MapReduce); and in
the second, other storage and/or processing solutions are used. The major lim-
itation of MapReduce framework is the overhead caused by materializing data
between Map and Reduce, and between two subsequent jobs. Thus, works in the
first category, e.g., [1,8,12], try to minimize the number of join operations, or
maximize the number of joins executed in the Map phase, or additionally, use
indexing techniques for triple lookup. In order to cope with this, works in the sec-
ond category suggest to store RDF triples in NoSQL databases (e.g., HBase and
Accumulo) instead, where a variety of physical representations, join patterns,
and partitioning schemes is suggested. For processing, either MapReduce is used
Table 5: Loading Time of RDF-3X.
RDF Dataset Loading time New size Ratio
Dataset193min 21GB 1:2.4
Dataset2Timed out - -
Table 6: SeBiDA vs. RDF-3X Query Execution Times (secs.) in Cold
Cache only on Dataset1. Significant differences are highlighted in bold.
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12
SeBiDA 3.00 2.20 1.00 4.00 3.00 0.78 11.3 6.00 16.00 7.00 11.07 11.00
RDF-3X 0.01 1.10 29.213 0.145 1175.98 2.68 77.80 610.81 0.23 0.419 0.13 1.58
on top (e.g., [9,11]), or the internal operations of the NoSQL store are utilized but
with conjunction with triple stores (e.g., [6,10]). Basically, these works suggest to
store triples in three-columns tables, called triple tables, using column-oriented
stores. This latter offers enhanced compression performances and efficient dis-
tributed indexes. Nevertheless, using the tied three-columns table still entails
a significant overhead because of the inter-join operations required to answer
most of the queries. There are few approaches that do not fall into one of the
two categories. In [7], the authors focus on providing a real-time RDF querying,
combining both live and historical data. Instead of plain RDF, RDF data is
stored under the binary space-efficient format RDF/HDT. Although called Big
Semantic Data and compared against the so-called Lambda Architecture, noth-
ing is said about the scalability of the approach when the storage and querying
of the data exceed the single-machine capacities. In [13], RDF data is loaded into
property tables using Parquet. Impala, a distributed SQL query engine, is used
to query those tables. A query compiler from SPARQL to SQL is devised. Our
approach is similar as we store data in property tables using Parquet. However,
we do not store all RDF data in only one table but rather create a table for each
detected RDF class. For more comprehensive survey we refer to [5].
In all the presented works, storage and querying were optimized for storing
RDF data only. We, in the other hand, not only aimed to optimize for storing
and querying RDF, but also to make the same underlying storage and query
engine available for non-RDF data; structured and semi-structured. Therefore,
our work is the first to propose a blueprint for an end-to-end semantified big
data architecture, and realize it with a framework that supports the semantic
data i.e. integration, storage, and exposure, along non-semantic data.
7 Conclusion and Future Work
The current version of semantic data loading does not consider an attached
schema. It rather extracts this latter thoroughly from the data instances, which
can scale proportionally with data size and, thus, put a burden in the data
integration process. Currently, in case of instances of multiple classes, the se-
lected class is the last one in lexical order. We would consider the schema in the
future—even if incomplete—to select the most specific class instead. Addition-
ally, as semantic data is currently stored isolated from other data, we could use
a more natural language, such as SPARQL, to query only RDF data. Thus, we
envision conceiving a SPARQL-to-SQL converter for this sake. Such converters
exist already, but due to the particularity of our storage model, which imposes
that instances of multiple types to be stored in only one table while adding refer-
ences to the other types (other tables), a revised version is required. This effort
is supported by and contributes to the H2020 BigDataEurope Project.
References
1. Jin-Hang Du, Hao-Fen Wang, Yuan Ni, and Yong Yu. Hadooprdf: A scalable se-
mantic data analytical engine. In Intelligent Computing Theories and Applications,
pages 633–641. Springer, 2012.
2. Craig Franke, Samuel Morin, Artem Chebotko, John Abraham, and Pearl Brazier.
Distributed semantic web data management in hbase and mysql cluster. In Cloud
Computing (CLOUD), 2011, pages 105–112. IEEE, 2011.
3. Doug Laney Gartner. 3-d data management: Controlling data volume, velocity
and variety. 6 February 2001.
4. Aidan Hogan. Skolemising blank nodes while preserving isomorphism. In 24th Int.
Conf. on World Wide Web, WWW 2015, 2015.
5. Zoi Kaoudi and Ioana Manolescu. Rdf in the clouds: A survey. The VLDB Journal,
24(1):67–91, 2015.
6. Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, and Paolo
Castagna. Jena-hbase: A distributed, scalable and efficient rdf triple store. In
11th Int. Semantic Web Conf. Posters & Demos, ISWC-PD, 2012.
7. Miguel A Mart´ınez-Prieto, Carlos E Cuesta, Mario Arias, and Javier D Fern´andez.
The solid architecture for real-time management of big semantic data. Future
Generation Computer Systems, 47:62–79, 2015.
8. Zhi Nie, Fang Du, Yueguo Chen, Xiaoyong Du, and Linhao Xu. Efficient sparql
query processing in mapreduce through data partitioning and indexing. In Web
Technologies and Applications, pages 628–635. Springer, 2012.
9. Nikolaos Papailiou, Ioannis Konstantinou, Dimitrios Tsoumakos, Panagiotis Kar-
ras, and Nectarios Koziris. H2rdf+: High-performance distributed joins over large-
scale rdf graphs. In BigData Conference. IEEE, 2013.
10. Roshan Punnoose, Adina Crainiceanu, and David Rapp. Rya: a scalable rdf triple
store for the clouds. In Proceedings of the 1st International Workshop on Cloud
Intelligence, page 4. ACM, 2012.
11. Alexander Sch¨atzle, Martin Przyjaciel-Zablocki, Christopher Dorner, Thomas Hor-
nung, and Georg Lausen. Cascading map-side joins over hbase for scalable join
processing. SSWS+ HPCSW, page 59, 2012.
12. Alexander Sch¨atzle, Martin Przyjaciel-Zablocki, Thomas Hornung, and Georg
Lausen. Pigsparql: A sparql query processing baseline for big data. In Inter-
national Semantic Web Conference (Posters & Demos), pages 241–244, 2013.
13. Alexander Sch¨atzle, Martin Przyjaciel-Zablocki, Antony Neu, and Georg Lausen.
Sempala: Interactive sparql query processing on hadoop. In The Semantic Web–
ISWC 2014, pages 164–179. Springer, 2014.
14. Jianling Sun and Qiang Jin. Scalable rdf store based on hbase and mapreduce. In
3rd Int. Conf. on Advanced Computer Theory and Engineering. IEEE, 2010.