Conference PaperPDF Available

Towards Semantification of Big Data Technology

Authors:
  • TIB-Information Centre for Science and Technology. Leibniz University Hannover. Simon Bolívar University

Abstract and Figures

Much attention has been devoted to support the volume and velocity dimensions of Big Data. As a result, a plethora of technology components supporting various data structures (e.g., key-value, graph, relational), modalities (e.g., stream, log, real-time) and computing paradigms (e.g., in-memory, cluster/cloud) are meanwhile available. However, systematic support for managing the variety of data, the third dimension in the classical Big Data definition, is still missing. In this article, we present SeBiDA, an approach for managing hybrid Big Data. SeBiDA supports the Semantification of Big Data using the RDF data model, i.e., non-semantic Big Data is semantically enriched by using RDF vocabularies. We empirically evaluate the performance of SeBiDA for two dimensions of Big Data, i.e., volume and variety; the Berlin Benchmark is used in the study. The results suggest that even in large datasets, query processing time is not affected by data variety.
Content may be subject to copyright.
Towards Semantification of Big Data Technology
Mohamed Nadjib Mami, Simon Scerri, S¨oren Auer, Maria-Esther Vidal
University of Bonn & Fraunhofer IAIS, Bonn & Sankt Augustin, Germany
{lastname}@cs.uni-bonn.de
Abstract. Much attention has been devoted to support the volume
and velocity dimensions of Big Data. As a result, a plethora of tech-
nology components supporting various data structures (e.g., key-value,
graph, relational), modalities (e.g., stream, log, real-time) and comput-
ing paradigms (e.g., in-memory, cluster/cloud) are meanwhile available.
However, systematic support for managing the variety of data, the third
dimension in the classical Big Data definition, is still missing. In this
article, we present SeBiDA, an approach for managing hybrid Big Data.
SeBiDA supports the Semantification of Big Data using the RDF data
model, i.e., non-semantic Big Data is semantically enriched by using RDF
vocabularies. We empirically evaluate the performance of SeBiDA for two
dimensions of Big Data, i.e., volume and variety; the Berlin Benchmark
is used in the study. The results suggest that even in large datasets,
query processing time is not affected by data variety.
1 Introduction
Before ’Big Data’ became a phenomenon and a tremendous marketplace backed
by ever-increasing research efforts, Gartner suggested a novel data management
model termed 3-D Data Management [3]. At the basis of this model are three
major challenges to the potential of data management systems: increasing vol-
ume, accelerated data flows, and diversified data types. These three dimensions
have since become known as Big Data’s ‘three V’s’: volume,velocity, and variety.
Other dimensions have been added to the V-family to cover a broader range of
emerging challenges such as veracity and value.
In the last few years, a number of efforts have sought to design generic re-
usable architectures that tackle the aforementioned Big Data challenges. How-
ever, they all miss the explicit support for semantic data integration, querying,
and exposure. The Big Data Semantification introduced in this paper is an um-
brella term that covers the previous operations in a large scale. It also includes
the ability to semantically enrich non-semantic input data using Resource De-
scription Framework (RDF) ontologies. In this paper, we propose a realization
of this vision by presenting a blueprint of a generic semantified Big Data ar-
chitecture. All while preserving semantic information, so semantic data can be
exported in its natural form, i.e., RDF. Although both an effective management
of volume and velocity has gained a considerable attention both from academia
and industry, the variety dimension has not been adequately tackled; even though
it has been reported to be the top big challenge by many industrial players and
stakeholders1,
In this paper we target the lack of variety by suggesting unique data model
and storage, and querying interface for both semantic and non-semantic data.
SeBiDA provides a particular support for Big Data semantification by enabling
the semantic lifting of non-semantic datasets. Experimental results show that
(1) SeBiDA is not impacted by the variety dimension, even in presence of an in-
creasing large volume of data, and (2) outperforms a state-of-the-art centralized
triple in several aspects.
Our contributions can be summarized as follows:
The definition of a blueprint for a semantified Big Data architecture that en-
ables the ingestion, querying and exposure of heterogeneous data with varying
levels of semantics (hybrid data), while ensuring the preservation of semantics
(section 3).
SeBiDA: A proof-of-concept implementation of the architecture using Big
Data components such as, Apache Spark & Parquet and MongoDB (section 4).
Evaluation of the benefits of using the Big Data technology for the storage
and processing of hybrid data (section 5).
The rest of this paper is structured as follows: Section 2 presents a moti-
vation example and the requirements of a Semantified Big Data Architecture.
Section 3 presents a blueprint for a generic Semantic Big Data Architecture.
SeBiDa implementation is described in Section 4, while the experimental study
is reported in Section 5. Section 6 summarizes related approaches. In Section 7,
we conclude and present an outlook to our future work.
2 Motivating Example and Requirements
Suppose there are three datasets (Figure 1) that are large in size (i.e., volume)
and different in type (i.e., variety): (1) Mobility: an RDF graph containing
transport information about buses, (2) Regions: a JSON encoded data about
one country’s regions semantically described using ontologies terms in JSON-LD
format, and (3) Stop: a structured (GTFS-compliant 2) data describing Stops
in CSV format. The problem to be solved is to provide unified data model to
store and query these datasets, independently of their dissimilar types.
A Semantified Big Data Architecture (SBDA) allows for efficiently ingesting
and processing the previous heterogeneous data in large scale. Previously, there
has been a focus on achieving an efficient big data loading and querying for RDF
data or other structured data separately. The support of variety that we claim
in this paper is achieved through proving (1) a unified data model and storage
that is adapted and optimized for RDF data and structured and semi-structured
non-RDF data, and (2) a unified query interface over the whole stored data.
1http://newvantage.com/wp-content/uploads/2014/12/
Big-Data-Survey-2014- Summary-Report-110314.pdf,https://www.
capgemini-consulting.com/resource-file-access/resource/pdf/cracking_
the_data_conundrum-big_data_pov_13-1-15_v2.pdf
2https://developers.google.com/transit/gtfs/
Fig. 1: Motivating Example.Mobility: semantic RDF graph for buses; Re-
gions: semantic JSON-LD data about country’s regions; and (3) Stop: non-
semantic data about stops, presented using the CSV format
SBDA meets the next requirements to provide the above-mentioned features:
R1: Ingest semantic and non-semantic data. SBDAs must be able to
process arbitrary types of data. However, we should distinguish between seman-
tic and non-semantic data. In this paper, semantic data is all data, which is
either originally represented according to the RDF data model or has an associ-
ated mapping, which allows to convert the data to RDF. Non-semantic data is
then all data that is represented in other formalisms, e.g., CSV, JSON, XML,
without associated mappings. The semantic lifting of non-semantic data can be
achieved through the integration of mapping techniques e.g., R2RM3, CSVW4
annotation models or JSON-LD contexts 5. This integration can lead to either a
representation of the non-semantic data in RDF, or its annotation with semantic
mappings so as to enable full conversion at a later stage. In our example Mo-
bility and Regions are semantic data. The former is originally in RDF model,
while the latter is not in RDF model but has mappings associated. Stops in the
other hand is a non-semantic dataset on which semantic lifting can be applied.
R2: Preserve semantics, and metadata in Big Data processing chains.
Once data is preprocessed, semantically enriched and ingested, it is paramount
to preserve the semantic enrichment as much as possible. RDF-based data repre-
sentations and mappings have the advantage (e.g., compared to XML) of using
fine-grained formalisms (e.g., RDF triples or R2RML triple maps) that persist
even when the data itself is significantly altered or aggregated. Semantics preser-
vation can be reduced as follows: 1) Preserve IRIs and literals. The most atomic
3http://www.w3.org/TR/r2rml/
4http://www.w3.org/2013/csvw/wiki/Main Page
5http://www.w3.org/TR/json-ld/
components of RDF-based data representation are IRIs and literals 6. Best prac-
tices and techniques to enable storage and indexing of IRIs (e.g., by separately
storing and indexing namespaces and local names) as well as literals (along with
their XSD or custom datatypes and language tags) in an SBDA, need to be de-
fined. In the dataset Mobility, the Literals ”Alex Alion” and ”12,005”xsd:long,
and the IRI ‘http://xmlns.com/foaf/0.1/name‘ (shortened foaf:name in the fig-
ure), should be stored in an optimal way in the big data storage. 2) Preserve
triple structure. Atomic IRI and literal components are organized in triples.
Various existing techniques can be applied to preserve RDF triple structures in
SBDA components (e.g., HBase [14,6,2]). In the dataset Mobility, the triple
(prs:Alex mb:drives mb:Bus1) should be preserved by adopting a storage scheme
that keeps the connection between the subject prs:Alex, the property mb:drives
and the object mb:Bus1. Preserve mappings. Although conversion of the original
data into RDF is ideal, it must not be a requirement as it is not always feasible
(due to limitations in the storage, or to time critical use-cases). However, it is
beneficial to at least annotate the original data with mappings, so that a trans-
formation of the (full or partial) data can be performed on demand. R2RML,
JSON-LD contexts, and CSV annotation models are examples of such mappings,
which are usually composed of fine-grained rules that define how a certain col-
umn, property or cell can be transformed to RDF. The (partial) preservation
of such data structures throughout processing pipelines means that the result-
ing views can also be directly transformed to RDF. In Regions dataset, the
semantic annotations defined by the JSON object @context should be persisted
associated to the actual data it describes: RegionA.
R3: Scalable and Efficient Query Processing. Data management tech-
niques like data caching, query optimization, and query processing have to be
exploited to ensure scalable and efficient performance during query processing.
3 A Blueprint for a Semantified Big Data Architecture
In this section, we provide a formalisation for an SBDA blueprint.
Definition 1 (Heterogeneous Input Superset). We define a heterogeneous
input superset HIS, as the union of the following three types of datasets:
Dn={dn1, . . . , dnm}is a set of non-semantic, structured or semi-structured,
datasets in any format (e.g., relational database, CSV files, Excel sheets, JSON
files).
Da=da1...,daqis a set of semantically annotated datasets, consisting of
pairs of non-semantic datasets with corresponding semantic mappings (e.g.,
JSON-LD context, metadata accompanying CSV).
Ds=ds1, . . . , dspis a set of semantic datasets consisting of RDF triples.
In our running example, stops,regions, and mobility correspond to Dn,
Da, and Ds, respectively.
6We disregard blank nodes, which can be avoided or replaced by IRIs [4].
Definition 2 (Dataset Schemata). Given HIS=DnDaDs, the dataset
schemata of Dn,Da, and Dsare defined as follows:
Sn={sn1, . . . , snm}is a set of non-semantic schemata structuring Dn,
where each sniis defined as follows:
sni={(T, AT)|T is an entity type and ATis the set of all the attributes of T}
Ss={ss1, . . . , ssq}is a set of the semantic schemata behind Dswhere
each ssiis defined as follows:
ssi={(C, PC)|T is an RDF class and PCis the set of all the properties of C7}
Sa={ss1, . . . , ssp}is a set of the semantic schemata annotating Da
where each ssiis defined the same way as elements of Ss.
In the running example, the semantic schema of the dataset8mobility is:
ss1={(mb:Bus,{mb:matric,mb:stopsBy}),(mb:Driver,{foaf:name,mb:drives})}
Definition 3 (Semantic Mapping). A semantic mapping is a relation linking
two semantically-equivalent schema elements. There are two types of semantic
mappings:
mc=(e, c)is a relation mapping an entity type efrom Snonto a class c.
mp=(a, p)is a relation mapping an attribute afrom Snonto a property p.
SBDA facilitates the lifting of non-semantic data to semantically annotated
data by mapping non-semantic schemata to RDF vocabularies. The following are
possible mappings: (stop name, rdfs:label),(stop lat, geo:lat),(stop long, geo:long).
Definition 4 (Semantic Lifting Function). Given a set of mappings Mand
a non-semantic dataset dn, a semantic lifting function SL returns a semantically-
annotated dataset dawith semantic annotations of entities and attributes in dn.
In the motivating example, dataset Stops can be semantically lifted using the
following set of mappings: {(stop name, rdfs:label),(stop lat, geo:lat), (stop long,
geo:long)}, thus a semantically annotated dataset is generated.
Definition 5 (Ingestion Function). Given an element dHIS, an ingestion
function In(d)returns a set of triples of the form (RT, AT, f), where:
Tan entity type or class for data on d,
ATis a set of attributes A1, . . . , Anof T,
RTtype(A1)×type(A2)× · · ·×type(An)d, where type(Ai) = Tiindicates
that Tiis the data type of the attribute Aiin d, and
f:RT×ATSAiATtype(Ai)such that f(t, Ai) = tiindicates that ti
tuple t in RTis the value of the attribute Ai.
The result of applying the ingestion function In over all dHIS is the final
dataset that we refer to as the Transformed Dataset T D.
T D =SdiHIS I n(di)
Fig. 2: A Semantified Big Data Architecture Blueprint
The above definitions are illustrated in Figure 2. The SBDA blueprint han-
dles a representation (T D) of the relations resulting from the ingestion of mul-
tiple heterogeneous datasets in HIS (ds, da, dn). The ingestion (In) generates
relations or tables (denoted RT) corresponding to the data, supported by a
schema for interpretation (denoted Tand AT). The ingestion of semantic (ds)
and semantically-annotated (da) data is direct (denoted resp. by the solid and
dashed lines), a non-semantic dataset (dn) can be optionally semantically lifted
(SL) given an input set of mappings (M). This explains the two dotted lines
outgoing from dn, where one denotes the option to directly ingest the data with-
out semantic lifting, and the other denotes the option to apply semantic lifting.
Query-driven processing can then generate a number (|Q|) of results over (T D ).
Next, we validate our blueprint through the description of a proof of concept
implementation.
4 SeBiDA: A Proof-of-concept Implementation
The SeBiDA architecture (Figure 3) comprises three main components:
Schema Extractor: performs schema knowledge extraction from input data
sources, and supports semantic lifting based on a provided set of mappings.
7A set of properties PCof an RDF class Cwhere: pPC(p rdfs:domain C).
8mb and foaf are prefixes for mobility and friend of friend vocabularies, respectively.
Fig. 3: The SeBiDA Architecture
Data Loader : creates tables based on extracted schemata and loads input data.
Data Server: receives queries; generates results as tuples or RDF triples.
The first two components jointly realise the ingestion function I n from sec-
tion 3. The resulting tables TD can the be queried using the Data Server to
generate the required views. Next, these components are described in more de-
tail.
4.1 Schema Extractor
We extract the structure of both semantic and non-semantic data to transform it
into a tabular format that can be easily handled (stored and queried) by existing
Big Data technologies.
(A) From each semantic or semantically-annotated input dataset (dsor da), we
extract classes and properties describing the data (cf. section 3). This is
achieved by first reformatting RDF data into the following representation:
(class, (subject, (property, object)+)+)
which reads: ”each class has one or more instances where each instance can
be described using one or more (property, object) pairs”, and then we retain
the classes and properties. The XSD datatypes, if present, are leveraged to
type the properties, otherwise 9string is used.
The reformatting operation is performed using Apache Spark10 , a popular
Big Data processing engine.
(B) From each non-semantic input dataset (dn), we extract entities and at-
tributes (cf. section 3). As examples, in a relational database, table and
column names can be returned using particular SQL queries; the entity
9When the object is occasionally not typed or is a URL
10 https://spark.apache.org/
and its attributes can be extracted from a CSV file’s name and header, re-
spectively. Similarly whenever possible, attribute datatypes are extracted,
otherwise casted to string. Schemata that do not natively have a tabular
format e.g., the case of XML, JSON, are also flattened into entity-attributes
pairs.
As depicted in Figure 3, the results are stored in an instance of MongoDB11.
MongoDB is an efficient document-based database that can be distributed among
a cluster. As schema can automatically be extracted (case of Dn), it is essential
to store the schema information separately and expose it. This enables a sort of
discovery, as one can navigate through the schema, visualize it, and formulate
queries accordingly.
4.2 Semantic Lifter
In this step, SeBiDA targets the lifting of non-semantic elements: entities/at-
tributes to existing semantic representations: classes/properties), by leveraging
the LOV catalog API 12. The lifting operation is supervised by the user and is op-
tional i.e. the user can choose to ingest non-semantic data in its original format.
The equivalent classes/properties are first fetched automatically from the LOV
catalogue based on syntactical similarities with the original entities/attributes.
The user next validates the suggested mappings or adjust them, either manually
or by keyword-based searching the LOV catalogue. If semantic counterparts are
undefined, internal IRIs are created by attaching a base IRI. Example 1 shows a
result of this process, where four attributes for the GTFS entity ’Stop’ 13 have
been mapped to existing vocabularies, and the fifth converted to an internal IRI.
Semantic mappings across the cluster are stored in the same MongoDB instance,
together with the extracted schemata.
Example 1 (Property mapping and typing).
Source Target Datatype
stop name http://xmlns.com/foaf/0.1/name string
stop lat http://www.w3.org/2003/01/geo/wgs84 pos#lat double
stop lon http://www.w3.org/2003/01/geo/wgs84 pos#long double
parent station http://example.com/sebida/20151215T1708/parent station string
4.3 Data Loader
This component loads data from the source HIS into the final dataset TD by
generating and populating tables as described below. These procedures are also
realised by employing Apache Spark. Historically, storing RDF triples in tabular
layouts (e.g., Jena Property Table14) has been avoided due to the resulting large
amount of null values in wide tables. This concern has largely been reduced
following the emergence of NOSQL databases e.g., HBase, and columnar storage
11 https://www.mongodb.org
12 http://lov.okfn.org/dataset/lov/terms
13 developers.google.com/transit/gtfs/reference
14 http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
formats on top of HDFS (Hadoop Distributed File System) e.g. Apache Parquet
and ORC files. HDFS is the de facto storage for Big Data applications, thus, is
supported by the vast majority of Big Data processing engines.
We use Apache Parquet 15, a column-oriented tabular format, as the stor-
age technology. One advantage of this kind of storage is the schema projection,
whereby only projected columns are read and returned following a query. Fur-
ther, columns are stored consecutively on disk; thus, Parquet tables are very
compression-friendly (e.g., via Snappy and LZO) as values in each column are
guaranteed to be of the same type. Parquet also supports composed and nested
columns, i.e., saving multiple values in one column and storing hierarchical data
in one table cell. This is very important since RDF properties are frequently
used to refer to multiple objects for the same instance. State-of-the-art encod-
ing algorithms are also supported, e.g., bit-packing, run-length, and dictionary
encoding. In particular, the latter can be useful to store long string IRIs.
Table Generation. A corresponding table template is created for each derived
class or entity as follows: (A) Following the representation described in sub-
section 4.1, a table with the same label is created for each class (e.g. a table
’Bus’ from RDF class :bus). A default column ’ID’ (of type string) is created to
store the triple’s subject. For each property describing the class, an additional
column is created typed according to the property extracted datatype. (B) For
each entity, a table is similarly created as above, taking the entity label as table
name and creating a typed column for each attribute.
Table Population In this step, each table is populated as follows:
(1) For each RDF extracted class (cf. subsection 4.1) iterate throug its instance,
a new row is inserted for each instance into the corresponding table: The in-
stance IRI is stored in the ’ID’ column, whereas the corresponding objects are
saved under the column representing the property. For example, the following
semantic descriptions (formatted for clarity):
(dbo:bus, [
(dbn:bus1, [(foaf:name, "OP354"), (dc:type, "mini")]),
(dbn:bus2, [(foaf:name, "OP355"), (dc:type, "larg")])
])
is flatted to a table ”dbo:bus” in this manner:
ID foaf:name dc:type
dbn:bus1 ”OP354” ”mini”
dbn:bus2 ”OP355” ”larg”
(2) The population of the tables in case of non-semantic data varies depending
on its type. For example, we iterate through each CSV line and save its values
into the corresponding column of the corresponding table. XPath can be used
to iteratively select the needed nodes in an XML file.
15 https://parquet.apache.org
4.4 Data Server
Data loaded into tables is persistent, so one can access the data and perform
analytical operations by way of ad-hoc queries.
Quering Interface The current implementation utilizes SQL queries, as the
internal TD representation corresponds to a tabular structure, and because the
query technology used, i.e., Spark, provides only SQL-like query interface.
Multi-Format Results As shown in Figure 3 query results can be returned as
a set of tables, or as an RDF graph. RDFisation is achieved as follows: Create
a triple from each projected column, casting the column name and value to the
triple predicate and object, respectively. If the result includes the ID column,
cast value as the triple subject. Otherwise, set the subject to base IRI/i, where
the base IRI is defined by the user, and iis an incremental integer.
5 Experimental Study
The goal of the study is to evaluate if SeBiDA 16 meets requirements R1,R2,
and R3 (cf., section 2). We evaluate if data management techniques, e.g., caching
data, allow SeBiDA to speed up both data loading and query execution time,
whenever semantic and non-semantic data is combined.
Datasets: Datasets have been created using the Berlin benchmark generator17.
Table 1 describes them in terms of number of triples and file size. We choose
XML as non-semantic data input to demonstrate the ability of SeBiDA to ingest
and query semi-structured data (requirement R1). Parquet and Spark are used
to store XML and query nested data in tables.
Metrics: We measure the loading time of the datasets, the size of the datasets
after the loading, as well as the query execution time over the loaded datasets.
Implementation: We ran our experiments on a small-size cluster of three ma-
chines each having DELL PowerEdge R815, 2x AMD Opteron 6376 (16 Core)
CPU, 256 GB RAM, and 3 TB SATA RAID-5 disk. We cleared the cache before
running each query. To run on warm cache, we executed the same query five
times by dropping the cache just before running the first iteration of the query;
thus, data temporally stored in cache during the execution of iteration icould
be used in iteration i+ 1.
Discussion As can be observed in Table 2, loading time takes almost 3 hours for
the largest dataset. This is because one step of the algorithm involves sending
part of the data back from the workers to the master, which incurs important
network transfer (collect function in Spark). This can, however, be overcome
by saving data to a distributed database e.g., Cassandra, which we intend to
implement in the next version of the system. However, we achieved a huge gain
16 https://github.com/EIS-Bonn/SeBiDA
17 Using a command line: ./generate -fc -pc [scaling factor] -s [file
format] -fn [file name], where file format is nt for RDF data and xml for XML
data. More details in: http://wifo5-03.informatik.uni-mannheim.de/bizer/
berlinsparqlbenchmark/spec/BenchmarkRules/#datagenerator+
Table 1: Description of the Berlin Benchmark RDF Datasets.
RDF Dataset Size (n triples) Type Scaling Factor
Dataset148.9GB (200M) RDF 569,600
Dataset298.0GB (400M) RDF 1,139,200
Dataset38.0GB (100M) XML 284,800
Table 2: Benchmark of RDF Data Loading For each dataset the loading
time as well as the obtained data size together with the compression ratio.
RDF Dataset Loading Time New size Ratio
Dataset11.1h 389MB 1:0.015
Dataset22.9h 524MB 1:0.018
Dataset30.5h 188MB 1:0.023
in terms of disk space, which is expected from the adopted data model that
avoids the repetition of data (in case of RDF data) and the adopted file format,
i.e. Parquet, which performs high compression rates (cf. subsection 4.3).
Tables 3 and 4 report on the results of executing the Berlin Benchmark 12
queries against Dataset1, Dataset2, in two ways: first, the dataset alone and, sec-
ond the dataset combined with Dataset3(using UNION in the query). Queries
are run in cold cache and warm cache. We notice that caching can improve
query performance significantly in case of hybrid large data (entries highlighted
in bold). Among the 12 queries, the most expensive queries are Q7 and Q9. Q7
scans a large number of tables: five tables, while Q9 produces a large number
of intermediate results. These results suggest that SiBiDA is able to scale to
large hybrid data, without deteriorating query performance. Further, these re-
sults provide evidences of the benefits of loading query intermediate results in
cache.
Centralized vs. Distributed Triple Stores. We can look at SeBiDA as a
distributed triple store as it can load and query RDF triples separately. We thus
try to compare against the performance of one of the fastest centralized triple
stores: RDF-3X 18. Comparative results can be found in Table 5 and Table 6.
Discussion. Table 5 shows that RDF-3X loaded Dataset1within 93min, com-
pared to 66min in SeBiDA, while it timedout loading Dataset2. We set a timeout
of 12 hours, RDF-3X took more than 24 hours; before we terminate it manually.
Table 6 shows no definitive dominant, but suggests that SeBiDA in all queries
does not exceed a threshold of 20s, while RDF-3X does in four queries and even
passes to the order of minutes. We do not report on query time of Dataset2using
RDF-3X because of the prohibitive time of loading it.
18 https://github.com/gh-rdf3x/gh-rdf3x
Table 3: Benchmark Query Execution Times (secs.) in Cold and Warm
Caches. Significant differences are highlighted in bold.
Dataset1- Only Semantic Data (RDF)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Geom. Mean
Cold
Cache 3.00 2.20 1.00 4.00 3.00 0.78 11.3 6.00 16.00 7.00 11.07 11.00 4.45
Warm
Cache 1.00 1.10 1.00 2.00 3.00 0.58 6.10 5.00 14.00 6.00 10.04 9.30 3.14
Dataset1Dataset3–RDF & XML Data
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Geom. Mean
Cold
Cache 3.00 2.94 2.00 5.00 3.00 0.90 11.10 7.00 25.20 8.00 11.00 11.5 5.28
Warm
Cache 2.00 1.10 1.00 5.00 3.00 1.78 8.10 6.00 20.94 7.00 11.00 9.10 4.03
Table 4: Benchmark Query Execution Times (secs.) in Cold and Warm
Caches-Significant differences are highlighted in bold
Dataset2- Only Semantic Data (RDF)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Geom. Mean
Cold
Cache 5.00 3.20 3.00 8.00 3.00 1.10 20.00 7.00 18.00 7.00 13.00 11.40 6.21
Warm
Cache 4.00 3.10 2.00 7.00 3.00 1.10 18.10 6.00 17.00 6.00 12.04 11.2 5.55
Dataset2Dataset3–RDF & XML Data
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Geom. Mean
Cold
Cache 11.00 3.20 7.20 17.00 3.00 1.10 23.10 16.00 20.72 10.00 14.10 13.20 8.75
Warm
Cache 4.00 3.20 2.00 8.00 3.00 1.10 21.20 8.00 18.59 7.00 12.10 11.10 5.96
6 Related Work
There have been many works related to the combination of Big Data and Se-
mantic technologies. They can be classified into two categories: MapReduce-
only-based and non-MapReduce-based, where in the first category only Hadoop
framework is used, for storage (HDFS) and for processing (MapReduce); and in
the second, other storage and/or processing solutions are used. The major lim-
itation of MapReduce framework is the overhead caused by materializing data
between Map and Reduce, and between two subsequent jobs. Thus, works in the
first category, e.g., [1,8,12], try to minimize the number of join operations, or
maximize the number of joins executed in the Map phase, or additionally, use
indexing techniques for triple lookup. In order to cope with this, works in the sec-
ond category suggest to store RDF triples in NoSQL databases (e.g., HBase and
Accumulo) instead, where a variety of physical representations, join patterns,
and partitioning schemes is suggested. For processing, either MapReduce is used
Table 5: Loading Time of RDF-3X.
RDF Dataset Loading time New size Ratio
Dataset193min 21GB 1:2.4
Dataset2Timed out - -
Table 6: SeBiDA vs. RDF-3X Query Execution Times (secs.) in Cold
Cache only on Dataset1. Significant differences are highlighted in bold.
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12
SeBiDA 3.00 2.20 1.00 4.00 3.00 0.78 11.3 6.00 16.00 7.00 11.07 11.00
RDF-3X 0.01 1.10 29.213 0.145 1175.98 2.68 77.80 610.81 0.23 0.419 0.13 1.58
on top (e.g., [9,11]), or the internal operations of the NoSQL store are utilized but
with conjunction with triple stores (e.g., [6,10]). Basically, these works suggest to
store triples in three-columns tables, called triple tables, using column-oriented
stores. This latter offers enhanced compression performances and efficient dis-
tributed indexes. Nevertheless, using the tied three-columns table still entails
a significant overhead because of the inter-join operations required to answer
most of the queries. There are few approaches that do not fall into one of the
two categories. In [7], the authors focus on providing a real-time RDF querying,
combining both live and historical data. Instead of plain RDF, RDF data is
stored under the binary space-efficient format RDF/HDT. Although called Big
Semantic Data and compared against the so-called Lambda Architecture, noth-
ing is said about the scalability of the approach when the storage and querying
of the data exceed the single-machine capacities. In [13], RDF data is loaded into
property tables using Parquet. Impala, a distributed SQL query engine, is used
to query those tables. A query compiler from SPARQL to SQL is devised. Our
approach is similar as we store data in property tables using Parquet. However,
we do not store all RDF data in only one table but rather create a table for each
detected RDF class. For more comprehensive survey we refer to [5].
In all the presented works, storage and querying were optimized for storing
RDF data only. We, in the other hand, not only aimed to optimize for storing
and querying RDF, but also to make the same underlying storage and query
engine available for non-RDF data; structured and semi-structured. Therefore,
our work is the first to propose a blueprint for an end-to-end semantified big
data architecture, and realize it with a framework that supports the semantic
data i.e. integration, storage, and exposure, along non-semantic data.
7 Conclusion and Future Work
The current version of semantic data loading does not consider an attached
schema. It rather extracts this latter thoroughly from the data instances, which
can scale proportionally with data size and, thus, put a burden in the data
integration process. Currently, in case of instances of multiple classes, the se-
lected class is the last one in lexical order. We would consider the schema in the
future—even if incomplete—to select the most specific class instead. Addition-
ally, as semantic data is currently stored isolated from other data, we could use
a more natural language, such as SPARQL, to query only RDF data. Thus, we
envision conceiving a SPARQL-to-SQL converter for this sake. Such converters
exist already, but due to the particularity of our storage model, which imposes
that instances of multiple types to be stored in only one table while adding refer-
ences to the other types (other tables), a revised version is required. This effort
is supported by and contributes to the H2020 BigDataEurope Project.
References
1. Jin-Hang Du, Hao-Fen Wang, Yuan Ni, and Yong Yu. Hadooprdf: A scalable se-
mantic data analytical engine. In Intelligent Computing Theories and Applications,
pages 633–641. Springer, 2012.
2. Craig Franke, Samuel Morin, Artem Chebotko, John Abraham, and Pearl Brazier.
Distributed semantic web data management in hbase and mysql cluster. In Cloud
Computing (CLOUD), 2011, pages 105–112. IEEE, 2011.
3. Doug Laney Gartner. 3-d data management: Controlling data volume, velocity
and variety. 6 February 2001.
4. Aidan Hogan. Skolemising blank nodes while preserving isomorphism. In 24th Int.
Conf. on World Wide Web, WWW 2015, 2015.
5. Zoi Kaoudi and Ioana Manolescu. Rdf in the clouds: A survey. The VLDB Journal,
24(1):67–91, 2015.
6. Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, and Paolo
Castagna. Jena-hbase: A distributed, scalable and efficient rdf triple store. In
11th Int. Semantic Web Conf. Posters & Demos, ISWC-PD, 2012.
7. Miguel A Mart´ınez-Prieto, Carlos E Cuesta, Mario Arias, and Javier D Fern´andez.
The solid architecture for real-time management of big semantic data. Future
Generation Computer Systems, 47:62–79, 2015.
8. Zhi Nie, Fang Du, Yueguo Chen, Xiaoyong Du, and Linhao Xu. Efficient sparql
query processing in mapreduce through data partitioning and indexing. In Web
Technologies and Applications, pages 628–635. Springer, 2012.
9. Nikolaos Papailiou, Ioannis Konstantinou, Dimitrios Tsoumakos, Panagiotis Kar-
ras, and Nectarios Koziris. H2rdf+: High-performance distributed joins over large-
scale rdf graphs. In BigData Conference. IEEE, 2013.
10. Roshan Punnoose, Adina Crainiceanu, and David Rapp. Rya: a scalable rdf triple
store for the clouds. In Proceedings of the 1st International Workshop on Cloud
Intelligence, page 4. ACM, 2012.
11. Alexander Sch¨atzle, Martin Przyjaciel-Zablocki, Christopher Dorner, Thomas Hor-
nung, and Georg Lausen. Cascading map-side joins over hbase for scalable join
processing. SSWS+ HPCSW, page 59, 2012.
12. Alexander Sch¨atzle, Martin Przyjaciel-Zablocki, Thomas Hornung, and Georg
Lausen. Pigsparql: A sparql query processing baseline for big data. In Inter-
national Semantic Web Conference (Posters & Demos), pages 241–244, 2013.
13. Alexander Sch¨atzle, Martin Przyjaciel-Zablocki, Antony Neu, and Georg Lausen.
Sempala: Interactive sparql query processing on hadoop. In The Semantic Web–
ISWC 2014, pages 164–179. Springer, 2014.
14. Jianling Sun and Qiang Jin. Scalable rdf store based on hbase and mapreduce. In
3rd Int. Conf. on Advanced Computer Theory and Engineering. IEEE, 2010.
... Rule-based (Joshi et al. 2013;Meier 2008;Pichler et al. 2010) and binary (Álvarez-García et al. 2011;Bok et al. 2019;Fernȧndez et al. 2013;Pan et al. 2014) compression techniques for RDF data effectively reduce the size of the data. Distributed and parallel processing frameworks for Big Data are exploited in several approaches Khadilkar et al. 2012;Mami et al. 2016;Nie et al. 2012;Papailiou et al. 2013;Punnoose et al. 2012;Schätzle et al. 2012Schätzle et al. , 2013. Moreover, column-oriented stores (Idreos et al. 2012;MacNicol and French 2004;Stonebraker et al. 2005;Zukowski et al. 2006) apply column-wise compression techniques, and improve query performance by projecting the required columns. ...
... Semantic Web and Big Data communities have been working for better storage and processing of large datasets. RDF compression techniques (Álvarez-García et al. 2011;Bok et al. 2019;Fernȧndez et al. 2013;Meier 2008;Pan et al. 2014;Pichler et al. 2010) are devised, as well as, Big Data tools are exploited in Du et al. (2012), Khadilkar et al. (2012), Mami et al. (2016), Nie et al. (2012), Papailiou et al. (2013), Punnoose et al. (2012), and Schätzle et al. (2013) to efficiently process RDF data. Furthermore, column-oriented stores (Idreos et al. 2012;MacNicol and French 2004;Stonebraker et al. 2005;Zukowski et al. 2006) exploit fully decomposed storage model (Copeland and Khoshafian 1985) to scale-up to large datasets, and data factorization based query optimization techniques are proposed in Bakibayev et al. (2013). ...
... Relational representations of RDF data over big data storage technologies, i.e., Parquet and MongoDB, are presented by Mami et al. (2016), where a table for each RDF class is created, representing class properties as attributes. Du et al. (2012) combine Hadoop framework and an RDF triple store, Sesame, to achieve scalable RDF data analysis. ...
Article
Full-text available
Nowadays, there is a rapid increase in the number of sensor data generated by a wide variety of sensors and devices. Data semantics facilitate information exchange, adaptability, and interoperability among several sensors and devices. Sensor data and their meaning can be described using ontologies, e.g., the Semantic Sensor Network (SSN) Ontology. Notwithstanding, semantically enriched, the size of semantic sensor data is substantially larger than raw sensor data. Moreover, some measurement values can be observed by sensors several times, and a huge number of repeated facts about sensor data can be produced. We propose a compact or factorized representation of semantic sensor data, where repeated measurement values are described only once. Furthermore, these compact representations are able to enhance the storage and processing of semantic sensor data. To scale up to large datasets, factorization based, tabular representations are exploited to store and manage factorized semantic sensor data using Big Data technologies. We empirically study the effectiveness of a semantic sensor’s proposed compact representations and their impact on query processing. Additionally, we evaluate the effects of storing the proposed representations on diverse RDF implementations. Results suggest that the proposed compact representations empower the storage and query processing of sensor data over diverse RDF implementations, and up to two orders of magnitude can reduce query execution time.
... Rule-based (Joshi et al., 2013;Meier, 2008;Pichler et al., 2010) and binary (Álvarez-García et al., 2011;Bok et al., 2019;Fernández et al., 2013;Pan et al., 2014) compression techniques for RDF data effectively reduce the size of the data. Distributed and parallel processing frameworks for Big Data are exploited in several approaches Khadilkar et al., 2012;Mami et al., 2016;Nie et al., 2012;Papailiou et al., 2013;Punnoose et al., 2012;Schätzle et al., 2012;Schätzle et al., 2013). Moreover, column-oriented stores (Idreos et al., 2012;MacNicol and French, 2004;Stonebraker et al., 2005;Zukowski et al., 2006) apply column-wise compression techniques, and improve query performance by projecting the required columns. ...
... RDF compression techniques (Álvarez-García et al., 2011;Bok et al., 2019;Fernández et al., 2013;Meier, 2008;Pan et al., 2014;Pichler et al., 2010) are devised, as well as, Big Data tools are exploited in Khadilkar et al., 2012;Mami et al., 2016;Nie et al., 2012;Papailiou et al., 2013;Punnoose et al., 2012;Schätzle et al., 2013) to efficiently process RDF data. Furthermore, column-oriented stores (Idreos et al., 2012;MacNicol and French, 2004;Stonebraker et al., 2005;Zukowski et al., 2006) exploit fully decomposed storage model (Copeland and Khoshafian, 1985) to scale-up to large datasets, and data factorization based query optimization techniques are proposed in (Bakibayev et al., 2013). ...
... Relational representations of RDF data over big data storage technologies, i.e., Parquet and MongoDB, are presented by Mami et al. (Mami et al., 2016), where a table for each RDF class is created, representing class properties as attributes. Du et al. (Du et al., 2012) combine Hadoop framework and an RDF triple store, Sesame, to achieve scalable RDF data analysis. ...
Preprint
Full-text available
Nowadays, there is a rapid increase in the number of sensor data generated by a wide variety of sensors and devices. Data semantics facilitate information exchange, adaptability, and interoperability among several sensors and devices. Sensor data and their meaning can be described using ontologies, e.g., the Semantic Sensor Network (SSN) Ontology. Notwithstanding, semantically enriched, the size of semantic sensor data is substantially larger than raw sensor data. Moreover, some measurement values can be observed by sensors several times, and a huge number of repeated facts about sensor data can be produced. We propose a compact or factorized representation of semantic sensor data, where repeated measurement values are described only once. Furthermore, these compact representations are able to enhance the storage and processing of semantic sensor data. To scale up to large datasets, factorization based, tabular representations are exploited to store and manage factorized semantic sensor data using Big Data technologies. We empirically study the effectiveness of a semantic sensor's proposed compact representations and their impact on query processing. Additionally, we evaluate the effects of storing the proposed representations on diverse RDF implementations. Results suggest that the proposed compact representations empower the storage and query processing of sensor data over diverse RDF implementations, and up to two orders of magnitude can reduce query execution time.
... Moreover, the W3C has proposed during years a number of standards to implement this vision as RDF, OWL, SPARQL and promoted the use of ontologies to give a formal, common and shared view of data. The union of the Big Data paradigm and the Semantic Web vision is a new and interesting research area called Semantic Big Data [32,35]. The combination of these approaches will define more efficient techniques for storing, organizing and analyzing the huge amount of available information, facilitating the work of data scientists reducing, for example, the information overload and redundancy of data. ...
... While much attention has been devoted to the Volume and Velocity dimensions of Big Data, a systematic support for managing the Variety of data, is only emerging in recent years. In [35], this process is called "Semantification" of Big Data. In this paper the authors present an approach for managing hybrid Big Data using RDF data model, i.e. non-semantic Big Data is semantically enriched by using RDF vocabularies. ...
Chapter
Full-text available
The use of formal representations has a basic importance in the era of big data. This need is more evident in the context of multimedia big data due to the intrinsic complexity of this type of data. Furthermore, the relationships between objects should be clearly expressed and formalized to give the right meaning to the correlation of data. For this reason the design of formal models to represent and manage information is a necessary task to implement intelligent information systems. Approaches based on the semantic web need to improve the data models that are the basis for implementing big data applications. Using these models, data and information visualization becomes an intrinsic and strategic task for the analysis and exploration of multimedia Big Data. In this article we propose the use of a semantic approach to formalize the structure of a multimedia Big Data model. Moreover, the identification of multimodal features to represent concepts and linguistic-semantic properties to relate them is an effective way to bridge the gap between target semantic classes and low-level multimedia descriptors. The proposed model has been implemented in a NoSQL graph database populated by different knowledge sources. We explore a visualization strategy of this large knowledge base and we present and discuss a case study for sharing information represented by our model according to a peer-to-peer(P2P) architecture. In this digital ecosystem, agents (e.g. machines, intelligent systems, robots,.. .) act like interconnected peers exchanging and delivering knowledge with each other.
... Computers have been applied in many areas, and big data technology has realized vigorous development in recent years [10,11]. Network interconnection, the application of smart products, etc., are generating many data all the time. ...
Article
Full-text available
Rural revitalization, as a significant element of the national economy, has become a hot topic at present. The development of digital economies makes big data play become more significant for the national economy. To accelerate the process of rural revitalization, using up big data is vital. To mine the relationship between factors, such as cultural creativity, industrial scale, management methods, and rural revitalization from big data, this study adopts a neural network method. Based on the proposed neural network, a scheme to analyze the relationship between cultural creativity, industrial scale, management methods and the level of rural revitalization is presented. Through case studies, the effectiveness of analyzing the influence of cultural creativity, industrial scale, and management methods on rural revitalization based on big data is demonstrated. Moreover, the results validate that the proposed neural network has good prediction accuracy, which indicates that it is reliable to use neural networks to analyze the relationship between the impacting factors and the rural revitalization level with the assistance of big data.
... The semantic approach to data governance has risen to solve the problems associated with the management of great volume and their variety. There are several references about the ''Semantification'' of big data Technology, like those introduced in [24], [25], [26]. Furthermore, an Ontology-Based Data Management (OBDM) was created to access and use data by means of ontologies [27]. ...
Article
Full-text available
Nowadays, companies and official bodies are using the data as a principal asset to take strategic decisions. The advances in big data processing, storage and analysis techniques have allowed to manage the continuous increase in the volume of data. This increase in the volume of data together with its high variability and the large number of sources lead to a constant growing of the complexity of the data management environment. Data governance is the key for simplifying that complexity: it is the element that controls the decision making and responsibilities for all the processes related to data management. This paper discusses an approach to data governance based on ontological reasoning to reduce data management complexity. The proposed data governance system is built over an autonomous system based on distributed components. It implements semantic techniques and automatic ontology-based reasoning. The different components use a Shared Knowledge Plane to interact. Its fundamental piece is an ontology that represents all the data management processes included in data governance. A prototype of such a system has been implemented and tested for Telefonica’s global video service. The results obtained show the feasibility of using this type of technology to reduce the complexity of managing big data environments.
... A Semantified Big Data Architecture, SBDA [183], allows for ingesting and processing heterogeneous data on a large scale. In the literature, there has been a separate focus on achieving efficient ingestion and querying of large RDF data and for other structured types of data. ...
Thesis
Full-text available
The remarkable advances achieved in both research and development of Data Management as well as the prevalence of high-speed Internet and technology in the last few decades have caused unprecedented data avalanche. Large volumes of data manifested in a multitude of types and formats are being generated and becoming the new norm. In this context, it is crucial to both leverage existing approaches and propose novel ones to overcome this data size and complexity, and thus facilitate data exploitation. In this thesis, we investigate two major approaches to addressing this challenge: Physical Data Integration and Logical Data Integration. The specific problem tackled is to enable querying large and heterogeneous data sources in an ad hoc manner. In the Physical Data Integration, data is physically and wholly transformed into a canonical unique format, which can then be directly and uniformly queried. In the Logical Data Integration, data remains in its original format and form and a middleware is posed above the data allowing to map various schemata elements to a high-level unifying formal model. The latter enables the querying of the underlying original data in an ad hoc and uniform way, a framework which we call Semantic Data Lake, SDL. Both approaches have their advantages and disadvantages. For example, in the former, a significant effort and cost are devoted to pre-processing and transforming the data to the unified canonical format. In the latter, the cost is shifted to the query processing phases, e.g., query analysis, relevant source detection and results reconciliation. In this thesis we investigate both directions and study their strengths and weaknesses. For each direction, we propose a set of approaches and demonstrate their feasibility via a proposed implementation. In both directions, we appeal to Semantic Web technologies, which provide a set of time-proven techniques and standards that are dedicated to Data Integration. In the Physical Integration, we suggest an end-to-end blueprint for the semantification of large and heterogeneous data sources, i.e., physically transforming the data to the Semantic Web data standard RDF (Resource Description Framework). A unified data representation, storage and query interface over the data are suggested. In the Logical Integration, we provide a description of the SDL architecture, which allows querying data sources right on their original form and format without requiring a prior transformation and centralization. For a number of reasons that we detail, we put more emphasis on the virtual approach. We present the effort behind an extensible implementation of the SDL, called Squerall, which leverages state-of-the-art Semantic and Big Data technologies, e.g., RML (RDF Mapping Language) mappings, FnO (Function Ontology) ontology, and Apache Spark. A series of evaluation is conducted to evaluate the implementation along with various metrics and input data scales. In particular, we describe an industrial real-world use case using our SDL implementation. In a preparation phase, we conduct a survey for the Query Translation methods in order to back some of our design choices.
Chapter
Safety on construction sites is the most important aspect that a company should guarantee to its employees. To reduce the risk, it is necessary to analyze HIgh POtential (HIPO) hazards. In this study, we focus on Fall From Height (FFH) risk which is one of the main causes of worker fatalities. In order to improve the prevention plan, artificial intelligence (AI) can help to determine the causes and the safety actions from FFH historical data. This paper aims to: (i) develop and populate an ontology for FFH with the help of domain experts, and (ii) analyze and extract key information from a HIPO database through Natural Language Processing (NLP). Experimental results are conducted in order to evaluate the proposed approach.
Chapter
Big data is data that cannot be handled with the casual methods and tools used at a time due to its excessive volume, velocity, or variety. In the decision-making process, it is important to understand the value-time curve, which characterizes the diminishing value of the data over time. The more data the company collects, the greater the technical challenges of processing it. As a result, not all data is processed, leading to the phenomenon of knowledge gap. It is important to understand that collecting all the data does not mean that our knowledge about the subject matter is complete and correct. You can learn from the chapter why big data poses challenges not only from a technical point of view.
Article
Full-text available
The work presented in this paper is motivated by the acknowledgement that a complete and updated systematic literature review (SLR) that consolidates all the research efforts for Big Data modeling and management is missing. This study answers three research questions. The first question is how the number of published papers about Big Data modeling and management has evolved over time. The second question is whether the research is focused on semi-structured and/or unstructured data and what techniques are applied. Finally, the third question determines what trends and gaps exist according to three key concepts: the data source, the modeling and the database. As result, 36 studies, collected from the most important scientific digital libraries and covering the period between 2010 and 2019, were deemed relevant. Moreover, we present a complete bibliometric analysis in order to provide detailed information about the authors and the publication data in a single document. This SLR reveal very interesting facts. For instance, Entity Relationship and document-oriented are the most researched models at the conceptual and logical abstraction level respectively and MongoDB is the most frequent implementation at the physical. Furthermore, 2.78% studies have proposed approaches oriented to hybrid databases with a real case for structured, semi-structured and unstructured data.
Conference Paper
Full-text available
In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows ad-hoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of Pig, a data analysis platform on top of Hadoop MapReduce, as an inter-mediate layer between SPARQL and MapReduce. This additional level of abstraction makes our approach independent of the actual Hadoop ver-sion and thus ensures the compatibility to future changes of the Hadoop framework as they will be covered by the underlying Pig layer. We re-visit PigSPARQL and demonstrate the performance improvement when simply switching the underlying version of Pig from 0.5.0 to 0.11.0 with-out any changes to PigSPARQL itself. Because of this sustainability, PigSPARQL is an attractive long-term baseline for comparing various MapReduce based SPARQL implementations which is also underpinned by its competitiveness with existing systems, e.g. HadoopRDF.
Conference Paper
Full-text available
Driven by initiatives like Schema.org, the amount of semantically annotated data is expected to grow steadily towards massive scale, requiring cluster-based solutions to query it. At the same time, Hadoop has become dominant in the area of Big Data processing with large infrastructures being already deployed and used in manifold application fields. For Hadoop-based applications, a common data pool (HDFS) provides many synergy benefits, making it very attractive to use these infrastructures for semantic data processing as well. Indeed, existing SPARQL-on- Hadoop (MapReduce) approaches have already demonstrated very good scalability, however, query runtimes are rather slow due to the underlying batch processing framework. While this is acceptable for data-intensive queries, it is not satisfactory for the majority of SPARQL queries that are typically much more selective requiring only small subsets of the data. In this paper, we present Sempala, a SPARQL-over-SQL-on-Hadoop approach designed with selective queries in mind. Our evaluation shows performance improvements by an order of magnitude compared to existing approaches, paving the way for interactive-time SPARQL query processing on Hadoop.
Article
Full-text available
Big Semantic Data management has become a critical task in many application systems, which usually rely on heavyweight batch processes to manage such large amounts of data. However, batch architectures are not an adequate choice for designing real-time systems in which data updates and reads must be satisfied with very low latency. Thus, gathering and consuming high volumes of data at high velocities is an emerging challenge which we specifically address in the scope of innovative scenarios based on semantic data (RDF) management. The Linked Open Data initiative or emergent projects in the Internet of Things are examples of such scenarios. This paper describes a new architecture (referred to as SOLID) which separates the complexities of Big Semantic Data storage and indexing from real-time data acquisition and consumption. This decision relies on the use of two optimized datastores which respectively store historical (big) data and run-time data. It ensures efficient volume management and high processing velocity, but adds the need of coordinating both datastores. SOLID proposes a 3-tiered architecture in which each responsibility is specifically addressed. Besides its theoretical description, we also propose and evaluate a SOLID prototype built on top of binary RDF and state-of-the-art triplestores. Our experimental numbers report that SOLID achieves large savings in data storage (it uses up to 5 times less space than the compared triplestores), while provides efficient SPARQL resolution over the Big Semantic Data (in the order of 10-20 milliseconds for the studied queries). These experiments also show that SOLID ensures low-latency operations because data effectively managed in real-time remain small, so do not suffer Big Semantic Data issues.
Conference Paper
Full-text available
The proliferation of data in RDF format calls for efficient and scalable solutions for their management. While scalability in the era of big data is a hard requirement, modern systems fail to adapt based on the complexity of the query. Current approaches do not scale well when faced with substantially complex, non-selective joins, resulting in exponential growth of execution times. In this work we present H2RDF+, an RDF store that efficiently performs distributed Merge and Sort-Merge joins over a multiple index scheme. H2RDF+ is highly scalable, utilizing distributed MapReduce processing and HBase indexes. Utilizing aggressive byte-level compression and result grouping over fast scans, it can process both complex and selective join queries in a highly efficient manner. Furthermore, it adaptively chooses for either single- or multi-machine execution based on join complexity estimated through index statistics. Our extensive evaluation demonstrates that H2RDF+ efficiently answers non-selective joins an order of magnitude faster than both current state-of-the-art distributed and centralized stores, while being only tenths of a second slower in simple queries, scaling linearly to the amount of available resources.
Conference Paper
In this paper, we propose and evaluate a scheme to produce canonical labels for blank nodes in RDF graphs. These labels can be used as the basis for a Skolemisation scheme that gets rid of the blank nodes in an RDF graph by mapping them to globally canonical IRIs. Assuming no hash collisions, the scheme guarantees that two Skolemised graphs will be equal if and only if the two input graphs are isomorphic. Although the proposed scheme is exponential in the worst case, we claim that such cases are unlikely to be encountered in practice. To support these claims, we present the results of applying our Skolemisation scheme over a diverse collection of 43.5 million real-world RDF graphs (BTC-2014); we also provide results for some nasty synthetic cases.
Article
The Resource Description Framework (RDF) pioneered by the W3C is increasingly being adopted to model data in a variety of scenarios, in particular data to be published or exchanged on the Web. Managing large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance, and elasticity feature it provides, enabling the easy deployment of distributed and parallel architectures. In this article, we survey RDF data management architectures and systems designed for a cloud environment, and more generally, those large-scale RDF data management systems that can be easily deployed therein. We first give the necessary background, then describe the existing systems and proposals in this area, and classify them according to dimensions related to their capabilities and implementation techniques. The survey ends with a discussion of open problems and perspectives.
Conference Paper
With the rapid growth of the scale of semantic data, to handle the problem of analyzing this large-scale data has become a hot topic. Traditional triple stores deployed on a single machine have been proved to be effective to provide storage and retrieval of RDF data. However, the scalability is limited and cannot handle billion ever growing triples. On the other hand, Hadoop is an open-source project which provides HDFS as a distributed file storage system and MapReduce as a computing framework for distributed processing. It has proved to perform well for large data analysis. In this paper, we propose, HadoopRDF, a system to combine both worlds (triple stores and Hadoop) to provide a scalable data analysis service for the RDF data. It benefits the scalability of Hadoop and the ability to support flexible analysis query like SPARQL of traditional triple stores. Experimental evaluation results show the effectiveness and efficiency of the approach.
Conference Paper
Processing SPARQL queries on single node is obviously not scalable, considering the rapid growth of RDF knowledge bases. This calls for scalable solutions of SPARQL query processing over Web-scale RDF data. There have been attempts for applying SPARQL query processing techniques in MapReduce environments. However, no study has been conducted on finding optimal partitioning and indexing schemes for distributing RDF data in MapReduce. In this paper, we investigate RDF data partitioning technique that provides effective indexing schemes to support efficient SPARQL query processing in MapReduce. Our extensive experiments over a huge real-life RDF dataset show the performance of the proposed partitioning and indexing schemes for efficient SPARQL query processing.