Conference PaperPDF Available

Translation of Relational and Non-Relational Databases into RDF with xR2RML



With the growing amount of data being continuously produced, it is crucial to come up with solutions to expose data from ever more heterogeneous databases (e.g. NoSQL systems) as linked data. In this paper we present {xR2RML}, a language designed to describe the mapping of various types of databases to RDF. {xR2RML} flexibly adapts to heterogeneous query languages and data models while remaining free from any specific language or syntax. It extends R2RML, the W3C recommendation for the mapping of relational databases to RDF, and relies on RML for the handling of various data representation formats. We analyse data models of several modern databases as well as the format in which query results are returned, and we show that {xR2RML} can translate any data element within such results into RDF, relying on existing languages such as XPath and JSONPath if needed. We illustrate some features of {xR2RML} such as the generation of RDF collections and containers, and the ability to deal with mixed content.
Translation of Relational and Non-Relational Databases into RDF with
Franck Michel1, Lo¨
ıc Djimenou1, Catherine Faron-Zucker and Johan Montagnat1
1Univ. Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France
Keywords: Linked Data, RDF, R2RML, NoSQL
Abstract: With the growing amount of data being continuously produced, it is crucial to come up with solutions to
expose data from ever more heterogeneous databases (e.g. NoSQL systems) as linked data. In this paper
we present xR2RML, a language designed to describe the mapping of various types of databases to RDF.
xR2RML flexibly adapts to heterogeneous query languages and data models while remaining free from any
specific language or syntax. It extends R2RML, the W3C recommendation for the mapping of relational
databases to RDF, and relies on RML for the handling of various data representation formats. We analyse data
models of several modern databases as well as the format in which query results are returned, and we show
that xR2RML can translate any data element within such results into RDF, relying on existing languages such
as XPath and JSONPath if needed. We illustrate some features of xR2RML such as the generation of RDF
collections and containers, and the ability to deal with mixed content.
The web of data is now emerging through the pub-
lication and interlinking of various open data sets
in RDF. Initiatives such as the W3C Data Activity1
and the Linking Open Data (LOD) project2aim at
Web-scale data integration and processing, assuming
that making heterogenous data available in a common
machine-readable format should create opportunities
for novel applications and services. Their success
largely depends on the ability to reach data from the
deep web (He et al., 2007), a part of the web content
consisting of documents and databases hardly linked
with other data sources and hardly indexed by stan-
dard search engines. Furthermore, the integration of
heterogeneous data sources is a major challenge in
several domains (Field et al., 2013). As data seman-
tics is often poorly captured in database schemas, or
encoded in application logics, data integration tech-
niques have to capture and expose database semantics
in an explicit and machine-readable manner.
The deep web keeps on growing as data is continu-
ously being accumulated in ever more heterogeneous
databases. In particular, NoSQL systems have gained
a remarkable success during recent years. Driven by
major web companies, they have been developed to
meet requirements of web 2.0 services, that relational
databases (RDB) could not achieve (flexible schema,
high throughput, high availability, horizontal elastic-
ity on commodity hardware). Thus, NoSQL systems
should be considered as potential big contributors of
the linked open data. Other types of databases have
been developed over time, either for generic purpose
or specific domains, such as XML databases (no-
tably used in edition and digital humanities), object-
oriented databases or directory-based databases.
Significant efforts have been invested in the def-
inition of methods to translate various kinds of data
sources into RDF. R2RML (Das et al., 2012), for in-
stance, is the W3C recommendation to describe RDB-
to-RDF mappings. RML extends R2RML for the in-
tegration of heterogeneous data formats (Dimou et al.,
2014a), but it does not address the constraints that
arise when dealing with different types of databases
and query languages. In particular, to our knowledge,
no method has been proposed yet to tackle NoSQL-
to-RDF translation.
In this paper, we present xR2RML, a mapping lan-
guage designed as an extension of R2RML and RML.
Besides relational databases, xR2RML addresses the
mapping of a large and extensible scope of non-
relational databases to RDF. It is designed to flex-
ibly adapt to various data models and query lan-
guages. xR2RML can translate data with mixed for-
mats and generate RDF collections and containers.
Our primary focus includes some NoSQL and XML
native databases but the approach can equally apply
to other types of database such as object-oriented and
directory-based databases.
In the rest of this section we draw a picture
of other works related to the translation of various
data sources to RDF, and we scope the objectives of
xR2RML. Section 2 explores in more details the capa-
bilities required to reach these goals. In section 3 we
summarize the main characteristics of R2RML and
RML, and in section 4 we describe xR2RML specific
extensions. Section 5 presents a working implemen-
tation of the language, finally sections 6 and 7 discuss
xR2RML applicability in different contexts and con-
cludes by outlining some perspectives.
1.1 Related Works
Wrapper-based data integration systems like Garlic
(Roth and Schwartz, 1997) and SQL/MED (Melton
et al., 2002) generally have similar architectures: a
global data model is described using specific mod-
elling languages (e.g. Garlic’s GDL), a query feder-
ation engine handles user queries expressed in terms
of a global data model and determines a query plan, a
per-data source wrapper implements a specific wrap-
per interface and performs the mapping with the data
source schema. No guideline is provided as to how a
wrapper should describe and implement the mapping.
The same global architecture holds in data inte-
gration systems based on semantic web technologies.
Existing works focus on efficient query planning and
distribution, such as FedX (Schwarte et al., 2011),
Anapsid (Acosta et al., 2011) and KGRAM-DQP
(Gaignard, 2013). The global data model is expressed
by domain ontologies using common languages, e.g.
RDFS or OWL. User queries, expressed in terms
of the domain ontologies, are written in SPARQL.
SPARQL is also used as the wrapper interface. Each
data source wrapper is a SPARQL endpoint that per-
forms the schema mapping with the source schema.
Our work, as well as most related works listed be-
low, focuses on the mapping step: the rationale is to
standardize the schema mapping description, so that a
mapping description be written once and applied with
different wrapper implementations.
RDB-to-RDF mapping has been an active field of
research during the last ten years (Spanos et al., 2012;
Sequeda et al., 2011; Michel et al., 2014b). Sev-
eral mapping methods and languages have been pro-
posed over time, based either on the materialization
of RDF data sets or on the SPARQL-based access
to relational data. Published in 2012, R2RML, the
W3C RDB-to-RDF mapping language recommenda-
tion, has reached a notable consensus3.
Similarly, various solutions exist to map XML
data to RDF. The XSPARQL (Bischof et al., 2012)
query language combines XQuery and SPARQL
for bidirectional transformations between XML and
RDF. Several other solutions are based on the XSLT
technology such as XML Scissor-lift (Fennell, 2014)
that describes mapping rules in Schematron XML val-
idation language, and AstroGrid-D (Breitling, 2009).
SPARQL2XQuery (Bikakis et al., 2013) applies XML
Schema to RDF/OWL translation rules.
Much work has already been accomplished re-
garding the translation of CSV, TSV and spreadsheets
to RDF. Tools have been developed such as XLWrap
(Langegger and W¨
oss, 2009) and RDF Refine4. The
Linked CSV5format is a proposition to embed meta-
data in a CSV file, that make it easy to link on the Web
and eventually to translate to RDF or JSON. How-
ever this approach assumes that CSV data be made
compliant with the format in the first place, before it
can be translated to RDF. The CSV on the Web W3C
Working Group6, created in 2014, intends to propose
a recommendation for the description of and access
to CSV data on the Web. In this context, RDF is one
of the formats envisaged either to represent metadata
about CSV data, or as a format to translate CSV data
Several tools are designed as frameworks for the
integration of sources with heterogeneous data for-
mats. XSPARQL, cited above, provides an R2RML-
compliant extension. Thus it can simultaneously
translate relational, XML and RDF data to XML or
RDF. TARQL7is a SPARQL-based mapping lan-
guage that can convert from RDF, CSV/TSV and
JSON formats to RDF, but it does not focus on how
the data is retrieved from different types of databases.
Datalift (Scharffe et al., 2012) provides an integrated
set of tools for the publication in RDF of raw struc-
tured data (RDB, CSV, XML) and the interlinking of
resulting data sets.
RML (Dimou et al., 2014b; Dimou et al., 2014a)
is an extension of R2RML that tackles the mapping
of data sources with heterogeneous data formats such
as CSV/TSV, XML or JSON. Most approaches cre-
ate links between data sets after they were trans-
lated to RDF, e.g. using properties rdfs:seeAlso or
owl:sameAs. This is sometimes not adequate as log-
ical resources having different identifiers in different
data sets cannot easily be reconciled. RML creates
linked data sets at mapping time by enabling the si-
multaneous mapping of multiple data sources, thus
allowing for cross-references between resources de-
fined in various data sources. However, RML does
not investigate the constraints that arise when deal-
ing with different types of databases. It proposes a
solution to reference data elements within query re-
sults using expressive languages such as XPath and
JSONPath. But it does not clearly distinguish be-
tween such languages and the actual query language
of a database. In some cases they might be the
same, e.g. XPath can be used to query an XML
native database, and later on to reference data ele-
ments from query results. But in the general case, the
query language and the language used to reference el-
ements within query results must be dissociated, e.g.
NoSQL document stores proprietary query languages,
while results are JSON documents that can be eval-
uated against JSONPath expressions. Furthermore,
RML explicitly refers to known evaluation languages
(ql:JSONPath, ql:XPath). In this context, support-
ing a new evaluation language requires to change the
mapping language definition. To achieve more flex-
ibility, we believe that such characteristics should be
implementation-dependent, leaving the mapping lan-
guage free from any explicit dependency.
1.2 Objectives of this Work
The works presented in section 1.1 address various
types of data sources. Some of them could be ex-
tended to new data sources by developing ad-hoc ex-
tensions, although they are generally not designed to
easily support new data models and query languages.
Only RML comes with this flexibility as its design
aims at adapting to new data models. Our goal with
xR2RML is to define a generic mapping language
able to equally apply to most common relational and
non-relational databases. We make a specific focus
on NoSQL and XML native databases, and we argue
that our work can be generalized to some other types
of database, for instance object-oriented and directory
(LDAP) databases. In section 2 we explore the capa-
bilities required by xR2RML to reach these goals.
Different kinds of databases typically differ in several
aspects: the query language used to retrieve data, the
data model that underlies the data structures retrieved
and the cross-data referencing scheme, if any. Below
we explore in further details the capabilities that we
want xR2RML to provide.
Query languages. The landscape of modern
database systems shows a vast diversity of query
languages. Relational databases generally support
ANSI SQL, and most native XML databases sup-
port XPath and XQuery. By contrast, NoSQL is
a catch-all term referring to very diverse systems
(Hecht and Jablonski, 2011; Gajendran, 2013). They
have heterogeneous access methods ranging from
low-level APIs to expressive query languages. De-
spite several propositions of common query lan-
guage (N1QL8, UnQL9, SQL++ (Ong et al., 2014),
ArangoDB QL10, CloudMdsQL (Kolev et al., 2014)),
no consensus has emerged yet, that would fit most
NoSQL databases. Therefore, until a standard even-
tually arises, xR2RML must be agile enough to cope
with various query languages and protocols in a trans-
parent manner.
Data models. Similarly to the case of query lan-
guages, we observe a large heterogeneity in data mod-
els of modern databases. To describe their translation
to RDF, a mapping language must be able to refer-
ence any data element from their data models. Below
we list most common data models, we shortly anal-
yse formats in which data is retrieved and figure out
how a mapping language can reference data elements
within retrieved data.
Relational databases comply with a row-based
model in which column names uniquely reference
cells in a row. NoSQL extensible column stores11
also comply with the row-based model, with the dif-
ference that all rows do not necessarily share the
same columns. For such systems, referencing data
elements is simply achieved using column names.
Other non-relational systems, such as XML native
databases, NoSQL key-value stores, document stores
or graph stores, have heterogeneous data models that
can hardly be reduced to a row-based model:
- In databases relying on a specific data representa-
tion format like JSON (notably in NoSQL document
stores) and XML, data is stored and retrieved as doc-
uments consisting of tree-like compound values. Ref-
erencing data elements within such documents can be
achieved thanks to languages such as JSONPath and
- Object-oriented databases conventionally provide
methods to serialize objects, typically as key-value as-
sociations: keys are attribute names while values are
objects (composition or aggregation relationship), or
compound values (collection, map, etc). Serialization
is typically done in XML or JSON, thus here again
11aka. column family store, column-oriented store, etc.
we can apply XPath or JSONPath expressions.
- A directory data model is organised as a tree: each
node has an identifier and a set of attributes repre-
sented as name=value. Each entry retrieved from an
LDAP request is named using an LDAP path expres-
sion, e.g. cn=Franck Michel,ou=cnrs,o=fr. Refer-
encing data elements within such entries can be sim-
ply achieved using attribute names.
- In graph databases, the abstract data model basically
consists of nodes and edges. Query capabilities gen-
erally allow to retrieve either values matching certain
patterns (like the SPARQL SELECT clause), or a set
of nodes and edges representing a result graph (like
the SPARQL CONSTRUCT clause). Whatever the
type of result though, graph databases commonly pro-
vide APIs to manipulate query results. For instance a
SPARQL SELECT result set has a row-based format:
each row of a result set consists of columns typically
named after query variable names. The Neo4J graph
database provides a JDBC interface to process a query
result, and its REST interface returns result graphs as
JSON documents. Thus, although a graph may be a
somehow complex data structure, query results can be
fairly easy to manipulate using well-known formats:
a row-column model, a serialization in JSON or some
other representation syntax, etc.
Finally, the way a mapping language can refer-
ence data elements within query results depends more
on the API capabilities than the data model itself. To
be effective, xR2RML must transparently accept any
type of data element reference expression. This in-
cludes a column name (applicable not only to row-
based data models but also to any row-based query
result), JSONPath, XPath or LDAP path expressions,
etc. An xR2RML processing engine must be able to
evaluate such expressions against query results, but
the mapping language itself must remain free from
any reference to specific expression syntaxes.
Collections. Many data models support the rep-
resentation of collections: these can be sets, arrays
or maps of all kinds (sorted or not sorted, with or
without duplicates, etc.). Although the RDF data
model supports such data structures, to the best of our
knowledge, existing mapping languages do not allow
for the production of RDF collections (rdf:List) nor
RDF containers (rdf:Bag,rdf:Seq,rdf:Alt), except
TARQL that is able to convert a JSON array into an
rdf:List. In all other cases, structured values such as
collections or key-value associations are flattened into
multiple RDF triples. Listing 1 is an example XML
collection consisting of two “movie” elements.
Its translation into two triples is illustrated in Listing
2. Assuming that the order of “movie” elements im-
plicitly represents the chronological order in which
movies were shot, triples in Listing 2 lose this infor-
mation. Using an RDF sequence may be more appro-
priate in this case, as illustrated in Listing 3.
< di r e ct o r n a me =" W o od y A l le n " >
< mo v ie > A n ni e Ha ll </ mo vi e >
< mo v ie > M a nh a t ta n < / m ov i e >
</director >
Listing 1: Example of XML collection
< ht t p :/ / e x am p l e . or g / di r / W o od y \ %2 0 A ll e n >
ex : d i r e ct e d " A n ni e Ha l l ".
< ht t p :/ / e x am p l e . or g / di r / W o od y \ %2 0 A ll e n >
ex : d i r e ct e d " M a n ha t t an ".
Listing 2: Translation to multiple RDF triples
< ht t p :/ / e x am p l e . or g / di r / W o od y \ %2 0 A ll e n >
ex : m o v ie L is t [ a rd f : Se q ;
rd f : _ 1 " A nn i e H a ll ";
rd f : _2 " M an h at t an " ] .
Listing 3: Translation to an RDF sequence
Consequently, to map heterogeneous data to RDF
while preserving concepts such as collections, bags,
alternates or sequences, xR2RML must be able to
map data elements to RDF collections and containers.
Cross-references. Cross-references are com-
monly implemented as foreign key constraints in re-
lational data models, or aggregation and composi-
tion relationships in object-oriented models. Cross-
referencing is even the primary goal of graph-based
databases. More generally, it is possible to cross-
reference logical entities in any type of database. For
instance, a JSON document of a NoSQL document
store may refer to another document by its identifier
or any other field that identifies it uniquely, even if
this is generally not recommended for the sake of per-
A cross-referenced logical resource may be
mapped alternatively as the subject or the object of
triples. This may entail joint queries between tables
or documents. Therefore, xR2RML must (i) allow a
modular description so that the mapping of a logical
resource can be written once and easily reused as a
subject or an object, and (ii) allow the description of
joint queries to retrieve cross-referenced logical re-
Summary. Finally, we draw up the list of key ca-
pabilities expected from xR2RML as follows:
1. It enables to describe the mapping of various rela-
tional and non-relational databases to RDF.
2. It is flexible enough to allow for new databases,
query languages and data models in an agile manner:
supporting a new system, query language and/or data
model only requires changes in the implementation
(adaptor, plug-in, etc.), but no changes are required in
the mapping language itself.
3. It enables to generate RDF collections (rdf:List)
or containers (rdf:Seq,rdf:Bag,rdf:Alt) from one-
to-many relations modelled as compound values or as
cross-references. RDF collections and containers can
be nested.
4. It enables to perform joint queries following cross-
references between logical resources, and it allows the
modular reuse of mapping definitions.
Taken the other way round, data sources to be
mapped to RDF using xR2RML need to fulfil some
requirements entailed by xR2RML’s capabilities:
1. The data source interface should provide a declar-
ative query language. If not, it must be possible to
fetch the whole data at once, like a CSV or XML file
returned by a Web service.
2. There must exist technical means to parse query
results and reference data elements: this ranges from
simple column names to expressive languages like
3. In case of large data sets, the database interface
should provide ways to iterate on query results, simi-
larly to SQL cursors in RDBs.
Notice that the last two requirements are quite natural
features of most decent database systems.
To help in the design of xR2RML we chose to
leverage R2RML, a standard, well-adopted mapping
language for relational databases. R2RML already
provides some of the requirements listed above: mod-
ularity, management of cross-references, as well as
rich features such as the ability to define target named
graphs. To facilitate its understanding and adop-
tion, xR2RML is designed as a backward compatible
extension of R2RML. Besides, to address the map-
ping of heterogeneous data formats such as CSV/TSV,
XML and JSON, we leverage propositions of RML
that is itself an extension of R2RML.
R2RML is a generic language meant to describe cus-
tomized mappings that translate data from a relational
database into an RDF data set. An R2RML map-
ping is expressed as an RDF graph written in Tur-
tle syntax12. An R2RML mapping graph consists
of triples maps, each one specifying how to map
rows of a logical table to RDF triples. A triples
map is composed of exactly one logical table (prop-
erty rr:logicalTable), one subject map (property
rr:subjectMap) and any number of predicate-object
maps (property rr:predicateObjectMap). A logi-
cal table may be a table, an SQL view (property
rr:tableName), or the result of a valid SQL query
(property rr:sqlQuery). A predicate-object map con-
sists of predicate maps (property rr:predicateMap)
and object maps (property rr:objectMap). For each
row of the logical table, the subject map generates
a subject IRI, while each predicate-object map cre-
ates one or more predicate-object pairs. Triples are
produced by combining the subject IRI with each
predicate-object pair. Additionally, triples are gener-
ated either in the default graph or in a named graph
specified using graph maps (property rr:graphMap).
Subject, predicate, object and graph maps are all
R2RML term maps. A term map is a function that
generates RDF terms (either a literal, an IRI or a
blank node) from elements of a logical table row. A
term map must be exactly one of the following: a
constant-valued term map (property rr:constant) al-
ways generates the same value; a column-valued term
map (property rr:column) produces the value of a
given column in the current row; a template-valued
term map (property rr:template) builds a value from
a template string that references columns of the cur-
rent row.
When a logical resource is cross-referenced, typi-
cally by means of a foreign key relationship, it may
be used as the subject of some triples and the ob-
ject of some others. In such cases, a referencing ob-
ject map uses IRIs produced by the subject map of a
(parent) triples map as the objects of triples produced
by another (child) triples map. In case both triples
maps do not share the same logical table, a joint
query must be performed. A join condition (property
rr:joinCondition) names the columns from the par-
ent and child triples maps, that must be joined (prop-
erties rr:parent and rr:child).
Below we provide a short illustrative exam-
ple. Triples map <#R2RML Directors> uses table
DIRECTORS to create triples linking movie directors
(whose IRIs are built from column NAME) with their
birth date (column BIRTH DATE).
<# R 2R M L_ D ir e ct o rs >
rr:logicalTable [
rr:tableName "DIRECTORS"; ];
rr:subjectMap [rr:template
" ht t p : // e x a mp l e . or g / di r / { N AM E } "; ];
rr:predicateObjectMap [
rr:predicate ex: bithdate;
rr:objectMap [
rr:column "BIRTH_DATE";
rr:datatype x sd : d a te ; ] ; ].
RML is an extension of R2RML that targets
the simultaneous mapping of heterogeneous data
sources with various data formats, in particular hi-
erarchical data formats. An RML logical source
Co l le c ti o n " d ir e ct or s " :
{" n a me " : " W oo dy A ll en " , " d i re c te d " : [" M a nh a tt a n " , " In t er i or s " ]} ,
{" n am e ": " W on g K ar - w ai " , " d i re ct e d ": [ " 20 46 " , " In t h e Mo od fo r L ov e " ]}
Co l le c t io n " m ov i es " :
{ " de c ad e " : "2 0 00 s " , " mo v ie s " : [
{" n a me " : " 20 46 " , " c o de " : " m2 0 46 " , " a ct o rs " : [ "T . L e un g " , "G . L i "] } ,
{" n a me " : " In th e M o od fo r L ov e " , " co d e ": " M oo d " , " a ct or s " : [ "M . C h eu n g " ]} ] }
{ " de c ad e " : "1 9 70 s " :, " m o vi e s ": [
{" n a me " : " M an h a tt a n ", " c o de " : " M an h " , " ac t or s " : [" W o od y Al le n " , " D ia ne K ea t on "] }
{" na me " : " I n t e r i ors ", " co de ": " In t0 1 ", " a c t o r s " : [ " D . K e a t o n " , " G . P a g e " ] } ] }
Listing 4: Example Database
(property rml:logicalSource) extends R2RML log-
ical table and points to the data source (prop-
erty rml:source): this may be a file on the lo-
cal file system, or data returned from a web ser-
vice for instance. A reference formulation (property
rml:referenceFormulation) names the syntax used to
reference data elements within the logical source. As
of today, possible values are ql:JSONPath for JSON
data, ql:XPath for XML data, and rr:SQL2008 for re-
lational databases. Data elements are referenced with
property rml:reference that extends rr:column. Its
object is an expression whose syntax matches the ref-
erence formulation. Similarly, the definition of prop-
erty rr:template is extended to allow such reference
expressions to be enclosed within curly braces (’{
and ’}’). Below we provide an RML example. It is
very similar to the R2RML example above, with the
difference that data now comes from a JSON file “di-
<#RML_Directors >
rml:logicalSource [
rml:source " d ir e ct o rs . j so n " ;
rml:referenceFormulation ql:JSONPath;
rml:iterator " $ . *" ; ];
rr:subjectMap [rr:template
" ht t p : // e x a mp l e . or g / di r / { $ .* . n am e } ";
rr:predicateObjectMap [
rr:predicate ex: bithdate;
rr:objectMap [
rml:reference " $ . *. b i t hd a te ";
rr:datatype x sd : d a te ; ] ; ].
In this section we breifly describe the elements of the
xR2RML language. A complete specification is pro-
vided in (Michel et al., 2014a). We illustrate the de-
scriptions with a running example: Listing 4 shows
JSON documents stored in a MongoDB database, in
two collections: a “directors” collection with doc-
uments on movie directors, and a “movies” collec-
tion in which movies are grouped in per-decade doc-
uments. Listing 5 shows an xR2RML mapping graph
to translate those documents into RDF. Director IRIs
are built using director names, while IRIs of resources
representing movies use movie codes. We assume the
following namespace prefix definitions (the @predfix
key word is not displayed for readability):
xrr: <>.
rr: <>.
rml: <>.
xsd: <>.
ex: <>.
4.1 Describing A Logical Source
To reach its genericity objective, xR2RML must avoid
explicitly referring to specific query languages or data
models. Keeping this in mind, we define logical
sources as a mean to represent a data set from any
kind of database. In conformance with R2RML prin-
ciples, we keep database connection details out of the
scope of the mapping language. In RML on the other
hand, a logical source points to the data to be mapped
typically using a file URL (property rml:source).
This difference makes it difficult for xR2RML to
extend RML’s logical source concept. Instead,
xR2RML extends the R2RML logical source while
commonalities are addressed by using or extending
some RML properties (rml:referenceFormulation,
xR2RML triples maps extend R2RML triples
maps by referencing a logical source (property
xrr:logicalSource) which is the result of a request
applied to the input database. It is either an xR2RML
base table or an xR2RML view. The xR2RML base
table extends the concept of R2RML table or view to
tabular databases beyond relational databases (exten-
<# M ov i es >
xrr:logicalSource [
xrr:query " d b . m ov i e s . f in d ( { d ec a d e : { $ ex i s ts :t r u e } }) " ;
rml:iterator " $ . mo v ie s . *" ;
rr:subjectMap [rr:template " ht t p :/ / e x am p le . o rg / m o vi e / { $. c o de } " ; ];
rr:predicateObjectMap [
rr:predicate ex: starring;
rr:objectMap [
rr:termType xrr:RdfBag;
xrr:reference " $ . ac t or s . *" ;
xrr:nestedTermMap [rr:datatype xs d : st r ing ; ]; ]; ] .
<# D ir e ct o rs >
xrr:logicalSource [xrr:query " db . d i r e ct o r s . f in d ( ) "; ];
rr:subjectMap [rr:template " ht t p : // e x a mp l e . or g / di r / { $. na me }" ; ] ;
rr:predicateObjectMap [
rr:predicate ex: directed;
rr:objectMap [
rr:parentTriplesMap < # Mo vi e s > ;
rr:join [
rr:child " $ . di r ec te d . *" ;
rr:parent " $ . na m e ";
]; ] ; ].
Listing 5: xR2RML Example Mapping Graph
sible column store, CSV/TSV, etc.). It refers to a table
by its name (property rr:tableName). An xR2RML
view represents the result of executing a query against
the input database. It has exactly one xrr:query prop-
erty that extends RML property rml:query (which it-
self extends rr:sqlQuery13). Its value is a valid ex-
pression with regards to the query language supported
by the input database. No assumption is made what-
soever as to the query language used.
Reference formulation. Retrieving values from a
query result set requires evaluating data element refer-
ences against the query result, as discussed in section
1.2. Relational database APIs (such as JDBC drivers)
natively support the evaluation of a column name
against the current row of a result set. Conversely,
some databases come with simple APIs that provide
low-level evaluation features. For instance, APIs of
most NoSQL document stores return JSON docu-
ments but hardly support JSONPath. Therefore, the
responsibility of evaluating data element references
may fall back on the xR2RML processing engine. To
do so, it needs to know which syntax is being used.
To this end, RML introduced the reference formula-
tion concept (property rml:referenceFormulation of
a logical source) to name the syntax of data element
13rml:query also subsumes rml:xmlQuery and
rml:queryLanguage, although none of those proper-
ties are described or exemplified in the RML language
specification and articles at the time of writing.
references. As underlined above, xR2RML adheres to
R2RML’s principle that database-specific details are
out of the scope of the mapping language. Moreover,
we want the mapping language to remain free from
explicit reference to specific syntaxes. As a result, we
amend the R2RML processor definition as follows: an
xR2RML processor must be provided with a database
connection and the reference formulation applicable
to results of queries run against the connection. If the
reference formulation is not provided, it defaults to
column name, in order to ensure backward compati-
bility with R2RML.
Iteration model. In R2RML, the row-based iter-
ation occurs on a set of rows read from a logical ta-
ble. xR2RML applies this principle to other systems
returning row-based result sets: CSV/TSV files, ex-
tensible column stores, but also some graph databases
as underlined in 1.2, e.g. a SPARQL SELECT result
set is a table in which columns are named after the
variables in the SELECT clause. In the context of
non row-based result sets, the model is implicitly ex-
tended to a document-based iteration model: a docu-
ment is basically one entry of a result set returned by
the database, e.g. a JSON document retrieved from
a NoSQL document store, or an XML document re-
trieved from an XML native database. In the case of
data sources whose access interface does not provide
built-in iterators, e.g. a web service returning an XML
response at once, then a single iteration occurs on the
whole retrieved document.
Yet, some specific needs may not be fulfilled. For
instance, it may be needed to iterate on explicitly
specified entries of a JSON document or elements of
an XML tree. To this end, we leverage the concept
of iterator introduced in RML. An iterator (property
rml:iterator) specifies the iteration pattern to apply
to data read from the input database. Its value is a
valid expression written using the syntax specified in
the reference formulation. The iterator can be either
omitted or empty when the reference formulation is a
column name.
Listing 5 presents two logical source definition ex-
amples. Both consist of a MongoDB query (property
xrr:query). We assume that the JSONPath reference
formulation is provided along with the database con-
nection. In collection “directors” (Listing 4), each
document describes exactly one director. By contrast,
in collection “movies” each document refers to sev-
eral movies grouped by decade. To avoid mixing up
multiple movies of a single document, an iterator with
JSONPath expression $.movies.* is associated with
triples map <#Movies>: thus, the triples map applies
separately on each movie of each document.
4.2 Referencing Data Elements
In section 3 we have seen that RML properties
rml:reference and rr:template both allow data el-
ement references expressed according to the refer-
ence formulation (column name, XPath, JSONPath).
xR2RML uses these RML definitions as a starting
point to a broader set of use cases.
In real world use cases, databases commonly store
values written in a data format that they cannot inter-
pret. For instance, in key-value stores and in most
extensible column stores, values are stored as binary
objects whose content is opaque to the system. A de-
veloper may choose to embed JSON, CSV or XML
values in the column of a relational table, for perfor-
mance issues or due to application design constraints.
We call such cases mixed content.
xR2RML proposes to apply the principle of data
element references defined in RML, and extend it to
allow referencing data elements within mixed con-
tent. An xR2RML mixed-syntax path consists of the
concatenation of several path expressions, each path
being enclosed in a syntax path constructor that ex-
plicits the path syntax. Existing constructors are:
Column(), CSV(), TSV(), JSONPath() and XPath().
For example, in a relational table, a text column
NAME stores JSON-formatted values containing peo-
ple’s first and last names, e.g.: {"First":"John",
"Last":"Smith"}. Field FirstName can be ref-
erenced with the following mixed-syntax path:
Column(NAME)/JSONPath($.First). An xR2RML
processing engine evaluates a mixed-syntax path from
left to right, passing the result of each path construc-
tor on to the next one. In this example, the first path
retrieves the value associated with column NAME. Then
the value is passed on to the next path constructor that
evaluates JSONPath expression “$.First” against the
value. The resulting value is finally translated into an
RDF term according to the current term map defini-
xR2RML defines property xrr:reference as an
extension of RML property rml:reference, and ex-
tends the definition of property rr:template. Both
properties accept either simple references (illustrated
in Listing 5) or mixed-syntax path expressions.
4.3 Producing RDF Terms and (Nested)
RDF Collections/Containers
In a row-based logical source, a valid column name
reference returns zero or one value during each triples
map iteration. In turn an R2RML term map gener-
ates zero or one RDF term per iteration. By con-
trast, JSONPath and XPath expressions used with
properties xrr:reference and rr:template allow ad-
dressing multiple values. For instance, XPath expres-
sion //movie/name returns all <name> elements of all
<movie> elements. Therefore, reference-valued and
template-valued term maps can return multiple RDF
terms at once. This change entails the definition of
two strategies with regards to how triples maps com-
bine RDF terms to build triples: the Cartesian product
strategy, and the collection/container strategy.
Cartesian product strategy. During each iter-
ation of an xR2RML triples map, triples are gener-
ated as the Cartesian product between RDF terms pro-
duced by the subject map and each predicate-object
pair. Predicate-object pairs result of the Cartesian
product between RDF terms produced by the predi-
cate maps and object maps of each predicate-object
map. Like any other term map, a graph map may
also produce multiple terms. The Cartesian product
strategy equally applies in that case, therefore triples
are produced simultaneously in all target graphs cor-
responding to the multiple RDF terms produced by
the graph map.
Collection/container strategy. Multiple val-
ues returned by properties xrr:reference and
rr:template are combined into an RDF collection or
container. This is achieved using new xR2RML val-
ues of the rr:termType property: a term map with
term type xrr:RdfList generates an RDF term of
type rdf:List, term type xrr:RdfSeq corresponds to
rdf:Seq,xrr:RdfBag to rdf:Bag and xrr:RdfAlt to
rdf:Alt. Listing 5 illustrates this use case. Instead
of generating multiple triples relating each movie to
one actor, triples map <#Movies> relates each movie
to a bag of actors starring in that movie. For instance:
<> ex:starring [
a rdf:Bag;
rdf: 1 "Tony Leung"; rdf: 2 "Gong Li" ].
At this point, two important needs must still be
addressed in the collection/container strategy: (i) like
in a regular term map, it must be possible to assign
a term type, language tag or data type to the mem-
bers of an RDF collection or container; and (ii) it
must be possible to nest any number of RDF collec-
tions and containers inside each-other. Both needs
are fulfilled using xR2RML Nested Term Maps. A
nested term map (property xrr:nestedTermMap) very
much resembles a regular term map, with the excep-
tion that it can be defined only in the context of a term
map that produces RDF collections or containers.
In a column-valued or reference-valued term map, a
nested term map describes how to translate values
read from the logical source into RDF terms, by spec-
ifying optional properties rr:termType,rr:language
and rr:datatype. Similarly, in a template-valued
term map, a nested term map applies to values pro-
duced by applying the template string to input values.
Listing 5 illustrates the usage of nested term maps by
the production of bags of literals representing movie
names: the nested term map assigns each movie name
an xsd:string datatype. For instance:
<> ex:starring [
a rdf:Bag;
rdf: 1 "Tony Leung"ˆˆxsd:string;
rdf: 2 "Gong Li"ˆˆxsd:string ].
Finally, properties xrr:reference and rr:template
can be used within a nested term map to recursively
parse structured values while producing nested RDF
collections and containers.
4.4 Reference Relationships Between
Logical Sources
A cross-referenced logical resource usually serves as
the subject of some triples and the object of other
triples. In R2RML, this is achieved using a refer-
encing object map. xR2RML extends R2RML ref-
erencing object maps in two ways. Firstly, when a
joint query is needed (i.e. the parent and chlid triples
map do not share the same logical source), properties
rr:child and rr:parent of the join condition con-
tain data element references (4.2), possibly includ-
ing mixed-syntax paths. As underlined in section
4.3, such data element references may produce multi-
ple terms. Consequently, the equivalent joint query
of a referencing object map must deal with multi-
valued child and parent references. More precisely,
a join condition between two multi-valued references
should be satisfied if at least one data element of the
child reference matches one data element of the par-
ent reference. This is described in Definition 1 using
an SQL-like syntax and first order logic for the de-
scription of WHERE conditions.
Definition 1: If a referencing object map has at
least one join condition, then its equivalent joint query
SELECT * FROM (child-query) AS child,
(parent-query) AS parent WHERE
c1eval(child, {child-ref1}),
p1eval(parent, {parent-ref1}), c1 = p1
c2eval(child, {child-ref2}),
p2eval(parent, {parent-ref2}), c2 = p2
AND ...
where “{child-refi}” and “{parent-refi}” are the child
and parent references of the ith join condition, and
“eval(child, {ref})” and “eval(parent, {ref})” are the
result of evaluating data element reference “{ref}
on the result of the child and parent queries.
Listing 5 depicts a simple example: in triples
map <#Directors>, the object map uses movie IRIs
generated by parent triples map <#Movies>. When
processing director “Wong Kar-wai”, the child
reference ($.directed.*) returns values “2046” and
“In the Mood for Love”, while the parent reference
($.name) returns a single movie name. The condition
is satisfied if the parent reference returns one of
“2046” or “In the Mood for Love”. Generated triples
use movie codes to build movie IRIs, such as:
ex:directed <>.
Secondly, the objects produced by a referencing
object map can be grouped in an RDF collection
or container, instead of being the objects of multi-
ple triples. To do so, an xR2RML referencing ob-
ject map may have a rr:termType property with value
xrr:RdfList,xrr:RdfSeq,xrr:RdfBag or xrr:RdfAlt.
Results of the joint query are grouped by child value,
i.e. objects generated by the parent triples map, refer-
ring to the same child value, are grouped as members
of an RDF collection or container. An interesting con-
sequence of this use case is the ability, in the case of a
regular relational database, to build an RDF collection
or container reflecting a one-to-many relation.
To evaluate the effectiveness of xR2RML, we have
developed an open source prototype implementation
available on Github14. It is developed in Scala and
based on Morph-RDB (Priyatna et al., 2014), an
R2RML implementation that we have extended to
support xR2RML specificities.
In a first step, we upgraded Morph-RDB to sup-
port xR2RML features in the context of relational
databases. This included the support of logical
sources, mixed contents (JSON, XML, CSV or TSV
data embedded in cells) and RDF collections/contain-
ers. In a second step, we developed a connector to
the MongoDB document store, to translate MongoDB
JSON documents into RDF. A MongoDB shell query
string is specified in each triples map logical source
(property xrr:query). The connector executes the
query and iterates over result documents returned by
the database. Subsequently, results are passed to the
xR2RML processor that applies the optional iterator
(rml:iterator) and evaluates JSONPath expressions
in each xrr:reference and rr:template property of
all term maps. The support of RDF collections and
containers was validated, in particular in the case of
cross-references (referencing object map) that entail
a joint query between two JSON documents.
Software architecture. The prototype architec-
ture derives from the initial Morph-RDB architec-
ture. To deal with the heterogeneity of databases,
Morph follows the object factory design pattern. An
abstract runner factory class provides abstract meth-
ods to build a runner, the core object that performs
the translation of an input database with regards to
an xR2RML mapping graph. A concrete runner fac-
tory class copes with database specificities through
a set of objects: (i) a generic connection wraps a
database connection; (ii) a query unfolder builds a
concrete query object reflecting each defined triples
map; (iii) a data translator runs the query against the
database connection and generates triples according
to the triples map definitions; (iv) finally, a data mate-
rializer writes created triples into a target file accord-
ing to the chosen RDF serialization.
In the current status we provide two factory im-
plementations: the RDB implementation extends the
original Morph-RDB code, while the new MongoDB
implementation relies on the MongoDB API and the
Jongo API for the management of MongoDB shell
queries. In the RDB context, the unfolder builds
an SQL query from the table name (logical table
definition), named columns (propertiesrr:column and
rr:template) and the optional join conditions (ref-
erencing object maps). In the MongoDB case, the
query string is provided in the mapping. Further-
more, since MongoDB does not support joint queries,
the xR2RML processing engine has to perform two
queries and join results afterwards. As a result, the
unfolder is fairly simple, it checks the query string
correctness and returns an appropriate API object.
Evaluation. We evaluated the prototype using
two simple databases: a MySQL relational database
and a MongoDB database with two collections. In
both cases, the data and associated xR2RML map-
pings were written to cover most mapping situations
addressed by xR2RML: strategies for handling mul-
tiple RDF terms, JSONPath and XPath expressions,
mixed-syntax paths with mixed contents (relational,
JSON, XML, CSV/TSV), cross-references, produc-
tion of RDF collection/containers, management of
UTF-8 characters. A dump of both databases as well
as the example mappings are available on the same
GitHub repository. The current status of the prototype
applies the data materialization approach, i.e. RDF
data is generated by sequentially applying all triples
maps. The query rewriting approach (SPARQL to
database specific query rewriting) may be considered
in future works as suggested in section 7. At the time
of writing the prototype has two limitations: (i) only
one level of RDF collections and containers can be
generated (no nested collections/containers), and (ii)
the result of a joint query in a relational database can-
not be translated into an RDF collection or container.
xR2RML relies on the assumption that databases
to translate into RDF provide a declarative query
language, such that queries can be expressed di-
rectly in a mapping description. This complies with
the equivalent assumption of R2RML that all RDBs
support ANSI SQL. However this is somehow re-
strictive. Some NoSQL key-value stores, like Dy-
namoDB and Riak, have no declarative query lan-
guage, instead they provide APIs for usual program-
ming languages to describe queries in an impera-
tive manner. For xR2RML to work with those sys-
tems, a query language should be figured out along
with a compiler that transforms queries into imper-
ative code. Interestingly, this is already the case of
some systems supporting the MapReduce program-
ming model. MapReduce is conventionally supported
through APIs for programming languages, however
more and more systems now propose an SQL or SQL-
like query language on top of a MapReduce frame-
work (e.g. Apache Hive). Queries are compiled into
MapReduce jobs. This approach is often referred to
as SQL-on-Hadoop (Floratou et al., 2014).
To achieve the targeted flexibility, xR2RML
comes with features that are applicable independently
of the type of database used. Nevertheless, all fea-
tures should probably not be applied with all kinds
of database. For instance, join conditions entail joint
queries. Whereas RDBs are optimized to support
joins very efficiently, it is not recommended to make
cross-references within NoSQL document or exten-
sible column stores, as this may lead to poor perfor-
mances. Similarly, translating a JSON element into
an RDF collection is quite straightforward, but trans-
lating the result of an SQL joint query into an RDF
collection is likely to be quite inefficient. In other
words, because the language makes a mapping possi-
ble does not mean that it should be applied regardless
of the context (database type, data model, query ca-
pabilities). Consequently, mapping designers should
be aware of how databases work in order to write ef-
ficient mappings of big databases to RDF.
Like R2RML, xR2RML assumes that well-
defined domain ontologies exist beforehand, whereof
classes and properties will be used to translate a data
source into RDF triples. In the context of RDBs, an
alternative approach, the Direct Mapping, translates
relational data into RDF in a straightforward man-
ner, by converting tables to classes and columns to
properties (Sequeda et al., 2011; Arenas et al., 2012).
The direct mapping comes up with an ad-hoc on-
tology that reflects the relational schema. R2RML
implementations often provide a tool to automati-
cally generate an R2RML direct mapping from the
relational schema (e.g. Morph-RDB (de Medeiros
et al., 2015)). The same principles could be extended
to automatically generate an xR2RML mapping for
other types of data source, as long as they comply
with a schema: column names in CSV/TSV files and
extensible column stores, XSD or DTD for XML
data, JSON schema15 or a JSON-LD16 description for
JSON data. Nevertheless, such schemas do not neces-
sarily exist, and some databases like the DynamoDB
key-value store are schemaless. In such cases, au-
tomatically generating an xR2RML direct mapping
should involve different methods aimed at learning
the database schema from the data itself.
More generally, how to automate the generation
of xR2RML mappings may become a concern to map
large and/or complex schemas. There exists signifi-
cant work related to schema mapping and matching
(Shvaiko and Euzenat, 2005). For instance, Clio (Fa-
gin et al., 2009) generates a schema mapping based
on the discovery of queries over the source and tar-
get schemas and a specification of their relationships.
Karma (Knoblock et al., 2012) semi-automatic maps
structured data sources to existing domain ontolo-
gies. It produces a Global-and-Local-As-View map-
ping that can be used to translate the data into RDF.
xR2RML does not directly address the question of
how mappings are written, but can be complemen-
tary of approaches like Clio and Karma. In particu-
lar, Karma authors suggest that their tool could easily
export mapping rules as an R2RML mapping graph.
A similar approach could be applied to discover map-
pings between a non-relational database and domain
ontologies, and export the result as an xR2RML map-
ping graph.
In this paper we have presented xR2RML, a language
designed to describe the mapping of various types of
databases to RDF, by flexibly adapting to heteroge-
neous query languages and data models. We have
analysed data models of several modern databases as
well as the format in which query results are returned,
and we have shown that xR2RML can translate any
data element within such results into RDF, relying
when necessary on existing languages such as XPath
and JSONPath. We have illustrated some features of
xR2RML such as the generation of RDF collections
and containers, and the ability to deal with mixed con-
tent, e.g. when a relational table stores data formatted
in another syntax like XML, JSON or CSV.
Principles of the xR2RML mapping language
have been validated in a prototype implementation
supporting several RDBs and the MongoDB NoSQL
document store. The development of connectors to
other types of database shall be considered based on
concrete use cases. Depending on the target system,
different optimizations shall be studied, notably re-
garding the computation of joint queries. Further-
more, the data materialization approach we imple-
mented is effective but it does not scale to big data
sets. Dealing with big data sets requires the data to re-
main in legacy databases, and that translation to RDF
be performed on demand through the xR2RML-based
rewriting of SPARQL queries into the source database
query language. In this regard, existing works related
to RDBs should be leveraged (Priyatna et al., 2014;
Sequeda and Miranker, 2013).
Acosta, M., Vidal, M., Lampo, T., Castillo, J., and Ruck-
haus, E. (2011). ANAPSID: an adaptive query pro-
cessing engine for SPARQL endpoints. In Proc. of
ISWC’11, pages 18–34.
Arenas, M., Bertails, A., Prud’hommeaux, E., and Sequeda,
J. (2012). A direct mapping of relational data to RDF.
Bikakis, N., Tsinaraki, C., Stavrakantonakis, I., Gi-
oldasis, N., and Christodoulakis, S. (2013). The
SPARQL2XQuery interoperability framework. CoRR,
Bischof, S., Decker, S., Krennwallner, T., Lopes, N., and
Polleres, A. (2012). Mapping between RDF and
XML with XSPARQL. Journal on Data Semantics,
Breitling, F. (2009). A standard transformation from XML
to RDF via XSLT. Astronomical Notes, 330:755.
Das, S., Sundara, S., and Cyganiak, R. (2012). R2RML:
RDB to RDF mapping language.
de Medeiros, L. F., Priyatna, F., and Corcho, O. (2015).
MIRROR: Automatic R2RML mapping generation
from relational databases. In Submission to ICWE
Dimou, A., Sande, M. V., Slepicka, J., Szekely, P., Man-
nens, E., Knoblock, C., and Walle, R. V. d. (2014a).
Mapping hierarchical sources into RDF using the
RML mapping language. In Proc. of ICSC’2014,
pages 151–158. IEEE.
Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R.,
Mannens, E., and Van de Walle, R. (2014b). RML: A
generic language for integrated RDF mappings of het-
erogeneous data. In Proc. of the 7th LDOW workshop.
Fagin, R., Haas, L. M., Hernndez, M., Miller, R. J., Popa,
L., and Velegrakis, Y. (2009). Clio: Schema mapping
creation and data exchange. In Conceptual Model-
ing: Foundations and Applications, pages 198–236.
Fennell, P. (2014). Schematron - more useful than you’d
thought. In Proc. of the XML London 2014 Confer-
ence, pages 103–112.
Field, L., Suhr, S., Ison, J., Wittenburg, P., Los, W., Broeder,
D., Hardisty, A., Repo, S., and Jenkinson, A. (2013).
Realising the full potential of research data: common
challenges in data management, sharing and integra-
tion across scientific disciplines.
Floratou, A., Minhas, U. F., and Ozcan, F. (2014). Sql-on-
hadoop: Full circle back to shared-nothing database
architectures. Proc. of the VLDB Endowment, 7(12).
Gaignard, A. (2013). Distributed knowledge sharing and
production through collaborative e-science platforms.
PhD thesis.
Gajendran, S. K. (2013). A survey on NoSQL databases
(technical report).
He, B., Patel, M., Zhang, Z., and Chang, K. C.-C. (2007).
Accessing the deep web. Communications of the
ACM, 50(5):94–101.
Hecht, R. and Jablonski, S. (2011). NoSQL evaluation: A
use case oriented survey. In Proc. of CSC’2011, pages
336–341. IEEE Computer Society.
Knoblock, C. A., Szekely, P., Ambite, J. L., Goel, A.,
Gupta, S., Lerman, K., Muslea, M., Taheriyan, M.,
and Mallick, P. (2012). Semi-automatically mapping
structured sources into the semantic web. In Proc. of
ESWC’2012, pages 375–390. Springer.
Kolev, B., Valduriez, P., Jimenez-Peris, R., Mart`
N., and Pereira, J. (2014). CloudMdsQL: Querying
heterogeneous cloud data stores with a common lan-
guage. In Proc. of the BDA’2014 Conference.
Langegger, A. and W¨
oss, W. (2009). XLWrap - querying
and integrating arbitrary spreadsheets with SPARQL.
In Proc. of ISWC’2009.
Melton, J., Michels, J. E., Josifovski, V., Kulkarni, K., and
Schwarz, P. (2002). SQL/MED: a status report. ACM
SIGMOD Record, 31(3):81–89.
Michel, F., Djimenou, L., Faron-Zucker, C., and Montagnat,
J. (2014a). xR2RML: Relational and non-relational
databases to RDF mapping language. Research report.
ISRN I3S/RR 2014-04-FR v3.
Michel, F., Montagnat, J., and Faron-Zucker, C. (2014b).
A survey of RDB to RDF translation approaches and
tools. Research report. ISRN I3S/RR 2013-04-FR.
Ong, K. W., Papakonstantinou, Y., and Vernoux, R. (2014).
The SQL++ unifying semi-structured query language,
and an expressiveness benchmark of SQL-on-Hadoop,
NoSQL and NewSQL databases (submitted). CoRR,
Priyatna, F., Corcho, O., and Sequeda, J. (2014). Formal-
isation and experiences of R2RML-based SPARQL
to SQL query translation using Morph. In Proc. of
Roth, M. T. and Schwartz, P. (1997). Don’t scrap it, wrap
it! A wrapper architecture for legacy data sources. In
Proc. of VLDB’1997, pages 266–275.
Scharffe, F., Atemezing, G., Troncy, R., Gandon, F., Villata,
S., Bucher, B., Hamdi, F., Bihanic, L., K´
eklian, G.,
Cotton, F., and others (2012). Enabling linked data
publication with the Datalift platform. In Proc. of the
AAAI workshop on semantic cities.
Schwarte, A., Haase, P., Hose, K., Schenkel, R., and
Schmidt, M. (2011). FedX: Optimization techniques
for federated query processing on linked data. In Proc.
of ISWC’11, pages 601–616.
Sequeda, J., Tirmizi, S. H., Corcho, s., and Miranker,
D. P. (2011). Survey of directly mapping SQL
databases to the semantic web. Knowledge Eng. Re-
view, 26(4):445–486.
Sequeda, J. F. and Miranker, D. P. (2013). Ultrawrap:
SPARQL execution on relational data. Web Seman-
tics: Science, Services and Agents on the WWW,
Shvaiko, P. and Euzenat, J. (2005). A survey of schema-
based matching approaches. In Journal on Data Se-
mantics IV, pages 146–171. Springer.
Spanos, D.-E., Stavrou, P., and Mitrou, N. (2012). Bringing
relational databases into the semantic web: A survey.
Semantic Web Journal, 3(2):169–209.
... Although RML has already been extended with additional constructs to enable complex operations (e.g., FnO [19] and FunUL [16] for transformation functions, or RML fields [21] and mixed-syntax paths [36] for nested data), relying on SQL may ease the development of mappings by data engineers who know this query language well and are generally unfamiliar with semantic web technologies. Moreover, current implementations of these RML extensions, such as RMLMapper [38], RocketRML [39] and RMLStreamer [40], do not scale to large volumes of data [10]. ...
... Mixed Content [36]. Tabular sources in real data integration use cases usually present composite data values: values such as JSON or lists are embedded in cells. ...
... Tabular sources in real data integration use cases usually present composite data values: values such as JSON or lists are embedded in cells. This has been referred to as mixed content [36]. RML does not allow for mixed content, although solutions such as fields [21] or mixed-syntax paths [36] addressed this limitation. ...
Full-text available
A large amount of data is available in tabular form. RML is commonly used to declare how such data can be transformed into RDF. However, RML presents limitations that lead, in many cases, to the need for additional preprocessing using scripting. Although some proposed extensions (e.g., FnO or RML fields) address some of these limitations, they are verbose, unfamiliar to most data engineers, and implemented in systems that do not scale up when large volumes of data need to be processed. In this work, we expand RML views to tabular sources so as to address the limitations of this mapping language. In this way, transformation functions, complex joins, or mixed syntax can be defined directly in SQL queries. We present our extension of Morph-KGC to efficiently support RML views for tabular sources. We validate our implementation adapting R2RML test cases with views and compare it against state-of-the-art RML+FnO systems showing that our system is significantly more scalable. Moreover, we present specific examples of a real use case in the public procurement domain where basic RML mappings could not be used without additional preprocessing.Resource type: Software frameworkLicense: Apache 2.0DOI: 10.5281/zenodo.7385488URL: GraphRMLCSVData Integration
... The original data can be expressed in a variety of formats such as tabular, JSON, or XML. Due to the heterogeneous nature of data, the wide variety of techniques, and specific requirements that some scenarios may impose, an increasing number of mapping languages have been proposed [8][9][10]. The differences among them are usually based on three aspects: (a) the focus on one or more data formats, e.g., the W3C Recommenda- tions R2RML focuses on SQL tabular data [11]; (b) a specific requirement they address, e.g., SPARQL-Generate [12] allows the definition of functions in a mapping for cleaning or linking the generated RDF data; or (c) if they are designed for a scenario that has special requirements, e.g., the WoT-mappings [13] were designed as an extension of the WoT standard [14] and used as part of the Thing Descriptions [15]. ...
... SML [49], OBDA mappings from Ontop [50]) and several extensions of R2RML were developed in the following years after its release: R2RML-f [32] extends R2RML to include functions to be applied over the data; RML [8] and its userfriendly compact syntax YARRRML [51] provide the possibility of covering additional data formats (CSV, XML and JSON); this language also considers the use of functions for data transformation (e.g. lowercase, replace, trim) by using the Function Ontology (FnO) 4 [17]; FunUL [31] proposes an extension to also incorporate functions, but focusing on the CSV format; KR2RML [30] is also an extension for CSV, XML and JSON, with the addition of representing all sources with the Nested Relational Model as an intermediate model and the possibility of cleaning data with Python functions; xR2RML [9] extends R2RML and RML to include NoSQL databases and incorporates more features to handle tree-like data; D2RML [33], also based on R2RML and RML, is able to transform data from XML, JSON, CSVs and REST/SPARQL endpoints, and enables functions and conditions to create triples. ...
... The following RDF-based languages are included: R2RML [11], RML [8], KR2RML [30], xR2RML [9], R2RML-F [32], FunUL [31], XLWrap [36], WoT mappings [13], CSVW [38], and D2RML [33]. The SPARQL-based languages that were analyzed are: XS-PARQL [40], TARQL [42], SPARQL-Generate [12], Facade-X [43] and SMS2 [45]. ...
Full-text available
Knowledge Graphs are currently created using an assortment of techniques and tools: ad hoc code in a programming language, database export scripts, OpenRefine transformations, mapping languages, etc. Focusing on the latter, the wide variety of use cases, data peculiarities, and potential uses has had a substantial impact in how mappings have been created, extended, and applied. As a result, a large number of languages and their associated tools have been created. In this paper, we present the Conceptual Mapping ontology, that is designed to represent the features and characteristics of existing declarative mapping languages to construct Knowledge Graphs. This ontology is built upon the requirements extracted from experts experience, a thorough analysis of the features and capabilities of current mapping languages presented as a comparative framework; and the languages' limitations discussed by the community and denoted as Mapping Challenges. The ontology is evaluated to ensure that it meets these requirements and has no inconsistencies, pitfalls or modelling errors, and is publicly available online along with its documentation and related resources.
... Virtualization uses M to translate SPARQL queries into the native query language of S, i.e., data integration is performed on-the-fly during query processing [41]. There are many techniques and associated implementations that can be used to create knowledge graphs integrating heterogeneous data sources using declarative mapping rules [3,12,15,26,31,33,38,39]. In the specific case of materialization, different optimizations have been proposed to speed up the materialization process in complex data integration scenarios (e.g., high rate of duplicates, large data sources, or transformation functions). ...
... RML [14] is a well-known superset of R2RML that removes specific references to the relational data model and enables data formats beyond RDBs. In addition, other R2RML-related proposals have addressed transformation functions [11,12], mixed content and RDF collections [33], usability [20] or scalability [40]. ...
Full-text available
Knowledge graphs are often constructed from heterogeneous data sources, using declarative rules that map them to a target ontology and materializing them into RDF. When these data sources are large, the materialization of the entire knowledge graph may be computationally expensive and not suitable for those cases where a rapid materialization is required. In this work, we propose an approach to overcome this limitation, based on the novel concept of mapping partitions. Mapping partitions are defined as groups of mapping rules that generate disjoint subsets of the knowledge graph. Each of these groups can be processed separately, reducing the total amount of memory and execution time required by the materialization process. We have included this optimization in our materialization engine Morph-KGC, and we have evaluated it over three different benchmarks. Our experimental results show that, compared with state-of-the-art techniques, the use of mapping partitions in Morph-KGC presents the following advantages: (i) it decreases significantly the time required for materialization, (ii) it reduces the maximum peak of memory used, and (iii) it scales to data sizes that other engines are not capable of processing currently.
... These tools focus on providing a SPARQL endpoint and translating the SPARQL queries received into one or more languages. OBDA are tools that only translate from SPARQL to just one language [4,6,12,54,66,72], instead OBDI tools that ...
Full-text available
Building and publishing knowledge graphs (KG) as Linked Data, either on the Web or in private companies, has become a relevant and crucial process in many domains. This process requires that users perform a wide number of tasks conforming to the life cycle of a KG, and these tasks usually involve different unrelated research topics, such as RDF materialisation or link discovery. There is already a large corpus of tools and methods designed to perform these tasks; however, the lack of one tool that gathers them all leads practitioners to develop ad-hoc pipelines that are not generic and, thus, non-reusable. As a result, building and publishing a KG is becoming a complex and resource-consuming process. In this paper, a generic framework called Helio is presented. The framework aims to cover a set of requirements elicited from the KG life cycle and provide a tool capable of performing the different tasks required to build and publish KGs. As a result, Helio aims at providing users with the means for reducing the effort required to perform this process and, also, Helio aims to prevent the development of ad-hoc pipelines. Furthermore, the Helio framework has been applied in many different contexts, from European projects to research work.
... Generation and publication of the KG. The translation into RDF of the outputs of each step is carried out using Morph-xR2RML, 12 an implementation of the xR2RML mapping language [27] for MongoDB databases. Thus, the next steps consist of importing the outputs into MongoDB, pre-processing them to filter out unneeded or invalid data, and apply the translation rules with Morph-xR2RML. ...
Full-text available
Faced with the ever-increasing number of scientific publications, researchers struggle to keep up, find and make sense of articles relevant to their own research. Scientific open archives play a central role in helping deal with this deluge, yet keyword-based search services often fail to grasp the richness of the semantic associations between articles. In this paper, we present the methods, tools and services implemented in the ISSA project to tackle these issues. The project aims to (1) provide a generic, reusable and extensible pipeline for the analysis and processing of articles of an open scientific archive, (2) translate the result into a semantic index stored and represented as an RDF knowledge graph; (3) develop innovative search and visualization services that leverage this index to allow researchers, decision makers or scientific information professionals to explore thematic association rules, networks of co-publications, articles with co-occurring topics, etc. To demonstrate the effectiveness of the solution, we also report on its deployment and user-driven customization for the needs of an institutional open archive of 110,000+ resources. Fully in line with the open science and FAIR dynamics, the presented work is available under an open license with all the accompanying documents necessary to facilitate its reuse. The knowledge graph produced on our use-case is compliant with common linked open data best practices.
... We first downloaded from Météo-France's portal the list of SYNOP weather stations in GeoJSON format, as well as the monthly observation reports generated by these stations as CSV files. Then, we implemented a reproducible pipeline to generate WeaKG-MF in compliance with the proposed semantic model where the mapping is performed by the Morph-xR2RML tool [5]. The first version of WeaKG-MF covers the period from January 2019 to the end of December 2021. ...
Full-text available
In this paper, we present the WeKG-MF Knowledge Graph constructed from open weather observations published by Météo-France institution. WeKG-MF relies on a semantic model that formalizes knowledge about meteorological observational data. The model is generic enough to be adopted and extended by meteorological data providers to publish and integrate their sources while complying with Linked Data principles. WeKG-MF offers access to a large number of meteorological variables described through spatial and temporal dimensions and thus has the potential to serve several scientific case studies from different domains including agriculture, agronomy, environment, climate change and natural disasters.KeywordsKnowledge graphSemantic modellingObservational dataLinked dataMeteorology
... Then, we implemented a reproducible software pipeline to generate WeaKG-MF in compliance with the proposed model. The core of the pipeline is the mapping task that is performed with Morph-xR2RML tool14, an implementation of the xR2RML mapping language [7] for MongoDB databases. Pipeline scripts as well as xr2RML mapping triples are available in our github repository15. ...
Full-text available
To study and predict meteorological phenomenons and to include them in broader studies, the ability to represent and exchange meteorological data is of paramount importance. A typical approach in integrating and publishing such data now is to formalize a knowledge graph relying on Linked Data and semantic Web standard models and practices. In this paper, we first discuss the semantic modelling issues related to spatio-temporal data such as meteorological observational data. We motivate the reuse of a network of existing ontologies to define a semantic model in which meteorological parameters are semantically defined, described and integrated. The model is generic enough to be adopted and extended by meteorological data providers to publish and integrate their sources while complying with Linked Data principles. Finally, we present a meteorological knowledge graph of weather observations based on our proposed model, published in the form of an RDF dataset, that we produced by transforming observation records made by Météo-France weather stations. It covers a large number of meteorological variables described through spatial and temporal dimensions and thus has the potential to serve several scientific case studies from different domains including agriculture, agronomy, environment, climate change and natural disasters.KeywordsKnowledge graphSemantic modellingObservational dataLinked DataMeteorology
... Subsequently, new needs arose to support formats other than relational databases. As a result, RML [8] and xR2RML [9] mapping languages were proposed to deal with XML, CSVs and JSON data sources, and MongoDB document database, respectively. ...
Data integration is the dominant use case for RDF Knowledge Graphs. However, Web resources come in formats with weak semantics (for example CSV and JSON), or formats specific to a given application (for example BibTex, HTML, and Markdown). To solve this problem, Knowledge Graph Construction (KGC) is gaining momentum due to its focus on supporting users in transforming data into RDF. However, using existing KGC frameworks result in complex data processing pipelines, which mix structural and semantic mappings, whose development and maintenance constitute a significant bottleneck for KG engineers. Such frameworks force users to rely on different tools, sometimes based on heterogeneous languages, for inspecting sources, designing mappings, and generating triples, thus making the process unnecessarily complicated. We argue that it is possible and desirable to equip KG engineers with the ability of interacting with Web data formats by relying on their expertise in RDF and the well-established SPARQL query language [2]. In this article, we study a unified method for data access to heterogeneous data sources with Facade-X, a meta-model implemented in a new data integration system called SPARQL Anything. We demonstrate that our approach is theoretically sound, since it allows a single meta-model, based on RDF, to represent data from (a) any file format expressible in BNF syntax, as well as (b) any relational database. We compare our method to state-of-the-art approaches in terms of usability (cognitive complexity of the mappings) and general performance. Finally, we discuss the benefits and challenges of this novel approach by engaging with the reference user community.
In the past few years, several cloud services such as IBM cloud services, Microsoft service, Amazon web services as well as Google cloud platform services related databases are rising rapidly all over the world. In the case of a Relational database, it is easy to handle only a small amount of data. So to overcome the shortcomings, this paper proposes a Hospital Management System (HMS) using Mongo DB as a database. The Mongo DB is the classification of a document-related database that comes under the category of NoSQL database which is referred to as a non-relational database. Mongo DB aims in reducing the gap among two different types of scalable key-value databases such as fast and high scalable key-value databases. Also, this database reduces the time delay during the working of four operational modes such as selection, insertion, deletion, and updation. Furthermore, the Enhanced Entity-Relationship Model (EERM) for the HMS that is designed to manage the entire section of the hospital which includes the reception section, casualty section, details regarding the medical treatment, and employees. The HMS stores three million tuple entities and is loaded into the Mongo DB database. Also, the operation based on selection, insertion, deletion, and updation mode is evaluated. Therefore, the experimental analysis reveals that the proposed HMS using Mongo DB provides is highly efficient with less time delay, thus obtaining an effective system.
Technical Report
Full-text available
This document is the specification of xR2RML, a language for expressing customized mappings from various types of databases (XML, object-oriented, NoSQL) to RDF datasets. xR2RML flexibly adapts to heterogeneous query languages and data models while remaining free from any specific language or syntax. It extends R2RML, the W3C recommendation for the mapping of relational databases to RDF, and relies on RML for the handling of various data representation formats. This research report gave birth to a conference article that you may wish to cite instead: Michel F., Djimenou L., Faron-Zucker C. & Montagnat J. (2015). Translation of Relational and Non-Relational Databases into RDF with xR2RML. In Proceeding of the 11th international conference on Web Information Systems and Technologies (WebIST), pp. 443–454. Lisbon, Portugal.
Conference Paper
Full-text available
Two W3C recommendations exist for the transformation of RDB content into RDF: Direct Mapping (DM) and R2RML. The DM recommendation specifies the set of fixed transformation rules, whilst R2RML allows customising them. Here we describe the MIRROR system, which generates two sets of R2RML mappings. First, it creates a set of mappings that allow any R2RML engine to generate a set of RDF triples homomorphic to the ones that a DM engine would generate (they only differentiate in the URIs used). This allows R2RML engines to exhibit a similar behaviour to that of DM engines. Second, it produces an additional set of R2RML mappings that allow generating triples resulting from the implicit knowledge encoded in relational database schemas, such as subclass-of and M-N relationships. We demonstrate the behaviour of MIRROR using the W3C DM Test Case together with an extended version of one of its databases.
Full-text available
This thesis addresses the issues of coherent distributed knowledge production and sharing in the Life-science area. In spite of the continuously increasing computing and storage capabilities of computing infrastructures, the management of massive scientific data through centralized approaches became inappropriate, for several reasons: (i) they do not guarantee the autonomy property of data providers, constrained, for either ethical or legal concerns, to keep the control over the data they host, (ii) they do not scale and adapt to the massive scientific data produced through e-Science platforms. In the context of the NeuroLOG and VIP Life-science collaborative platforms, we address on one hand, distribution and heterogeneity issues underlying, possibly sensitive, resource sharing ; and on the other hand, automated knowledge production through the usage of these e-Science platforms, to ease the exploitation of the massively produced scientific data. We rely on an ontological approach for knowledge modeling and propose, based on Semantic Web technologies, to (i) extend these platforms with efficient, static and dynamic, transparent federated semantic querying strategies, and (ii) to extend their data processing environment, from both provenance information captured at run-time and domain-specific inference rules, to automate the semantic annotation of ''in silico'' experiment results. The results of this thesis have been evaluated on the Grid'5000 distributed and controlled infrastructure. They contribute to addressing three of the main challenging issues faced in the area of computational science platforms through (i) a model for secured collaborations and a distributed access control strategy allowing for the setup of multi-centric studies while still considering competitive activities, (ii) semantic experiment summaries, meaningful from the end-user perspective, aimed at easing the navigation into massive scientific data resulting from large-scale experimental campaigns, and (iii) efficient distributed querying and reasoning strategies, relying on Semantic Web standards, aimed at sharing capitalized knowledge and providing connectivity towards the Web of Linked Data.
Full-text available
In the context of the emergent Web of Data, a large number of organizations, institutes and companies (e.g., DBpedia,, GeoNames, PubMed) adopt the Linked Data practices. Utilizing the Semantic Web (SW) technologies, they publish their data and offer SPARQL endpoints (i.e., SPARQL-based search services). On the other hand, the dominant standard for information exchange in the Web today is XML. Additionally, many international standards (e.g., Dublin Core, MPEG-7, METS, TEI, IEEE LOM) in several domains (e.g., Digital Libraries, GIS, Multimedia, e-Learning) have been expressed in XML Schema. The aforementioned have led to an increasing emphasis on XML data, accessed using the XQuery query language. The SW and XML worlds and their developed infrastructures are based on different data models, semantics and query languages. Thus, it is crucial to develop interoperability mechanisms that allow the Web of Data users to access XML datasets, using SPARQL, from their own working environments. It is unrealistic to expect that all the existing legacy data (e.g., Relational, XML, etc.) will be transformed into SW data. Therefore, publishing legacy data as Linked Data and providing SPARQL endpoints over them has become a major research challenge. In this direction, we introduce the SPARQL2XQuery Framework which creates an interoperable environment, where SPARQL queries are automatically translated to XQuery queries, in order to access XML data across the Web. The SPARQL2XQuery Framework provides a mapping model for the expression of OWL–RDF/S to XML Schema mappings as well as a method for SPARQL to XQuery translation. To this end, our Framework supports both manual and automatic mapping specification between ontologies and XML Schemas. In the automatic mapping specification scenario, the SPARQL2XQuery exploits the XS2OWL component which transforms XML Schemas into OWL ontologies. Finally, extensive experiments have been conducted in order to evaluate the schema transformation, mapping generation, query translation and query evaluation efficiency, using both real and synthetic datasets.
SQL-on-Hadoop, NewSQL and NoSQL databases provide semi-structured data models (typically JSON based) and respective query languages. Lack of formal syntax and semantics, idiomatic (non-SQL) language constructs and large variations in syntax, semantics and actual capabilities pose problems even to database experts: It is hard to understand, compare and use these languages. It is especially tedious to write software that interoperates between two of them or an SQL database and one of them. Towards solving these problems, first we formally specify the syntax and semantics of SQL++. It consists of a semi-structured data model (which extends both JSON and the relational data model) and a query language that is fully backwards compatible with SQL. SQL++ is "unifying" in the sense that it is explicitly designed to encompass the data model and query language capabilities of current SQL-on-Hadoop, NoSQL and NewSQL databases. Then, we itemize fifteen SQL++ data model and query language features and benchmark eleven databases on their support of the multiple options associated with each feature, leading to feature matrices and commentary. Each feature matrix is the result of empirical validation through sample queries. Since SQL itself is a subset of SQL++, the SQL-aware reader will easily identify in which ways each of the surveyed databases provides more or less than SQL. The eleven databases are Hive, Jaql, Pig, Cassandra, JSONiq, MongoDB, Couchbase, SQL, AsterixDB, BigQuery and UnityJDBC. They were selected due to their market adoption or because they present cutting edge, advanced query language abilities. Finally, we briefly discuss the use of SQL++ as the query language of the FORWARD middleware query processor, which executes SQL++ queries over SQL and non-SQL databases. FORWARD provides a proof-of-concept of SQL++'s applicability as a unifying data model and query language.
SQL query processing for analytics over Hadoop data has recently gained significant traction. Among many systems providing some SQL support over Hadoop, Hive is the first native Hadoop system that uses an underlying framework such as MapReduce or Tez to process SQL-like statements. Impala, on the other hand, represents the new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Both systems optimize their data ingestion via columnar storage, and promote different file formats: ORC and Parquet. In this paper, we compare the performance of these two systems by conducting a set of cluster experiments using a TPC-H like benchmark and two TPC-DS inspired workloads. We also closely study the I/O efficiency of their columnar formats using a set of micro-benchmarks. Our results show that Impala is 3.3X to 4.4X faster than Hive on MapReduce and 2.1X to 2.8X than Hive on Tez for the overall TPC-H experiments. Impala is also 8.2X to 10X faster than Hive on MapReduce and about 4.3X faster than Hive on Tez for the TPC-DS inspired experiments. Through detailed analysis of experimental results, we identify the reasons for this performance gap and examine the strengths and limitations of each system.
Conference Paper
Incorporating structured data in the Linked Data cloud is still complicated, despite the numerous existing tools. In particular, hierarchical structured data (e.g., JSON) are underrepresented, due to their processing complexity. A uniform mapping formalization for data in different formats, which would enable reuse and exchange between tools and applied data, is missing. This paper describes a novel approach of mapping heterogeneous and hierarchical data sources into RDF using the RML mapping language, an extension over R2RML (the W3C standard for mapping relational databases into RDF). To facilitate those mappings, we present a toolset for producing RML mapping files using the Karma data modelling tool, and for consuming them using a prototype RML processor. A use case shows how RML facilitates the mapping rules' definition and execution to map several heterogeneous sources.
Conference Paper
Despite the significant number of existing tools, incorporating data from multiple sources and different formats into the Linked Open Data cloud remains complicated. No mapping formalisation exists to define how to map such heterogeneous sources into RDF in an integrated and interoperable fashion. This paper introduces the RML mapping language, a generic language based on an extension over R2RML, the W3C standard for mapping relational databases into RDF. Broadening RML’s scope, the language becomes source-agnostic and extensible, while facilitating the definition of mappings of multiple heterogeneous sources. This leads to higher integrity within datasets and richer interlinking among resources.