Conference PaperPDF Available

Data Integration over NoSQL Stores Using Access Path Based Mappings

Authors:

Abstract and Figures

Due to the large amount of data generated by user interactions on the Web, some companies are currently innovating in the domain of data management by designing their own systems. Many of them are referred to as NoSQL databases, standing for ’Not only SQL’. With their wide adoption will emerge new needs and data integration will certainly be one of them. In this paper, we adapt a framework encountered for the integration of relational data to a broader context where both NoSQL and relational databases can be integrated. One important extension consists in the efficient answering of queries expressed over these data sources. The highly denormalized aspect of NoSQL databases results in varying performance costs for several possible query translations. Thus a data integration targeting NoSQL databases needs to generate an optimized translation for a given query. Our contributions are to propose (i) an access path based mapping solution that takes benefit of the design choices of each data source, (ii) integrate preferences to handle conflicts between sources and (iii) a query language that bridges the gap between the SQL query expressed by the user and the query language of the data sources. We also present a prototype implementation, where the target schema is represented as a set of relations and which enables the integration of two of the most popular NoSQL database models, namely document and a column family stores.
Content may be subject to copyright.
Data Integration over NoSQL Stores Using
Access Path Based Mappings
Olivier Cur´e1, Robin Hecht2, Chan Le Duc3, Myriam Lamolle3
1Universit´e Paris-Est, LIGM, Marne-la-Vall´ee, France
ocure@univ-mlv.fr
2University of Bayreuth, Bayreuth, Germany
robin.hecht@uni-bayreuth.de
3LIASD Universit´e Paris 8 - IUT de Montreuil
chan.leduc,myriam.lamolle@iut.univ-paris8.fr
Abstract. Due to the large amount of data generated by user interac-
tions on the Web, some companies are currently innovating in the domain
of data management by designing their own systems. Many of them are
referred to as NoSQL databases, standing for ’Not only SQL’. With their
wide adoption will emerge new needs and data integration will certainly
be one of them. In this paper, we adapt a framework encountered for the
integration of relational data to a broader context where both NoSQL
and relational databases can be integrated. One important extension
consists in the efficient answering of queries expressed over these data
sources. The highly denormalized aspect of NoSQL databases results in
varying performance costs for several possible query translations. Thus a
data integration targeting NoSQL databases needs to generate an opti-
mized translation for a given query. Our contributions are to propose (i)
an access path based mapping solution that takes benefit of the design
choices of each data source, (ii) integrate preferences to handle conflicts
between sources and (iii) a query language that bridges the gap between
the SQL query expressed by the user and the query language of the data
sources. We also present a prototype implementation, where the target
schema is represented as a set of relations and which enables the in-
tegration of two of the most popular NoSQL database models, namely
document and a column family stores.
1 Introduction
In [8], several database experts argued that Relational Data Base Management
Systems (RDBMS) can no longer handle all the data management issues en-
countered by many current applications. This is mostly due to (i) the high, and
ever increasing, volume of data needed to be stored by many (web) companies,
(ii) the extreme query workload required to access and analyze these data and
(iii) the need for schema flexibility.
Several systems have already emerged to propose an alternative to RDBMS
and many of them are categorized under the term NoSQL, standing for ’Not only
SQL’. Many of these databases are based on the Distributed Hash Table (DHT)
model which provides a hash table access semantics. That is, in order to access
or modify an object data, a client is required to provide the key of that object
and then the database will lookup the object using an equality match to the re-
quired attribute key. The first implementations where developed by companies
like Google and Amazon with respectively Bigtable [1] and Dynamo [3]. These
systems influenced the implementation of several open source systems such as
Cassandra4, HBase5, etc. Nowadays, the NoSQL ecosystem is relatively rich with
several categories of databases: column family (e.g. Bigtable, HBase, Cassandra),
key/value ( e.g. Dynamo, Riak6), document (e.g. MongoDB7, CouchDB8) and
graph oriented (e.g. InfiniteGrap9, Neo4J10). Most of these systems share com-
mon characteristics by aiming to support scalability, availability, flexibility and
to ensure fast access times for storage, data retrieval and analysis.
In order to meet some of these requirements, NoSQL database instances are
designed to reply efficiently to the precise needs of a given application. Note
that a similar approach, named denormalization [4], is frequently encountered
for application using relational databases. Nevertheless, it may be required to
combine the data stored in several NoSQL database instances into a single ap-
plication and at the same time to leave them evolve with their own applications.
This combination of data coming from different sources corresponds to the no-
tion of a data integration system presented in [5]. Yet, several issues emerge
due to the following NoSQL characteristics: (i) NoSQL categories are based on
different data models and each implementation within a category may have its
own specificities. (ii) There does not exist a common query language for all
NoSQL databases. Moreover, most systems only support a procedural definition
of queries. (iii) The NoSQL ecosystem is characterized by a set of heteroge-
neous data management systems, e.g. not all databases support indexing. (iv)
The denormalized aspect of NoSQL databases makes query performance highly
dependent on access paths.
In this paper, we present a data integration system which is based on the
assumptions of Figure 1. The target schema corresponds to a standard relational
model. This is motivated by the familiarity of most end-users with this data
model and its possibility to be queried with the SQL language. The sources can
either correspond to a set of column family, document and key/value stores as
well as to standard RDBMS.
To enable the querying of NoSQL databases within a data integration frame-
work we propose the following contributions. (1) We define a mapping language
between the target and the sources which takes into account the denormaliza-
tion aspect of NoSQL databases. This is materialized by storing preferred access
4http://cassandra.apache.org/
5http://hbase.apache.org/
6http://www.basho.com/
7http://www.mongodb.org/
8http://couchdb.apache.org/
9http://www.infinitegraph.com/
10 http://neo4j.org/
Fig. 1. Data integration overview
paths for a given mapping assertion. Moreover, this mapping language incor-
porates features dealing with conflicting data. (2) We propose a Bridge Query
Language (BQL) that enables a transformation from an SQL query defined over
the target to the query executed over a given source. (3) We present a proto-
type implementation which generates query programs for a popular document
oriented database, namely MongoDB, and Cassandra, a column family store.
This paper is organized as follows. In Section 2, we present related works in
the domain of querying NoSQL databases. In Section 3, we provide background
knowledge on two feature rich and popular NoSQL databases: document and
column family stores. Section 4 presents our data integration framework with a
presentation of the syntax and semantics of the mapping language. In Section 5,
query processing in our data integration system is presented and BQL is detailed.
Section 6 concerns aspects of the prototype implementation. Finally, Section 7
concludes this paper.
2 Related work
In this section, we present some related works in the domain of querying non
relational databases in the context of the cloud and Map/Reduce.
Decoupling query semantics from the underlying data store is a widely spread
technique to support multiple data sources in one framework. Therefore, various
systems offer a common abstraction layer on top of their data storage layer.
Hadoop11 is a framework that supports data-intensive applications. On top
of a distributed, scalable, and portable filesystem (HDFS, [1]), Hadoop provides
a column-oriented database called HBase for real-time read and write access to
very large datasets.
In order to support queries against these large datasets, a programming
model called MapReduce [2] is provided by the system. MapReduce divides
workloads into suitable units, which can be distributed over many nodes and
therefore can be processed in parallel. However, the advantage of the fast pro-
cessing of large datasets has also its catch, because writing MapReduce programs
is a very time consuming business. There is a lot of overhead even for simple
11 http://hadoop.apache.org/
tasks. Working out how to fit data processing into the MapReduce pattern can
be a challenge. Therefore, Hadoop offers three different abstraction layers for its
MapReduce implementation, called Hive, Pig and Cascading.
Hive12 is a data warehouse infrastructure, which aims to bridge the gap be-
tween SQL and MapReduce queries. Therefore, it provides its own SQL like
query language called HiveQL [9]. It has traditional SQL constructs like joins,
group by,where,select and from clauses. These commands are translated
into MapReduce functions afterwards. Hive insists that all data has to be stored
in tables, with a schema under its management. Hive allows traditional MapRe-
duce programs to be able to plug in their own mappers and reducers to do more
sophisticated analysis.
Like Hive, Pig13 tries to raise the level of abstraction for processing large data
sets with Hadoop’s MapReduce implementation. The Pig platform consists of a
high level language called Pig Latin [6] for constructing data pipelines, where
operations on an input relation are executed one after the other. These Pig Latin
data pipelines are translated into a sequence of MapReduce jobs by a compiler,
which is also included in the Pig framework.
Cascading14 is an API for data processing on Hadoop clusters. It is not a
new text based query syntax like Pig or another complex system that must be
installed and maintained like Hive. Cascading offers a collection of operations
like functions, filters and aggregators, which can be wired together into complex,
scale-free and fault tolerant data processing workflows as opposed to directly
implementing MapReduce algorithms.
In contrast to missing standards in a query language for NoSQL databases,
standards for persisting java objects already exist. With the Java Persistence API
(JPA15) and Java Domain Objects (JDO16 ) it is possible to map java objects
into different databases. The Datanucleus implementation of these two standards
provides a mapping layer on top of HBase, BigTable [1], Amazon S317, MongoDB
and Cassandra. Googles App Engine18 uses this framework for persistence.
A powerful data query and administration tool which is used extensively
within the Oracle community is Quest Softwares Toad19. Since 2010, a prototype
which also offers its support for column family stores is available. During the
time of writing this paper, the beta version 1.2 can be connected to Azure
Table Services20, Cassandra, SimpleDB21 , HBase and every ODBC compliant
relational database.
12 http://hive.apache.org/
13 http://pig.apache.org
14 http://www.cascading.org/
15 http://www.oracle.com/technetwork/java/javaee/tech/persistence-jsp-140049.html
16 http://www.oracle.com/technetwork/java/index-jsp-135919.html
17 http://aws.amazon.com/de/s3/
18 http://code.google.com/intl/de-DE/appengine/
19 http://toadforcloud.com
20 http://msdn.microsoft.com/en-us/library/dd179423.aspx
21 http://aws.amazon.com/de/simpledb/
Toad for Cloud consists of two components. The first is the Toad client,
which can be installed on a Microsoft Windows computer. It can be used to
access different databases, the Amazon EC2 console, and to write and execute
SQL statements. The second component is the Data Hub. It translates SQL
statements submitted through the Toad client into a language understood by all
supported databases and returns results in the familiar tabular row and column
format.
In order to use SQL on column family stores like Cassandra and HBase,
the column families, rows and columns have to be mapped into virtual tables.
Afterwards, the user does have full MySQL support on these data, containing
also inserts, updates and deletes. Furthermore, it is possible to do virtual data
integration with different data sources.
One reason why only column family stores are supported by Toad is their
easy mapping to relational databases. To do the same with a document store
containing objects with a deep nested structure or squeezing a graph into a
relational schema is a much more complicated task. Even if a suitable solution
was found, powerful and easy to use query languages, tools and interfaces like
Traverser API for Neo4J would be missing in the SQL layer of Toad, which
queries the mapped tables.
Due to their different data models and their relatively young history, NoSQL
databases still lack a common query language. However, Quest Software and
Hadoop demonstrate that it is possible to use SQL (Toad) or a SQL like query
language (Hive) on top of column family stores. A mapping to document stores
and graph databases is still missing.
3 Background
We have seen that each category of NoSQL databases has its own data model. In
this section, we present details concerning document oriented and column family
categories.
3.1 Document oriented databases
Document oriented databases correspond to an extension of the well-known key-
value concept where in this case the value consists of a structured document.
A document contains hierarchically organized data similar to XML and JSON.
This permits to represent one-to-one as well as one-to-many relationships in
a single document. Therefore a complex document can be retrieved or stored
without using joins. Since document oriented databases are aware of stored data,
it enables to define document field indexes as well as to propose advanced query
features. The most popular document oriented databases are MongoDB (10gen)
and CouchDB (Apache).
Example 1: In the following a document oriented database stores drug
information aimed at an application targeting the general public. According
to features proposed by the application, two so-called collections are defined:
drugD and therapD.drugD includes documents describing drug related informa-
tion whereas therapD contains documents with information about therapeutic
classes and drugs used for it. Each drugD document is identified by a drug iden-
tifier. In this example, its attributes are limited to the name of the product, its
price, pharmaceutical lab and a list of therapeutic classes. The key for therapD
documents is a string corresponding to the therapeutic class name. It contains
a single attribute corresponding to the list of drug identifiers treating this ther-
apeutic class. Figure 2 presents an extract of this database. Finally, in order to
ensure an efficient search to patients an index on the attribute name of the drugD
document is defined.
Fig. 2. Extract of drug document oriented database
3.2 Column-family databases
Column family stores correspond to persistent, sparse, distributed multilevel
hash maps. In column family stores, arbitrary keys (rows) are applied to arbitrary
key value pairs (columns). These columns can be extended with further arbitrary
key value pairs. Afterwards, these key value pair lists can be organized into
column families and keyspaces. Finally, column-family stores can appear in a very
similar shape to relational databases on the surface. The most popular systems
are HBase and Cassandra. All of them are influenced by Googles Bigtable.
Example 2: Figure 3 presents some column families defined in a medical
application. Since Cassandra works best when its data model is denormalized,
the data is divided on three column families: drugC,drugNameC and therapC.
The columns drugName,contra,composition,lab are integrated into drugC
and identified by row key drugId.drugNameC contains row key drugName and
adrugId column in order to provide an efficient search for patients. Since end-
users of this database need to search products by therapeutical classes, therapC
contains therapName as row key and a column for each drugId with a timestamp
as value.
Fig. 3. Extract of drug column family database
4 Data integration framework
This section presents the syntax and semantics of our data integration frame-
work. Moreover, it focuses on the mapping language which supports the def-
inition of correspondences between sources and target entities. Our mapping
language integrates some original aspects by considering (i) query processing
performances of the sources via access paths and (ii) dealing with contradicting
information found between sources using preferences. In the rest of this paper,
we consider the following example.
Example 3: A medical application needs to integrate drug data coming from
two different NoSQL stores. The first database, corresponding to a document
store, denoted docDB, and is used in a patient oriented application while the other
database, a column family store, denoted colDB, contains information aimed at
health care professionals. In this paper, we concentrate on some extracts of
docDB and colDB which correspond to respectively Figures 2 and 3. The data
stored in both databases present some overlapping as well as some discrepancies
both at the tuple and schema level. For instance, at the schema level, both
databases contain french drug identifiers, names, pharmaceutical companies and
prices but only colDB proposes access to the composition of a drug product.
Considering the tuple level, some drugs may be present in one database but
not in the other. Moreover information concerning the same drug product (i.e.
identified by the same identifier value) may contradict themselves in different
sources. Given these source databases, the target schema is defined as follows.
We consider that relation and attribute names are self-explanatory.
drug(drugId, drugName, lab, composition, price)
therapDrug (drugId, therapeuticName)
Obviously, our next step it to define correspondances between the sources and
the target. This is supporteed by mapping assertions which are currently being
defined manually by domain experts. In the near future, we aim to discover some
of them automatically by analyzing extensions and intensions of both sources
and the target. Nevertheless, we do not believe that all mapping assertions can
be discovered automatically due to the lack of semantics contained in both the
target and the sources. Next, we present the mapping language enabling the
definitions of mapping assertions.
4.1 Mapping language
Our data integration system takes the form of a triple hT ,S,Mi where Tis the
target schema, Sis the source schema and Mis the mapping between Tand
S. In Section 1, we motivated the fact that the set Scould correspond to both
RDBMS and NoSQL stores and that Ttakes the form of a relational schema. We
consider that the target schema is given, possibly defined by a team of domain
experts or using schema matching techniques [7].
This mapping language adopts a GAV (Global As View) approach with sound
sources [5]. The mapping assertions are thus of the following form: φS φT
where φSis a query over Sand φTis a relation of T.
Our system must deal with the heterogeneity of the sources and the highly
denormalized aspect of NoSQL database instances. In order to cope with this
last aspect, our mapping language handles the different access paths proposed
by a given source. This is due to the important performance differences one can
observe between the processing of the same query through different access paths.
For instance, in the context of colDB, retrieving the drug information from a
drug name will be more effective using the drugCName column family rather than
the drugC column family (which would require complete scan of all its tuples).
We believe that a mapping assertion corresponds to the ideal place to store
the preferred access paths possible for a target relation. Hence each mapping
assertion is associated with a list of attributes contained in its target relation.
The mapping language enables the use of the ’*’ symbol which, like in SQL,
denotes the complete list of attributes of a given relation. For a given mapping
assertion, an access path with attribute ’a’ is defined when the source entity offers
an efficient access, either using a key or an index, to a collection, set of columns
or tuple. Note that for a source corresponding to an RDBMS, the definition of
access paths is not necessary since computing the most effective query execution
plan will be performed by the system. Hence, definitions of access paths are
mandatory only for mapping assertions whose right hand side corresponds to a
NoSQL database.
Definition 1: General syntax of a mapping assertion with an access path
specification on attribute ’a’: RelationT(a,b,c)
aEntityS(< key;value >)
where RelationT and EntityS respectively denote a relation of the target and
a conjunction of collections, column families or relations of a source. In this
mapping assertion, the attributes of RelationT follow the definition order of
this relation. Due to the schema flexibility of NoSQL databases, we can not
rely upon any attribute ordering in a collection or column family. Hence, we
must use attribute names to identify distinct portions of a tuple. In order to
map RelationT and EntityS attributes, we introduce a ’AS’ keyword to define a
correspondence between attribute symbols of the mapping assertion. Finally, an
entry in EntityS is defined as a key/value structure using a ’<key ; value>’
syntax, where key is either (i) ’PKEY AS k’ or (ii) a variable name (previously
defined in a EntityS couple of the same mapping) and value is either of the
form (i) nameS AS nameT (where nameS and nameT are resp. attribute names
from the source and the target) or (ii) of the form FOREACH item AS name IN
list (where item corresponds to an element of the set denoted by list and
name is an attribute identifier of the source). Finally, a keyword is introduced to
denote the primary key of the structure (i.e. ’PKEY AS’) and to manipulate it,
e.g. IN KEY.
Note the possibility that some target relation attributes are never associated
to a mapping assertion. This means that there is not an efficient way to filter a
query by this attribute due to the denormalization of the source databases. For
instance, a query searching for a given drug composition will be highly inefficient
in the colDB database due to the absence of a predefined structure proposing
a key-based access from this attribute. Finally, for a given source and target
relation, there must be a single mapping assertion with a given attribute.
A second feature of our mapping language consists in handling the data
structures of NoSQL databases that can be multivalued, nested and also con-
tain some valuable information in the key of a key/value structure; e.g. in the
DrugNameC column family of Figure 3, drug identifiers of a drug product are
stored in the key position (i.e. left hand side of the record). A multivalued ex-
ample is present in both docDB and colDB extracts for the therap attribute. This
forces our mapping language to handle access to the information associated to
these constructors, e.g. to enable iteration over lists. Nested attributes are han-
dled by using the standard ’.’ notation found in object oriented programming.
On the second hand, iterations over lists require the introduction of a ’FOREACH
construct.
We now present the mapping assertions of our running example (a.p. denotes
an access path):
1. drug(i, l, n, c, p)
drugD(<PKEY AS i ; name AS n,
lab AS l, price AS p>)
2. drug(i, n, l, c, p)
i drugC (<PKEY AS i ; name AS n
lab AS l, compo AS c, price AS p>)
3. drug(i, n, l, c, p)
n drugN ameC(<PKEY AS n ;
FOREACH id AS i IN KEY>,
drugC (<id ; lab AS l, compo AS c, price AS p>)
4. therapDrug(i, t)
i drugD(<PKEY AS i ; FOREACH the AS t IN Therap>)
5. therapDrug(i, t)
t therapD(<PKEY AS t ; FOREACH id AS i IN Drugs>)
6. therapDrug(i, t)
i DrugC(<PKEY AS i ; FOREACH the AS t IN Therap>)
7. therapDrug(i, t)
t therapC(<PKEY AS t ; FOREACH id AS i IN KEY>
This set of mapping assertions examplifies an important part of our mapping
language features:
assertion #1 has a ’*’ access path since we consider that all attributes of the
drugC collection are indexed. Also note that on this mapping assertion, the
cattribute is not mapped to any source attribute since that information is
not available in the docDB database.
Mapping assertion #3 introduces the fact that several source entities can be
used in a single mapping (i.e. drugNameC and drugC column families). Intu-
itively, this query accesses a given drugNameC column family entry identified
by a drug name and iterates over its drug identifiers, which are keys (using
the ’IN KEY’ expression) then it uses these identifiers to access entries in to
drugC column family (using iiterator variable in the key position of the
drugC).
4.2 Dealing with conflicting data using attribute preferences
In general, data integration and data exchange solutions adopt the certain an-
swers semantics for query answering, i.e. results of a query expressed over the
target contain the intersection of data retrieved from the sources. We believe
that this pessismistic approach is too restrictive and as a consequence, many
valid results may be missing from final results.
At the other extreme of the query answering semantics spectrum, we find
the possible answer semantics which provides as results over a target query the
union of sources results. With this optimistic approach conflicting results may
be proposed as a final result, leaving the end-users unsatisfied.
In this work, we propose a trade-off between these two semantics which is
based on a preference-based approach. Intuitively, preferences provided over tar-
get attributes define a partial order over mapped sources. Hence, for a given
data object, conflicting information on the same attribute among different data
sources can be handled efficiently and the final result will contain the preferred
values.
Example 4: Consider the queries over docDB and colDB asking for lab and
price information for the drug identified by value 3295935. Given the informa-
tion stored in both sources, respectively the column store (Figure 2) and column
family store (Figure 3), conflicts arise on the prices, resp. 1.88 and 2.05 euros,
and pharmaceuticals, resp. Pfizer and Wyeth.
When creating the mapping assertions, domain experts can express that drug
prices are more accurate in the document store (docDB) and that information
about pharmaceutical laboratory is more trustable in the column family (colDB).
Hence the result of this query will contain a single tuple consisting of: {Advil,
Wyeth, 1.88}, i.e. mixing values retrieved the different sources.
We now define the notion of preferences over mapping assertions.
Definition 2: Consider a set of source databases {DB1, DB2, ..DBn}, a
preference relation, denoted , is a relation ⊆ DBi×DBj, with i6=j, that
is defined on each non primary key attribute of target relations. A preference
is total on an attribute A if for every pair {DBi, DBj}of sources that propose
attribute A, either DBiDBjor DBjDBiwith the transitive closure
of .
Example 5: Consider the drug relation in our running example target
schema. Its definition according to the preferences proposed in Example 3 are
the following: drug(drugId, drugNamedocDBcolDB , labcolD BdocDB ,
composition, pricedocDBcolD B )
That is, for a given drug, in case of conflict, its docDB drugName attribute
is preferred to the one proposed by colDB and the preferred value for the lab
attribute is colDB over docDB. Note that since the composition attribute can only
be retrieved from the colDB source, it is not necessary to define a preference order
over this attribute.
5 Query processing
Once a target relation schema and a set of mapping assertions have been defined,
end-users can expressed queries in SQL over the target database. Since that
database is virtual, i.e. it does not contain any data, data needs to be retrieved
from the sources and processed to provide a final result. The presence of NoSQL
databases in the set of sources imposes to transform the former SQL query into a
query specifically tailored to each NoSQL source. This transformation is based on
the peculiarities of the source database, e.g. whether a declarative query language
exists or only procedural approach enables to query that database, and the
mapping assertions. Since most NoSQL stores support only a procedural query
approach, we have decided to implement a query language to bridge the gap
between SQL and some code in a programming language. This section presents
the Bridge Query Language (henceforth BQL) which is used internally by our
data integration system and the query processing semantics.
5.1 Architecture
The overall architecture of query processing within our data integration sys-
tem is presented in Figure 4. First, an end-user writes an SQL query over the
target schema. The expressivity of accepted SQL queries corresponds to Select
Project Join (SPJ) conjunctive queries, e.g. GROUP BY clauses are not accepted
but we are planning to introduce them in future extensions. Note that this limi-
tation is due to a common abstraction of the NoSQL databases we are studying
in this paper (column family and document).
An end-user SQL query is then translated into the BQL internal query lan-
guage of our data integration system. This transformation corresponds to a
rewritting of the SQL into a BQL query using the mapping assertions. Note
that this translation step is not needed for a RDBMS.
Then for each BQL, a second transformation is performed, this time to gen-
erate a query tailoring the NoSQL database system. Thus, for each supported
NoSQL implementation, a set of rules is defined for the translation of a BQL
query. Most of the time, the BQL translation takes the form of a program and
uses a specific API. In Section 6, we provide details on the translation from BQL
to Java programs into MongoDB and Cassandra.
The results obtained from each query is later processed within the data inte-
gration system. Intuitively, each result set takes the form of a list containing the
awaited target columns. In order to detect data conflicts, we need to efficiently
identify similar objects. This step is performed by incorporating into the result
set values corresponding to primary keys of target relations of the SQL query. So,
even if primary keys are not supposed to be displayed in the final query result,
they are temporarily stored in the result set. Hence objects returned from the
union of the result sets are easily and unambiguously identified. Similar objects
can then be analyzed using the preference orders defined over target attributes.
The query result contains values retrieved from the preferred source attributes.
Fig. 4. Query processing
5.2 Bridge Query Language
BQL is the internal query language that bridges the gap between the SQL lan-
guage of the target model and the different and the heterogenous query languages
of the sources. The syntax of the query language follows the EBNF proposed in
the companion web site. This language contains a set of reserved words whose
semantics is obvious for a programmer. For instance, the get instruction enables
to define a set of filter operations and to define the distinguished variables of
the query, i.e. the values needed in the result. The foreach in : instruction
is frequently encountered in many programming languages and their semantics
align. Intuitively, it supports an iteration over elements of a result set and the
associated processing is performed after the ’:’ symbol. We have implemented
an SQL to BQL translator which parses an SQL query and generates a set of
BQL queries, one for each NoSQL database mapped to the relation of the target
query. This translator takes into account the mapping assertions defined over
the data integration system.
Example 6: We now introduce a set of queries expressed over our running
example. They correspond to different real case scenario and emphasize the
different functionalities of our query language. For each target query (SQL), we
present the BQL generated for both colDB and docDB.
Query 1 accesses a single table via its primary key.
SQL: SELECT drugName, price FROM drug WHERE drugId=3295935;
docDB’s BQL: ans(drugName, price) = docDB.drugD.get({PKEY=
3295935},{name, price})provides answer (Advil, 1.88)
colDB’s BQL: ans(drugName, price) = colDB.drugC.get({PKEY=
3295935},{name, price})provides answer (Advil, 2.05)
Answer: Since the query identifies tuples with the primary key, the real world
object of the answers is supposed to be the same and we apply the preferences
over the union of the results. The processed result is (Advil, 1.88)
Query 2 access single table over a non primary key but indexed attribute of
the target.
SQL: SELECT drugId, price FROM drug WHERE drugName LIKE ’Advil’;
docDB’s BQL: ans(drugId, price) = docDB.drugD.get({name=’Advil’},
{PKEY, price})with answer {(3295935, 1.88)}
colDB’s BQL: temp(drugId) = colDB.drugNameC.get({name=’Advil’},
{KEY})
ans(drugId, price) = foreach id in temp(drugId):colDB.drugC.get(
{KEY=id},{KEY, price}) colDB.drugC with {(3295935, 2.05),(3575994, 2.98)}
Answer: The final result set is {(3295935, 1.88),(3575994, 2.98)}thus mixing
the results and taking advantage of the preference setting.
Query 3 retrieves data from a single relation with a filter over a non-primary
and non-indexed attribute of the target.
SQL: SELECT drugName FROM drug WHERE lab=’Bayer’;
docDB’s BQL: ans(drugName) = docDB.drugD.get({lab=’Bayer’},{name})
colDB’s BQL: No solution
Answer: Since no queries are generated for the colDB store, the results are
retrieved solely from docDB.
Query 4 involves 2 relations and an access from a single primary key at-
tribute of the target.
SQL: SELECT drugName FROM drug d, therapDrug td WHERE d.drugId=td.drugId
AND therapId LIKE ’NSAID’;
docDB’s BQL: temp(drugs) = docDB.therapD.get({PKEY=’NSAID’},{drugs})
ans(drugName) = foreach id in temp(drugs) : docDB.drugC.get(
{PKEY=id},{name})
colDB’s BQL: temp(drugs) = colDB.therapC.get({PKEY=’NSAID’},{KEY})
ans(drugName)=foreach id in temp(drugs) :
coldDB.drugC.get({PKEY=id},{name})
Answer: provides the same result as in Query 2.
6 Implementation
In this section, we sketch our prototype implementation which tackles a docu-
ment store (MongoDB) and a column store (Cassandra). Together they represent
some of the most popular open source projects in the NoSQL ecosystem. The
platform we have adopted corresponds to Java since both MongoDB and Cas-
sandra propose APIs and enable the execution of this programming language.
Moreover, Java is adapted to numerous other NoSQL stores. Nevertheless, in the
near future, we are planning to tackle other systems, e.g. CouchDB or HBase,
and consider other access methods, e.g. javascript or python.
An important task of our prototype is to handle the transformation modules
found in Figure 4. That is to process the translation (i) from SQL to BQL and
(ii) from BQL to the query language supported by each NoSQL stores. Due to
the declarative nature of both query languages of the former translation, this
task is easy to implement and is implemented in linear time on the length of
the input SQL query. The latter translation task is more involved since BQL
corresponds to a declarative language and the target query of our NoSQL stores
corresponds Java methods. The high denormalization aspect of NoSQL stores
imposes that only a limited set of queries can be efficiently processed. In fact,
this results in having similarities between queries expressed over a given NoSQL
database instance. We have extract these similarities into patterns which are
implemented using Java methods. The main idea is to consider a set of finite
BQL templates and to associate a Java method to each of these templates.
One can ask the following question: is this approach still valid and efficient
when more complicated SQL queries, e.g. involving aggregate functions (min,
max, etc.), group by and having or like constructors? A reply to this question
necessarily needs to consider the particular features of each NoSQL database
supported by the system since, up to now, a common framework for NoSQL
stores does not exist. In the case of MongoDB, the support of regular expressions
enables the execution of complex statements with minimal additional effort. On
the other hand, Cassandra only supports a range query approach on primary
keys. This results in having an inefficient support for aggregate operations and
queries involving regular expressions.
7 Conclusions
This paper is a first approach to integrate data coming from NoSQL stores and
relational databases into a single virtualized database. Due to the increasing
popularity of this novel trend of databases, we consider that such data integration
systems will be quite useful in the near future. Our system adopts a relational
approach for the target schema which enables end-users to express queries in
the declarative SQL language. Several transformation steps are then required
to obtain results from the data stored at each of the sources. Hence a bridge
query language has been presented as the cornerstone of these transformations.
Another important component of our system is the mapping language which
(i) handles uncertainty and contradicting information at the sources by defining
preferences over mapping assertions and (ii) supports the setting of access path
information in order to generate an efficiently processable query plan.
On preliminary results, the overhead of these transformation steps does not
impact the performance of query answering. Our list of future works is important
and among others, it contains the support of NoSQL stores corresponding to a
graph model and the (semi) automatic discovery of mapping assertions based on
the analysis of value stored in each source.
References
1. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chan-
dra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured
data (awarded best paper!). In OSDI, pages 205–218, 2006.
2. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters.
Commun. ACM, 51(1):107–113, 2008.
3. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin,
S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon’s highly available
key-value store. In SOSP, pages 205–220, 2007.
4. M. Kifer, A. Bernstein, and P. M. Lewis. Database Systems: An Application Oriented
Approach, Complete Version (2nd Edition). Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 2005.
5. M. Lenzerini. Data integration: A theoretical perspective. In PODS, pages 233–246,
2002.
6. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-
so-foreign language for data processing. In SIGMOD ’08: Proceedings of the 2008
ACM SIGMOD international conference on Management of data, pages 1099–1110,
New York, NY, USA, 2008. ACM.
7. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema match-
ing. The VLDB Journal, 10:334–350, December 2001.
8. M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Hel-
land. The end of an architectural era (it’s time for a complete rewrite). In VLDB,
pages 1150–1160, 2007.
9. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyck-
off, and R. Murthy. Hive - a warehousing solution over a map-reduce framework.
PVLDB, 2(2):1626–1629, 2009.
... Motivated by the mainstream popularity of SQL query language, [91] poses a relational schema over NoSQL stores. A set of mapping assertions that associate the general relational schema with the data source schemata are defined. ...
... This is in practice hard to achieve with the highly heterogeneous Data Lake nature. Therefore, numerous recent publications (e.g., [89][90][91]) advocate for the use of an intermediate query language to interface between the SPARQL query and the data sources. In our case, the intermediate query language is the query language (e.g., SQL) corresponding to ParSet data model (e.g., tabular). ...
Thesis
Full-text available
The remarkable advances achieved in both research and development of Data Management as well as the prevalence of high-speed Internet and technology in the last few decades have caused unprecedented data avalanche. Large volumes of data manifested in a multitude of types and formats are being generated and becoming the new norm. In this context, it is crucial to both leverage existing approaches and propose novel ones to overcome this data size and complexity, and thus facilitate data exploitation. In this thesis, we investigate two major approaches to addressing this challenge: Physical Data Integration and Logical Data Integration. The specific problem tackled is to enable querying large and heterogeneous data sources in an ad hoc manner. In the Physical Data Integration, data is physically and wholly transformed into a canonical unique format, which can then be directly and uniformly queried. In the Logical Data Integration, data remains in its original format and form and a middleware is posed above the data allowing to map various schemata elements to a high-level unifying formal model. The latter enables the querying of the underlying original data in an ad hoc and uniform way, a framework which we call Semantic Data Lake, SDL. Both approaches have their advantages and disadvantages. For example, in the former, a significant effort and cost are devoted to pre-processing and transforming the data to the unified canonical format. In the latter, the cost is shifted to the query processing phases, e.g., query analysis, relevant source detection and results reconciliation. In this thesis we investigate both directions and study their strengths and weaknesses. For each direction, we propose a set of approaches and demonstrate their feasibility via a proposed implementation. In both directions, we appeal to Semantic Web technologies, which provide a set of time-proven techniques and standards that are dedicated to Data Integration. In the Physical Integration, we suggest an end-to-end blueprint for the semantification of large and heterogeneous data sources, i.e., physically transforming the data to the Semantic Web data standard RDF (Resource Description Framework). A unified data representation, storage and query interface over the data are suggested. In the Logical Integration, we provide a description of the SDL architecture, which allows querying data sources right on their original form and format without requiring a prior transformation and centralization. For a number of reasons that we detail, we put more emphasis on the virtual approach. We present the effort behind an extensible implementation of the SDL, called Squerall, which leverages state-of-the-art Semantic and Big Data technologies, e.g., RML (RDF Mapping Language) mappings, FnO (Function Ontology) ontology, and Apache Spark. A series of evaluation is conducted to evaluate the implementation along with various metrics and input data scales. In particular, we describe an industrial real-world use case using our SDL implementation. In a preparation phase, we conduct a survey for the Query Translation methods in order to back some of our design choices.
... We also found a schema implementation for GeoJSON (based on JSON Schema draft-7) that is similar to our proposal at GitHub. 6 However, we do not consider this implementation as an official schema, as the GeoJSON official homepage has no link to this code presented in GitHub, which is a technical documentation that only shows a schema for geographical data types. Another unofficial geographic data schema proposal is geojson.json, ...
... Second, we demonstrate, through this case study, how JS4Geo can be used to define the equivalence of attributes and real-world entities, which are common problems faced by data integration or data interoperability processes. The related literature holds numerous examples of equivalence problems that could be easily solved by using JS4Geo [6,8]. For the sake of paper space, we do not detail them. ...
Article
Full-text available
The large volume and variety of data produced in the current Big Data era lead companies to seek solutions for the efficient data management. Within this context, NoSQL databases rise as a better alternative to the traditional relational databases, mainly in terms of scalability and availability of data. A usual feature of NoSQL databases is to be schemaless, i.e., they do not impose a schema or have a flexible schema. This is interesting for systems that deal with complex data, such as GIS. However, the lack of a schema becomes a problem when applications need to perform processes such as data validation, data integration, or data interoperability, as there is no pattern for schema representation in NoSQL databases. On the other hand, the JSON language stands out as a standard for representing and exchanging data in document NoSQL databases, and JSON Schema is a schema representation language for JSON documents that it is also leading to become a standard. However, it does not include spatial data types. From this limitation, this paper proposes an extension to JSON Schema, called JS4Geo, that allows the definition of schemas for geographic data. We demonstrate that JS4Geo is able to represent schemas of any NoSQL data model, as well as other standards for geographic data, like GML and KML. We also present a case study that shows how a data integration system can benefit of JS4Geo to define local schemas for geographic datasets and generate an integrated global schema.
... One of the more recent proposals is SQLtoKeyNoSQL [31], a layer for translation between SQL and key-oriented nonrelational database. In [32] a mapping language is defined where schema on which queries are defined corresponds to a relational data model, allowing queries to be specified in SQL, while the source systems can be column, key-value, or document stores. ...
Conference Paper
Full-text available
Modern large-scale information systems often use multiple database management systems, not all of which are necessarily relational. In recent years, NoSQL databases have gained acceptance in certain domains while relational databases remain de facto standard in many others. Many "legacy" information systems also use relational databases. Unlike relational database systems, NoSQL databases do not have a common data model or query language, making it difficult for users to access data in a uniform manner when using a combination of relational and NoSQL databases or simply several different NoSQL database systems. Therefore, the need for uniform data access from such a variety of data sources becomes one of the central problems for data integration. In this paper we provide an overview of the main problems, methods, and solutions for data integration between relational and NoSQL databases, as well as between different NoSQL databases. We focus mainly on the problems of structural, syntactic, and semantic heterogeneity and on proposed solutions for uniform data access, emphasizing some of the more recent proposals.
... • NoSQL relationally -has features both multi-model and multi-level ones. The approach includes, e.g., a multi-model solution considering document and columnoriented DB integrated through a middleware into a virtual SQL database [4]. • schema and data conversion -includes, e.g., a schema conversion model, in which the SQL database schema is converted to the NoSQL database schema [27]. ...
Article
Full-text available
In today’s multi-model database world there is an effort to integrate databases expressed in different data models. The aim of the article is to show possibilities of integration of relational and graph databases with the help of a functional data model and its formal language – a typed lambda calculus. We suppose the existence of a data schema both for the relational and graph database. In this approach, relations are considered as characteristic functions and property graphs as sets of single-valued and multivalued functions. Then it is possible to express a query over such integrated heterogeneous database by one query expression expressed in a version of the typed lambda calculus. A more user-friendly version of such language could serve as a powerful query tool in practice. We discuss also queries sent to the integrated system and translated into queries in SQL and Cypher - the graph query language for Neo4j.
... Data integration model with a NoSQL database can potentially unite medical studies data, alternatively to the most frequently used statistical/machine learning methods. Most of the NoSQL database systems share common characteristics, supporting the scalability, availability, flexibility and ensuring fast access times for storage, data retrieval and analysis [18,19]. Very often when applying cluster analysis methods for grouping or joining data issues occur − mainly with outliers, small classes, and mostly with data dynamically changing relatedness. ...
Article
Full-text available
Background: Recently high-throughput technologies have been massively used alongside clinical tests to study various types of cancer. Data generated in such large-scale studies are heterogeneous, of different types and formats. With lack of effective integration strategies novel models are necessary for efficient and operative data integration, where both clinical and molecular information can be effectively joined for storage, access and ease of use. Such models, combined with machine learning methods for accurate prediction of survival time in cancer studies, can yield novel insights into disease development and lead to precise personalized therapies. Results: We developed an approach for intelligent data integration of two cancer datasets (breast cancer and neuroblastoma) - provided in the CAMDA 2018 'Cancer Data Integration Challenge', and compared models for prediction of survival time. We developed a novel semantic network-based data integration framework that utilizes NoSQL databases, where we combined clinical and expression profile data, using both raw data records and external knowledge sources. Utilizing the integrated data we introduced Tumor Integrated Clinical Feature (TICF) - a new feature for accurate prediction of patient survival time. Finally, we applied and validated several machine learning models for survival time prediction. Conclusion: We developed a framework for semantic integration of clinical and omics data that can borrow information across multiple cancer studies. By linking data with external domain knowledge sources our approach facilitates enrichment of the studied data by discovery of internal relations. The proposed and validated machine learning models for survival time prediction yielded accurate results. Reviewers: This article was reviewed by Eran Elhaik, Wenzhong Xiao and Carlos Loucera.
... • Using SQL query language. [5] suggests an intermediate query language that transforms SQL to Java methods accessing NoSQL databases. A dedicated mapping language to express access links to NoSQL databases was defined. ...
Conference Paper
Full-text available
Increasing data volumes have extensively increased application possibilities. However, accessing this data in an ad hoc manner remains an unsolved problem due to the diversity of data management approaches, formats and storage frameworks, resulting in the need to effectively access and process distributed heterogeneous data at scale. For years, Semantic Web techniques have addressed data integration challenges with practical knowledge representation models and ontology-based mappings. Leveraging these techniques, we provide a solution enabling uniform access to large, heterogeneous data sources, without enforcing centralization; thus realizing the vision of a Semantic Data Lake. In this paper, we define the core concepts underlying this vision and the architectural requirements that systems implementing it need to fulfill. Squerall, an example of such a system, is an extensible framework built on top of state-of-the-art Big Data technologies. We focus on Squerall's distributed query execution techniques and strategies, empirically evaluating its performance throughout its various sub-phases.
Article
Full-text available
The fundamental set of operations for the relational model is known as the relational algebra. These operations permit a user to specify basic retrieval requirements. The result of retrieval is a new relation, which may have been formed from one or more relations. The algebra operations thus produce new relations, which can be further manipulated using operations of the same algebra. A sequence of relational algebra operations forms a relational algebra expression, whose result will also be a relation that represents the result of a database query (or retrieval request). A language based on operators and a domain of values operators map values taken from the domain into other domain values hence, an expression involving operators and arguments produces a value in the field when the domain is a set of all relations. The operators are project, union, set difference, Cartesian product, select, rename, set intersection, natural join, and assignment, to generate and get the relational algebra. We refer to the expression as a query and the value produced as the query result.
Article
The popularity of social networks and the expansion of various second-generation Internet services have contributed to the increase of data, of different structuredness levels, in use. Relational databases (frequently called SQL databases) pose themselves as a logical choice for the management of data containing fixed or rarely changeable structure. The need for fast processing of vast quantities of unstructured data has opened the door for the rise of NoSQL databases popularity. The business of modern organisations often faces the challenge of parallel use of different database types. In recent years, hybrid SQL/NoSQL databases, which contain SQL and NoSQL databases as its components, become a popular solution for the issue above. This paper identifies and describes a possible way of integration and uniform use (as two significant non-functional requirements) of hybrid database components, as well as introduce the architecture for this purpose. The presented architecture, with its specially developed components, provides as simple usage as a single database does, with advantages of parallel use of databases of different types. The functioning principle of the new architecture is elaborated on a series of practical use cases of various complexities, which were tested against a hybrid database, and Oracle and MongoDB as well.
Conference Paper
Full-text available
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon's core services use to provide an "always-on" experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
Article
Full-text available
Schema matching is a basic problem in many database application domains, such as data integration, E- business, data warehousing, and semanticquery proc essing. In current implementations, schema matching is typically per- formed manually, which has significant limitations. On the other hand, previous research papers have proposed many techniques to achieve a partial automation of the match op- eration for specific application domains. We present a taxon- omy that covers many of these existing approaches, and we describe the approaches in some detail. In particular, we distin- guish between schema-level and instance-level, element-level and structure-level, and language-based and constraint-based matchers. Based on our classification we review some pre- vious match implementations thereby indicating which part of the solution space they cover. We intend our taxonomy and review of past work to be useful when comparing different ap- proaches to schema matching, when developing a new match algorithm, and when implementing a schema matching com- ponent.
Book
http://www.pearsonhighered.com/educator/academic/product/0,1144,0321268458,00.html
Conference Paper
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.
Conference Paper
In previous papers (SC05, SBC+07), some of us predicted the end of "one size fits all" as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1-2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets. Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T., to a popular RDBMS on the standard transactional benchmark, TPC-C. We conclude that the current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for tomorrow's requirements, not continue to push code lines and architectures designed for yesterday's needs.
Conference Paper
MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.
Article
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.
Article
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.