Fig 1 - uploaded by Myriam Lamolle
Content may be subject to copyright.
Data integration overview 

Data integration overview 

Source publication
Conference Paper
Full-text available
Due to the large amount of data generated by user interactions on the Web, some companies are currently innovating in the domain of data management by designing their own systems. Many of them are referred to as NoSQL databases, standing for ’Not only SQL’. With their wide adoption will emerge new needs and data integration will certainly be one of...

Context in source publication

Context 1
... [8], several database experts argued that Relational Data Base Management Systems (RDBMS) can no longer handle all the data management issues encountered by many current applications. This is mostly due to (i) the high, and ever increasing, volume of data needed to be stored by many (web) companies, (ii) the extreme query workload required to access and analyze these data and (iii) the need for schema flexibility. Several systems have already emerged to propose an alternative to RDBMS and many of them are categorized under the term NoSQL, standing for ’Not only SQL’. Many of these databases are based on the Distributed Hash Table (DHT) model which provides a hash table access semantics. That is, in order to access or modify an object data, a client is required to provide the key of that object and then the database will lookup the object using an equality match to the required attribute key. The first implementations where developed by companies like Google and Amazon with respectively Bigtable [1] and Dynamo [3]. These systems influenced the implementation of several open source systems such as Cassandra 4 , HBase 5 , etc. Nowadays, the NoSQL ecosystem is relatively rich with several categories of databases: column family (e.g. Bigtable, HBase, Cassandra), key/value ( e.g. Dynamo, Riak 6 ), document (e.g. MongoDB 7 , CouchDB 8 ) and graph oriented (e.g. InfiniteGrap 9 , Neo4J 10 ). Most of these systems share common characteristics by aiming to support scalability, availability, flexibility and to ensure fast access times for storage, data retrieval and analysis. In order to meet some of these requirements, NoSQL database instances are designed to reply efficiently to the precise needs of a given application. Note that a similar approach, named denormalization [4], is frequently encountered for application using relational databases. Nevertheless, it may be required to combine the data stored in several NoSQL database instances into a single application and at the same time to leave them evolve with their own applications. This combination of data coming from different sources corresponds to the notion of a data integration system presented in [5]. Yet, several issues emerge due to the following NoSQL characteristics: (i) NoSQL categories are based on different data models and each implementation within a category may have its own specificities. (ii) There does not exist a common query language for all NoSQL databases. Moreover, most systems only support a procedural definition of queries. (iii) The NoSQL ecosystem is characterized by a set of heteroge- neous data management systems, e.g. not all databases support indexing. (iv) The denormalized aspect of NoSQL databases makes query performance highly dependent on access paths. In this paper, we present a data integration system which is based on the assumptions of Figure 1. The target schema corresponds to a standard relational model. This is motivated by the familiarity of most end-users with this data model and its possibility to be queried with the SQL language. The sources can either correspond to a set of column family, document and key/value stores as well as to standard RDBMS. To enable the querying of NoSQL databases within a data integration framework we propose the following contributions. (1) We define a mapping language between the target and the sources which takes into account the denormalization aspect of NoSQL databases. This is materialized by storing preferred access paths for a given mapping assertion. Moreover, this mapping language incor- porates features dealing with conflicting data. (2) We propose a Bridge Query Language (BQL) that enables a transformation from an SQL query defined over the target to the query executed over a given source. (3) We present a prototype implementation which generates query programs for a popular document oriented database, namely MongoDB, and Cassandra, a column family store. This paper is organized as follows. In Section 2, we present related works in the domain of querying NoSQL databases. In Section 3, we provide background knowledge on two feature rich and popular NoSQL databases: document and column family stores. Section 4 presents our data integration framework with a presentation of the syntax and semantics of the mapping language. In Section 5, query processing in our data integration system is presented and BQL is detailed. Section 6 concerns aspects of the prototype implementation. Finally, Section 7 concludes this paper. In this section, we present some related works in the domain of querying non relational databases in the context of the cloud and Map/Reduce. Decoupling query semantics from the underlying data store is a widely spread technique to support multiple data sources in one framework. Therefore, various systems offer a common abstraction layer on top of their data storage layer. Hadoop 11 is a framework that supports data-intensive applications. On top of a distributed, scalable, and portable filesystem (HDFS, [1]), Hadoop provides a column-oriented database called HBase for real-time read and write access to very large datasets. In order to support queries against these large datasets, a programming model called MapReduce [2] is provided by the system. MapReduce divides workloads into suitable units, which can be distributed over many nodes and therefore can be processed in parallel. However, the advantage of the fast processing of large datasets has also its catch, because writing MapReduce programs is a very time consuming business. There is a lot of overhead even for simple tasks. Working out how to fit data processing into the MapReduce pattern can be a challenge. Therefore, Hadoop offers three different abstraction layers for its MapReduce implementation, called Hive, Pig and Cascading. Hive 12 is a data warehouse infrastructure, which aims to bridge the gap between SQL and MapReduce queries. Therefore, it provides its own SQL like query language called HiveQL [9]. It has traditional SQL constructs like joins, group by , where , select and from clauses. These commands are translated into MapReduce functions afterwards. Hive insists that all data has to be stored in tables, with a schema under its management. Hive allows traditional MapReduce programs to be able to plug in their own mappers and reducers to do more sophisticated analysis. Like Hive, Pig 13 tries to raise the level of abstraction for processing large data sets with Hadoop’s MapReduce implementation. The Pig platform consists of a high level language called Pig Latin [6] for constructing data pipelines, where operations on an input relation are executed one after the other. These Pig Latin data pipelines are translated into a sequence of MapReduce jobs by a compiler, which is also included in the Pig framework. Cascading 14 is an API for data processing on Hadoop clusters. It is not a new text based query syntax like Pig or another complex system that must be installed and maintained like Hive. Cascading offers a collection of operations like functions, filters and aggregators, which can be wired together into complex, scale-free and fault tolerant data processing workflows as opposed to directly implementing MapReduce algorithms. In contrast to missing standards in a query language for NoSQL databases, standards for persisting java objects already exist. With the Java Persistence API (JPA 15 ) and Java Domain Objects (JDO 16 ) it is possible to map java objects into different databases. The Datanucleus implementation of these two standards provides a mapping layer on top of HBase, BigTable [1], Amazon S3 17 , MongoDB and Cassandra. Googles App Engine 18 uses this framework for persistence. A powerful data query and administration tool which is used extensively within the Oracle community is Quest Softwares Toad 19 . ...

Similar publications

Conference Paper
Full-text available
The complexity imposed by data heterogeneity makes it difficult to integrate 'streaming x streaming' and 'streaming x historical' data types. For practical analysis, the enrichment and contextualization process based on historical and streaming data would benefit from approaches that facilitate data integration, abstracting details and formats of t...
Article
Full-text available
In the modern database environment, new non-traditional database types appear that are Not SQL database (NoSQL). This NoSQL database does not rely on the principles of the relational database. Couchdb is one of the NoSQL Document-Oriented databases, in Couchdb the basic element was a document. All types of databases have the same conceptual data mo...
Preprint
Full-text available
A Natural Language Interface (NLI) facilitates users to pose queries to retrieve information from a database without using any artificial language such as the Structured Query Language (SQL). Several applications in various domains including healthcare, customer support and search engines, require elaborating structured data having information on t...
Conference Paper
Full-text available
Currently, the standards and protocols for data access in the Virtual Observatory architecture (DAL) are generally implemented with relational databases based on SQL. In particular, the Astronomical Data Query Language (ADQL), language used by IVOA to represent queries to VO services, was created to satisfy the different data access protocols, such...

Citations

... One of the more recent proposals is SQLtoKeyNoSQL [31], a layer for translation between SQL and key-oriented nonrelational database. In [32] a mapping language is defined where schema on which queries are defined corresponds to a relational data model, allowing queries to be specified in SQL, while the source systems can be column, key-value, or document stores. ...
Conference Paper
Full-text available
Modern large-scale information systems often use multiple database management systems, not all of which are necessarily relational. In recent years, NoSQL databases have gained acceptance in certain domains while relational databases remain de facto standard in many others. Many "legacy" information systems also use relational databases. Unlike relational database systems, NoSQL databases do not have a common data model or query language, making it difficult for users to access data in a uniform manner when using a combination of relational and NoSQL databases or simply several different NoSQL database systems. Therefore, the need for uniform data access from such a variety of data sources becomes one of the central problems for data integration. In this paper we provide an overview of the main problems, methods, and solutions for data integration between relational and NoSQL databases, as well as between different NoSQL databases. We focus mainly on the problems of structural, syntactic, and semantic heterogeneity and on proposed solutions for uniform data access, emphasizing some of the more recent proposals.
... Motivated by the mainstream popularity of SQL query language, [91] poses a relational schema over NoSQL stores. A set of mapping assertions that associate the general relational schema with the data source schemata are defined. ...
... This is in practice hard to achieve with the highly heterogeneous Data Lake nature. Therefore, numerous recent publications (e.g., [89][90][91]) advocate for the use of an intermediate query language to interface between the SPARQL query and the data sources. In our case, the intermediate query language is the query language (e.g., SQL) corresponding to ParSet data model (e.g., tabular). ...
Thesis
Full-text available
The remarkable advances achieved in both research and development of Data Management as well as the prevalence of high-speed Internet and technology in the last few decades have caused unprecedented data avalanche. Large volumes of data manifested in a multitude of types and formats are being generated and becoming the new norm. In this context, it is crucial to both leverage existing approaches and propose novel ones to overcome this data size and complexity, and thus facilitate data exploitation. In this thesis, we investigate two major approaches to addressing this challenge: Physical Data Integration and Logical Data Integration. The specific problem tackled is to enable querying large and heterogeneous data sources in an ad hoc manner. In the Physical Data Integration, data is physically and wholly transformed into a canonical unique format, which can then be directly and uniformly queried. In the Logical Data Integration, data remains in its original format and form and a middleware is posed above the data allowing to map various schemata elements to a high-level unifying formal model. The latter enables the querying of the underlying original data in an ad hoc and uniform way, a framework which we call Semantic Data Lake, SDL. Both approaches have their advantages and disadvantages. For example, in the former, a significant effort and cost are devoted to pre-processing and transforming the data to the unified canonical format. In the latter, the cost is shifted to the query processing phases, e.g., query analysis, relevant source detection and results reconciliation. In this thesis we investigate both directions and study their strengths and weaknesses. For each direction, we propose a set of approaches and demonstrate their feasibility via a proposed implementation. In both directions, we appeal to Semantic Web technologies, which provide a set of time-proven techniques and standards that are dedicated to Data Integration. In the Physical Integration, we suggest an end-to-end blueprint for the semantification of large and heterogeneous data sources, i.e., physically transforming the data to the Semantic Web data standard RDF (Resource Description Framework). A unified data representation, storage and query interface over the data are suggested. In the Logical Integration, we provide a description of the SDL architecture, which allows querying data sources right on their original form and format without requiring a prior transformation and centralization. For a number of reasons that we detail, we put more emphasis on the virtual approach. We present the effort behind an extensible implementation of the SDL, called Squerall, which leverages state-of-the-art Semantic and Big Data technologies, e.g., RML (RDF Mapping Language) mappings, FnO (Function Ontology) ontology, and Apache Spark. A series of evaluation is conducted to evaluate the implementation along with various metrics and input data scales. In particular, we describe an industrial real-world use case using our SDL implementation. In a preparation phase, we conduct a survey for the Query Translation methods in order to back some of our design choices.
... We also found a schema implementation for GeoJSON (based on JSON Schema draft-7) that is similar to our proposal at GitHub. 6 However, we do not consider this implementation as an official schema, as the GeoJSON official homepage has no link to this code presented in GitHub, which is a technical documentation that only shows a schema for geographical data types. Another unofficial geographic data schema proposal is geojson.json, ...
... Second, we demonstrate, through this case study, how JS4Geo can be used to define the equivalence of attributes and real-world entities, which are common problems faced by data integration or data interoperability processes. The related literature holds numerous examples of equivalence problems that could be easily solved by using JS4Geo [6,8]. For the sake of paper space, we do not detail them. ...
Article
Full-text available
The large volume and variety of data produced in the current Big Data era lead companies to seek solutions for the efficient data management. Within this context, NoSQL databases rise as a better alternative to the traditional relational databases, mainly in terms of scalability and availability of data. A usual feature of NoSQL databases is to be schemaless, i.e., they do not impose a schema or have a flexible schema. This is interesting for systems that deal with complex data, such as GIS. However, the lack of a schema becomes a problem when applications need to perform processes such as data validation, data integration, or data interoperability, as there is no pattern for schema representation in NoSQL databases. On the other hand, the JSON language stands out as a standard for representing and exchanging data in document NoSQL databases, and JSON Schema is a schema representation language for JSON documents that it is also leading to become a standard. However, it does not include spatial data types. From this limitation, this paper proposes an extension to JSON Schema, called JS4Geo, that allows the definition of schemas for geographic data. We demonstrate that JS4Geo is able to represent schemas of any NoSQL data model, as well as other standards for geographic data, like GML and KML. We also present a case study that shows how a data integration system can benefit of JS4Geo to define local schemas for geographic datasets and generate an integrated global schema.
... • NoSQL relationally -has features both multi-model and multi-level ones. The approach includes, e.g., a multi-model solution considering document and columnoriented DB integrated through a middleware into a virtual SQL database [4]. • schema and data conversion -includes, e.g., a schema conversion model, in which the SQL database schema is converted to the NoSQL database schema [27]. ...
Article
Full-text available
In today’s multi-model database world there is an effort to integrate databases expressed in different data models. The aim of the article is to show possibilities of integration of relational and graph databases with the help of a functional data model and its formal language – a typed lambda calculus. We suppose the existence of a data schema both for the relational and graph database. In this approach, relations are considered as characteristic functions and property graphs as sets of single-valued and multivalued functions. Then it is possible to express a query over such integrated heterogeneous database by one query expression expressed in a version of the typed lambda calculus. A more user-friendly version of such language could serve as a powerful query tool in practice. We discuss also queries sent to the integrated system and translated into queries in SQL and Cypher - the graph query language for Neo4j.
... Data integration model with a NoSQL database can potentially unite medical studies data, alternatively to the most frequently used statistical/machine learning methods. Most of the NoSQL database systems share common characteristics, supporting the scalability, availability, flexibility and ensuring fast access times for storage, data retrieval and analysis [18,19]. Very often when applying cluster analysis methods for grouping or joining data issues occur − mainly with outliers, small classes, and mostly with data dynamically changing relatedness. ...
Article
Full-text available
Background: Recently high-throughput technologies have been massively used alongside clinical tests to study various types of cancer. Data generated in such large-scale studies are heterogeneous, of different types and formats. With lack of effective integration strategies novel models are necessary for efficient and operative data integration, where both clinical and molecular information can be effectively joined for storage, access and ease of use. Such models, combined with machine learning methods for accurate prediction of survival time in cancer studies, can yield novel insights into disease development and lead to precise personalized therapies. Results: We developed an approach for intelligent data integration of two cancer datasets (breast cancer and neuroblastoma) - provided in the CAMDA 2018 'Cancer Data Integration Challenge', and compared models for prediction of survival time. We developed a novel semantic network-based data integration framework that utilizes NoSQL databases, where we combined clinical and expression profile data, using both raw data records and external knowledge sources. Utilizing the integrated data we introduced Tumor Integrated Clinical Feature (TICF) - a new feature for accurate prediction of patient survival time. Finally, we applied and validated several machine learning models for survival time prediction. Conclusion: We developed a framework for semantic integration of clinical and omics data that can borrow information across multiple cancer studies. By linking data with external domain knowledge sources our approach facilitates enrichment of the studied data by discovery of internal relations. The proposed and validated machine learning models for survival time prediction yielded accurate results. Reviewers: This article was reviewed by Eran Elhaik, Wenzhong Xiao and Carlos Loucera.
... • Using SQL query language. [5] suggests an intermediate query language that transforms SQL to Java methods accessing NoSQL databases. A dedicated mapping language to express access links to NoSQL databases was defined. ...
Conference Paper
Full-text available
Increasing data volumes have extensively increased application possibilities. However, accessing this data in an ad hoc manner remains an unsolved problem due to the diversity of data management approaches, formats and storage frameworks, resulting in the need to effectively access and process distributed heterogeneous data at scale. For years, Semantic Web techniques have addressed data integration challenges with practical knowledge representation models and ontology-based mappings. Leveraging these techniques, we provide a solution enabling uniform access to large, heterogeneous data sources, without enforcing centralization; thus realizing the vision of a Semantic Data Lake. In this paper, we define the core concepts underlying this vision and the architectural requirements that systems implementing it need to fulfill. Squerall, an example of such a system, is an extensible framework built on top of state-of-the-art Big Data technologies. We focus on Squerall's distributed query execution techniques and strategies, empirically evaluating its performance throughout its various sub-phases.
... This structure, defined as Embedded Documents, lead to denormalized data models that allow data manipulation in a single database transaction [24]. This grouping is not only possible, but encouraged in these types of systems [8]. ...
... Most research efforts attempt to implement a schemaindependent way to specify and execute queries, but the support is normally restricted to simpler queries, such as the ones requesting a single entity. This is the case of the works by Atzeni, Bugiotti and Rossi [1], with the SOS framework (Save Our Systems), Sellami, Bhiri and Defude [21], with the ODBAPI (OPEN-PaaS-Database API), and Curé et al. [8], with the BQL (Bridge Query Language). All these works have successfully employed the generative approach to facilitate the effort of writing queries independently from the data structure, however they do not properly support complex queries, in particular those requiring "join" operations. ...
Conference Paper
NoSQL databases are designed to fulfill performance and scalability requirements, normally by allowing data to be stored without a fixed schema. For this reason, it is not rare that new usage and performance requirements appear during a system's life cycle, demanding changes to be made in the schema, challenging the developer with extra adaptation effort to update data access code (database queries). The literature presents some solutions to reduce this effort by making queries independent from the schema, but the solutions are normally restricted to simple queries or a predefined mapping. In this paper, we present evidence showing that a classic ER algebra and a Model Management approach can be used to implement a solution that works with complex queries in any schema. The algebra defines operations that can be used by developers to specify complex queries in terms of Entities and Relationships. We created a language for this algebra, with a concrete syntax and a generative operational semantics targeting a document-oriented database. As in Model Management, the generative semantics is guided by the mapping information between Entities, Relationships, and Documents, and is able to generate, for a single ER-based input query, native query code for different schemas, all producing the same results in terms of data structure. Test results show that our implementation is consistent with the algebra's definition, producing evidence that this approach can lead to schema independence in complex NoSQL queries.
... Accepted manuscript to appear in VJCS  The multi-model solution [8] considers source document and column-oriented database integrated through a middleware into a virtual SQL database. Its authors propose a Bridge Query Language (BQL) that enables a transformation from an SQL query defined over the target to the query executed over a given source. ...
Article
Full-text available
The analysis of relational and NoSQL databases leads to the conclusion that these data processing systems are to some extent complementary. In the current Big Data applications, especially where extensive analyses (so-called Big Analytics) are needed, it turns out that it is nontrivial to design an infrastructure involving data and software of both types. Unfortunately, the complementarity negatively influences integration possibilities of these data stores both at the data model and data processing levels. In terms of performance, it may be beneficial to use a polyglot persistence, a multimodel approach or multilevel modeling, or even to transform the SQL database schema into NoSQL and to perform data migration between the relational and NoSQL databases. Another possibility is to integrate a NoSQL database and relational database with the help of a third data model. The aim of the paper is to show these possibilities and present some new methods of designing such integrated database architectures.
... For non-ontology-based access, [6] defines a mapping language to express access links to NoSQL databases. It proposes an intermediate query language to transform SQL to Java methods accessing NoSQL databases. ...
... The 1.5M scale factor generates 500M RDF triples, and the 5M factor 1,75B triples.6 See https://github.com/EIS-Bonn/Squerall/tree/master/evaluation ...
Conference Paper
Full-text available
The last two decades witnessed a remarkable evolution in terms of data formats, modalities, and storage capabilities. Instead of having to adapt one's application needs to the, earlier limited, available storage options, today there is a wide array of options to choose from to best meet an application's needs. This has resulted in vast amounts of data available in a variety of forms and formats which, if interlinked and jointly queried, can generate valuable knowledge and insights. In this article, we describe Squerall: a framework that builds on the principles of Ontology-Based Data Access (OBDA) to enable the querying of dis-parate heterogeneous sources using a unique query language, SPARQL. In Squerall, original data is queried on-the-fly without prior data materi-alization or transformation. In particular, Squerall allows the aggregation and joining of large data in a distributed manner. Squerall supports out-of-the-box five data sources and moreover, it can be programmatically extended to cover more sources and incorporate new query engines. The framework provides user interfaces for the creation of necessary inputs, as well as guiding non-SPARQL experts to write SPARQL queries. Squerall is integrated into the popular SANSA stack and available as open-source software via GitHub and as a Docker image.
... Similar efforts to integrate and query large data sources exist in the literature. For instance, [4] defines a mapping language to express access links to NoSQL databases. [12] allows to run CRUD operations over NoSQL databases. ...
Conference Paper
Full-text available
Squerall is a tool that allows the querying of heterogeneous, large-scale data sources by leveraging state-of-the-art Big Data processing engines: Spark and Presto. Queries are posed on-demand against a Data Lake, i.e., directly on the original data sources without requiring prior data transformation. We showcase Squerall's ability to query five different data sources, including inter alia the popular Cassandra and MongoDB. In particular, we demonstrate how it can jointly query heterogeneous data sources, and how interested developers can easily extend it to support additional data sources. Graphical user interfaces (GUIs) are offered to support users in (1) building intra-source queries, and (2) creating required input files.