Fig 2 - uploaded by Myriam Lamolle
Content may be subject to copyright.
Source publication
Due to the large amount of data generated by user interactions on the Web, some companies are currently innovating in the
domain of data management by designing their own systems. Many of them are referred to as NoSQL databases, standing for ’Not
only SQL’. With their wide adoption will emerge new needs and data integration will certainly be one of...
Contexts in source publication
Context 1
... general, data integration and data exchange solutions adopt the certain answers semantics for query answering, i.e. results of a query expressed over the target contain the intersection of data retrieved from the sources. We believe that this pessismistic approach is too restrictive and as a consequence, many valid results may be missing from final results. At the other extreme of the query answering semantics spectrum, we find the possible answer semantics which provides as results over a target query the union of sources results. With this optimistic approach conflicting results may be proposed as a final result, leaving the end-users unsatisfied. In this work, we propose a trade-off between these two semantics which is based on a preference-based approach. Intuitively, preferences provided over target attributes define a partial order over mapped sources. Hence, for a given data object, conflicting information on the same attribute among different data sources can be handled efficiently and the final result will contain the preferred values. Example 4: Consider the queries over docDB and colDB asking for lab and price information for the drug identified by value 3295935. Given the information stored in both sources, respectively the column store ( Figure 2) and column family store (Figure 3), conflicts arise on the prices, resp. 1.88 and 2.05 euros, and pharmaceuticals, resp. Pfizer and Wyeth. When creating the mapping assertions, domain experts can express that drug prices are more accurate in the document store ( docDB ) and that information about pharmaceutical laboratory is more trustable in the column family ( colDB ). Hence the result of this query will contain a single tuple consisting of: { Advil, Wyeth, 1.88 } , i.e. mixing values retrieved the different sources. We now define the notion of preferences over mapping assertions. Definition 2: Consider a set of source databases { DB 1 , DB 2 , ..DB n } , a preference relation, denoted , is a relation ⊆ DB i × DB j , with i = j , that is defined on each non primary key attribute of target relations. A preference is total on an attribute A if for every pair DB i , DB j of sources that propose attribute A, either DB i ∗ DB j or DB j ∗ DB i with ∗ the transitive closure of . Example 5: Consider the drug relation in our running example target schema. Its definition according to the preferences proposed in Example 3 are the following: drug(drugId, drugName docDB colDB , lab colDB docDB , That is, for a given drug, in case of conflict, its docDB drugName attribute is preferred to the one proposed by colDB and the preferred value for the lab attribute is colDB over docDB . Note that since the composition attribute can only be retrieved from the colDB source, it is not necessary to define a preference order over this attribute. Once a target relation schema and a set of mapping assertions have been defined, end-users can expressed queries in SQL over the target database. Since that database is virtual, i.e. it does not contain any data, data needs to be retrieved from the sources and processed to provide a final result. The presence of NoSQL databases in the set of sources imposes to transform the former SQL query into a query specifically tailored to each NoSQL source. This transformation is based on the peculiarities of the source database, e.g. whether a declarative query language exists or only procedural approach enables to query that database, and the mapping assertions. Since most NoSQL stores support only a procedural query approach, we have decided to implement a query language to bridge the gap between SQL and some code in a programming language. This section presents the Bridge Query Language (henceforth BQL) which is used internally by our data integration system and the query processing semantics. The overall architecture of query processing within our data integration system is presented in Figure 4. First, an end-user writes an SQL query over the target schema. The expressivity of accepted SQL queries corresponds to Select Project Join (SPJ) conjunctive queries, e.g. GROUP BY clauses are not accepted but we are planning to introduce them in future extensions. Note that this limi- tation is due to a common abstraction of the NoSQL databases we are studying in this paper (column family and document). An end-user SQL query is then translated into the BQL internal query language of our data integration system. This transformation corresponds to a rewritting of the SQL into a BQL query using the mapping assertions. Note that this translation step is not needed for a RDBMS. Then for each BQL, a second transformation is performed, this time to generate a query tailoring the NoSQL database system. Thus, for each supported NoSQL implementation, a set of rules is defined for the translation of a BQL query. Most of the time, the BQL translation takes the form of a program and uses a specific API. In Section 6, we provide details on the translation from BQL to Java programs into MongoDB and Cassandra. The results obtained from each query is later processed within the data integration system. Intuitively, each result set takes the form of a list containing the awaited target columns. In order to detect data conflicts, we need to efficiently identify similar objects. This step is performed by incorporating into the result set values corresponding to primary keys of target relations of the SQL query. So, even if primary keys are not supposed to be displayed in the final query result, they are temporarily stored in the result set. Hence objects returned from the union of the result sets are easily and unambiguously identified. Similar objects can then be analyzed using the preference orders defined over target attributes. The query result contains values retrieved from the preferred source attributes. BQL is the internal query language that bridges the gap between the SQL language of the target model and the different and the heterogenous query languages of the sources. The syntax of the query language follows the EBNF proposed in the companion web site. This language contains a set of reserved words whose semantics is obvious for a programmer. For instance, the get instruction enables to define a set of filter operations and to define the distinguished variables of the query, i.e. the values needed in the result. The foreach in : instruction is frequently encountered in many programming languages and their semantics align. Intuitively, it supports an iteration over elements of a result set and the associated processing is performed after the ’:’ symbol. We have implemented an SQL to BQL translator which parses an SQL query and generates a set of BQL queries, one for each NoSQL database mapped to the relation of the target query. This translator takes into account the mapping assertions defined over the data integration system. Example 6: We now introduce a set of queries expressed over our running example. They correspond to different real case scenario and emphasize the different functionalities of our query language. For each target query (SQL), we present the BQL generated for both colDB and docDB ...
Context 2
... analysis. Like Hive, Pig 13 tries to raise the level of abstraction for processing large data sets with Hadoop’s MapReduce implementation. The Pig platform consists of a high level language called Pig Latin [6] for constructing data pipelines, where operations on an input relation are executed one after the other. These Pig Latin data pipelines are translated into a sequence of MapReduce jobs by a compiler, which is also included in the Pig framework. Cascading 14 is an API for data processing on Hadoop clusters. It is not a new text based query syntax like Pig or another complex system that must be installed and maintained like Hive. Cascading offers a collection of operations like functions, filters and aggregators, which can be wired together into complex, scale-free and fault tolerant data processing workflows as opposed to directly implementing MapReduce algorithms. In contrast to missing standards in a query language for NoSQL databases, standards for persisting java objects already exist. With the Java Persistence API (JPA 15 ) and Java Domain Objects (JDO 16 ) it is possible to map java objects into different databases. The Datanucleus implementation of these two standards provides a mapping layer on top of HBase, BigTable [1], Amazon S3 17 , MongoDB and Cassandra. Googles App Engine 18 uses this framework for persistence. A powerful data query and administration tool which is used extensively within the Oracle community is Quest Softwares Toad 19 . Since 2010, a prototype which also offers its support for column family stores is available. During the time of writing this paper, the beta version 1.2 can be connected to Azure Table Services 20 , Cassandra, SimpleDB 21 , HBase and every ODBC compliant relational database. Toad for Cloud consists of two components. The first is the Toad client, which can be installed on a Microsoft Windows computer. It can be used to access different databases, the Amazon EC2 console, and to write and execute SQL statements. The second component is the Data Hub. It translates SQL statements submitted through the Toad client into a language understood by all supported databases and returns results in the familiar tabular row and column format. In order to use SQL on column family stores like Cassandra and HBase, the column families, rows and columns have to be mapped into virtual tables. Afterwards, the user does have full MySQL support on these data, containing also inserts, updates and deletes. Furthermore, it is possible to do virtual data integration with different data sources. One reason why only column family stores are supported by Toad is their easy mapping to relational databases. To do the same with a document store containing objects with a deep nested structure or squeezing a graph into a relational schema is a much more complicated task. Even if a suitable solution was found, powerful and easy to use query languages, tools and interfaces like Traverser API for Neo4J would be missing in the SQL layer of Toad, which queries the mapped tables. Due to their different data models and their relatively young history, NoSQL databases still lack a common query language. However, Quest Software and Hadoop demonstrate that it is possible to use SQL (Toad) or a SQL like query language (Hive) on top of column family stores. A mapping to document stores and graph databases is still missing. We have seen that each category of NoSQL databases has its own data model. In this section, we present details concerning document oriented and column family categories. Document oriented databases correspond to an extension of the well-known key- value concept where in this case the value consists of a structured document. A document contains hierarchically organized data similar to XML and JSON. This permits to represent one-to-one as well as one-to-many relationships in a single document. Therefore a complex document can be retrieved or stored without using joins. Since document oriented databases are aware of stored data, it enables to define document field indexes as well as to propose advanced query features. The most popular document oriented databases are MongoDB (10gen) and CouchDB (Apache). Example 1: In the following a document oriented database stores drug information aimed at an application targeting the general public. According to features proposed by the application, two so-called collections are defined: drugD and therapD . drugD includes documents describing drug related information whereas therapD contains documents with information about therapeutic classes and drugs used for it. Each drugD document is identified by a drug identifier. In this example, its attributes are limited to the name of the product, its price, pharmaceutical lab and a list of therapeutic classes. The key for therapD documents is a string corresponding to the therapeutic class name. It contains a single attribute corresponding to the list of drug identifiers treating this therapeutic class. Figure 2 presents an extract of this database. Finally, in order to ensure an efficient search to patients an index on the attribute name of the drugD document is defined. Column family stores correspond to persistent, sparse, distributed multilevel hash maps. In column family stores, arbitrary keys (rows) are applied to arbitrary key value pairs (columns). These columns can be extended with further arbitrary key value pairs. Afterwards, these key value pair lists can be organized into column families and keyspaces. Finally, column-family stores can appear in a very similar shape to relational databases on the surface. The most popular systems are HBase and Cassandra. All of them are influenced by Googles Bigtable. Example 2: Figure 3 presents some column families defined in a medical application. Since Cassandra works best when its data model is denormalized, the data is divided on three column families: drugC , drugNameC and therapC . The columns drugName , contra , composition , lab are integrated into drugC and identified by row key drugId . drugNameC contains row key drugName and a drugId column in order to provide an efficient search for patients. Since end- users of this database need to search products by therapeutical classes, therapC contains therapName as row key and a column for each drugId with a timestamp as value. This section presents the syntax and semantics of our data integration framework. Moreover, it focuses on the mapping language which supports the definition of correspondences between sources and target entities. Our mapping language integrates some original aspects by considering (i) query processing performances of the sources via access paths and (ii) dealing with contradicting information found between sources using preferences. In the rest of this paper, we consider the following example. Example 3: A medical application needs to integrate drug data coming from two different NoSQL stores. The first database, corresponding to a document store, denoted docDB , and is used in a patient oriented application while the other database, a column family store, denoted colDB , contains information aimed at health care professionals. In this paper, we concentrate on some extracts of docDB and colDB which correspond to respectively Figures 2 and 3. The data stored in both databases present some overlapping as well as some discrepancies both at the tuple and schema level. For instance, at the schema level, both databases contain french drug identifiers, names, pharmaceutical companies and prices but only colDB proposes access to the composition of a drug product. Considering the tuple level, some drugs may be present in one database but not in the other. Moreover information concerning the same drug product (i.e. identified by the same identifier value) may contradict themselves in different sources. Given these source databases, the target schema is defined as follows. We consider that relation and attribute names are self-explanatory. Obviously, our next step it to define correspondances between the sources and the target. This is supporteed by mapping assertions which are currently being defined manually by domain experts. In the near future, we aim to discover some of them automatically by analyzing extensions and intensions of both sources and the target. Nevertheless, we do not believe that all mapping assertions can be discovered automatically due to the lack of semantics contained in both the target and the sources. Next, we present the mapping language enabling the definitions of mapping assertions. Our data integration system takes the form of a triple , , where is the target schema, S is the source schema and M is the mapping between T and S . In Section 1, we motivated the fact that the set S could correspond to both RDBMS and NoSQL stores and that T takes the form of a relational schema. We consider that the target schema is given, possibly defined by a team of domain experts or using schema matching techniques [7]. This mapping language adopts a GAV (Global As View) approach with sound sources [5]. The mapping assertions are thus of the following form: φ S φ T where φ S is a query over S and φ T is a relation of T . Our system must deal with the heterogeneity of the sources and the highly denormalized aspect of NoSQL database instances. In order to cope with this last aspect, our mapping language handles the different access paths proposed by a given source. This is due to the important performance differences one can observe between the processing of the same query through different access paths. For instance, in the context of colDB , retrieving the drug information from a drug name will be more effective using the drugCName column family rather than the drugC column family (which would require complete scan of all its tuples). We believe that a mapping assertion corresponds to the ideal place to store the preferred access paths possible for a target relation. Hence each ...
Similar publications
NoSQL is a free and open-source, scattered, extensive column store database management system intended to handle large amounts of data across many product servers, providing high obtainability and accessibility with no single point of failure. It is the easiest truly big-data database that can scale and replicate data globally in a master-less conf...
Land cover (LC) is a scientific landscape classification based on physical properties of earth materials. This information is usually retrieved through remote sensing techniques (e.g. forest cover, urban, clay content, among others). In contrast, Land use (LU) is defined from an anthropocentric point of view. It describes how a specific area is use...
NoSQL is often used as a successful alternative to relational databases,
especially when it is necessary to provide adequate system dimensioning,
usage of a variety of data types and high efficiency at a low cost for
maintaining consistency. The work is conceived in a manner that covers
the general concept of a database, i.e. the concept of relatio...
Patient cohort identification across heterogeneous data sources is a challenging task, which may involve a complicated process of data loading, harmonization and querying. Most existing cohort identification tools use a relational database model implemented in SQL for storing patient data. However, SQL databases have restrictions on the maximum num...
Citations
... One of the more recent proposals is SQLtoKeyNoSQL [31], a layer for translation between SQL and key-oriented nonrelational database. In [32] a mapping language is defined where schema on which queries are defined corresponds to a relational data model, allowing queries to be specified in SQL, while the source systems can be column, key-value, or document stores. ...
Modern large-scale information systems often use multiple database management systems, not all of which are necessarily relational. In recent years, NoSQL databases have gained acceptance in certain domains while relational databases remain de facto standard in many others. Many "legacy" information systems also use relational databases. Unlike relational database systems, NoSQL databases do not have a common data model or query language, making it difficult for users to access data in a uniform manner when using a combination of relational and NoSQL databases or simply several different NoSQL database systems. Therefore, the need for uniform data access from such a variety of data sources becomes one of the central problems for data integration. In this paper we provide an overview of the main problems, methods, and solutions for data integration between relational and NoSQL databases, as well as between different NoSQL databases. We focus mainly on the problems of structural, syntactic, and semantic heterogeneity and on proposed solutions for uniform data access, emphasizing some of the more recent proposals.
... Motivated by the mainstream popularity of SQL query language, [91] poses a relational schema over NoSQL stores. A set of mapping assertions that associate the general relational schema with the data source schemata are defined. ...
... This is in practice hard to achieve with the highly heterogeneous Data Lake nature. Therefore, numerous recent publications (e.g., [89][90][91]) advocate for the use of an intermediate query language to interface between the SPARQL query and the data sources. In our case, the intermediate query language is the query language (e.g., SQL) corresponding to ParSet data model (e.g., tabular). ...
The remarkable advances achieved in both research and development of Data Management as well as the prevalence of high-speed Internet and technology in the last few decades have caused unprecedented data avalanche. Large volumes of data manifested in a multitude of types and formats are being generated and becoming the new norm. In this context, it is crucial to both leverage existing approaches and propose novel ones to overcome this data size and complexity, and thus facilitate data exploitation. In this thesis, we investigate two major approaches to addressing this challenge: Physical Data Integration and Logical Data Integration. The specific problem tackled is to enable querying large and heterogeneous data sources in an ad hoc manner.
In the Physical Data Integration, data is physically and wholly transformed into a canonical unique format, which can then be directly and uniformly queried. In the Logical Data Integration, data remains in its original format and form and a middleware is posed above the data allowing to map various schemata elements to a high-level unifying formal model. The latter enables the querying of the underlying original data in an ad hoc and uniform way, a framework which we call Semantic Data Lake, SDL. Both approaches have their advantages and disadvantages. For example, in the former, a significant effort and cost are devoted to pre-processing and transforming the data to the unified canonical format. In the latter, the cost is shifted to the query processing phases, e.g., query analysis, relevant source detection and results reconciliation.
In this thesis we investigate both directions and study their strengths and weaknesses. For each direction, we propose a set of approaches and demonstrate their feasibility via a proposed implementation. In both directions, we appeal to Semantic Web technologies, which provide a set of time-proven techniques and standards that are dedicated to Data Integration. In the Physical Integration, we suggest an end-to-end blueprint for the semantification of large and heterogeneous data sources, i.e., physically transforming the data to the Semantic Web data standard RDF (Resource Description Framework). A unified data representation, storage and query interface over the data are suggested. In the Logical Integration, we provide a description of the SDL architecture, which allows querying data sources right on their original form and format without requiring a prior transformation and centralization. For a number of reasons that we detail, we put more emphasis on the virtual approach. We present the effort behind an extensible implementation of the SDL, called Squerall, which leverages state-of-the-art Semantic and Big Data technologies, e.g., RML (RDF Mapping Language) mappings, FnO (Function Ontology) ontology, and Apache Spark. A series of evaluation is conducted to evaluate the implementation along with various metrics and input data scales. In particular, we describe an industrial real-world use case using our SDL implementation. In a preparation phase, we conduct a survey for the Query Translation methods in order to back some of our design choices.
... Moreover, IoT data storage solution should be able to store huge amount of data generated as well as should support horizontal scaling efficiently. Furthermore, the data collected can be structured as well as unstructured and through multiple sources, and hence, data collection provisions and components should be able to support heterogeneity in terms of data generation and collection [10,11]. For the stated challenges, a storage platform is required which can store and manage a high volume of structured and unstructured data. ...
... We also found a schema implementation for GeoJSON (based on JSON Schema draft-7) that is similar to our proposal at GitHub. 6 However, we do not consider this implementation as an official schema, as the GeoJSON official homepage has no link to this code presented in GitHub, which is a technical documentation that only shows a schema for geographical data types. Another unofficial geographic data schema proposal is geojson.json, ...
... Second, we demonstrate, through this case study, how JS4Geo can be used to define the equivalence of attributes and real-world entities, which are common problems faced by data integration or data interoperability processes. The related literature holds numerous examples of equivalence problems that could be easily solved by using JS4Geo [6,8]. For the sake of paper space, we do not detail them. ...
The large volume and variety of data produced in the current Big Data era lead companies to seek solutions for the efficient data management. Within this context, NoSQL databases rise as a better alternative to the traditional relational databases, mainly in terms of scalability and availability of data. A usual feature of NoSQL databases is to be schemaless, i.e., they do not impose a schema or have a flexible schema. This is interesting for systems that deal with complex data, such as GIS. However, the lack of a schema becomes a problem when applications need to perform processes such as data validation, data integration, or data interoperability, as there is no pattern for schema representation in NoSQL databases. On the other hand, the JSON language stands out as a standard for representing and exchanging data in document NoSQL databases, and JSON Schema is a schema representation language for JSON documents that it is also leading to become a standard. However, it does not include spatial data types. From this limitation, this paper proposes an extension to JSON Schema, called JS4Geo, that allows the definition of schemas for geographic data. We demonstrate that JS4Geo is able to represent schemas of any NoSQL data model, as well as other standards for geographic data, like GML and KML. We also present a case study that shows how a data integration system can benefit of JS4Geo to define local schemas for geographic datasets and generate an integrated global schema.
... • NoSQL relationally -has features both multi-model and multi-level ones. The approach includes, e.g., a multi-model solution considering document and columnoriented DB integrated through a middleware into a virtual SQL database [4]. • schema and data conversion -includes, e.g., a schema conversion model, in which the SQL database schema is converted to the NoSQL database schema [27]. ...
In today’s multi-model database world there is an effort to integrate databases expressed in different data models. The aim of the article is to show possibilities of integration of relational and graph databases with the help of a functional data model and its formal language – a typed lambda calculus. We suppose the existence of a data schema both for the relational and graph database. In this approach, relations are considered as characteristic functions and property graphs as sets of single-valued and multivalued functions. Then it is possible to express a query over such integrated heterogeneous database by one query expression expressed in a version of the typed lambda calculus. A more user-friendly version of such language could serve as a powerful query tool in practice. We discuss also queries sent to the integrated system and translated into queries in SQL and Cypher - the graph query language for Neo4j.
... Data integration model with a NoSQL database can potentially unite medical studies data, alternatively to the most frequently used statistical/machine learning methods. Most of the NoSQL database systems share common characteristics, supporting the scalability, availability, flexibility and ensuring fast access times for storage, data retrieval and analysis [18,19]. Very often when applying cluster analysis methods for grouping or joining data issues occur − mainly with outliers, small classes, and mostly with data dynamically changing relatedness. ...
Background:
Recently high-throughput technologies have been massively used alongside clinical tests to study various types of cancer. Data generated in such large-scale studies are heterogeneous, of different types and formats. With lack of effective integration strategies novel models are necessary for efficient and operative data integration, where both clinical and molecular information can be effectively joined for storage, access and ease of use. Such models, combined with machine learning methods for accurate prediction of survival time in cancer studies, can yield novel insights into disease development and lead to precise personalized therapies.
Results:
We developed an approach for intelligent data integration of two cancer datasets (breast cancer and neuroblastoma) - provided in the CAMDA 2018 'Cancer Data Integration Challenge', and compared models for prediction of survival time. We developed a novel semantic network-based data integration framework that utilizes NoSQL databases, where we combined clinical and expression profile data, using both raw data records and external knowledge sources. Utilizing the integrated data we introduced Tumor Integrated Clinical Feature (TICF) - a new feature for accurate prediction of patient survival time. Finally, we applied and validated several machine learning models for survival time prediction.
Conclusion:
We developed a framework for semantic integration of clinical and omics data that can borrow information across multiple cancer studies. By linking data with external domain knowledge sources our approach facilitates enrichment of the studied data by discovery of internal relations. The proposed and validated machine learning models for survival time prediction yielded accurate results.
Reviewers:
This article was reviewed by Eran Elhaik, Wenzhong Xiao and Carlos Loucera.
... • Using SQL query language. [5] suggests an intermediate query language that transforms SQL to Java methods accessing NoSQL databases. A dedicated mapping language to express access links to NoSQL databases was defined. ...
Increasing data volumes have extensively increased application possibilities. However, accessing this data in an ad hoc manner remains an unsolved problem due to the diversity of data management approaches, formats and storage frameworks, resulting in the need to effectively access and process distributed heterogeneous data at scale. For years, Semantic Web techniques have addressed data integration challenges with practical knowledge representation models and ontology-based mappings. Leveraging these techniques, we provide a solution enabling uniform access to large, heterogeneous data sources, without enforcing centralization; thus realizing the vision of a Semantic Data Lake.
In this paper, we define the core concepts underlying this vision and the architectural requirements that systems implementing it need to fulfill. Squerall, an example of such a system, is an extensible framework built on top of state-of-the-art Big Data technologies. We focus on Squerall's distributed query execution techniques and strategies, empirically evaluating its performance throughout its various sub-phases.
... This structure, defined as Embedded Documents, lead to denormalized data models that allow data manipulation in a single database transaction [24]. This grouping is not only possible, but encouraged in these types of systems [8]. ...
... Most research efforts attempt to implement a schemaindependent way to specify and execute queries, but the support is normally restricted to simpler queries, such as the ones requesting a single entity. This is the case of the works by Atzeni, Bugiotti and Rossi [1], with the SOS framework (Save Our Systems), Sellami, Bhiri and Defude [21], with the ODBAPI (OPEN-PaaS-Database API), and Curé et al. [8], with the BQL (Bridge Query Language). All these works have successfully employed the generative approach to facilitate the effort of writing queries independently from the data structure, however they do not properly support complex queries, in particular those requiring "join" operations. ...
NoSQL databases are designed to fulfill performance and scalability requirements, normally by allowing data to be stored without a fixed schema. For this reason, it is not rare that new usage and performance requirements appear during a system's life cycle, demanding changes to be made in the schema, challenging the developer with extra adaptation effort to update data access code (database queries). The literature presents some solutions to reduce this effort by making queries independent from the schema, but the solutions are normally restricted to simple queries or a predefined mapping. In this paper, we present evidence showing that a classic ER algebra and a Model Management approach can be used to implement a solution that works with complex queries in any schema. The algebra defines operations that can be used by developers to specify complex queries in terms of Entities and Relationships. We created a language for this algebra, with a concrete syntax and a generative operational semantics targeting a document-oriented database. As in Model Management, the generative semantics is guided by the mapping information between Entities, Relationships, and Documents, and is able to generate, for a single ER-based input query, native query code for different schemas, all producing the same results in terms of data structure. Test results show that our implementation is consistent with the algebra's definition, producing evidence that this approach can lead to schema independence in complex NoSQL queries.
... Accepted manuscript to appear in VJCS The multi-model solution [8] considers source document and column-oriented database integrated through a middleware into a virtual SQL database. Its authors propose a Bridge Query Language (BQL) that enables a transformation from an SQL query defined over the target to the query executed over a given source. ...
The analysis of relational and NoSQL databases leads to the conclusion that these data processing systems are to some extent complementary. In the current Big Data applications, especially where extensive analyses (so-called Big Analytics) are needed, it turns out that it is nontrivial to design an infrastructure involving data and software of both types. Unfortunately, the complementarity negatively influences integration possibilities of these data stores both at the data model and data processing levels. In terms of performance, it may be beneficial to use a polyglot persistence, a multimodel approach or multilevel modeling, or even to transform the SQL database schema into NoSQL and to perform data migration between the relational and NoSQL databases. Another possibility is to integrate a NoSQL database and relational database with the help of a third data model. The aim of the paper is to show these possibilities and present some new methods of designing such integrated database architectures.
... For non-ontology-based access, [6] defines a mapping language to express access links to NoSQL databases. It proposes an intermediate query language to transform SQL to Java methods accessing NoSQL databases. ...
... The 1.5M scale factor generates 500M RDF triples, and the 5M factor 1,75B triples.6 See https://github.com/EIS-Bonn/Squerall/tree/master/evaluation ...
The last two decades witnessed a remarkable evolution in terms of data formats, modalities, and storage capabilities. Instead of having to adapt one's application needs to the, earlier limited, available storage options, today there is a wide array of options to choose from to best meet an application's needs. This has resulted in vast amounts of data available in a variety of forms and formats which, if interlinked and jointly queried, can generate valuable knowledge and insights. In this article, we describe Squerall: a framework that builds on the principles of Ontology-Based Data Access (OBDA) to enable the querying of dis-parate heterogeneous sources using a unique query language, SPARQL. In Squerall, original data is queried on-the-fly without prior data materi-alization or transformation. In particular, Squerall allows the aggregation and joining of large data in a distributed manner. Squerall supports out-of-the-box five data sources and moreover, it can be programmatically extended to cover more sources and incorporate new query engines. The framework provides user interfaces for the creation of necessary inputs, as well as guiding non-SPARQL experts to write SPARQL queries. Squerall is integrated into the popular SANSA stack and available as open-source software via GitHub and as a Docker image.