Conference PaperPDF Available

Extending In-Memory Relational Database Engines with Native Graph Support

Authors:

Abstract and Figures

The plethora of graphs and relational data give rise to many interesting graph-relational queries in various domains, e.g., finding related proteins retrieved by a relational subquery in a biological network. The maturity of RDBMSs motivated academia and industry to invest efforts in leveraging RDBMSs for graph processing , where efficiency is proven for vital graph queries. However, none of these efforts process graphs natively inside the RDBMS, which is particularly challenging due to the impedance mismatch between the relational and the graph models. In this paper, we propose to manage graphs as first-class citizens inside the rela-tional engine. We realize our approach inside VoltDB [6], an open-source in-memory relational database, and name this realization GRFusion. The SQL and query engine of GRFusion are empowered to declaratively define graphs and execute cross-data-model query plans acting on graphs and relations, resulting in up to four orders-of-magnitude in query-time speedup w.r.t. state-of-the-art approaches.
Content may be subject to copyright.
A preview of the PDF is not available
... A second approach introduces a new graph-specific query processor that co-exists with the existing processor of the RDBMS. This has been most recently adopted by the GR-Fusion system [23,24]. Specifically, SQL is extended to contain graph-specific constructs, using which users create graphs. ...
... GR-Fusion [23,24] is designed to perform graph querying natively inside an RDMBS. Users define graphs as views over tables, The topology of graph views are stored natively in an adjacency list index. ...
... Second, we aim to perform an ablation study to demonstrate that each of our optimizations that facilitated different levels of integration has additional benefits. Third, we aim to compare the performance characteristics of our approach against index nested loop join-based implementations that are prevalent in GDBMSs and prior approaches that integrate predefined joins into RDBMSs [23,32]. Fourth, we demonstrate that sip makes DuckDB's optimizer more robust by analyzing the plan spectrum of DuckDB and GRainDB on a suite of queries. ...
Preprint
Full-text available
Joins in native graph database management systems (GDBMSs) are predefined to the system as edges, which are indexed in adjacency list indices and serve as pointers. This contrasts with and can be more performant than value-based joins in RDBMSs and has lead researchers to investigate ways to integrate predefined joins directly into RDBMSs. Existing approaches adopt a strict separation of graph and relational data and processors, where a graph-specific processor uses left-deep and index nested loop joins for a subset of joins. This may be suboptimal, and may lead to non-sequential scans of data in some queries. We propose a purely relational approach to integrate predefined joins in columnar RDBMSs that uses row IDs (RIDs) of tuples as pointers. Users can predefine equality joins between any two tables, which leads to materializing RIDs in extended tables and optionally in RID indices. Instead of using the RID index to perform the join directly, we use it primarily in hash joins to generate semi-join filters that can be passed to scans using sideways information passing, ensuring sequential scans. In some settings, we also use RID indices to reduce the number of joins in query plans. Our approach does not introduce any graph-specific system components, can execute predefined joins on any join plan, and can improve performance on any workload that contains equality joins that can be predefined. We integrated our approach to DuckDB and call the resulting system {\em GRainDB}. We demonstrate that GRainDB far improves the performance of DuckDB on relational and graph workloads with large many-to-many joins, making it competitive with a state-of-the-art GDBMS, and incurs no major overheads otherwise.
... One prevalent approach for integrating machine learning into a DBMS is to extend the SQL execution engine with iterative concepts, as proposed by [9,11,12], and translate the ML algorithm into a sequence of SQL queries. A main issue of this approach is that SQL engines only offer bulk synchronous parallelism for ML algorithms; i.e., each iteration of an ML algorithm is implemented by executing an SQL querypotentially in parallel -over the full data set before the next iteration can start. ...
... GRFusion from Hassan et al. [11] extended a relational database engine with graph support. The idea of GRFusion is to add an efficient in-memory graph data structure to the DBMS. ...
... While BISMARK [9] and GRFusion [11] only support a limited set of problems, DB4ML supports a general programming model for iterative ML algorithms. Another aspect, of all before-mentioned papers, is that they do not support fine-grained parallelism, which is shown to be a substantial part of effective machine learning [33,38]. ...
Conference Paper
Full-text available
In this paper, we revisit the question of how ML algorithms can be best integrated into existing DBMSs to not only avoid expensive data copies to external ML tools but also to comply with regulatory reasons. The key observation is that database transactions already provide an execution model that allows DBMSs to efficiently mimic the execution model of modern parallel ML algorithms. As a main contribution, this paper presents DB4ML, an in-memory database kernel that allows applications to implement user-defined ML algorithms and efficiently run them inside a DBMS. Thereby, the ML algorithms are implemented using a programming model based on the idea of so called iterative transactions. Our experimental evaluation shows that DB4ML can support user-defined ML algorithms inside a DBMS with the efficiency of modern specialized ML engines. In contrast to DB4ML, these engines not only need to transfer data out of the DBMS but also hard-code the ML algorithms and thus are not extensible.
... Similar approaches [34,59] are introduce to address the problem in the context of property graphs. GRFusion [34] focuses on filling the gap between the relational and the graph models rather than optimizing the graph schema to achieve better query performance. ...
... Similar approaches [34,59] are introduce to address the problem in the context of property graphs. GRFusion [34] focuses on filling the gap between the relational and the graph models rather than optimizing the graph schema to achieve better query performance. SQLGraph [59] and Db2 Graph [61] introduce a physical schema design that combines rela-tional storage for adjacency information with JSON storage for vertex and edge attributes. ...
Article
Full-text available
Enterprises create domain-specific knowledge bases (KBs) by curating and integrating their business data from multiple sources. To support a variety of query types over domain-specific KBs, we propose Hermes, an ontology-based system that allows storing KB data in multiple backends, and querying them with different query languages. In this paper, we address two important challenges in realizing such a system: data placement and schema optimization. First, we identify the best data store for any query type and determine the subset of the KB that needs to be stored in this data store, while minimizing data replication. Second, we optimize how we organize the data for best query performance. To choose the best data stores, we partition the data described by the domain ontology into multiple overlapping subsets based on the operations performed in a given query workload, and place these subsets in appropriate data stores according to their capabilities. Then, we optimize the schema on each data store to enable efficient querying. In particular, we focus on the property graph schema optimization, which has been largely ignored in the literature. We propose two algorithms to generate an optimized schema from the domain ontology. We demonstrate the effectiveness of our data placement and schema optimization algorithms with two real-world KBs from the medical and financial domains. The results show that the proposed data placement algorithm generates near-optimal data placement plans with minimal data replication overhead, and the schema optimization algorithms produce high-quality schemas, achieving up to two orders of magnitude speed-up compared to alternative schema designs.
... Many prior works in this area [4][5][6] extend RDBMS architecture to integrate machine learning algorithms into database, by which users are enabled to call machine learning methods using SQL queries inside the database. Another major area is the extension over relational algebra to introduce new operators and optimization into database, e.g., similarity-aware relational algebra [7,8], graph-relational algebra [3], relational algebra with entity resolution operators [1], etc. ...
... To avoid waste of time, we make the functions ready before this demo and browse them together with the audience at the demo time. If the audience is interested in some details, we will make a further explanation to them, like how the functions parse such a literal " [1,2,3,4]" into a Vector. The second step is registering the SimSelection Syntax into the registry. ...
Preprint
With the rapid increasing of data scale, in-database analytics and learning has become one of the most studied topics in data science community, because of its significance on reducing the gap between the management and the analytics of data. By extending the capability of database on analytics and learning, data scientists can save much time on exchanging data between databases and external analytic tools. For this goal, researchers are attempting to integrate more data science algorithms into database. However, implementing the algorithms in mainstream databases is super time-consuming, especially when it is necessary to have a deep dive into the database kernels. Thus there are demands for an easy-to-extend database simulator to help fast prototype and verify the in-database algorithms before implementing them in real databases. In this demo, we present such an extensible relational database simulator, DBSim, to help data scientists prototype their in-database analytics and learning algorithms and verify the effectiveness of their ideas with minimal cost. DBSim simulates a real relational database by integrating all the major components in mainstream RDBMS, including SQL parser, relational operators, query optimizer, etc. In addition, DBSim provides various interfaces for users to flexibly plug their custom extension modules into any of the major components, without modifying the kernel. By those interfaces, DBSim supports easy extensions on SQL syntax, relational operators, query optimizer rules and cost models, and physical plan execution. Furthermore, DBSim provides utilities to facilitate users' developing and debugging, like query plan visualizer and interactive analyzer on optimization rules. We develop DBSim using pure Python to support seamless implementation of most data science algorithms into it, since many of them are written in Python.
... View-based approaches in graph databases. With the advances of graph databases, graph view-based approaches [13,16,21,37] have gained more and more attention. For instance, DB2 graph [37] utilized a graph overlay approach to define a graph view of the underlying relational data. ...
... Fan et al. [16] implemented graph views for pattern queries based on graph simulation [15]. GRFusion [21] decomposed the graph topology from the relational tables and used pointers to connect the graph topology with the relational attribute data. While the aforementioned methods implemented the graph view using the relational approaches, G-View proposed an extended graph view, which not only utilizes a native graph approach, but also supports the subgraph and supergraph query answering. ...
Preprint
Full-text available
Recently, several works have studied the problem of view selection in graph databases. However, existing methods cannot fully exploit the graph properties of views, e.g., supergraph views and common subgraph views, which leads to a low view utility and duplicate view content. To address the problem, we propose an end-to-end graph view selection tool, G-View, which can judiciously generate a view set from a query workload by exploring the graph properties of candidate views and considering their efficacy. Specifically, given a graph query set and a space budget, G-View translates each query to a candidate view pattern and checks the query containment via a filtering-and-verification framework. G-View then selects the views using a graph gene algorithm (GGA), which relies on a three-phase framework that explores graph view transformations to reduce the view space and optimize the view benefit. Finally, G-View generates the extended graph views that persist all the edge-induced subgraphs to answer the subgraph and supergraph queries simultaneously. Extensive experiments on real-life and synthetic datasets demonstrated G-View achieved averagely 21x and 2x query performance speedup over two view-based methods while having 2x and 5x smaller space overhead, respectively. Moreover, the proposed selection algorithm, GGA, outperformed other selection methods in both effectiveness and efficiency.
... Recently, works such as [28,44,45] address a similar problem in the context of property graphs. GRFusion [28] focuses on filling the gap between the relational and the graph models rather than optimizing the graph schema to achieve better query performance. ...
... Recently, works such as [28,44,45] address a similar problem in the context of property graphs. GRFusion [28] focuses on filling the gap between the relational and the graph models rather than optimizing the graph schema to achieve better query performance. Szárnyas et al. [45] propose to use incremental view maintenance for property graph queries. ...
Preprint
Full-text available
Enterprises are creating domain-specific knowledge graphs by curating and integrating their business data from multiple sources. The data in these knowledge graphs can be described using ontologies, which provide a semantic abstraction to define the content in terms of the entities and the relationships of the domain. The rich semantic relationships in an ontology contain a variety of opportunities to reduce edge traversals and consequently improve the graph query performance. Although there has been a lot of effort to build systems that enable efficient querying over knowledge graphs, the problem of schema optimization for query performance has been largely ignored in the graph setting. In this work, we show that graph schema design has significant impact on query performance, and then propose optimization algorithms that exploit the opportunities from the domain ontology to generate efficient property graph schemas. To the best of our knowledge, we are the first to present an ontology-driven approach for property graph schema optimization. We conduct empirical evaluations with two real-world knowledge graphs from medical and financial domains. The results show that the schemas produced by the optimization algorithms achieve up to 2 orders of magnitude speed-up compared to the baseline approach.
... Not to pick on any other researchers, the first author lists only the systems that his group has developed that follow that same pitfall above, e.g., AQWA [7], LocationSpark [52], Tornado/SWARM [16,33], SP-GiST [10], and GRFusion [23,24] that extend on Hadoop [47], Spark [58], Storm [1], PostgreSQL [50], and VoltDB [51], respectively, where the latter systems have been originally optimized for non-location data. Many other researchers and industries follow the same paths with some notable successes, e.g., Oracle Spatial [5], Spatial Hadoop [19], and GeoSpark [57]. ...
Preprint
Full-text available
Due to the ubiquity of mobile phones and location-detection devices, location data is being generated in very large volumes. Queries and operations that are performed on location data warrant the use of database systems. Despite that, location data is being supported in data systems as an afterthought. Typically, relational or NoSQL data systems that are mostly designed with non-location data in mind get extended with spatial or spatiotemporal indexes, some query operators, and higher level syntactic sugar in order to support location data. The ubiquity of location data and location data services calls for systems that are solely designed and optimized for the efficient support of location data. This paper envisions designing intelligent location+X data systems, ILX for short, where in ILX, location is treated as a first-class citizen type. ILX is tailored with location data as the main data type (location-first). Because location data is typically augmented with other data types X, e.g., graphs, text data, click streams, annotations, etc., ILX needs to be extensible to support other data types X along with location. This paper envisions the main features that ILX should support, some research challenges related to realizing and supporting ILX.
... Table 1 shows the response times for all scenarios. Authors in [6] have proposed a DualFetchQL system which provides a platform in order to integrate and access the data from MySQL (a relational database) and MongoDB (a NoSQL database). A query syntax named aggregate query is devised to access data from two different sources which itself determines from which of the underlying databases the data can be found. ...
Article
Full-text available
Relational databases hold a leading position in database industry and are better know as workhorses of database industry. They are one of most popular and commonly used medium for recording, manipulating and retrieving data. They find its application in varied and protean requirements including data mining, business intelligence and analytical querying. However, new requirements from database developers started coming up with the arrival of web 3.0 and data explosion with big data and social networks. These requirements focused more on relationships in data, thus making graph-structured data important. Graph database provide flexible structure to address graph data. However, there are various domains that give rise to graph+relational queries, which leads to the development of hybrid database. Hybrid model design is integration of two popular database models in single framework in order to abolish the drawbacks of the each system. This paper briefs about the previous researches done in the development of hybrid database model and attempts to propose a hybrid system.
Article
There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded’s design is a new class of join algorithms that satisfy strong theoretical guarantees, but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and execution engine that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP. Finally, we show that the EmptyHeaded design can easily be extended to accommodate a standard resource description framework (RDF) workload, the LUBM benchmark. On the LUBM benchmark, we show that EmptyHeaded can compete with and sometimes outperform two high-level, but specialized RDF baselines (TripleBit and RDF-3X), while outperforming MonetDB by up to three orders of magnitude and LogicBlox by up to two orders of magnitude.
Article
Analyzing interconnection structures among underlying entities or objects in a dataset through the use of graph analytics has been shown to provide tremendous value in many application domains. However, graphs are not the primary representation choice for storing most data today, and in order to have access to these analyses, users are forced to extract data from their data stores, construct the requisite graphs, and then load them into some graph engine in order to execute their graph analysis task. Moreover, these graphs can be significantly larger than the initial input stored in the database, making it infeasible to construct or analyze such graphs in memory. In this paper we address both of these challenges by building a system that enables users to declaratively specify graph extraction tasks over a relational database schema and then execute graph algorithms on the extracted graphs. We propose a declarative domain-specific language for this purpose, and pair it up with a novel condensed, in-memory representation that significantly reduces the memory footprint of these graphs, permitting analysis of larger-than-memory graphs. We present a general algorithm for creating this condensed representation for a large class of graph extraction queries against arbitrary schemas. We observe that the condensed representation suffers from a duplication issue, that results in inaccuracies for most graph algorithms. We then present a suite of in-memory representations that handle this duplication in different ways and allow trading off the memory required and the computational cost for executing different graph algorithms. We introduce novel deduplication algorithms for removing this duplication in the graph, which are of independent interest for graph compression, and provide a comprehensive experimental evaluation over several real-world and synthetic datasets illustrating these trade-offs.
Article
Big data processing is driven by new types of in-memory database systems. In this article, we apply performance modeling to efficiently optimize workload placement for such systems. In particular, we propose novel response time approximations for in-memory databases based on fork-join queuing models and contention probabilities to model variable threading levels and per-class memory occupation under analytical workloads. We combine these approximations with a nonlinear optimization methodology that seeks optimal load dispatching probabilities in order to minimize memory swapping and resource utilization. We compare our approach with state-of-the-art response time approximations using real data from an SAP HANA in-memory system and show that our models markedly improve accuracy over existing approaches, at similar computational costs.
Conference Paper
Graph data is prevalent in many domains, but it has usually required specialized engines to analyze. This design is onerous for users and precludes optimization across complete workflows. We present GraphFrames, an integrated system that lets users combine graph algorithms, pattern matching and relational queries, and optimizes work across them. GraphFrames generalize the ideas in previous graph-on-RDBMS systems, such as GraphX and Vertexica, by letting the system materialize multiple views of the graph (not just the specific triplet views in these systems) and executing both iterative algorithms and pattern matching using joins. To make applications easy to write, GraphFrames provide a concise, declarative API based on the "data frame" concept in R that can be used for both interactive queries and standalone programs. Under this API, GraphFrames use a graph-aware join optimization algorithm across the whole computation that can select from the available views. We implement GraphFrames over Spark SQL, enabling parallel execution on Spark and integration with custom code. We find that GraphFrames make it easy to express end-to-end workflows and match or exceed the performance of standalone tools, while enabling optimizations across workflow steps that cannot occur in current systems. In addition, we show that GraphFrames' view abstraction makes it easy to further speed up interactive queries by registering the appropriate view, and that the combination of graph and relational data allows for other optimizations, such as attribute-aware partitioning.
Article
The nested relational model provides a better way to represent complex objects than the (flat) relational model, by allowing relations to have relation-valued attributes. A recursive algebra for nested relations that allows tuples at all levels of nesting in a nested relation to be accessed and modified without any special navigational operators and without having to flatten the nested relation has been developed. In this algebra, the operators of the nested relational algebra are extended with recursive definitions so that they can be applied not only to relations but also to subrelations of a relation. In this paper, we show that queries are more efficient and succinct when expressed in the recursive algebra than in languages that require restructuring in order to access subrelations of relations. We also show that most of the query optimization techniques that have been developed for the relational algebra can be easily extended for the recursive algebra and that queries are more easily optimizable when expressed in the recursive algebra than when they are expressed in languages like the non-recursive algebra.
Conference Paper
An in-memory indexing tree is a critical component of many databases. Modern many-core processors, such as GPUs, are offering tremendous amounts of computing power making them an attractive choice for accelerating indexing. However, the memory available to the accelerating co-processor is rather limited and expensive in comparison to the memory available to the CPU. This drawback is a barrier to exploit the computing power of co-processors for arbitrarily large index trees. In this paper, we propose a novel design for a B+-tree based on the heterogeneous computing platform and the hybrid memory architecture found in GPUs. We propose a hybrid CPU-GPU B+-tree, "HB+-tree," which targets high search throughput use cases. Unique to our design is the joint and simultaneous use of computing and memory resources of CPU-GPU systems. Our experiments show that our HB+-tree can perform up to 240 million index queries per second, which is 2.4X higher than our CPU-optimized solution.
Conference Paper
Multicore in-memory databases often rely on traditional con- currency control schemes such as two-phase-locking (2PL) or optimistic concurrency control (OCC). Unfortunately, when the workload exhibits a non-trivial amount of contention, both 2PL and OCC sacrifice much parallel execution op- portunity. In this paper, we describe a new concurrency control scheme, interleaving constrained concurrency con- trol (IC3), which provides serializability while allowing for parallel execution of certain conflicting transactions. IC3 combines the static analysis of the transaction workload with runtime techniques that track and enforce dependencies among concurrent transactions. The use of static analysis simplifies IC3's runtime design, allowing it to scale to many cores. Evaluations on a 64-core machine using the TPC- C benchmark show that IC3 outperforms traditional con- currency control schemes under contention. It achieves the throughput of 434K transactions/sec on the TPC-C bench- mark configured with only one warehouse. It also scales better than several recent concurrent control schemes that also target contended workloads.
Conference Paper
A variety of applications spanning various domains, e.g., social networks, transportation, and bioinformatics, have graphs as first-class citizens. These applications share a vital operation, namely, finding the shortest path between two nodes. In many scenarios, users are interested in filtering the graph before finding the shortest path. For example, in social networks, one may need to compute the shortest path between two persons on a sub-graph containing only family relationships. This paper focuses on dynamic graphs with labeled edges, where the target is to find a shortest path after filtering some edges based on user-specified query labels. This problem is termed the Edge-Constrained Shortest Path query (or ECSP, for short). This paper introduces Edge-Disjoint Partitioning (EDP, for short), a new technique for efficiently answering ECSP queries over dynamic graphs. EDP has two main components: a dynamic index that is based on graph partitioning, and a traversal algorithm that exploits the regular patterns of the answers of ECSP queries. The main idea of EDP is to partition the graph based on the labels of the edges. On demand, EDP computes specific sub-paths within each partition and updates its index. The computed sub-paths act as pre-computations that can be leveraged by future queries. To answer an ECSP query, EDP connects sub-paths from different partitions using its efficient traversal algorithm. EDP can dynamically handle various types of graph updates, e.g., label, edge, and node updates. The index entries that are potentially affected by graph updates are invalidated and re-computed on demand. EDP is evaluated using real graph datasets from various domains. Experimental results demonstrate that EDP can achieve query performance gains of up to four orders of magnitude in comparison to state of the art techniques.
Conference Paper
By maintaining the data in main memory, in-memory databases dramatically reduce the I/O cost of transaction processing. However, for recovery purposes, in-memory systems still need to flush the log to disk, which incurs a substantial number of I/Os. Recently, command logging has been proposed to replace the traditional data log (e.g., ARIES logging) in in-memory databases. Instead of recording how the tuples are updated, command logging only tracks the transactions that are being executed, thereby effectively reducing the size of the log and improving the performance. However, when a failure occurs, all the transactions in the log after the last checkpoint must be redone sequentially and this significantly increases the cost of recovery. In this paper, we first extend the command logging technique to a distributed system, where all the nodes can perform their recovery in parallel. We show that in a distributed system, the only bottleneck of recovery caused by command logging is the synchronization process that attempts to resolve the data dependency among the transactions. We then propose an adaptive logging approach by combining data logging and command logging. The percentage of data logging versus command logging becomes a tuning knob between the performance of transaction processing and recovery to meet different OLTP requirements, and a model is proposed to guide such tuning. Our experimental study compares the performance of our proposed adaptive logging, ARIES-style data logging and command logging on top of H-Store. The results show that adaptive logging can achieve a 10x boost for recovery and a transaction throughput that is comparable to that of command logging.