Table 1 - uploaded by Erhard Rahm
Content may be subject to copyright.
Comparison of graph database systems

Comparison of graph database systems

Source publication
Chapter
Full-text available
Many big data applications in business and science require the management and analysis of huge amounts of graph data. Suitable systems to manage and to analyze such graph data should meet a number of challenging requirements including support for an expressive graph data model with heterogeneous vertices and edges, powerful query and graph mining c...

Contexts in source publication

Context 1
... database systems are based on a graph data model representing data by graph structures and providing graph-based operators such as neighborhood traversal and pattern matching [6]. Table 1 provides an overview about recent graph database sys- tems including supported data models, their application scope and the used storage approaches. The selection claims no completeness but shows representatives from current research projects and commercial systems with diverse characteristics. ...
Context 2
... only system capable to run custom graph processing algorithms within the database is Blazegraph by its gather-apply-scatter (see Sect. 3) API. 3 Additionally, the current version of TinkerPop includes the virtual integration of graph processing systems in graph databases, i.e., from the user perspective graph processing is part of the database system but data is actually moved to an external system. However, indicated by a circle in the analytics column in Table 1, we could identify only two systems currently implementing this functionality. ...
Context 3
... 1d shows a hypergraph with a ternary hyperedge. From the graph databases of Table 1 only HypergraphDB supports hypergraphs by default. A graph data model supporting edges not only between vertices but also between graphs is the hypernode model [86]. ...
Context 4
... is most popular in the context of the semantic web where its major strengths are standardization, the availability of web knowledge bases to flexibly enrich user data- bases and the resulting reasoning capabilities over linked RDF data [111]. Kaoudi and Manolescu [58] comprehensively survey recent approaches to manage large RDF graphs and consider additional systems not listed in Table 1. ...
Context 5
... consequence, every PGM edge is expressed by 3 + m triples, where m is the number of properties. Two of the graph databases of Table 1 store the PGM using RDF but both are using alternative, non- standard ways of reification. Stardog is using n-quads [22] for PGM edge reification. ...
Context 6
... 3 1 1 1 1 1 2 2 1 1 1 3 3 1 1 1 4 4 4 4 4 5 5 4 4 4 6 6 3 1 1 7 7 6 3 1 8 8 4 4 4 initialization step, a sequential connected component algorithm is executed to find all local connected components. The locally computed component label for each bound- ary vertex is then sent to its corresponding internal vertex. ...

Similar publications

Article
Full-text available
Recent demands for storing and querying big data have revealed various shortcomings of traditional relational database systems. This, in turn, has led to the emergence of a new kind of complementary nonrelational data store, named as NoSQL. This survey mainly aims at elucidating the design decisions of NoSQL stores with regard to the four nonorthog...

Citations

... GDBs have been researched in both academia and industry [10,11,75,90,100,124,133], in terms of query languages [7,8,49,208], database management [49, 120,156,172,175], compression and data models [23,35,40,43,146,147,162], execution in novel environments such as the serverless setting [71,150,219], and others [77]. Many graph databases exist [6, 12-15, 46, 55, 56, 59, 65, 73, 78, 82, 87, 89, 125, 126, 151, 154, 155, 164-168, 178, 182, 183, 202, 209, 211, 214, 217, 218, 233, 238, 240]. ...
Preprint
Full-text available
Graph databases (GDBs) are crucial in academic and industry applications. The key challenges in developing GDBs are achieving high performance, scalability, programmability, and portability. To tackle these challenges, we harness established practices from the HPC landscape to build a system that outperforms all past GDBs presented in the literature by orders of magnitude, for both OLTP and OLAP workloads. For this, we first identify and crystallize performance-critical building blocks in the GDB design, and abstract them into a portable and programmable API specification, called the Graph Database Interface (GDI), inspired by the best practices of MPI. We then use GDI to design a GDB for distributed-memory RDMA architectures. Our implementation harnesses one-sided RDMA communication and collective operations, and it offers architecture-independent theoretical performance guarantees. The resulting design achieves extreme scales of more than a hundred thousand cores. Our work will facilitate the development of next-generation extreme-scale graph databases.
... Another prominent challenge when dealing with Big data is storage, Junghanns et al. (2017) (as cited in Alabdullah et al., 2018) claimed that traditional relational databases are out of date and incapable of storing and processing data generated by modern business applications. Furthermore, the study by Alabdullah et al. (2018) reported that everyday problems like data recording, data storage costs, and synchronization issues drive data scientists to use NoSQL alternatives. ...
... (2) How to organize historical graph data? It is observed that current graph database systems mostly focus on OLTP-like CRUD operations (create, update, read, delete) for vertices and edges as well as on queries only on small portions of a graph [24]. Thus it is vital to support fine-grained time management which can record the evolution of temporal graphs by only maintaining the related changed graph objects. ...
Preprint
Real-world graphs are often dynamic and evolve over time. To trace the evolving properties of graphs, it is necessary to maintain every change of both vertices and edges in graph databases with the support of temporal features. Existing works either maintain all changes in a single graph or periodically materialize snapshots to maintain the historical states of each vertex and edge and process queries over proper snapshots. The former approach presents poor query performance due to the ever-growing graph size as time goes by, while the latter one suffers from prohibitively high storage overheads due to large redundant copies of graph data across different snapshots. In this paper, we propose a hybrid data storage engine, which is based on the MVCC mechanism, to separately manage current and historical data, which keeps the current graph as small as possible. In our design, changes in each vertex or edge are stored once. To further reduce the storage overhead, we simply store the changes as opposed to storing the complete snapshot. To boost the query performance, we place a few anchors as snapshots to avoid deep historical version traversals. Based on the storage engine, a temporal query engine is proposed to reconstruct subgraphs as needed on the fly. Therefore, our alternative approach can provide fast querying capabilities over subgraphs at a past time point or range with small storage overheads. To provide native support of temporal features, we integrate our approach into Memgraph, and call the extended database system TGDB(Temporal Graph Database). Extensive experiments are conducted on four real and synthetic datasets. The results show TGDB performs better in terms of both storage and performance against state-of-the-art methods and has almost no performance overheads by introducing the temporal features.
... A graph is a data structure that represents multiple relationships through vertices and edges [1,2]. With the advancements in big data and artificial intelligence technology, graphs are widely used to process and analyze various relationships between objects [3,4]. ...
Article
Full-text available
Studies on the real-time detection of connected components in graph streams have been carried out. The existing connected component detection method cannot process connected components incrementally, and the performance deteriorates due to frequent data transmission when GPU is used. In this paper, we propose a new incremental processing method to solve the problems found in the existing methods for detecting connected components on GPUs. The proposed method minimizes the amount of data to be sent to the GPU by determining the subgraph affected by the graph stream update and by detecting the part to be recalculated. We consider the number of vertices to quickly determine the connected components of a graph stream on the GPU. An asynchronous execution method is used to shorten the transfer time between the CPU and the GPU according to real-time graph stream changes. In order to show that the proposed method provides fast incremental connected component detection on the GPU, we evaluated its performance using various datasets.
... Indeed, graph analytics handle many emerging large data management scenarios in several application domains such as social media, astronomy, computational biology, telecommunications, protein networks, and many more [Bat+15;Sah+20]. Thus, large graph analytics became one of the most effective Big Data (BD) tasks for getting insights from huge volumes of interconnected data [Jun+17]. One recent example on the importance of big graph analytics is the timely Graphs4Covid-19 1 initiative that aims at alleviating the global COVID-19 pandemic. ...
... These systems are mostly native, in the sense that they process a graph stored using dedicated graph data structures that optimize graph query operations such as graph pattern matching and graph traversals. However, native graph database management systems do not show good scalability for large query workloads [RWE15;Jun+17]. Indeed, most graph databases are centralized and require representing the graphs in main memory to maintain nodes and references to their adjacent nodes. On the other side, distributed graph computing engines (e.g., Apache Giraph, and Spark GraphX, and others) can scale, but they only are dedicated for iterative graph analytics via implementing graph algorithms (e.g., PageRank, Triangle Counting, Connected Components) [Bat+15;Jun+17]. ...
... Indeed, most graph databases are centralized and require representing the graphs in main memory to maintain nodes and references to their adjacent nodes. On the other side, distributed graph computing engines (e.g., Apache Giraph, and Spark GraphX, and others) can scale, but they only are dedicated for iterative graph analytics via implementing graph algorithms (e.g., PageRank, Triangle Counting, Connected Components) [Bat+15;Jun+17]. However, these systems are not dedicated to large declarative graph query processing. ...
Thesis
Full-text available
The thesis investigates how to enable prescriptive analytics via ranking criteria. We designed a PPA framework (called "Bench-Ranking") that employs several Single-Dimensional (SD)and Multi-Dimensional (MD) ranking criteria for ranking the system’s performance with several experimental dimensions. Bench-Ranking provides an accurate yet simple way that supports the practitioners in their evaluation tasks even in the existence of dimensions trade-offs. Finally, the thesis provides evaluation metrics for assessing the efficiency of the proposed ranking criteria.
... Graph databases have a long history of development and focus in both academia and in the industry, and there has been significant work on them [9,10,58,69,80,91,100]. A lot of research has been dedicated to graph query languages [8,8,40], graph database management [40,89,115,128,132], compression in graph databases and graph data models [19,27,31,34,107,109,120], execution in novel environments such as the serverless setting [54,111,158], and others. Many graph databases exist [1-3, 7, 11-14, 37, 45, 46, 48, 56, 61, 62, 64, 67, 92, 93, 112-114, 122-126, 135, 136, 138, 150, 153, 154, 157, 184, 192, 193]. ...
Preprint
Graph databases (GDBs) enable processing and analysis of unstructured, complex, rich, and usually vast graph datasets. Despite the large significance of GDBs in both academia and industry, little effort has been made into integrating them with the predictive power of graph neural networks (GNNs). In this work, we show how to seamlessly combine nearly any GNN model with the computational capabilities of GDBs. For this, we observe that the majority of these systems are based on, or support, a graph data model called the Labeled Property Graph (LPG), where vertices and edges can have arbitrarily complex sets of labels and properties. We then develop LPG2vec, an encoder that transforms an arbitrary LPG dataset into a representation that can be directly used with a broad class of GNNs, including convolutional, attentional, message-passing, and even higher-order or spectral models. In our evaluation, we show that the rich information represented as LPG labels and properties is properly preserved by LPG2vec, and it increases the accuracy of predictions regardless of the targeted learning task or the used GNN model, by up to 34% compared to graphs with no LPG labels/properties. In general, LPG2vec enables combining predictive power of the most powerful GNNs with the full scope of information encoded in the LPG model, paving the way for neural graph databases, a class of systems where the vast complexity of maintained data will benefit from modern and future graph machine learning methods.
... This concept is mainly used to simplify the data and replace the substructure graph as an example. Moreover, this concept allows the abstraction of detailed structure for providing data, unstructured graphs, and relevant attributes to understand the data (Junghanns et al. 2017; Thompson and Langley 1991). There are five types of data mining: classification analysis, association rule learning, outlier detection, cluster analysis, and regression analysis (Kesavaraj et al. 2013). ...
Article
Full-text available
In recent years, graph-based data mining (GDM) is the most accepted research due to numerous applications in a broad selection of software bug localization, computational biology, practical field, computer networking, and keyword searching. Moreover, graph data are subject to suspicions because of incompleteness and vagueness of data. Graph data mining of uncertain graphs is the most challenging and semantically different from correct data mining. The main problem of the GDM is mining uncertain graph data and subgraph pattern frequency. This paper discussed different techniques related to GDM, complexities, and the different size of the graph, and also investigated the dataset used for GDM, techniques of GDM like clustering analysis, and anomaly detection. To improve the performance of the online learning system, GDM is introduced. Additionally, the study algorithm is used for GDM, dataset, advantages, and disadvantages. In the end, future directions to enrich online learning based on the results of GDM are discussed. Performance metrics of different techniques such as accuracy, precision, recall, F-measure, and runtime are observed. Finally, conclude the survey with a discussion and overall performance of graph-based data mining.
... There is a large spectrum of analysis forms for graph data, ranging from graph queries to find certain patterns (e.g., biological pathways), over graph mining (e.g., to rank websites or detect communities in social graphs) to machine learning on graph data, e.g., to predict new relations. Graphs are often large and heterogeneous with millions or billions of vertices and edges of different types making the efficient implementation and execution of graph algorithms challenging [42,73]. Furthermore, the structure and contents of graphs and networks mostly change over time making it necessary to continuously evolve the graph data and support temporal graph analysis instead of being limited to the analysis of static graph data snapshots. ...
... Two major categories of systems focus on the management and analysis of graph data: graph database systems and distributed graph processing systems [42]. A closer look at both categories and their strengths and weaknesses is given in Sect 2. So graph database systems are typically less suited for high-volume data analysis and graph mining [31,53,75] as they often do not support distributed processing on partitioned graphs which limits the maximum graph size and graph analysis to the resources of a single machine. ...
... A closer look at both categories and their strengths and weaknesses is given in Sect 2. So graph database systems are typically less suited for high-volume data analysis and graph mining [31,53,75] as they often do not support distributed processing on partitioned graphs which limits the maximum graph size and graph analysis to the resources of a single machine. By contrast, distributed graph processing systems support high scalability and parallel graph processing but typically lack an expressive graph data model and declarative query support [13,42]. In particular, the latter makes it difficult for users to formulate complex analytical tasks as this requires profound programming and system knowledge. ...
Article
Full-text available
Temporal property graphs are graphs whose structure and properties change over time. Temporal graph datasets tend to be large due to stored historical information, asking for scalable analysis capabilities. We give a complete overview of Gradoop, a graph dataflow system for scalable, distributed analytics of temporal property graphs which has been continuously developed since 2005. Its graph model TPGM allows bitemporal modeling not only of vertices and edges but also of graph collections. A declarative analytical language called GrALa allows analysts to flexibly define analytical graph workflows by composing different operators that support temporal graph analysis. Built on a distributed dataflow system, large temporal graphs can be processed on a shared-nothing cluster. We present the system architecture of Gradoop, its data model TPGM with composable temporal graph operators, like snapshot, difference, pattern matching, graph grouping and several implementation details. We evaluate the performance and scalability of selected operators and a composed workflow for synthetic and real-world temporal graphs with up to 283 M vertices and 1.8 B edges, and a graph lifetime of about 8 years with up to 20 M new edges per year. We also reflect on lessons learned from the Gradoop effort.
... A graph is a data structure for representing the multiple relationships between objects [1,2]. In recent real applications, dynamic graphs are generated, in which the vertices and edges constituting the graph are constantly changing [3][4][5]. ...
Article
Full-text available
Incremental graph processing has been developed to reduce unnecessary redundant calculations in dynamic graphs. In this paper, we propose an incremental dynamic graph-processing scheme using a cost model to selectively perform incremental processing or static processing. The cost model calculates the predicted values of the detection cost and processing cost of the recalculation region based on the past processing history. If there is a benefit of the cost model, incremental query processing is performed. Otherwise, static query processing is performed because the detection cost and processing cost increase due to the graph change. The proposed incremental scheme reduces the amount of computation by processing only the changed region through incremental processing. Further, it reduces the detection and disk I/O costs of the vertex, which are calculated by reusing the subgraphs from the previous results. The processing structure of the proposed scheme stores the data read from the cache and the adjacent vertices and then performs only memory mapping when processing these graph. It is demonstrated through various performance evaluations that the proposed scheme outperforms the existing schemes.
... Cypher query The property graph query language (Cypher) is used commercially by different multinational companies and researchers for data mining. The ASCII method [9] is used by a cypher to represent distinct patterns in the data set. Nodes (e: entity) and relationship (r: relationship) are critical for different complex questions in cypher queries [27]. ...
... Martin et al. [9] has described the analysis of big graph data and surveyed extensive data systems in three categories: graph database system, distributed graph processing system, and distributed graph platform. Since it was a less expressive mode, the Gradoop distributed data flow approach was introduced as an extended property graph model. ...
Article
Full-text available
Over the last few decades, graphs have become increasingly important in many applications and domains for managing Big data. Big data analysis in a graph database is described as an analysis of exponentially increasing massive interconnected data concerning time. However, analyzing big connected data in social networks and synthetic identity detection is challenging. In previous approaches, fraud detection has been done on the complete graph data, which is a time-consuming process and will create bottlenecks while query execution. To overcome the issue, this paper proposes a new fraud detection technique to unveil synthetic identities involved in the Panama Paper leak dataset (unprecedented leak of 11.5 m data from the database of the world’s fourth-biggest offshore law arm, Mossack Fonseca) using a Node rank-based fraud detection algorithm by integrating distributed data profiling techniques on a minimized graph by minimizing the least influential nodes. The proposed model is verified on the three nodes cluster to improve data scalability, reduce the query execution time by an average of 30–36% and finally reduce the fraud detection time by 18.2%.