Conference Paper

The Graph Database Interface: Scaling Online Transactional and Analytical Graph Workloads to Hundreds of Thousands of Cores

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In recent years, the continuous growth in data volume and the expansion of application scenarios have progressively exposed the limitations of traditional database technologies in handling ultra-large scale data. Traditional databases encounter bottlenecks in data storage, diminished query efficiency, and performance constraints in scenarios that demand high real-time responsiveness, low latency, and high computational efficiency [1]. These limitations become particularly pronounced when processing complex topological structures and relational queries. ...
... where mips i indicates the processing speed of the CPU, cpu i denotes the number of CPU cores, mem i the size of the memory, and disk i the disk capacity. If task t i is scheduled to execute on server v k , the execution time for each task, denoted as t i , is outlined in Eq. (1). ...
... c) Randomly sample a minibatch of transitions (s, a, r, s ) label from the experience replay unit. d) For each sampled transition (s, a, r, s ) label : 1 If s is a terminal state, set TD target = r. 2 Otherwise, the evaluation network Q(s, a, θ) is used to find the action a that produces the maximum Q value as shown in Eq. (9): ...
Article
Full-text available
This paper focuses on the scheduling problem of workflow tasks that exhibit interdependencies. Unlike independent batch tasks, workflows typically consist of multiple subtasks with intrinsic correlations and dependencies. It necessitates the distribution of various computational tasks to appropriate computing node resources in accordance with task dependencies to ensure the smooth completion of the entire workflow. Workflow scheduling must consider an array of factors, including task dependencies, availability of computational resources, and the schedulability of tasks. Therefore, this paper delves into the distributed graph database workflow task scheduling problem and proposes a workflow scheduling methodology based on deep reinforcement learning (DRL). The method optimizes the maximum completion time (makespan) and response time of workflow tasks, aiming to enhance the responsiveness of workflow tasks while ensuring the minimization of the makespan. The experimental results indicate that the Q-learning Deep Reinforcement Learning (Q-DRL) algorithm markedly diminishes the makespan and refines the average response time within distributed graph database environments. In quantifying makespan, Q-DRL achieves mean reductions of 12.4% and 11.9% over established First-fit and Random scheduling strategies, respectively. Additionally, Q-DRL surpasses the performance of both DRL-Cloud and Improved Deep Q-learning Network (IDQN) algorithms, with improvements standing at 4.4% and 2.6%, respectively. With reference to average response time, the Q-DRL approach exhibits a significantly enhanced performance in the scheduling of workflow tasks, decreasing the average by 2.27% and 4.71% when compared to IDQN and DRL-Cloud, respectively. The Q-DRL algorithm also demonstrates a notable increase in the efficiency of system resource utilization, reducing the average idle rate by 5.02% and 9.30% in comparison to IDQN and DRL-Cloud, respectively. These findings support the assertion that Q-DRL not only upholds a lower average idle rate but also effectively curtails the average response time, thereby substantially improving processing efficiency and optimizing resource utilization within distributed graph database systems.
... Finally, the Graph Database Interface (GDI) [170] has recently been proposed to deliver a toolbox for designing a scalable and high-performance data access and transaction layer for general graph databases that can also be used to maintain knowledge graphs. Its RDMA-based implementation, GDI-RMA [171], has been shown to scale to more than a hundred thousand compute cores and to label-and property-rich graphs with more than 500 billion edges. The key mechanisms used for high performance are one-sided non-blocking RDMA communication, hardware-accelerated network atomic operations, and collective communication, a form of group communication that has been tuned over decades by the MPI community [177]. ...
... On the other hand, RDMA is usually more complex to program and maintain. This, however, has been alleviated with efforts such as the one-sided communication within MPI [137], [177] or by GDI [170], [171]. ...
Preprint
Knowledge graphs (KGs) have achieved significant attention in recent years, particularly in the area of the Semantic Web as well as gaining popularity in other application domains such as data mining and search engines. Simultaneously, there has been enormous progress in the development of different types of heterogeneous hardware, impacting the way KGs are processed. The aim of this paper is to provide a systematic literature review of knowledge graph hardware acceleration. For this, we present a classification of the primary areas in knowledge graph technology that harnesses different hardware units for accelerating certain knowledge graph functionalities. We then extensively describe respective works, focusing on how KG related schemes harness modern hardware accelerators. Based on our review, we identify various research gaps and future exploratory directions that are anticipated to be of significant value both for academics and industry practitioners.
... If we take a step back, this is one solution to the problem of how to store data that allows efficient concurrent reads. This is exactly what is solved by graph databases that support horizontal scaling for OLAP transactions [44]. The partitioning solution is commonly referred to as graph sharding [45,46], while simply duplicating copies of the data is another viable option. ...
Preprint
Full-text available
Training graph neural networks on large datasets has long been a challenge. Traditional approaches include efficiently representing the whole graph in-memory, designing parameter efficient and sampling-based models, and graph partitioning in a distributed setup. Separately, graph databases with native graph storage and query engines have been developed, which enable time and resource efficient graph analytics workloads. We show how to directly train a GNN on a graph DB, by retrieving minimal data into memory and sampling using the query engine. Our experiments show resource advantages for single-machine and distributed training. Our approach opens up a new way of scaling GNNs as well as a new application area for graph DBs.
Article
We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks
Article
Full-text available
Cellular regulation is inherently complex, and one particular cellular function is often controlled by a cascade of different types of regulatory interactions. For example, the activity of a transcription factor (TF), which regulates the expression level of downstream genes through transcriptional regulation, can be regulated by small molecules through compound–protein interactions. To identify such complex regulatory cascades, traditional relational databases require ineffective additional operations and are computationally expensive. In contrast, graph databases are purposefully developed to execute such deep searches efficiently. Here, we present ERMer (E. coli Regulation Miner), the first cloud platform for mining the regulatory landscape of Escherichia coli based on graph databases. Combining the AWS Neptune graph database, AWS lambda function, and G6 graph visualization engine enables quick search and visualization of complex regulatory cascades/patterns. Users can also interactively navigate the E. coli regulatory landscape through ERMer. Furthermore, a Q&A module is included to showcase the power of graph databases in answering complex biological questions through simple queries. The backend graph model can be easily extended as new data become available. In addition, the framework implemented in ERMer can be easily migrated to other applications or organisms. ERMer is available at https://ermer.biodesign.ac.cn/.
Conference Paper
Full-text available
We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity
Article
Full-text available
Data analysis, data management, and big data play a major role in both social and business perspective, in the last decade. Nowadays, the graph database is the hottest and trending research topic. A graph database is preferred to deal with the dynamic and complex relationships in connected data and offer better results. Every data element is represented as a node. For example, in social media site, a person is represented as a node, and its properties name, age, likes, and dislikes, etc and the nodes are connected with the relationships via edges. Use of graph database is expected to be beneficial in business, and social networking sites that generate huge unstructured data as that Big Data requires proper and efficient computational techniques to handle with. This paper reviews the existing graph data computational techniques and the research work, to offer the future research line up in graph database management. © Copyright 2018 Institute of Advanced Engineering and Science. All rights reserved.
Article
Full-text available
A graph database is a database where the data structures for the schema and/or instances are modeled as a (labeled)(directed) graph or generalizations of it, and where querying is expressed by graph-oriented operations and type constructors. In this article we present the basic notions of graph databases, give an historical overview of its main development, and study the main current systems that implement them.
Article
Full-text available
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Here we provide a conceptual review of key advancements in this area of representation learning on graphs, including matrix factorization-based methods, random-walk based algorithms, and graph convolutional networks. We review methods to embed individual nodes as well as approaches to embed entire (sub)graphs. In doing so, we develop a unified framework to describe these recent approaches, and we highlight a number of important applications and directions for future work.
Conference Paper
Full-text available
We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.
Chapter
Full-text available
Many big data applications in business and science require the management and analysis of huge amounts of graph data. Suitable systems to manage and to analyze such graph data should meet a number of challenging requirements including support for an expressive graph data model with heterogeneous vertices and edges, powerful query and graph mining capabilities, ease of use as well as high performance and scalability. In this chapter, we survey current system approaches for management and analysis of “big graph data”. We discuss graph database systems, distributed graph processing systems such as Google Pregel and its variations, and graph dataflow approaches based on Apache Spark and Flink. We further outline a recent research framework called Gradoop that is build on the so-called Extended Property Graph Data Model with dedicated support for analyzing not only single graphs but also collections of graphs. Finally, we discuss current and future research challenges.
Article
Full-text available
In this paper we introduce LDBC Graphalytics, a new industrial-grade benchmark for graph analysis platforms. It consists of six deterministic algorithms, standard datasets, synthetic dataset generators, and reference output, that enable the objective comparison of graph analysis platforms. Its test harness produces deep metrics that quantify multiple kinds of system scalability, such as horizontal/vertical and weak/strong, and of robustness, such as failures and performance variability. The benchmark comes with open-source software for generating data and monitoring performance. We describe and analyze six implementations of the benchmark (three from the community, three from the industry), providing insights into the strengths and weaknesses of the platforms. Key to our contribution, vendors perform the tuning and benchmarking of their platforms.
Conference Paper
Full-text available
The Linked Data Benchmark Council (LDBC) is now two years underway and has gathered strong industrial participation for its mission to establish benchmarks, and benchmarking practices for evaluating graph data management systems. The LDBC introduced a new choke-point driven methodology for developing benchmark workloads, which combines user input with input from expert systems architects, which we outline. This paper describes the LDBC Social Network Benchmark (SNB), and presents database benchmarking innovation in terms of graph query functionality tested, correlated graph generation techniques, as well as a scalable benchmark driver on a workload with complex graph dependencies. SNB has three query workloads under development: Interactive, Business Intelligence, and Graph Algorithms. We describe the SNB Interactive Workload in detail and illustrate the workload with some early results, as well as the goals for the two other workloads.
Conference Paper
Full-text available
We propose Atomic Active Messages (AAM), a mechanism that accelerates irregular graph computations on both shared- and distributed-memory machines. The key idea behind AAM is that hardware transactional memory (HTM) can be used for simple and efficient processing of irregular structures in highly parallel environments. We illustrate techniques such as coarsening and coalescing that enable hardware transactions to considerably accelerate graph processing. We conduct a detailed performance analysis of AAM on Intel Haswell and IBM Blue Gene/Q and we illustrate various performance tradeoffs between different HTM parameters that impact the efficiency of graph processing. AAM can be used to implement abstractions offered by existing programming models and to improve the performance of irregular graph processing codes such as Graph500 or Galois.
Conference Paper
Full-text available
Real world data offers a lot of possibilities to be represented as graphs. As a result we obtain undirected or directed graphs, multigraphs and hypergraphs, labelled or weighted graphs and their variants. A development of graph modelling brings also new approaches, e.g., considering constraints. Processing graphs in a database way can be done in many different ways. Some graphs can be represented as JSON or XML structures and processed by their native database tools. More generally, a graph database is specified as any storage system that provides index-free adjacency, i.e. an explicit graph structure. Graph database technology contains some technological features inherent to traditional databases, e.g. ACID properties and availability. Use cases of graph databases like Neo4j, OrientDB, InfiniteGraph, FlockDB, AllegroGraph, and others, document that graph databases are becoming a common means for any connected data. In Big Data era, important questions are connected with scalability for large graphs as well as scaling for read/write operations. For example, scaling graph data by distributing it in a network is much more difficult than scaling simpler data models and is still a work in progress. Still a challenge is pattern matching in graphs providing, in principle, an arbitrarily complex identity function. Mining complete frequent patterns from graph databases is also challenging since supporting operations are computationally costly. In this paper, we discuss recent advances and limitations in these areas as well as future directions.
Conference Paper
Full-text available
This paper analyzes the potential of Semantic Web technologies to support innovation in industrial scenarios. The study focuses in particular on RDF stores, the software components dedicated to the storage, representation and retrieval of semantic information. Starting from a literature review, a qualitative analysis is carried out considering a set of these systems. RDF stores are evaluated to find the implementations that are best suited to play the role of backbone in a software architecture sharing information between the software tools adopted by the various stakeholders. This can be achieved if the architecture overcomes the problems deriving from the lack of integration between the involved software applications, providing in this way an integrated view on relevant engineering knowledge.
Article
Full-text available
The Linked Data Benchmark Council (LDBC) is an EU project that aims to develop industry-strength benchmarks for graph and RDF data management systems. It includes the creation of a non-profit LDBC organization, where industry players and academia come together for managing the development of benchmarks as well as auditing and publishing official results. We present an overview of the project including its goals and organization, and describe its process and design methodology for benchmark development. We introduce so-called “choke-point” based benchmark development through which experts identify key technical challenges, and introduce them in the benchmark workload. Finally, we present the status of two benchmarks currently in development, one targeting graph data management systems using a social network data case, and the other targeting RDF systems using a data publishing case.
Conference Paper
Full-text available
We present the Koblenz Network Collection (KONECT), a project to collect network datasets in the areas of web science, network science and related areas, as well as provide tools for their analysis. In the cited areas, a surprisingly large number of very heterogeneous data can be modeled as networks and consequently, a unified representation of networks can be used to gain insight into many kinds of problems. Due to the emergence of the World Wide Web in the last decades many such datasets are now openly available. The KONECT project thus has the goal of collecting many diverse network datasets from the Web, and providing a way for their systematic study. The main parts of KONECT are (1) a collection of over 160 network datasets, consisting of directed, undirected, unipartite, bipartite, weighted, unweighted, signed and temporal networks collected from the Web, (2) a Matlab toolbox for network analysis and (3) a website giving a compact overview the various computed statistics and plots. In this paper, we describe KONECT's taxonomy of networks datasets, give an overview of the datasets included, review the supported statistics and plots, and briefly discuss KONECT's role in the area of web science and network science.
Conference Paper
Full-text available
Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth, and message rate. We also demonstrate application performance improvements with comparable programming complexity.
Article
Numerous irregular graph datasets, for example social networks or web graphs, may contain even trillions of edges. Often, their structure changes over time and they have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size and irregularity of such datasets, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., document stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 51 graph database systems are presented and compared, including Neo4j, OrientDB, and Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we outline future research and engineering challenges related to graph databases.
Article
Most products at ByteDance, e.g., TikTok, Douyin, and Toutiao, naturally generate massive amounts of graph data. To efficiently store, query and update massive graph data is challenging for the broad range of products at ByteDance with various performance requirements. We categorize graph workloads at ByteDance into three types: online analytical, transaction, and serving processing, where each workload has its own characteristics. Existing graph databases have different performance bottlenecks in handling these workloads and none can efficiently handle the scale of graphs at ByteDance. We developed ByteGraph to process these graph workloads with high throughput, low latency and high scalability. There are several key designs in ByteGraph that make it efficient for processing our workloads, including edge-trees to store adjacency lists for high parallelism and low memory usage, adaptive optimizations on thread pools and indexes, and geographic replications to achieve fault tolerance and availability. ByteGraph has been in production use for several years and its performance has shown to be robust for processing a wide range of graph workloads at ByteDance.
Article
Graph transaction processing poses unique challenges such as random data access due to the irregularity of graph structures, low throughput and high abort rate due to the relatively large read/write sets in graph transactions. To address these challenges, we present G-Tran, a remote direct memory access (RDMA)-enabled distributed in-memory graph database with serializable and snapshot isolation support. First, we propose a graph-native data store to achieve good data locality and fast data access for transactional updates and queries. Second, G-Tran adopts a fully decentralized architecture that leverages RDMA to process distributed transactions with the massively parallel processing (MPP) model, which can achieve high performance by utilizing all computing resources. In addition, we propose a new multi-version optimistic concurrency control (MV-OCC) protocol with two optimizations to address the issue of large read/write sets in graph transactions. Extensive experiments show that G-Tran achieves competitive performance compared with other popular graph databases on benchmark workloads.
Article
Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics systems, learning molecular fingerprints, predicting protein interface, and classifying diseases demand a model to learn from graph inputs. In other domains such as learning from non-structural data like texts and images, reasoning on extracted structures (like the dependency trees of sentences and the scene graphs of images) is an important research topic which also needs graph reasoning models. Graph neural networks (GNNs) are neural models that capture the dependence of graphs via message passing between the nodes of graphs. In recent years, variants of GNNs such as graph convolutional network (GCN), graph attention network (GAT), graph recurrent network (GRN) have demonstrated ground-breaking performances on many deep learning tasks. In this survey, we propose a general design pipeline for GNN models and discuss the variants of each component, systematically categorize the applications, and propose four open problems for future research.
Article
The specific characteristics of graph workloads make it hard to design a one-size-fits-all graph storage system. Systems that support transactional updates use data structures with poor data locality, which limits the efficiency of analytical workloads or even simple edge scans. Other systems run graph analytics workloads efficiently, but cannot properly support transactions. This paper presents LiveGraph, a graph storage system that outperforms both the best graph transactional systems and the best solutions for real-time graph analytics on fresh data. LiveGraph achieves this by ensuring that adjacency list scans, a key operation in graph workloads, are purely sequential: they never require random accesses even in presence of concurrent transactions. Such pure-sequential operations are enabled by combining a novel graph-aware data structure, the Transactional Edge Log (TEL), with a concurrency control mechanism that leverages TEL's data layout. Our evaluation shows that LiveGraph significantly outperforms state-of-the-art (graph) database solutions on both transactional and real-time analytical workloads.
Article
Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications, where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on the existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this article, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNs, convolutional GNNs, graph autoencoders, and spatial-temporal GNNs. We further discuss the applications of GNNs across various domains and summarize the open-source codes, benchmark data sets, and model evaluation of GNNs. Finally, we propose potential research directions in this rapidly growing field.
Article
Deep learning has been shown to be successful in a number of domains, ranging from acoustics, images, to natural language processing. However, applying deep learning to the ubiquitous graph data is non-trivial because of the unique characteristics of graphs. Recently, substantial research efforts have been devoted to applying deep learning methods to graphs, resulting in beneficial advances in graph analysis techniques. In this survey, we comprehensively review the different types of deep learning methods on graphs. We divide the existing methods into five categories based on their model architectures and training strategies: graph recurrent neural networks, graph convolutional networks, graph autoencoders, graph reinforcement learning, and graph adversarial methods. We then provide a comprehensive overview of these methods in a systematic manner mainly by following their development history. We also analyze the differences and compositions of different methods. Finally, we briefly outline the applications in which they have been used and discuss potential future research directions.
Conference Paper
The property graph (PG) model is one of the most general graph data model and has been widely adopted in many graph analytics and processing systems. However, existing systems suffer from poor performance in terms of both latency and throughput for processing online analytical workloads on PGs due to their design defects such as expensive interactions with external databases, low parallelism, and high network overheads. In this paper, we propose Grasper, a high performance distributed system for OLAP on property graphs. Grasper adopts RDMA-aware system designs to reduce the network communication cost. We propose a novel query execution model, called Expert Model, which supports adaptive parallelism control at the fine-grained query operation level and allows tailored optimizations for different categories of query operators, thus achieving high parallelism and good load balancing. Experimental results show that Grasper achieves low latency and high throughput on a broad range of online analytical workloads.
Article
The paper describes the present and the future of graph updates in Cypher, the language of the Neo4j property graph database and several other products. Update features include those with clear analogs in relational databases, as well as those that do not correspond to any relational operators. Moreover, unlike SQL, Cypher updates can be arbitrarily intertwined with querying clauses. After presenting the current state of update features, we point out their shortcomings, most notably violations of atomicity and non-deterministic behavior of updates. These have not been previously known in the Cypher community. We then describe the industry-academia collaboration on designing a revised set of Cypher update operations. Based on discovered shortcomings of update features, a number of possible solutions were devised. They were presented to key Cypher users, who were given the opportunity to comment on how update features are used in real life, and on their preferences for proposed fixes. As the result of the consultation, a new set of update operations for Cypher were designed. Those led to a streamlined syntax, and eliminated the unexpected and problematic behavior that original Cypher updates exhibited.
Article
Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.
Conference Paper
In this short paper, we provide an early look at the LDBC Social Network Benchmark's Business Intelligence (BI) workload which tests graph data management systems on a graph business analytics workload. Its queries involve complex aggregations and navigations (joins) that touch large data volumes, which is typical in BI workloads, yet they depend heavily on graph functionality such as connectivity tests and path finding. We outline the motivation for this new benchmark, which we derived from many interactions with the graph database industry and its users, and situate it in a scenario of social network analysis. The workload was designed by taking into account technical "chokepoints" identified by database system architects from academia and industry, which we also describe and map to the queries. We present reference implementations in openCypher, PGQL, SPARQL, and SQL, and preliminary results of SNB BI on a number of graph data management systems.
Article
We survey foundational features underlying modern graph query languages. We first discuss two popular graph data models: edge-labelled graphs, where nodes are connected by directed, labelled edges, and property graphs, where nodes and edges can further have attributes. Next we discuss the two most fundamental graph querying functionalities: graph patterns and navigational expressions. We start with graph patterns, in which a graph-structured query is matched against the data. Thereafter, we discuss navigational expressions, in which patterns can be matched recursively against the graph to navigate paths of arbitrary length; we give an overview of what kinds of expressions have been proposed and how they can be combined with graph patterns. We also discuss several semantics under which queries using the previous features can be evaluated, what effects the selection of features and semantics has on complexity, and offer examples of such features in three modern languages that are used to query graphs: SPARQL, Cypher, and Gremlin. We conclude by discussing the importance of formalisation for graph query languages; a summary of what is known about SPARQL, Cypher, and Gremlin in terms of expressivity and complexity; and an outline of possible future directions for the area.
Conference Paper
Many applications detect the emergence or deletion of certain subgraphs in their input graphs continuously. In order to evaluate such continuous subgraph queries, these applications resort to inefficient or highly specialized solutions because existing graph databases are passive systems that only support one-time subgraph queries. We demonstrate Graphflow, a prototype active graph data-base that evaluates general one-time and continuous subgraph queries. Graphflow supports the property graph data model and the Cypher++ query language, which extends Neo4j's declarative Cypher language with subgraph-condition-action triggers. At the core of Graphflow's query processor are two worst-case optimal join algorithms called Generic Join and our new Delta Generic Join algorithm for one-time and continuous subgraph queries, respectively.
Conference Paper
In this paper, we describe our experiences and lessons learned from building a general-purpose in-memory key-value middleware, called HydraDB. HydraDB synthesizes a collection of state-of-the-art techniques, including continuous fault-tolerance, Remote Direct Memory Access (RDMA), as well as awareness for multicore systems, etc, to deliver a high-throughput, low-latency access service in a reliable manner for cluster computing applications. The uniqueness of HydraDB mainly lies in its design commitment to fully exploit the RDMA protocol to comprehensively optimize various aspects of a general-purpose key-value store, including latency-critical operations, read enhancement, and data replications for high-availability service, etc. At the same time, HydraDB strives to efficiently utilize multicore systems to prevent data manipulation on the servers from curbing the potential of RDMA. Many teams in our organization have adopted HydraDB to improve the execution of their cluster computing frameworks, including Hadoop, Spark, Sensemaking analytics, and Call Record Processing. In addition, our performance evaluation with a variety of YCSB workloads also shows that HydraDB can substantially outperform several existing in-memory key-value stores by an order of magnitude. Our detailed performance evaluation further corroborates our design choices.
Article
Many signal processing problems involve data whose underlying structure is non-Euclidean, but may be modeled as a manifold or (combinatorial) graph. For instance, in social networks, the characteristics of users can be modeled as signals on the vertices of the social graph. Sensor networks are graph models of distributed interconnected sensors, whose readings are modelled as time-dependent signals on the vertices. In genetics, gene expression data are modeled as signals defined on the regulatory network. In neuroscience, graph models are used to represent anatomical and functional structures of the brain. Modeling data given as points in a high-dimensional Euclidean space using nearest neighbor graphs is an increasingly popular trend in data science, allowing practitioners access to the intrinsic structure of the data. In computer graphics and vision, 3D objects are modeled as Riemannian manifolds (surfaces) endowed with properties such as color texture. Even more complex examples include networks of operators, e.g., functional correspondences or difference operators in a collection of 3D shapes, or orientations of overlapping cameras in multi-view vision ("structure from motion") problems. The complexity of geometric data and the availability of very large datasets (in the case of social networks, on the scale of billions) suggest the use of machine learning techniques. In particular, deep learning has recently proven to be a powerful tool for problems with large datasets with underlying Euclidean structure. The purpose of this paper is to overview the problems arising in relation to geometric deep learning and present solutions existing today for this class of problems, as well as key difficulties and future research directions.
Conference Paper
Measuring and reporting performance of parallel computers constitutes the basis for scientific advancement of high-performance computing (HPC). Most scientific reports show performance improvements of new techniques and are thus obliged to ensure reproducibility or at least interpretability. Our investigation of a stratified sample of 120 papers across three top conferences in the field shows that the state of the practice is lacking. For example, it is often unclear if reported improvements are deterministic or observed by chance. In addition to distilling best practices from existing work, we propose statistically sound analysis and reporting techniques and simple guidelines for experimental design in parallel computing and codify them in a portable benchmarking library. We aim to improve the standards of reporting research results and initiate a discussion in the HPC field. A wide adoption of our minimal set of rules will lead to better interpretability of performance results and improve the scientific culture in HPC.
Article
In the era of big data, data analytics, business intelligence database management plays a vital role from technical business management and research point of view. Over many decades, database management has been a topic of active research. There are different type of database management system have been proposed over a period of time but Relational Database Management System (RDBMS) is the one which has been most popularly used in academic research as well as industrial setup[1]. In recent years, graph databases regained interest among the researchers for certain obvious reasons. One of the most important reasons for such an interest in a graph database is because of the inherent property of graphs as a graph structure. Graphs are present everywhere in the data structure, which represents the strong connectivity within the data. Most of the graph database models are defined in which data-structure for schema and instances are modeled as graph or generalization of a graph. In such graph database models, data manipulations are expressed by graph-oriented operations and type constructors [9]. Now days, most of the real world applications can be modeled as a graph and one of the best real world examples is social or biological network. This paper gives an overview of the different type of graph databases, applications, and comparison between their models based on some properties.
Article
Graph databases have become a common infrastructure component. Yet existing systems either operate on offline snapshots, provide weak consistency guarantees, or use expensive concurrency control techniques that limit performance. In this paper, we introduce a new distributed graph database, called Weaver, which enables efficient, transactional graph analyses as well as strictly serializable ACID transactions on dynamic graphs. The key insight that allows Weaver to combine strict serializability with horizontal scalability and high performance is a novel request ordering mechanism called refinable timestamps. This technique couples coarse-grained vector timestamps with a fine-grained timeline oracle to pay the overhead of strong consistency only when needed. Experiments show that Weaver enables a Bitcoin blockchain explorer that is 8x faster than Blockchain.info, and achieves 10.9x higher throughput than the Titan graph database on social network workloads and 4x lower latency than GraphLab on offline graph traversal workloads.
Article
Collective operations are among the most important communication operations in shared- and distributed-memory parallel applications. In this paper, we analyze the tradeoffs between energy, memory, and runtime of different algorithms to implement such operations. We show that each existing algorithms have varying behavior and no algorithm exists that is optimal in all three regards. We also show examples where of three different algorithms solving the same problem, each algorithm is best in a different metric. We conclude by posing the challenge to explore the resulting tradeoffs in a more structured manner.
Article
We address efficient processing of SPARQL queries over RDF datasets. The proposed techniques, incorporated into the gStore system, handle, in a uniform and scalable manner, SPARQL queries with wildcards and aggregate operators over dynamic RDF datasets. Our approach is graph based. We store RDF data as a large graph and also represent a SPARQL query as a query graph. Thus, the query answering problem is converted into a subgraph matching problem. To achieve efficient and scalable query processing, we develop an index, together with effective pruning rules and efficient search algorithms. We propose techniques that use this infrastructure to answer aggregation queries. We also propose an effective maintenance algorithm to handle online updates over RDF repositories. Extensive experiments confirm the efficiency and effectiveness of our solutions.
Conference Paper
Large-scale graph-structured computation is central to tasks ranging from targeted advertising to natural language processing and has led to the development of several graph-parallel abstractions including Pregel and GraphLab. However, the natural graphs commonly found in the real-world have highly skewed power-law degree distributions, which challenge the assumptions made by these abstractions, limiting performance and scalability. In this paper, we characterize the challenges of computation on natural graphs in the context of existing graph-parallel abstractions. We then introduce the PowerGraph abstraction which exploits the internal structure of graph programs to address these challenges. Leveraging the PowerGraph abstraction we introduce a new approach to distributed graph placement and representation that exploits the structure of power-law graphs. We provide a detailed analysis and experimental evaluation comparing PowerGraph to two popular graph-parallel systems. Finally, we describe three different implementation strategies for PowerGraph and discuss their relative merits with empirical evaluations on large-scale real-world problems demonstrating order of magnitude gains.
Article
The volume of RDF data continues to grow over the past decade and many known RDF datasets have billions of triples. A grant challenge of managing this huge RDF data is how to access this big RDF data efficiently. A popular approach to addressing the problem is to build a full set of permutations of (S, P, O) indexes. Although this approach has shown to accelerate joins by orders of magnitude, the large space overhead limits the scalability of this approach and makes it heavyweight. In this paper, we present TripleBit, a fast and compact system for storing and accessing RDF data. The design of TripleBit has three salient features. First, the compact design of TripleBit reduces both the size of stored RDF data and the size of its indexes. Second, TripleBit introduces two auxiliary index structures, ID-Chunk bit matrix and ID-Predicate bit matrix, to minimize the cost of index selection during query evaluation. Third, its query processor dynamically generates an optimal execution ordering for join queries, leading to fast query execution and effective reduction on the size of intermediate results. Our experiments show that TripleBit outperforms RDF-3X, MonetDB, BitMat on LUBM, UniProt and BTC 2012 benchmark queries and it offers orders of mangnitude performance improvement for some complex join queries.
Conference Paper
Database benchmarks are an important tool for database researchers and practitioners that ease the process of making informed comparisons between different database hardware, software and configurations. Large scale web services such as social networks are a major and growing database application area, but currently there are few benchmarks that accurately model web service workloads. In this paper we present a new synthetic benchmark called LinkBench. LinkBench is based on traces from production databases that store "social graph" data at Facebook, a major social network. We characterize the data and query workload in many dimensions, and use the insights gained to construct a realistic synthetic benchmark. LinkBench provides a realistic and challenging test for persistent storage of social and web service data, filling a gap in the available tools for researchers, developers and administrators.
Conference Paper
Higher global bandwidth requirement for many applications and lower network cost have motivated the use of the Dragonfly network topology for high performance computing systems. In this paper we present the architecture of the Cray Cascade system, a distributed memory system based on the Dragonfly [1] network topology. We describe the structure of the system, its Dragonfly network and the routing algorithms. We describe a set of advanced features supporting both mainstream high performance computing applications and emerging global address space programing models. We present a combination of performance results from prototype systems and simulation data for large systems. We demonstrate the value of the Dragonfly topology and the benefits obtained through extensive use of adaptive routing.
Article
Graphs appear in numerous applications including cyber security, the Internet, social networks, protein networks, recommendation systems, citation networks, and many more. Graphs with millions or even billions of nodes and edges are common-place. How to store such large graphs efficiently? What are the core operations/queries on those graph? How to answer the graph queries quickly? We propose Gbase, an efficient analysis platform for large graphs. The key novelties lie in (1) our storage and compression scheme for a parallel, distributed settings and (2) the carefully chosen graph operations and their efficient implementations. We designed and implemented an instance of Gbase using Mapreduce/Hadoop. Gbase provides a parallel indexing mechanism for graph operations that both saves storage space, as well as accelerates query responses. We run numerous experiments on real and synthetic graphs, spanning billions of nodes and edges, and we show that our proposed Gbase is indeed fast, scalable, and nimble, with significant savings in space and time.