Conference PaperPDF Available

Pregel: A system for large-scale graph processing

Authors:
Conference Paper

Pregel: A system for large-scale graph processing

Abstract

Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.
A preview of the PDF is not available
... Over the past decade, there has been a proliferation of graph processing systems, ranging from low-level platforms [80,136,176,183] to more recent declarative designs [241]. While users can deploy these systems in a variety of contexts, the largest instances routinely scale to multiple racks of servers contained in vast datacenters like those of Google,Facebook,and Microsoft [225]. ...
... The first part of this dissertation focuses on large-scale graph systems. Many graph processing systems have been proposed [286], including Pregel [183], Giraph [80], GraphX [115], Pow-erGraph [114], GPS [228], Pregelix [70], GraphChi [152], and Chaos [224]. GraphRex adopts a Datalog-like interface and computation model in order to explore the space of optimizations for large graph queries running on modern datacenter infrastructure. ...
... Graph processing. Systems like Pregel [183] and PowerGraph [114] process structured pointerbased graph datasets that lead to unpredictable memory access to different parts of the input graphs depending on the query and data characteristics. In PowerGraph, for example, every gather-apply-scatter iteration requires a vertex to communicate with its neighbors to exchange local data for the next round. ...
Article
Today’s largest data processing workloads are hosted in cloud data centers. Due to unprecedented data growth and the end of Moore’s Law, these workloads have ballooned to the hyperscale level, encompassing billions to trillions of data items and hundreds to thousands of machines per query. Enabling and expanding with these workloads are highly scalable data center networks that connect up to hundreds of thousands of networked servers. These massive scales fundamentally challenge the designs of both data processing systems and data center networks, and the classic layered designs are no longer sustainable. Rather than optimize these massive layers in silos, we build systems across them with principled network-centric designs. In current networks, we redesign data processing systems with network-awareness to minimize the cost of moving data in the network. In future networks, we propose new interfaces and services that the cloud infrastructure offers to applications and codesign data processing systems to achieve optimal query processing performance. To transform the network to future designs, we facilitate network innovation at scale. This dissertation presents a line of systems work that covers all three directions. It first discusses GraphRex, a network-aware system that combines classic database and systems techniques to push the performance of massive graph queries in current data centers. It then introduces data processing in disaggregated data centers, a promising new cloud proposal. It details TELEPORT, a compute pushdown feature that eliminates data processing performance bottlenecks in disaggregated data centers, and Redy, which provides high-performance caches using remote disaggregated memory. Finally, it presents MimicNet, a fine-grained simulation framework that evaluates network proposals at datacenter scale with machine learning approximation. These systems demonstrate that our ideas in network-centric designs achieve orders of magnitude higher efficiency compared to the state of the art at hyperscale.
... First, partitioning strategies [38,6,15,49,43,33] were proposed to partition graph data into a cluster. Second, distributed graph processing engines [6,11,30,12,50,35] emerged to analyze distributed graph data. Finally, parallel algorithms [22,16,28,29] emerged to exploit the distributed environment. ...
... These distributed framework-based graph systems express diverse iterative graph algorithms using simple programming abstraction and aim for linear-scalable execution rather than optimizing single iteration. Based on the same philosophy as the above system, Pregel [30] presented a think like a vertex model that defines what behavior should be performed from the vertex perspective for each iteration, and GraphLab [27] and Cyclops [5] also used this vertexcentric model. However, in many real-world graph data has skewness property that a few vertices are connected to a large number of edges, whereas most vertices are connected to a small number of edges. ...
Preprint
Full-text available
Analyzing large graph data is an essential part of many modern applications, such as social networks. Due to its large computational complexity, distributed processing is frequently employed. This requires graph data to be divided across nodes, and the choice of partitioning strategy has a great impact on the execution time of the task. Yet, there is no one-size-fits-all partitioning strategy that performs well on arbitrary graph data and algorithms. The performance of a strategy depends on the characteristics of the graph data and algorithms. Moreover, due to the complexity of graph data and algorithms, manually identifying the best partitioning strategy is also infeasible. In this work, we propose a machine learning-based approach to select the most appropriate partitioning strategy for a given graph and processing algorithm. Our approach enumerates viable partitioning strategies, predicts the execution time of the target algorithm for each, and selects the partitioning strategy with the fastest estimated execution time. Our machine learning model is trained on features extracted from graph data and algorithm pseudo-code. We also propose a method that augments real execution logs of graph tasks to create a large synthetic dataset. Evaluation results show that the strategies selected by our approach lead to 1.46X faster execution time on average compared with the mean execution time of the partitioning strategies and about 0.95X the performance compared to the best partitioning strategy.
... Data transfer has a significant impact on application performance in data-parallel computing frameworks such as MapReduce [1], Pregel [2] and Spark [3]. These computing frameworks all implement a data partitioning model, in which jobs are decomposed into finer-grained tasks, and massive amounts of intermediate data between their computation stages need to be transferred through the network before generating the final results. ...
Preprint
Full-text available
Optimizing data transfers is critical for improving job performance in data-parallel frameworks. In the hybrid data center with both wired and wireless links, reconfigurable wireless links can provide additional bandwidth to speed up job execution. However, it requires the scheduler and transceivers to make joint decisions under coupled constraints. In this work, we identify that the joint job scheduling and bandwidth augmentation problem is a complex mixed integer nonlinear problem, which is not solvable by existing optimization methods. To address this bottleneck, we transform it into an equivalent problem based on the coupling of its heuristic bounds, the revised data transfer representation and non-linear constraints decoupling and reformulation, such that the optimal solution can be efficiently acquired by the Branch and Bound method. Based on the proposed method, the performance of job scheduling with and without bandwidth augmentation is studied. Experiments show that the performance gain depends on multiple factors, especially the data size. Compared with existing solutions, our method can averagely reduce the job completion time by up to 10% under the setting of production scenario.
... Traditional graph processing. Graph analytic frameworks [70,86,129,134,155,182,202,225,226,298,311] target scale but not privacy. Work on social networks has dealt with issues of anonymity [25,120,307,309], but the proposed mechanisms either focus on answering limited differentially private queries [47], on aggregate network estimations that may hide effects of individual malicious nodes [150], or on previous definitions of privacy like k-anonymity [193]. ...
Article
Collecting distributed data from millions of individuals for the purpose of analytics is a common scenario – from Apple collecting typed words and emojis to improve its keyboard suggestions, to Google collecting location data to see how busy restaurants and businesses are. This data is often sensitive, and can be overly revealing about the individuals and communities whose data is being analyzed en masse. Differential privacy has become the gold-standard method to give strong individual privacy guarantees while releasing aggregate statistics about sensitive data. However, the process of computing such statistics can itself be a privacy risk. For instance, a simple approach would be to collect all the raw data at a single central entity, which then computes and releases the statistics. This entity then has to be trusted to not abuse the raw data; in practice, it can be difficult to find an entity with the requisite level of trust. In this thesis, we describe a new approach that uses cryptographic techniques to collect data privately and safely, without placing trust in any party. Although the natural candidates, such as secure multiparty computation (MPC) and fully homomorphic encryption (FHE) do not scale to millions of parties on their own, our key insight is that there are ways to refactor computations in such a way that they can be done using simpler techniques that do scale, such as additively homomorphic encryption. Our solution restructures centralized computations into distributed protocols that can be executed efficiently at scale. The systems we design based on this approach can support billions of participants and can handle a variety of real queries from the literature, including machine learning tasks, Pregel-style graph queries, and queries over large categorical data. We automate the distributed refactoring so that analysts can write the query as if the data were centralized without understanding how the rewriting works, and we protect against malicious parties who aim to poison or bias the results.
... Graph partition plays a key role in performance improvement for massive graph processing systems. In a distributed graph system, such as Google Pregel (Malewicz et al. 2010), GraphX (Gonzalez et al. 2014), and GraphLab , the original graph may be too large to fit in memory and has to be partitioned into multiple parts which are processed in parallel by multiple machines. The quality of graph partition is often measured by two important performance criteria. ...
Article
Graph partition is a key component to achieve workload balance and reduce job completion time in parallel graph processing systems. Among the various partition strategies, edge partition has demonstrated more promising performance in power-law graphs than vertex partition and thereby has been more widely adopted as the default partition strategy by existing graph systems. The graph edge partition problem, which is to split the edge set into multiple balanced parts with the objective of minimizing the total number of copied vertices, has been widely studied from the view of optimization and algorithms. In this paper, we study local search algorithms for this problem to further improve the partition results from existing methods. More specifically, we propose two novel concepts, namely adjustable edges and blocks. Based on these, we develop a greedy heuristic as well as an improved search algorithm utilizing the property of max-flow model. To evaluate the performance of our algorithms, we first provide adequate theoretical analysis in terms of approximation quality. We significantly improve the previous known approximation ratio for this problem. Then we conduct extensive experiments on a large number of benchmark datasets and state-of-the-art edge partition strategies. The results show that our proposed local search framework can further improve the quality of graph partition by a wide margin.
... Vertex-centric programming [13], [23] has been widely adopted in Graph processing frameworks, for its simplicity, high scalability, and powerful expression ability. It defines a generic function that defines the behavior of a vertex and its neighbors. ...
Preprint
Processing large graphs with memory-limited GPU needs to resolve issues of host-GPU data transfer, which is a key performance bottleneck. Existing GPU-accelerated graph processing frameworks reduce the data transfers by managing the active subgraph transfer at runtime. Some frameworks adopt explicit transfer management approaches based on explicit memory copy with filter or compaction. In contrast, others adopt implicit transfer management approaches based on on-demand access with zero-copy or unified-memory. Having made intensive analysis, we find that as the active vertices evolve, the performance of the two approaches varies in different workloads. Due to heavy redundant data transfers, high CPU compaction overhead, or low bandwidth utilization, adopting a single approach often results in suboptimal performance. In this work, we propose a hybrid transfer management approach to take the merits of both the two approaches at runtime, with an objective to achieve the shortest execution time in each iteration. Based on the hybrid approach, we present HytGraph, a GPU-accelerated graph processing framework, which is empowered by a set of effective task scheduling optimizations to improve the performance. Our experimental results on real-world and synthesized graphs demonstrate that HyTGraph achieves up to 10.27X speedup over existing GPU-accelerated graph processing systems including Grus, Subway, and EMOGI.
Article
Traditional graph systems mainly use the iteration-based model which iteratively loads graph blocks into memory for analysis so as to reduce random I/Os. However, this iteration-based model limits the efficiency and scalability of running random walk, which is a fundamental technique to analyze large graphs. In this paper, we first propose a state-aware I/O model to improve the I/O efficiency of running random walk, then we develop a block-centric indexing and buffering scheme for managing walk data, and leverage an asynchronous walk updating strategy to improve random walk efficiency. We implement an I/O-efficient graph system GraphWalker , which is efficient to handle very large disk-resident graphs and also scalable to run tens of billions of random walks with only a single commodity machine. Experiments show that GraphWalker can achieve more than an order of magnitude speedup when compared with DrunkardMob, which is tailored for random walks based on the classical graph system GraphChi, as well as two state-of-the-art single-machine graph systems, Graphene and GraFSoft. Furthermore, comparing with the most recent distributed system KnightKing, GraphWalker still achieves comparable performance with only a single machine, thereby making it a more cost-effective alternative.
Article
Parallelism is often required for performance. In these situations an excess of non-determinism is harmful as it means the program can have several different behaviours or even different results. Even in domains such as high-performance computing where parallelism is crucial for performance, the computed value should be deterministic. Unfortunately, non-determinism in programs also allows dynamic scheduling of tasks, reacting to the first task that succeeds, cancelling tasks that cannot lead to a result, etc. Non-determinism is thus both a desired asset or an undesired property depending on the situation. In practice, it is often necessary to limit non-determinism and to identify precisely the sources of non-determinism in order to control what parts of a program are deterministic or not. This survey takes the perspective of programming languages, and studies how programming models can ensure the determinism of parallel programs. This survey studies not only deterministic languages but also programming models that prevent one particularly demanding source of non-determinism: data races. Our objective is to compare existing solutions to the following questions: How programming languages can help programmers write programs that run in a parallel manner without visible non-determinism? What programming paradigms ensure this kind of properties? We study these questions and discuss the merits and limitations of different approaches.
Preprint
Designing flexible graph kernels that can run well on various platforms is a crucial research problem due to the frequent usage of graphs for modeling data and recent architectural advances and variety. In this work, we propose a novel graph processing framework, PGAbB (Parallel Graph Algorithms by Blocks), for modern shared-memory heterogeneous platforms. Our framework implements a block-based programming model. This allows a user to express a graph algorithm using kernels that operate on subgraphs. PGAbB support graph computations that fit in host DRAM but not in GPU device memory, and provides simple but effective scheduling techniques to schedule computations to all available resources in a heterogeneous architecture. We have demonstrated that one can easily implement a diverse set of graph algorithms in our framework by developing five algorithms. Our experimental results show that PGAbB implementations achieve better or competitive performance compared to hand-optimized implementations. Based on our experiments on five graph algorithms and forty-four graphs, in the median, PGAbB achieves 1.6, 1.6, 5.7, 3.4, 4.5, and 2.4 times better performance than GAPBS, Galois, Ligra, LAGraph Galois-GPU, and Gunrock graph processing systems, respectively.
Article
We propose a new (theoretical) computational model for the study of massive data processing with limited computational resources. Our model measures the complexity of reading the very large data sets in terms of the data size N and analyzes the computational cost in terms of a parameter k that characterizes the computational power provided by limited local computing resources. We develop new algorithmic techniques that implement algorithms for solving well-known computational problems on the proposed model. In particular, randomized algorithms of running time O(N+g1(k)) and space O(k2), with very high probability, are developed for the famous graph matching problem on unweighted and weighted graphs (note that the term O(N) in the time complexity of our algorithms is needed just to read the input graph). More specifically, our algorithm for unweighted graphs finds a k-matching (i.e., a matching of k edges) in a general unweighted graph in time O(N+k2.5), and our algorithm for weighted graphs finds a maximum weighted k-matching in a general weighted graph in time O(N+k3log⁡k).
Conference Paper
Full-text available
NetworkX is a Python language package for exploration and analysis of networks and network algorithms. The core package provides data structures for representing many types of networks, or graphs, including simple graphs, directed graphs, and graphs with parallel edges and self loops. The nodes in NetworkX graphs can be any (hashable) Python object and edges can contain arbitrary data; this flexibility mades NetworkX ideal for representing networks found in many different scientific fields. In addition to the basic data structures many graph algorithms are implemented for calculating network properties and structure measures: shortest paths, betweenness centrality, clustering, and degree distribution and many more. NetworkX can read and write various graph formats for eash exchange with existing data, and provides generators for many classic graphs and popular graph models, such as the Erdoes-Renyi, Small World, and Barabasi-Albert models, are included. The ease-of-use and flexibility of the Python programming language together with connection to the SciPy tools make NetworkX a powerful tool for scientific computations. We discuss some of our recent work studying synchronization of coupled oscillators to demonstrate how NetworkX enables research in the field of computational networks.
Article
Full-text available
This paper presents the Parallel BGL, a generic C++ library for distributed graph computation. Like the sequential Boost Graph Library (BGL) upon which it is based, the Parallel BGL applies the paradigm of generic programming to the domain of graph computations. Emphasizing efficient generic algorithms and the use of concepts to specify the requirements on type parameters, the Parallel BGL also provides flexible supporting data structures such as distributed adjacency lists and external property maps. The generic programming approach simultaneously stresses flexibility and efficiency, resulting in a parallel graph library that can adapt to various data structures and communication models while retaining the efficiency of equivalent hand-coded programs. Performance data for selected algorithms are provided demonstrating the efficiency and scalability of the Parallel BGL.
Book
List of Figures. List of Tables. List of Examples and Remarks. List of Symbols. Foreword Franco P. Preparata. 1. Introduction. 2. Models for Robust Computation. 3. The Write-All Problem: Algorithms. 4. Lower Bounds, Snapshots and Approximation. 5. Fault-Tolerant Simulations. 6. Shared Memory Randomized Algorithms and Distributed Models and Algorithms. Bibliography and References. Author Index. Subject Index.
Article
Contenido: Introducción a los modelos de grafos; Estructura y representación; Arboles; Atravesar árboles; Conectividad; Grafos transversales óptimos; Planaridad y teorema de Kuratowski; Dibujos de grafos y mapas; Colorido de gráficos; Mediciones y mapeo; Teoría de grafos analítica; Modelos especiales de dígrafos; Flujo de redes y aplicaciones; Enumeración gráfica; Especificación algebraica de grafos; Diagramación no planar; Apéndice.
Article
: Given a set of N cities, with every two linked by a road, and the times required to traverse these roads, we wish to determine the path from one given city to another given city which minimizes the travel time. The times are not directly proportional to the distances due to varying quality of roads, and v varying quantities of traffic. The functional equation technique of dynamic programming, combined with approximation in policy space, yield an iterative algorithm which converges after at most (N-1) iterations.
Article
This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time oil a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first-order model. Then we will derive a more complete cost function and demonstrate a perturbation solution that provides accurate high order approximations to the optimum checkpoint interval, (c) 2004 Elsevier B.V. All rights reserved.
Conference Paper
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. We describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. We give a few examples of how engineers at Yahoo! are using Pig to dramatically reduce the time required for the development and execution of their data analysis tasks, compared to using Hadoop directly. We also report on a novel debugging environment that comes integrated with Pig, that can lead to even higher productivity gains. Pig is an open-source, Apache-incubator project, and available for general use.