Conference Paper

PEGASUS: A peta-scale graph mining system - Implementation and observations

SCS, Carnegie Mellon Univ., Pittsburgh, PA, USA
DOI: 10.1109/ICDM.2009.14 Conference: Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on
Source: DBLP

ABSTRACT In this paper, we describe PEGASUS, an open source peta graph mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. as the size of graphs reaches several giga-, tera- or peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ¿ 6,7 billion edges.

1 Follower
 · 
205 Views
  • Source
    • "Since inception , the Hadoop/MapReduce ecosystem has grown considerably in support of related Big Data tasks. However, these distributed frameworks are not suited for all purposes, in many cases can even result in poor performance [31] [59] [85]. Algorithms that make use of multiple iterations, especially those using graph or matrix data representations, are particularly poorly suited for popular Big Data processing systems. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The vertex-centric programming model is an established computational paradigm recently incorporated into distributed processing frameworks to address challenges in large-scale graph processing. Billion-node graphs that exceed the memory capacity of standard machines are not well-supported by popular Big Data tools like MapReduce, which are notoriously poor-performing for iterative graph algorithms such as PageRank. In response, a new type of framework challenges one to Think Like A Vertex (TLAV) and implements user-defined programs from the perspective of a vertex rather than a graph. Such an approach improves locality, demonstrates linear scalability, and provides a natural way to express and compute many iterative graph algorithms. These frameworks are simple to program and widely applicable, but, like an operating system, are composed of several intricate, interdependent components, of which a thorough understanding is necessary in order to elicit top performance at scale. To this end, the first comprehensive survey of TLAV frameworks is presented. In this survey, the vertex-centric approach to graph processing is overviewed, TLAV frameworks are deconstructed into four main components and respectively analyzed, and TLAV implementations are reviewed and categorized.
  • Source
    • "Our approach leverages concepts and programming models from graph mining systems described in [11] and [12]. We believe, our implementation has inherited the ability to efficiently process large-scale graphs borrowing distributed computing principles from Pegasus [11] and Pregel [13]. Furthermore, we expect that our SPARQL implementation will inspire extensions to other graph query languages (e.g., Cypher [14], Gremlin[15], etc.) on graph databases such as Neo4j [14], DEX [16], and Titan [17]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today, data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. We address that need through implementation of three popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, and PageRank). We implement these algorithms as SPARQL queries, wrapped within Python scripts. We evaluate the performance of our implementation on 6 real world data sets and show graph mining algorithms (that have a linear-algebra formulation) can indeed be unleashed on data represented as RDF graphs using the SPARQL query interface.
    ICDE Workshop on Data Engineering meets the Semantic Web; 04/2015
  • Source
    • "Graph analytics are also a popular application that crunches large amounts of data. There are many systems for processing very large scale graphs, like Pegasus, Pregel and others [24], [15], [25], [26]. Most of them are based on Valiant's Bulk Synchronous Parallel (BSP) model, consisting on processors with fast local memory connected via a computer network [27]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Public clouds have democratised the access to analytics for virtually any institution in the world. Virtual machines (VMs) can be provisioned on demand to crunch data after uploading into the VMs. While this task is trivial for a few tens of VMs, it becomes increasingly complex and time consuming when the scale grows to hundreds or thousands of VMs crunching tens or hundreds of TB. Moreover, the elapsed time comes at a price: the cost of provisioning VMs in the cloud and keeping them waiting to load the data. In this paper we present a big data provisioning service that incorporates hierarchical and peer-to-peer data distribution techniques to speed-up data loading into the VMs used for data processing. The system dynamically mutates the sources of the data for the VMs to speed-up data loading. We tested this solution with 1000 VMs and 100 TB of data, reducing time by at least 30 percent over current state of the art techniques. This dynamic topology mechanism is tightly coupled with classic declarative machine configuration techniques (the system takes a single high-level declarative configuration file and configures both software and data loading). Together, these two techniques simplify the deployment of big data in the cloud for end users who may not be experts in infrastructure management.
    IEEE Transactions on Cloud Computing 04/2015; 3(2):132-144. DOI:10.1109/TCC.2014.2360376
Show more

Preview (3 Sources)

Download
4 Downloads
Available from