Conference Paper

PEGASUS: A peta-scale graph mining system - Implementation and observations

SCS, Carnegie Mellon Univ., Pittsburgh, PA, USA
DOI: 10.1109/ICDM.2009.14 Conference: Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on
Source: DBLP


In this paper, we describe PEGASUS, an open source peta graph mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. as the size of graphs reaches several giga-, tera- or peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ¿ 6,7 billion edges.

1 Follower
38 Reads
  • Source
    • "Since inception , the Hadoop/MapReduce ecosystem has grown considerably in support of related Big Data tasks. However, these distributed frameworks are not suited for all purposes, in many cases can even result in poor performance [31] [59] [85]. Algorithms that make use of multiple iterations, especially those using graph or matrix data representations, are particularly poorly suited for popular Big Data processing systems. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The vertex-centric programming model is an established computational paradigm recently incorporated into distributed processing frameworks to address challenges in large-scale graph processing. Billion-node graphs that exceed the memory capacity of standard machines are not well-supported by popular Big Data tools like MapReduce, which are notoriously poor-performing for iterative graph algorithms such as PageRank. In response, a new type of framework challenges one to Think Like A Vertex (TLAV) and implements user-defined programs from the perspective of a vertex rather than a graph. Such an approach improves locality, demonstrates linear scalability, and provides a natural way to express and compute many iterative graph algorithms. These frameworks are simple to program and widely applicable, but, like an operating system, are composed of several intricate, interdependent components, of which a thorough understanding is necessary in order to elicit top performance at scale. To this end, the first comprehensive survey of TLAV frameworks is presented. In this survey, the vertex-centric approach to graph processing is overviewed, TLAV frameworks are deconstructed into four main components and respectively analyzed, and TLAV implementations are reviewed and categorized.
  • Source
    • "Other large-scale techniques are visual in a different sense; they present plots of calculated features of the graph instead of depicting their structural information. This is the case of Apolo [7], OPAvion [2], Pegasus [16], and OddBall [3]. There are also techniques [5] that rely on sampling to gain scalability, but this approach assumes that parts of the graph will be absent; parts that are of potential interest. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Given a planetary-scale graph with millions of nodes and billions of edges, how to reveal macro patterns of interest, like cliques, bi-partite cores, stars, and chains? Furthermore, how to visualize such patterns altogether getting insights from the graph to support wise decision-making? Although there are many algorithmic and visual techniques to analyze graphs, none of the existing approaches is able to present the structural information of graphs at planetary scale. Hence, this paper describes StructMatrix, a methodology aimed at high scalable visual inspection of graph structures with the goal of revealing macro patterns of interest. StructMatrix combines algorithmic structure detection and adjacency matrix visualization to present cardinality, distribution, and relationship features of the structures found in a given graph. We performed experiments in real, planetary-scale graphs with up to millions of nodes and over 10 billion edges. StructMatrix revealed that graphs of high relevance (e.g., Web, Wikipedia and DBLP) have characterizations that reflect the nature of their corresponding domains; our findings have not been seen in the literature so far. We expect that our technique will bring deeper insights into large graph mining, leveraging their use for decision making.
  • Source
    • "Our approach leverages concepts and programming models from graph mining systems described in [11] and [12]. We believe, our implementation has inherited the ability to efficiently process large-scale graphs borrowing distributed computing principles from Pegasus [11] and Pregel [13]. Furthermore, we expect that our SPARQL implementation will inspire extensions to other graph query languages (e.g., Cypher [14], Gremlin[15], etc.) on graph databases such as Neo4j [14], DEX [16], and Titan [17]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today, data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. We address that need through implementation of three popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, and PageRank). We implement these algorithms as SPARQL queries, wrapped within Python scripts. We evaluate the performance of our implementation on 6 real world data sets and show graph mining algorithms (that have a linear-algebra formulation) can indeed be unleashed on data represented as RDF graphs using the SPARQL query interface.
    ICDE Workshop on Data Engineering meets the Semantic Web; 04/2015
Show more

Preview (3 Sources)

38 Reads
Available from