Conference Paper

PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

SCS, Carnegie Mellon Univ., Pittsburgh, PA, USA
DOI: 10.1109/ICDM.2009.14 Conference: Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on
Source: DBLP

ABSTRACT In this paper, we describe PEGASUS, an open source peta graph mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. as the size of graphs reaches several giga-, tera- or peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ¿ 6,7 billion edges.

1 Bookmark
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this article we present a novel approach to data repre-sentation for computing this kernel, particularly targeting sparse matrices representing power-law graphs. Using real data, we show how our representation scheme, coupled with a novel tiling algorithm, can yield significant benefits over the current state of the art GPU and CPU efforts on a num-ber of core data mining algorithms such as PageRank, HITS and Random Walk with Restart.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As new data and updates are constantly arriving, the results of data mining applications become stale and obsolete over time. Incremental processing is a promising approach to refreshing mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we propose i2MapReduce, a novel incremental processing extension to MapReduce, the most widely used framework for mining big data. Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs key-value pair level incremental processing rather than task level re-computation, (ii) supports not only one-step computation but also more sophisticated iterative computation, which is widely used in data mining applications, and (iii) incorporates a set of novel techniques to reduce I/O overhead for accessing preserved fine-grain computation states. We evaluate i2MapReduce using a one-step algorithm and three iterative algorithms with diverse computation characteristics. Experimental results on Amazon EC2 show significant performance improvements of i2MapReduce compared to both plain and iterative MapReduce performing re-computation.

Preview (3 Sources)

Available from