Conference Paper

PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

SCS, Carnegie Mellon Univ., Pittsburgh, PA, USA
DOI: 10.1109/ICDM.2009.14 Conference: Data Mining, 2009. ICDM '09. Ninth IEEE International Conference on
Source: IEEE Xplore

ABSTRACT In this paper, we describe PEGASUS, an open source peta graph mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. as the size of graphs reaches several giga-, tera- or peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ¿ 6,7 billion edges.

1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we propose a fast, robust and scalable method for retrieving and analyzing recurring patterns of activity induced by a causal process, typically modeled as time series on a graph. We introduce a particular type of multilayer graph as a model for the data. This graph is structured for emphasizing causal relations between connected nodes and their successive time series values. Within the data, the patterns of activity are assumed to be dynamic, sparse or small compared to the size of the network. For some applications they are also expected to appear in a repeated manner over time but are allowed to differ from an exact copy. The analysis of the activity within a social network and within a transportation network illustrates the power and efficiency of the method. Relevant information can be extracted, giving insights on the behavior of group of persons in social networks and on traffic congestion patterns. Moreover, in this era of big data, it is crucial to design tools able to handle large datasets. Our approach scales linearly with the dataset size and is implemented in a parallel manner. By leveraging a state-of-the-art data analytics framework, our implementation can be distributed on clusters of computers and easily handles millions of nodes on a single commodity server.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We get the productivity benefits of a vertex programming framework without sacrificing performance. GraphMat is in C++, and we have been able to write a diverse set of graph algorithms in this framework with the same effort compared to other vertex programming frameworks. GraphMat performs 1.2-7X faster than high performance frameworks such as GraphLab, CombBLAS and Galois. It achieves better multicore scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native, hand-optimized code on a variety of different graph algorithms. Since GraphMat performance depends mainly on a few scalable and well-understood sparse matrix operations, GraphMatcan naturally benefit from the trend of increasing parallelism on future hardware.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Current popular systems, Hadoop and Spark, cannot achieve satisfied performance because of the inefficient overlapping of computation and communication when running iterative big data applications. The pipeline of computing, data movement, and data management plays a key role for current distributed data computing systems. In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPI-Iteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X∼21X speedup over Apache Hadoop, and 2X∼3X speedup over Apache Spark for PageRank and K-means.
    Journal of Computer Science and Technology 03/2015; 30(2):283-294. DOI:10.1007/s11390-015-1522-5 · 0.64 Impact Factor

Preview (3 Sources)

Available from