Conference Paper

Intersection Representation of Big Data Networks and Triangle Counting

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Chapter
Triangles are an essential part of network analysis, representing metrics such as transitivity and clustering coefficient. Using the correspondence between sparse adjacency matrices and graphs, linear algebraic methods have been developed for triangle counting and enumeration, where the main computational kernel is sparse matrix-matrix multiplication. In this paper, we use an intersection representation of graph data implemented as a sparse matrix, and engineer an algorithm to compute the “k-count” distribution of the triangles of the graph. The main computational task of computing sparse matrix-vector products is carefully crafted by employing compressed vectors as accumulators. Our method avoids redundant work by counting and enumerating each triangle exactly once. We present results from extensive computational experiments on large-scale real-world and synthetic graph instances that demonstrate good scalability of our method. In terms of run-time performance, our algorithm has been found to be orders of magnitude faster than the reference implementations of the miniTri data analytics application [18].
Article
Full-text available
Counting and enumeration of local topological structures, such as triangles, is an important task for analyzing large real‐life networks. For instance, triangle count in a network is used to compute transitivity—an important property for understanding graph evolution over time. Triangles are also used for various other tasks completed for real‐life networks, including community discovery, link prediction, and spam filtering. The task of triangle counting, though simple, has gained wide attention in recent years from the data mining community. This is due to the fact that most of the existing algorithms for counting triangles do not scale well to very large networks with millions (or even billions) of vertices. To circumvent this limitation, researchers proposed triangle counting methods that approximate the count or run on distributed clusters. In this paper, we discuss the existing methods of triangle counting, ranging from sequential to parallel, single‐machine to distributed, exact to approximate, and off‐line to streaming. We also present experimental results of performance comparison among a set of approximate triangle counting methods built under a unified implementation framework. Finally, we conclude with a discussion of future works in this direction. WIREs Data Mining Knowl Discov 2018, 8:e1226. doi: 10.1002/widm.1226 This article is categorized under: • Algorithmic Development > Structure Discovery
Article
Full-text available
The rise of graph analytic systems has created a need for ways to measure and compare the capabilities of these systems. Graph analytics present unique scalability difficulties. The machine learning, high performance computing, and visual analytics communities have wrestled with these difficulties for decades and developed methodologies for creating challenges to move these communities forward. The proposed Subgraph Isomorphism Graph Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a graph challenge that is reflective of many real-world graph analytics processing systems. The Subgraph Isomorphism Graph Challenge is a holistic specification with multiple integrated kernels that can be run together or independently. Each kernel is well defined mathematically and can be implemented in any programming environment. Subgraph isomorphism is amenable to both vertex-centric implementations and array-based implementations (e.g., using the GraphBLAS.org standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The surrounding kernels provide the context for each kernel that allows rigorous definition of both the input and the output for each kernel. Furthermore, since the proposed graph challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Serial implementations in C++, Python, Python with Pandas, Matlab, Octave, and Julia have been implemented and their single threaded performance have been measured. Specifications, data, and software are publicly available at GraphChallenge.org.
Article
Full-text available
The increasing size of Big Data is often heralded but how data are transformed and represented is also profoundly important to knowledge discovery, and this is exemplified in Big Graph analytics. Much attention has been placed on the scale of the input graph but the product of a graph algorithm can be many times larger than the input. This is true for many graph problems, such as listing all triangles in a graph. Enabling scalable graph exploration for Big Graphs requires new approaches to algorithms, architectures, and visual analytics. A brief tutorial is given to aid the argument for thoughtful representation of data in the context of graph analysis. Then a new algebraic method to reduce the arithmetic operations in counting and listing triangles in graphs is introduced. Additionally, a scalable triangle listing algorithm in the MapReduce model will be presented followed by a description of the experiments with that algorithm that led to the current largest and fastest triangle listing benchmarks to date. Finally, a method for identifying triangles in new visual graph exploration technologies is proposed.
Article
NetworkRepository (NR) is the first interactive data repository with a web-based platform for visual interactive analytics. Unlike other data repositories (e.g., UCI ML Data Repository, and SNAP), the network data repository (networkrepository.com) allows users to not only download, but to interactively analyze and visualize such data using our web-based interactive graph analytics platform. Users can in real-time analyze, visualize, compare, and explore data along many different dimensions. The aim of NR is to make it easy to discover key insights into the data extremely fast with little effort while also providing a medium for users to share data, visualizations, and insights. Other key factors that differentiate NR from the current data repositories is the number of graph datasets, their size, and variety. While other data repositories are static, they also lack a means for users to collaboratively discuss a particular dataset, corrections, or challenges with using the data for certain applications. In contrast, NR incorporates many social and collaborative aspects that facilitate scientific research, e.g., users can discuss each graph, post observations, and visualizations.
Chapter
The Edge Clique Cover (ECC) problem is concerned with covering edges of a graph with the minimum number of cliques, which is an NP-hard problem. This problem has many real-life applications, such as, in computational biology, food science, efficient representation of pairwise information, and so on. In this work we propose using a compact representation of network data based on sparse matrix data structures. Building upon an existing ECC heuristic due to Kellerman we proffer adding vertices during the clique-growing step of the algorithm in judiciously chosen degree-based orders. On a set of standard benchmark instances our ordered approach produced smaller sized clique cover compared to unordered processing.
Conference Paper
NR) is the first interactive data repository with a web-based platform for visual interactive analytics. Unlike other data repositories (e.g., UCI ML Data Repository, and SNAP), the network data repository (networkrepository.com) allows users to not only download, but to interactively analyze and visualize such data using our web-based interactive graph an-alytics platform. Users can in real-time analyze, visualize, compare, and explore data along many different dimensions. The aim of NR is to make it easy to discover key insights into the data extremely fast with little effort while also providing a medium for users to share data, visualizations, and insights. Other key factors that differentiate NR from the current data repositories is the number of graph datasets, their size, and variety. While other data repositories are static, they also lack a means for users to collaboratively discuss a particular dataset, corrections, or challenges with using the data for certain applications. In contrast, NR incorporates many social and col-laborative aspects that facilitate scientific research, e.g., users can discuss each graph, post observations, and visualizations.
Conference Paper
We describe the main design features of DSJM (Determine Sparse Jacobian Matrices), a software toolkit written in standard C++ that enables direct determination of sparse Jacobian matrices. Our design exploits the recently proposed unifying framework “pattern graph” and employs cache-friendly array-based sparse data structures. The DSJM implements a greedy grouping (coloring) algorithm and several ordering heuristics. In our numerical testing on a suite of large-scale test instances DSJM consistently produced better timing and partitions compared with a similar software.
SNAP Datasets: Stanford large network dataset collection
  • J Leskovec
  • A Krevl
A translation of sur deux proprietésdés classes d’ensembles by
  • E Szpilrajn-Marczewski