Chapter

Graph Algorithms in the Language of Linear Algebra

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Closeness centrality (clo) is a measure of the average shortest distance from each vertex to each other vertex (v). Specifically, it is the inverse of the average shortest distance between the vertex and all other vertices in the network (38,39). The formula is 1 / (average distance to all other vertices). ...
... This score is moderated by the total number of shortest paths existing between any couple of nodes of the graph. The target node will have a high betweenness centrality if it appears in many shortest paths (38,39). ...
... In graph theory, eigenvector centrality (eig) is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node than equal connections to low-scoring nodes (38,39). ...
Article
Full-text available
The gut microbiota plays a crucial role in infant health, with its development during the first 1,000 days influencing health outcomes. Understanding the relationships within the microbiota is essential to linking its maturation process to these outcomes. Several network-based methods have been developed to analyze the developing patterns of infant microbiota, but evaluating the reliability and effectiveness of these approaches remains a challenge. In this study, we created a test data pool using public infant microbiome data sets to assess the performance of four different network-based methods, employing repeated sampling strategies. We found that our proposed Probability-Based Co-Detection Model (PBCDM) demonstrated the best stability and robustness, particularly in network attributes such as node counts, average links per node, and the positive-to-negative link (P/N) ratios. Using the PBCDM, we constructed microbial co-existence networks for infants at various ages, identifying core genera networks through a novel network shearing method. Analysis revealed that core genera were more similar between adjacent age ranges, with increasing competitive relationships among microbiota as the infant microbiome matured. In conclusion, the PBCDM-based networks reflect known features of infant microbiota and offer a promising approach for investigating microbial relationships. This methodology could also be applied to future studies of genomic, metabolic, and proteomic data. IMPORTANCE As a research method and strategy, network analysis holds great potential for mining the relationships of bacteria. However, consistency and solid workflows to construct and evaluate the process of network analysis are lacking. Here, we provide a solid workflow to evaluate the performance of different microbial networks, and a novel probability-based co-existence network construction method used to decipher infant microbiota relationships. Besides, a network shearing strategy based on percolation theory is applied to find the core genera and connections in microbial networks at different age ranges. And the PBCDM method and the network shearing workflow hold potential for mining microbiota relationships, even possibly for the future deciphering of genome, metabolite, and protein data.
... This paper presents the graph analytics ecosystem and various applications where the graph-linear algebra combination can be real change maker. It introduces Graphblas [5] which is the standard specification for expressing graph algorithms in the language of linear algebra and also defines the basic building blocks for doing so. It tries to present an alley for further pursuit, where a number of different problems in applied domains amenable to graph formulations can be provided efficient and parallelizable implementations using Graphblas. ...
... The equivalence, from algorithmic perspective,lies underneath algbraic formulations of graph problems and enables the transition from the vertex-edge view to matrix-vector operations. A graph with substantially lower that n 2 edges can be represented by a sparse adjacency or weight matrix [5]. The matrix contains all the information that the vertex-edge representation carries and the two forms are interchangeable. ...
... Adjacency matrix can be generalized to support more complex interpretations by populating with elements from a generalized algebraic structure, semiring. A semiring (S, ⊕, ⊗) is a set S of elements with two binary operations, called addition ⊕ and multiplication ⊗ such that [5]: ...
Conference Paper
Linear algebra is the backbone of modern applied data analytics. Not only does it provide the theoretical substratum for the development of analytical algorithms, it also lends the technological support for extremely efficient implementations. The linear algebra ecosystem consists of remarkably fast hardware support with equally efficient software solutions leveraging the hardware capacity. Graphs are an ubiquitous modeling framework for the representation and analysis of connected data. A flipside from a practical perspective is that conventional graph algorithms are challenging as they are hard to parallelize and often make too much use of non-linear data structures which do not have hardware support. In this context, this paper presents an invitation to an alternate way to look at graph processing in terms of algebraic constructs, formulations, and technology. The paper presents the theoretical underpinnings of the new View and also some interesting progresses at the frontier level in this direction. Above all, the intended objective is to present a compelling invite to the academia and research community to adopt this approach in their academic and research activities.
... A sparse × matrix is called hyper-sparse if | | = ( ) [7]. For a family of hyper-sparse × matrices, the CSC format becomes inefficient in the same way as the dense format is inefficient for matrices with ( ) nonzero elements. ...
... Operation map+ = cfm from Table 3, will be used throughout the paper exclusively in correspondence to cases discussed in Subsection 3.2, i. e. pointer changes. Linear algebraic operations [7,18,19]. For the rest of the operations -namely operations map · sv, sv1 · sv2 T , map · cfm, cfm T , sv, cfm · sv the required estimations can be obtained trivially. ...
Article
Full-text available
In the paper, we demonstrate that modern priority queues can be expressed in terms of linear algebraic operations. Specifically, we showcase one of the arguably most asymptotically faster known priority queues — the Fibonacci heap. By employing our approach, we prove that for the Dijkstra, Prim, Brandes, and greedy maximal independent set algorithms, their theoretical complexity remains the same for both combinatorial and linear-algebraic cases.
... Expressing graph algorithms using linear algebra exposes natural parallelism, allowing users to capitalize on optimized sparse linear algebra routines for multi-threaded, GPU, and multi-node execution. This, together with the fact that most large graphs are sparse, positions sparse matrix algebra as a viable abstraction for graph computations [37,9]. ...
Preprint
Full-text available
The standardization of an interface for dense linear algebra operations in the BLAS standard has enabled interoperability between different linear algebra libraries, thereby boosting the success of scientific computing, in particular in scientific HPC. Despite numerous efforts in the past, the community has not yet agreed on a standardization for sparse linear algebra operations due to numerous reasons. One is the fact that sparse linear algebra objects allow for many different storage formats, and different hardware may favor different storage formats. This makes the definition of a FORTRAN-style all-circumventing interface extremely challenging. Another reason is that opposed to dense linear algebra functionality, in sparse linear algebra, the size of the sparse data structure for the operation result is not always known prior to the information. Furthermore, as opposed to the standardization effort for dense linear algebra, we are late in the technology readiness cycle, and many production-ready software libraries using sparse linear algebra routines have implemented and committed to their own sparse BLAS interface. At the same time, there exists a demand for standardization that would improve interoperability, and sustainability, and allow for easier integration of building blocks. In an inclusive, cross-institutional effort involving numerous academic institutions, US National Labs, and industry, we spent two years designing a hardware-portable interface for basic sparse linear algebra functionality that serves the user needs and is compatible with the different interfaces currently used by different vendors. In this paper, we present a C++ API for sparse linear algebra functionality, discuss the design choices, and detail how software developers preserve a lot of freedom in terms of how to implement functionality behind this API.
... As previously discussed, constraining problems to specific graph classes, such as tree-graphs or path-graphs, often simplifies complex algorithms (cf. [57,77,98,142,181]). Tasks that are computationally challenging for general graphs become more manageable when applied to these simpler structures, leading to more efficient solutions and improved computational performance across fields like computer science, biology, and network analysis. ...
Preprint
Full-text available
One of the most powerful tools in graph theory is the classification of graphs into distinct classes based on shared properties or structural features. Over time, many graph classes have been introduced, each aimed at capturing specific behaviors or characteristics of a graph. Neutrosophic Set Theory, a method for handling uncertainty, extends fuzzy logic by incorporating degrees of truth, indeterminacy, and falsity. Building on this framework, Neutrosophic Graphs [9, 96, 156] have emerged as significant generalizations of fuzzy graphs. In this paper, we extend several classes of fuzzy graphs to Neutrosophic graphs and analyze their properties.
... Multiplying two sparse matrices (SpGEMM) is a widely utilized computational operation across various domains such as graph algorithms [6,29], clustering [22,39], bioinformatics [14,24,26,32], algebraic multigrid solvers [12], and randomized sketching [33]. Distributed-memory parallel algorithms for SpGEMM have primarily emphasized sparsityoblivious approaches involving 2D and 3D partitioning strategies [8,35]. ...
Preprint
Multiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. Distributed-memory parallel algorithms for SpGEMM have mainly focused on sparsity-oblivious approaches that use 2D and 3D partitioning. Sparsity-aware 1D algorithms can theoretically reduce communication by not fetching nonzeros of the sparse matrices that do not participate in the multiplication. Here, we present a distributed-memory 1D SpGEMM algorithm and implementation. It uses MPI RDMA operations to mitigate the cost of packing/unpacking submatrices for communication, and it uses a block fetching strategy to avoid excessive fine-grained messaging. Our results show that our 1D implementation outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many configurations, inputs, and use cases, while remaining conceptually simpler.
... The ubiquitous matrix multiplication operation is an essential component of many problems arising in combinatorial and scientific computing. It is considered an intermediate step in a myriad of scientific, graph, and engineering applications including computer graphics, network theory, algebraic multigrid solvers, triangle counting, multisource breadth-first searching, shortest path problems, colored intersecting, subgraph matching, and quantized neural networks [1][2][3][4][5][6]. The data structures for storing large matrices optimize the memory used for storage and the performance of the multiplication operation. ...
Article
Full-text available
Matrix–matrix multiplication is of singular importance in linear algebra operations with a multitude of applications in scientific and engineering computing. Data structures for storing matrix elements are designed to minimize overhead information as well as to optimize the operation count. In this study, we utilize the notion of the compact diagonal storage method (CDM), which builds upon the previously developed diagonal storage—an orientation-independent uniform scheme to store the nonzero elements of a range of matrices. This study exploits both these storage schemes and presents efficient GPU-accelerated parallel implementations of matrix multiplication when the input matrices are banded and/or structured sparse. We exploit the data layouts in the diagonal storage schemes to expose a substantial amount of fine-grained parallelism and effectively utilize the GPU shared memory to improve the locality of data access for numerical calculations. Results from an extensive set of numerical experiments with the aforementioned types of matrices demonstrate orders-of-magnitude speedups compared with the sequential performance.
... The ubiquitous matrix multiplication operation is an essential component of many problems arising in combinatorial and scientific computing. It is considered an intermediate step in a myriad of scientific, graph, and engineering applications including computer graphics, network theory, algebraic multigrid solvers, triangle counting, multisource breadth-first searching, shortest path problems, colored intersecting, subgraph matching, and quantized neural networks [1][2][3][4][5][6]. The data structures for storing large matrices optimize the memory used for storage and the performance of the multiplication operation. ...
... A semiring (S, ⊕, ⊗) is a set of S elements with two binary operations, called addition ⊕ and multiplication ⊗ such that [15]: ...
... In this way, one can use cache-efficient, static data structures, such as the compressed-sparse-row format, see e.g. [29], for representing the auxiliary graph. ...
Article
Full-text available
The maximum-cut problem is one of the fundamental problems in combinatorial optimization. With the advent of quantum computers, both the maximum-cut and the equivalent quadratic unconstrained binary optimization problem have experienced much interest in recent years. This article aims to advance the state of the art in the exact solution of both problems—by using mathematical programming techniques. The main focus lies on sparse problem instances, although also dense ones can be solved. We enhance several algorithmic components such as reduction techniques and cutting-plane separation algorithms, and combine them in an exact branch-and-cut solver. Furthermore, we provide a parallel implementation. The new solver is shown to significantly outperform existing state-of-the-art software for sparse maximum-cut and quadratic unconstrained binary optimization instances. Furthermore, we improve the best known bounds for several instances from the 7th DIMACS Challenge and the QPLIB, and solve some of them (for the first time) to optimality.
... . This idea can be implemented and improved as follows. The distances in G and H can be found, using a LA form of Breadth First Search, see, for example, page 33 from [21]. For any ...
Article
Full-text available
For a given simple data graph G and a simple query graph H, the subgraph matching problem is to find all the subgraphs of G, each isomorphic to H. There are many combinatorial algorithms for it and its counting version, which are predominantly based on backtracking with several pruning techniques. Much less is known about linear algebraic (LA, for short), i.e., adjacency matrix algebra, algorithms for this problem. Revisiting old ideas of J. Nešetřil and S. Poljak, which reduce the general case to the case of clique-queries, and updating them, we present the first LA algorithm for the subgraph matching/counting problem. For the k-clique matching/counting problem, we present static and dynamic LA algorithms, which may be of independent interest. For the k-clique counting problem, we also provide results of computational experiments of our solver with some large graphs and several k, which speed up results of several recent solvers for it.
... Next, data could be taken as a weighted graph in which vertices are points and an edge between two vertices equals the mutual reachability distance. Hence, the minimum spanning tree is built via Prim's algorithm [19]. The third step is to condense the cluster hierarchy tree into a smaller tree with few data attached to each tree node using the parameter of minimum cluster size (Min cluster size) and minimum samples (Min samples). ...
... Similarly, forward substitution corresponds to forward mode calculation of automatic differentiation. As is well documented in the preface to the book Graph Algorithms in the Language of Linear Algebra [7] there have been many known benefits to formulate mathematically a graph algorithm in linear algebraic terms. One of the known benefits is cognitive complexity. ...
Preprint
Full-text available
We present a linear algebra formulation of backpropagation which allows the calculation of gradients by using a generically written ``backslash'' or Gaussian elimination on triangular systems of equations. Generally the matrix elements are operators. This paper has three contributions: 1. It is of intellectual value to replace traditional treatments of automatic differentiation with a (left acting) operator theoretic, graph-based approach. 2. Operators can be readily placed in matrices in software in programming languages such as Ju lia as an implementation option. 3. We introduce a novel notation, ``transpose dot'' operator ``{}T\{\}^{T_\bullet}'' that allows the reversal of operators. We demonstrate the elegance of the operators approach in a suitable programming language consisting of generic linear algebra operators such as Julia \cite{bezanson2017julia}, and that it is possible to realize this abstraction in code. Our implementation shows how generic linear algebra can allow operators as elements of matrices, and without rewriting any code, the software carries through to completion giving the correct answer.
... An important design consideration for LIGHTNE 2.0 is to embed very large graphs on a single machine. Although the CSR format is normally regarded as a good compressed graph representation [48], we need to further compress this data structure and reduce memory usage. Our approach builds on state-of-the-art parallel graph compression techniques, which enable both fast parallel graph encoding and decoding. ...
Preprint
We propose LIGHTNE 2.0, a cost-effective, scalable, automated, and high-quality network embedding system that scales to graphs with hundreds of billions of edges on a single machine. In contrast to the mainstream belief that distributed architecture and GPUs are needed for large-scale network embedding with good quality, we prove that we can achieve higher quality, better scalability, lower cost, and faster runtime with shared-memory, CPU-only architecture. LIGHTNE 2.0 combines two theoretically grounded embedding methods NetSMF and ProNE. We introduce the following techniques to network embedding for the first time: (1) a newly proposed downsampling method to reduce the sample complexity of NetSMF while preserving its theoretical advantages; (2) a high-performance parallel graph processing stack GBBS to achieve high memory efficiency and scalability; (3) sparse parallel hash table to aggregate and maintain the matrix sparsifier in memory; (4) a fast randomized singular value decomposition (SVD) enhanced by power iteration and fast orthonormalization to improve vanilla randomized SVD in terms of both efficiency and effectiveness; (5) Intel MKL for proposed fast randomized SVD and spectral propagation; and (6) a fast and lightweight AutoML library FLAML for automated hyperparameter tuning. Experimental results show that LIGHTNE 2.0 can be up to 84X faster than GraphVite, 30X faster than PBG and 9X faster than NetSMF while delivering better performance. LIGHTNE 2.0 can embed very large graph with 1.7 billion nodes and 124 billion edges in half an hour on a CPU server, while other baselines cannot handle very large graphs of this scale.
... The well-known, so-called school method for matrix multiplication employs the dot (or inner) product of rows of the leftmost multiplier and columns of the rightmost matrix. Many tutorials on linear algebra describe another way for GEMM via outer (or tensor) product [6]. The complexity of both approaches for square matrix at size N by N is O(N 3 ). ...
Article
Full-text available
Matrix multiplication is an important operation for many engineering applications. Sometimes new features that include matrix multiplication should be added to existing and even out-of-date embedded platforms. In this paper, an unusual problem is considered: how to implement matrix multiplication of 32-bit signed integers and fixed-point numbers on DSP having SIMD instructions for 16-bit integers only. For examined tasks, matrix size may vary from several tens to two hundred. The proposed mathematical approach for dense rectangular matrix multiplication of 32-bit numbers comprises decomposition of 32-bit matrices to matrices of 16-bit numbers, four matrix multiplications of 16-bit unsigned integers via outer product, and correction of outcome for signed integers and fixed point numbers. Several tricks for performance optimization are analyzed. In addition, ways for block-wise and parallel implementations are described. An implementation of the proposed method by means of 16-bit vector instructions is faster than matrix multiplication using 32-bit scalar instructions and demonstrates performance close to a theoretically achievable limit. The described technique can be generalized for matrix multiplication of n-bit integers and fixed point numbers via handling with matrices of n/2-bit integers. In conclusion, recommendations for practitioners who work on implementation of matrix multiplication for various DSP are presented.
... On the other hand, devising parallel algorithms built upon the notion of eliminants [Abdali 1994] to provide an efficient computation of the full path provenance (understood as the asteration of a matrix over a star semiring) seems quite promising, and could lead to efficient solutions to compute the provenance over large transportation networks. Existing solutions in the spirit of GraphBlas [Kepner et al. 2011], a framework for expressing efficient parallel algorithms over large sparse graphs in the language of linear algebra (notably involving semirings to describe the operations to be performed) are natural candidates for designing an efficient parallel implementation of the full path provenance in general semirings. Nevertheless, priority queue management is not trivially expressible into linear algebra operators, and also not easily amenable to parallelism. ...
Thesis
The growing amount of data collected by sensors or generated by human interaction has led to an increasing use of graph databases, an efficient model for representing intricate data.Techniques to keep track of the history of computations applied to the data inside classical relational database systems are also topical because of their application to enforce Data Protection Regulations (e.g., GDPR).Our research work mixes the two by considering a semiring-based provenance model for navigational queries over graph databases.We first present a comprehensive survey on semiring theory and their applications in different fields of computer sciences, geared towards their relevance for our context. From the richness of the literature, we notably obtain a lower bound for the complexity of the full provenance computation in our setting.In a second part, we focus on the model itself by introducing a toolkit of provenance-aware algorithms, each targeting specific properties of the semiring of use.We notably introduce a new method based on lattice theory permitting an efficient provenance computation for complex graph queries.We propose an open-source implementation of the above-mentioned algorithms, and we conduct an experimental study over real transportation networks of large size, witnessing the practical efficiency of our approach in practical scenarios.We finally consider how this framework is positioned compared to other provenance models such as the semiring-based Datalog provenance model.We make explicit how the methods we applied for graph databases can be extended to Datalog queries, and we show how they can be seen as an extension of the semi-naïve evaluation strategy.To leverage this fact, we extend the capabilities of Soufflé, a state-of-the-art Datalog solver, to design an efficient provenance-aware Datalog evaluator. Experimental results based on our open-source implementation entail the fact this approach stays competitive with dedicated graph solutions, despite being more general.In a final round, we discuss on some research ideas for improving the model, and state open questions raised by our work.
... Several centrality metrics have been proposed, the most popular and known, as described below [13,28,33,58]. In this study, centrality analysis, having a wide application area from communication networks to social networks, is used to identify the importance of the edges and vertices on a graph. ...
Article
Full-text available
Face recognition remains critical and up-to-date due to its undeniable contribution to security. Many descriptors, the most vital figures used for face discrimination, have been proposed and continue to be done. This article presents a novel and highly discriminative identifier that can maintain high recognition performance, even under high noise, varying illumination, and expression exposure. By evolving the image into a graph, the feature set is extracted from the resulting graph rather than making inferences directly on the image pixels as done conventionally. The adjacency matrix is created at the outset by considering the pixels’ adjacencies and their intensity values. Subsequently, the weighted-directed graph having vertices and edges denoting the pixels and adjacencies between them is formed. Moreover, the weights of the edges state the intensity differences between the adjacent pixels. Ultimately, information extraction is performed, which indicates the importance of each vertex in the graphic, expresses the importance of the pixels in the entire image, and forms the feature set of the face image. As evidenced by the extensive simulations performed, the proposed graphic-based identifier shows remarkable and competitive performance regarding recognition accuracy, even under extreme conditions such as high noise, variable expression, and illumination compared with the state-of-the-art face recognition methods.
... The GraphBLAS standard provides significant performance and compression capabilities which improve the feasibility of analyzing these volumes of data [9]- [23]. Specifically, the GraphBLAS is ideally suited for both constructing and analyzing anonymized hypersparse traffic matrices. ...
Preprint
Full-text available
Internet analysis is a major challenge due to the volume and rate of network traffic. In lieu of analyzing traffic as raw packets, network analysts often rely on compressed network flows (netflows) that contain the start time, stop time, source, destination, and number of packets in each direction. However, many traffic analyses benefit from temporal aggregation of multiple simultaneous netflows, which can be computationally challenging. To alleviate this concern, a novel netflow compression and resampling method has been developed leveraging GraphBLAS hyperspace traffic matrices that preserve anonymization while enabling subrange analysis. Standard multitemporal spatial analyses are then performed on each subrange to generate detailed statistical aggregates of the source packets, source fan-out, unique links, destination fan-in, and destination packets of each subrange which can then be used for background modeling and anomaly detection. A simple file format based on GraphBLAS sparse matrices is developed for storing these statistical aggregates. This method is scale tested on the MIT SuperCloud using a 50 trillion packet netflow corpus from several hundred sites collected over several months. The resulting compression achieved is significant (<0.1 bit per packet) enabling extremely large netflow analyses to be stored and transported. The single node parallel performance is analyzed in terms of both processors and threads showing that a single node can perform hundreds of simultaneous analyses at over a million packets/sec (roughly equivalent to a 10 Gigabit link).
Article
Sparse matrix-vector semiring computation is a key operation in sparse matrix computations, with performance strongly dependent on both program design and the features of the sparse matrices. Given the diversity of sparse matrices, designing a tailored program for each matrix is challenging. To address this, we propose SRSparse ¹ an program generator that creates tailored programs by automatically combining program designing methods to fit specific input matrices. It provides two components: the problem definition configuration , which declares the computation, and the scheduling language , which can be leveraged by an auto-tuner to specify the program designs. The two are lowered to the intermediate representations of SRSparse, the Format IR and Kernel IR , which respectively generates format conversion routine and kernel code. We evaluate SRSparse on four representative sparse kernels and three format conversion routines. For sparse kernels, SRSparse achieves median speedups over handwritten programs: COO (3.50 ×), CSR-Adaptive (5.36 ×), CSR5 (2.06 ×), ELL (1.63 ×), Gunrock (1.57 ×), and GraphBLAST (1.96 ×); over an auto-tuner: AlphaSparse (1.16 ×); and over a compiler: TACO (1.71 ×). For format conversion routines, SRSparse achieves median speedups over handwritten implementations: Intel MKL (7.60 ×), SPARSKIT (2.61 ×), CUSP (2.77 ×), and Ginkgo (1.74 ×); and over a compiler: TACO (4.04 ×).
Article
Graph is ubiquitous in various real-world applications, and many graph processing systems have been developed. Recently, hardware accelerators have been exploited to speed up graph systems. However, such hardware-specific systems are hard to migrate across different hardware backends. In this paper, we propose the first tensor-based graph processing framework, Tgraph, which can be smoothly deployed and run on any powerful hardware accelerators (uniformly called XPU) that support Tensor Computation Runtimes (TCRs). TCRs, which are deep learning frameworks along with their runtimes and compilers, provide tensor-based interfaces to users to easily utilize specialized hardware accelerators without delving into the complex low-level programming details. However, building an efficient tensor-based graph processing framework is non-trivial. Thus, we make the following efforts: (1) propose a tensor-centric computation model for users to implement graph algorithms with easy-to-use programming interfaces; (2) provide a set of graph operators implemented by tensor to shield the computation model from the detailed tensor operators so that Tgraph can be easily migrated and deployed across different TCRs; (3) design a tensor-based graph compression and computation strategy and an out-of-XPU-memory computation strategy to handle large graphs. We conduct extensive experiments on multiple graph algorithms (BFS, WCC, SSSP, etc.), which validate that Tgraph not only outperforms seven state-of-the-art graph systems, but also can be smoothly deployed and run on multiple DL frameworks (PyTorch and TensorFlow) and hardware backends (Nvidia GPU, AMD GPU, and Apple MPS).
Article
Triangle centrality is introduced for finding important vertices in a graph based on the concentration of triangles surrounding each vertex. It has the distinct feature of allowing a vertex to be central if it is in many triangles or none at all. Given a simple, undirected graph G=(V,E) , with n=Vn=|V| vertices and m=Em=|E| edges, let (v)\triangle(v) and (G)\triangle(G) denote the respective triangle counts of v and G . Let N(v) be the neighborhood set of v . Respectively, N(v)N_{\triangle}(v) and N[v]={v}N(v)N_{\triangle}[v]=\{v\}\cup N_{\triangle}(v) denote the set of neighbors that are in triangles with v and the closed set including v . Then the triangle centrality for a vertex v is TC(v)=13uN[v](u)+w{N(v)N(v)}(w)(G).TC(v) = \frac{\frac{1}{3}\sum_{u\in N_\triangle[v]} \triangle(u) + \sum_{w\in \{N(v)\setminus N_\triangle(v)\}} \triangle(w)}{\triangle(G)}. We show experimentally that triangle centrality is broadly applicable to many different types of networks. Our empirical results demonstrate that 30% of the time triangle centrality identified central vertices that differed with those found by five well-known centrality measures, which suggests novelty without being overly specialized. It is also asymptotically faster to compute on sparse graphs than all but the most trivial of these other measures. We introduce optimal algorithms that compute triangle centrality in O(mδˉ)O(m\bar{\delta}) time and O(m+n) space, where δˉO(m)\bar{\delta}\leq O(\sqrt{m}) is the average degeneracy introduced by Burkhardt, Faber, and Harris (2020). In practical applications δˉ\bar{\delta} is much smaller than m\sqrt{m} so triangle centrality can be computed in nearly linear-time. On a Concurrent Read Exclusive Write (CREW) Parallel Random Access Memory (PRAM) machine, we give a near work-optimal parallel algorithm that takes O(logn)O(\log n) time using O(mm)O(m\sqrt{m}) CREW PRAM processors. In MapReduce, we show it takes four rounds using O(mm)O(m\sqrt{m}) communication bits, and is therefore optimal. We also derive a linear algebraic formulation of triangle centrality which can be computed in O(mδˉ)O(m\bar{\delta}) time on sparse graphs.
Article
Sparse Triangular Solve (SpTRSV) has long been an essential kernel in the field of scientific computing. Due to its low computational intensity and internal data dependencies, SpTRSV is hard to implement and optimize on GPUs. Based on our experimental observations, existing implementations on GPUs fail to achieve the optimal performance due to their sub-optimal parallelism setups and code implementations, and lack of consideration of the irregular data distribution. Moreover, their algorithm design lacks the adaptability to different input matrices, which may involve substantial manual efforts of algorithm redesigning and parameter tuning for performance consistency. In this work, we propose AG-SpTRSV, an automatic framework to optimize SpTRSV on GPUs, which provides high performance on various matrices while eliminating the costs of manual design. AG-SpTRSV abstracts the procedures of optimizing an SpTRSV kernel as a scheme and constructs a comprehensive optimization space based on it. By defining a unified code template and preparing code variants, AG-SpTRSV enables fine-grained dynamic parallelism and adaptive code optimizations to handle various tasks. Through computation graph transformation and multi-hierarchy heuristic scheduling, AG-SpTRSV generates schemes for task partitioning and mapping, which effectively address the issues of irregular data distribution and internal data dependencies. AG-SpTRSV searches for the best scheme to optimize the target kernel for the specific matrix. A learned lightweight performance model is also introduced to reduce search costs and provide an efficient end-to-end solution. Experimental results with SuiteSparse Matrix Collection on NVIDIA Tesla A100 and RTX 3080 Ti show that AG-SpTRSV outperforms state-of-the-art implementations with geometric average speedups of 2.12x ∼ 3.99x. With the performance model enabled, AG-SpTRSV can provide an efficient end-to-end solution, with preprocessing times ranging from 3.4 to 245 times of the execution time.
Chapter
Sparse matrix-vector multiplication (SpMV) is extensively used in scientific computing and often accounts for a significant portion of the overall computational overhead. Therefore, improving the performance of SpMV is crucial. However, sparse matrices exhibit a sporadic and irregular distribution of non-zero elements, resulting in workload imbalance among threads and challenges in vectorization. To address these issues, numerous efforts have focused on optimizing SpMV based on the hardware characteristics of computing platforms. In this paper, we present an optimization on CSR-Based SpMV, since the CSR format is the most widely used and supported by various high-performance sparse computing libraries, on a novel MIMD computing platform Pezy-SC3s. Based on the hardware characteristics of Pezy-SC3s, we tackle poor data locality, workload imbalance, and vectorization challenges in CSR-Based SpMV by employing matrix chunking, applying Atomic Cache for workload scheduling, and utilizing SIMD instructions during performing SpMV. As the first study to investigate SpMV optimization on Pezy-SC3s, we evaluate the performance of our work by comparing it with the CSR-Based SpMV and SpMV provided by Nvidia’s CuSparse. Through experiments conducted on 2092 matrices obtained from SuiteSparse, we demonstrate that our optimization achieves a maximum speedup ratio of x17.63 and an average of x1.56 over CSR-Based SpMV and an average bandwidth utilization of 35.22%\% for large-scale matrices (nnz106nnz \ge 10^{6}) compared with 36.17%\% obtained using CuSparse. These results demonstrate that our optimization effectively harnesses the hardware resources of Pezy-SC3s, leading to improved performance of CSR-Based SpMV.
Preprint
Full-text available
Sparse generalized matrix-matrix multiplication (SpGEMM) is a fundamental operation for real-world network analysis. With the increasing size of real-world networks, the single-machine-based SpGEMM approach cannot perform SpGEMM on large-scale networks, exceeding the size of main memory (i.e., not scalable). Although the distributed-system-based approach could handle large-scale SpGEMM based on multiple machines, it suffers from severe inter-machine communication overhead to aggregate results of multiple machines (i.e., not efficient). To address this dilemma, in this paper, we propose a novel storage-based SpGEMM approach (SAGE) that stores given networks in storage (e.g., SSD) and loads only the necessary parts of the networks into main memory when they are required for processing via a 3-layer architecture. Furthermore, we point out three challenges that could degrade the overall performance of SAGE and propose three effective strategies to address them: (1) block-based workload allocation for balancing workloads across threads, (2) in-memory partial aggregation for reducing the amount of unnecessarily generated storage-memory I/Os, and (3) distribution-aware memory allocation for preventing unexpected buffer overflows in main memory. Via extensive evaluation, we verify the superiority of SAGE over existing SpGEMM methods in terms of scalability and efficiency.
Article
We study large-scale network embedding with the goal of generating high-quality embeddings for networks with more than 1 billion vertices and 100 billion edges. Recent attempts LightNE and NetSMF propose to sparsify and factorize the (dense) NetMF matrix for embedding large networks, where NetMF is a theoretically-grounded network embedding method. However, there is a trade-off between their embeddings' quality and scalability due to their expensive memory requirements, making embeddings less effective under real-world memory constraints. Therefore, we present the SketchNE model, a scalable, effective, and memory-efficient network embedding solution developed for a single machine with CPU only. The main idea of SketchNE is to avoid the explicit construction and factorization of the NetMF matrix either sparsely or densely when producing the embeddings through the proposed sparse-sign randomized single-pass SVD algorithm. We conduct extensive experiments on nine datasets of various sizes for vertex classification and link prediction, demonstrating the consistent outperformance of SketchNE over state-of-the-art baselines in terms of both effectiveness and efficiency. SketchNE costs only 1.0 hours to embed the Hyperlink2012 network with 3.5 billion vertices and 225 billion edges on a CPU-only single machine with embedding superiority (e.g., a 282% relative HITS@10 gain over LightNE).
Article
We propose LightNE 2.0 , a cost-effective, scalable, automated, and high-quality network embedding system that scales to graphs with hundreds of billions of edges on a single machine. In contrast to the mainstream belief that distributed architecture and GPUs are needed for large-scale network embedding with good quality, we prove that we can achieve higher quality, better scalability, lower cost, and faster runtime with shared-memory, CPU-only architecture. LightNE 2.0 combines two theoretically grounded embedding methods NetSMF and ProNE. We introduce the following techniques to network embedding for the first time: (1) a newly proposed downsampling method to reduce the sample complexity of NetSMF while preserving its theoretical advantages; (2) a high-performance parallel graph processing stack GBBS to achieve high memory efficiency and scalability; (3) sparse parallel hash table to aggregate and maintain the matrix sparsifier in memory; (4) a fast randomized singular value decomposition (SVD) enhanced by power iteration and fast orthonormalization to improve vanilla randomized SVD in terms of both efficiency and effectiveness; (5) Intel MKL for proposed fast randomized SVD and spectral propagation; and (6) a fast and lightweight AutoML library FLAML for automated hyperparameter tuning. Experimental results show that LightNE 2.0 can be up to 84× faster than GraphVite, 30× faster than PBG and 9× faster than NetSMF while delivering better performance. LightNE 2.0 can embed very large graph with 1.7 billion nodes and 124 billion edges in half an hour on a CPU server, while other baselines cannot handle very large graphs of this scale.
Article
SuiteSparse:GraphBLAS is a full parallel implementation of the GraphBLAS standard, which defines a set of sparse matrix operations on an extended algebra of semirings using an almost unlimited variety of operators and types. When applied to sparse adjacency matrices, these algebraic operations are equivalent to computations on graphs. A description of the parallel implementation of SuiteSparse:GraphBLAS is given, including its novel parallel algorithms for sparse matrix multiply, addition, element-wise multiply, submatrix extraction and assignment, and the GraphBLAS mask/accumulator operation. Its performance is illustrated by solving the graph problems in the GAP Benchmark and by comparing it with other sparse matrix libraries.
Article
A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step; 2) Its table-to-matrix transformation makes an over-simplified assumption of the attribute values; and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation. In this paper, we propose DIM ³ , an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM ³ to consider partial result caching and support Join- op queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM ³ outperforms previous Join-Project solutions by a factor of 2.3X-18X. Compared to RDBMSs, DIM ³ achieves orders of magnitude speedups.
Article
GraphBLAS is a recent standard that allows the expression of graph algorithms in the language of linear algebra and enables automatic code parallelization and optimization. GraphBLAS operations are memory bound and may benefit from data locality optimizations enabled by nonblocking execution. However, nonblocking execution remains under-evaluated. In this article, we present a novel design and implementation that investigates nonblocking execution in GraphBLAS. Lazy evaluation enables runtime optimizations that improve data locality, and dynamic data dependence analysis identifies operations that may reuse data in cache. The nonblocking execution of an arbitrary number of operations results in dynamic parallelism, and the performance of the nonblocking execution depends on two parameters, which are automatically determined, at run-time, based on a proposed analytic model. The evaluation confirms the importance of nonblocking execution for various matrices of three algorithms, by showing up to 4.11 × speedup over blocking execution as a result of better cache utilization. The proposed analytic model makes the nonblocking execution reach up to 5.13 × speedup over the blocking execution. The fully automatic performance is very close to that obtained by using the best manual configuration for both small and large matrices. Finally, the evaluation includes a comparison with other state-of-the-art frameworks for numerical linear algebra programming that employ parallel execution and similar optimizations to those discussed in this work, and the presented nonblocking execution reaches up to 16.1 × speedup over the state-of-the-art.
Article
We present parallel algorithms and data structures for three fundamental operations in Numerical Linear Algebra: (i) Gaussian and CountSketch random projections and their combination, (ii) computation of the Gram matrix and (iii) computation of the squared row norms of the product of two matrices, with a special focus on “tall-and-skinny” matrices, which arise in many applications. We provide a detailed analysis of the ubiquitous CountSketch transform and its combination with Gaussian random projections, accounting for memory requirements, computational complexity and workload balancing. We also demonstrate how these results can be applied to column subset selection, least squares regression and leverage scores computation. These tools have been implemented in pylspack , a publicly available Python package, ¹ whose core is written in C++ and parallelized with OpenMP, and which is compatible with standard matrix data structures of SciPy and NumPy. Extensive numerical experiments indicate that the proposed algorithms scale well and significantly outperform existing libraries for tall-and-skinny matrices.
ResearchGate has not been able to resolve any references for this publication.