Oded Green

Oded Green
NVIDIA | Nvidia

PhD

About

48
Publications
32,001
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
824
Citations
Additional affiliations
July 2018 - present
NVIDIA
Position
  • Developer
July 2015 - present
Georgia Institute of Technology
Position
  • Researcher
April 2014 - April 2015
ArrayFire
Position
  • Chief Operating Officer and Research Scientist

Publications

Publications (48)
Conference Paper
Full-text available
Sparse data computations are ubiquitous in science and engineering. Unlike their dense data counterparts, sparse data computations have less locality and more irregularity in their execution, making them significantly more challenging to parallelize and optimize. Many of the existing formats for sparse data representations on parallel architectures...
Preprint
Full-text available
Hash tables are ubiquitous and used in a wide range of applications for efficient probing of large and unsorted data. If designed properly, hash-tables can enable efficients look ups in a constant number of operations or commonly referred to as O(1) operations. As data sizes continue to grow and data becomes less structured (as is common for big-da...
Preprint
Full-text available
Breadth-First Search (BFS) is a building block used in a wide array of graph analytics and is used in various network analysis domains: social, road, transportation, communication, and much more. Over the last two decades, network sizes have continued to grow. The popularity of BFS has brought with it a need for significantly faster traversals. Thu...
Conference Paper
Full-text available
Breadth-First Search (BFS) traversals appear in a wide range of applications and domains. BFS traversals determine the distance between key vertices and the remaining vertices in the network. The distance between the vertices often called the number of hops, is the shortest path between a root and the remaining vertices in the graph. Given its appl...
Conference Paper
Full-text available
Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parall...
Article
Full-text available
Network analysis defines a number of centrality measures to identify the most central nodes in a network. Fast computation of those measures is a major challenge in algorithmic network analysis. Aside from closeness and betweenness, Katz centrality is one of the established centrality measures. In this paper, we consider the problem of computing ra...
Conference Paper
Full-text available
The transitive closure of a graph is a new graph where every vertex is directly connected to all vertices to which it had a path in the original graph. Transitive closures are useful for reachability and relationship querying. Finding the transitive closure can be computationally expensive and requires a large memory footprint as the output is typi...
Article
Full-text available
In this article, we introduce HashGraph, a new scalable approach for building hash tables that uses concepts taken from sparse graph representations—hence, the name HashGraph. HashGraph introduces a new way to deal with hash-collisions that does not use “open-addressing” or “separate-chaining,” yet it has the benefits of both these approaches. Hash...
Preprint
Full-text available
Hash tables are used in a plethora of applications, including database operations, DNA sequencing, string searching, and many more. As such, there are many parallelized hash tables targeting multicore, distributed, and accelerator-based systems. We present in this work a multi-GPU hash table implementation that can process keys at a throughput comp...
Conference Paper
Full-text available
For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of self-stabilization. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a...
Preprint
Full-text available
Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessari...
Preprint
Full-text available
Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessari...
Conference Paper
Full-text available
Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm...
Conference Paper
Full-text available
The k-core of a graph is a metric used in a wide range of applications, including social networks analytics, visualization, and graph coloring. Finding the maximal k-core of a graph can be be done in near linear time. The low computational requirements for finding the maximal k-core makes effective parallelization challenging, especially for the it...
Conference Paper
Full-text available
Merging and sorting algorithms are the backbone of many modern computer applications. As such, efficient implementations are desired. Recent architectural advancements in CPUs (Central Processing Units), such as wider and more powerful vector instructions, allow for algorithmic improvements. This paper presents a new approach to merge sort using ve...
Conference Paper
Full-text available
Triangle counting is a building block for numerous graph applications and given the fact that graphs continue to grow in size, its scalability is important. As such, numerous algorithms have been designed for triangle counting-some of which are compute-bound rather than memory bound. Even for compute-bound algorithms, one of the key challenges is t...
Conference Paper
Full-text available
The Betweenness Centrality of a vertex is an important metric used for determining how "central" a vertex is in a graph based on the number of shortest paths going through that vertex. Computing the betweenness centrality of a graph is computationally expensive, O(V ·(V +E)). This has led to the development of several important optimizations includ...
Conference Paper
Full-text available
List intersections are ubiquitous and can be found in wide range of applications, including triangle counting and finding the maximal k-truss, both of which are part of the HPEC Static Graph Challenge. For many graph based problems it is necessary to find intersections for a very large number of lists-these lists tend to vary greatly in size and ar...
Preprint
Full-text available
Network analysis defines a number of centrality measures to identify the most central nodes in a network. Fast computation of those measures is a major challenge in algorithmic network analysis. Aside from closeness and betweenness, Katz centrality is one of the established centrality measures. In this paper, we consider the problem of computing ra...
Conference Paper
Full-text available
Triangle counting is an important building block for finding key players in a graph. It is an integral part of the popular clustering coefficient analytic and can be used for pattern matching in social networks. A triangle, which is also a 3-clique, represents a strong connection between three players that are all connected. While counting triangle...
Article
Full-text available
Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering windo...
Conference Paper
Full-text available
The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerou...
Article
Full-text available
Pairwise association measure is an important operation in data analytics. Kendall's tau coefficient is one widely used correlation coefficient identifying non-linear relationships between ordinal variables. In this paper, we investigated a parallel algorithm accelerating all-pairs Kendall's tau coefficient computation via single instruction multipl...
Conference Paper
Full-text available
cuSTINGER, a new graph data structure targeting NVIDIA GPUs is designed for streaming graphs that evolve over time. cuSTINGER enables algorithm designers greater productivity and efficiency for implementing GPU-based an-alytics, relieving programmers of managing memory and data placement. In comparison with static graph data structures, which may r...
Article
Full-text available
We present an efficient distributed memory parallel algorithm for computing connected components in undirected graphs based on Shiloach-Vishkin's PRAM approach. We discuss multiple optimization techniques that reduce communication volume as well as balance the load to improve the performance of the algorithm in practice. We also note that the effic...
Conference Paper
Full-text available
We present a new fault-tolerant algorithm for the problem of computing the connected components of a graph. Our algorithm derives from a highly parallel but non-resilient algorithm, which is based on the technique of label propagation (LP). To make the (LP) algorithm resilient to transient soft faults, we apply an algorithmic design principle that...
Research
Full-text available
This paper quantifies the impact of branches and branch mispredictions on the single-core performance of certain graph problems, specifically for computing connected components. We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algorithms and implementati...
Conference Paper
Full-text available
Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Pa...
Conference Paper
Full-text available
Merging is a building block for many computational domains. In this work we consider the relationship between merging, branch predictors, and input data dependency. Branch predictors are ubiquitous in modern processors as they are useful for many high performance computing applications. While it is well known that the performance and the branch pre...
Conference Paper
Full-text available
Betweenness centrality is a graph analytic that states the importance of a vertex based on the number of shortest paths that it is on. As such, betweenness centrality is a building block for graph analysis tools and is used by many applications, including finding bottlenecks in communication networks and community detection. Computing betweenness c...
Conference Paper
Full-text available
This paper quantifies the impact of branches and branch mispredictions on the single-core performance of certain graph problems, specifically for computing connected components. We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algorithms and implementati...
Article
Full-text available
Technical Report
Full-text available
Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and minimizing inter-thread synchronization requirements. Efficient use of memory is also important. We present a novel, v...
Conference Paper
Full-text available
Clustering coefficients is a building block in network sciences that offers insights on how tightly bound vertices are in a network. Effective and scalable parallelization of clustering coefficients requires load balancing amongst the cores. This property is not easy to achieve since many real world networks are scale free, which leads to some vert...
Conference Paper
Full-text available
Social networks, communication networks, busi-ness intelligence databases, and large scientific data sources now contain hundreds of millions elements with billions of relationships. The relationships in these massive datasets are changing at ever-faster rates. Through representing these datasets as dynamic and semantic graphs of vertices and edges...
Conference Paper
Full-text available
The estimated covariance matrix is a building block for many algorithms, including signal and image processing. The Covariance Method is an estimator for the covariance matrix, favored both as an estimator and in view of the convenient properties of the matrix that it produces. However, the considerable computational requirements limit its use. We...
Conference Paper
Full-text available
Clustering coefficients, also called triangle counting, is a widely-used graph analytic for measuring the closeness in which vertices cluster together. Intuitively, clustering coefficients can be thought of as the ratio of common friends versus all possible connections a person might have in a social network. The best known time complexity for comp...
Conference Paper
Full-text available
This paper reports on methods and results of an applied research project by a team consisting of SAIC and four universities to develop, integrate, and evaluate new approaches to detect the weak signals characteristic of insider threats on organizations' information systems. Our system combines structural and semantic information from a real corpora...
Conference Paper
Full-text available
Computation of a signal's estimated covariance matrix is an important building block in signal processing, e.g., for spectral estimation. Each matrix element is a sum of products of elements in the input matrix taken over a sliding window. Any given product contributes to multiple output elements, thereby complicating parallelization. We present a...
Conference Paper
Full-text available
We consider many-core processors with task-oriented programming, whereby scheduling constraints among tasks are decided offline, and are then enforced by the runtime system. Here, exposing and beneficially exploiting fine grain data and control parallelism is increasingly important. Therefore, high expressive power for stating such constraints/dire...
Conference Paper
Full-text available
Analysis of social networks is challenging due to the rapid changes of its members and their relationships. For many cases it impractical to recompute the metric of interest, therefore, streaming algorithms are used to reduce the total runtime following modifications to the graph. Centrality is often used for determining the relative importance of...
Conference Paper
Full-text available
Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow for significant performance improvement when comput...
Conference Paper
Full-text available
Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and minimizing inter-thread synchronization requirements. Efficient use of memory is also important. We present a novel ap...

Network

Cited By

Projects

Projects (5)
Project
Algorithm & Architecture Codesign
Project
Accelerating graph applications by developing more efficient and scalable algorithms/approaches for shared and distributed-memory parallel platforms.