# Oded GreenNVIDIA | Nvidia

Oded Green

PhD

## About

48

Publications

32,001

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

824

Citations

Introduction

Additional affiliations

July 2018 - present

July 2015 - present

April 2014 - April 2015

**ArrayFire**

Position

- Chief Operating Officer and Research Scientist

## Publications

Publications (48)

Sparse data computations are ubiquitous in science and engineering. Unlike their dense data counterparts, sparse data computations have less locality and more irregularity in their execution, making them significantly more challenging to parallelize and optimize. Many of the existing formats for sparse data representations on parallel architectures...

Hash tables are ubiquitous and used in a wide range of applications for efficient probing of large and unsorted data. If designed properly, hash-tables can enable efficients look ups in a constant number of operations or commonly referred to as O(1) operations. As data sizes continue to grow and data becomes less structured (as is common for big-da...

Breadth-First Search (BFS) is a building block used in a wide array of graph analytics and is used in various network analysis domains: social, road, transportation, communication, and much more. Over the last two decades, network sizes have continued to grow. The popularity of BFS has brought with it a need for significantly faster traversals. Thu...

Breadth-First Search (BFS) traversals appear in a wide range of applications and domains. BFS traversals determine the distance between key vertices and the remaining vertices in the network. The distance between the vertices often called the number of hops, is the shortest path between a root and the remaining vertices in the graph. Given its appl...

Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parall...

Network analysis defines a number of centrality measures to identify the most central nodes in a network. Fast computation of those measures is a major challenge in algorithmic network analysis. Aside from closeness and betweenness, Katz centrality is one of the established centrality measures. In this paper, we consider the problem of computing ra...

The transitive closure of a graph is a new graph where every vertex is directly connected to all vertices to which it had a path in the original graph. Transitive closures are useful for reachability and relationship querying. Finding the transitive closure can be computationally expensive and requires a large memory footprint as the output is typi...

In this article, we introduce HashGraph, a new scalable approach for building hash tables that uses concepts taken from sparse graph representations—hence, the name HashGraph. HashGraph introduces a new way to deal with hash-collisions that does not use “open-addressing” or “separate-chaining,” yet it has the benefits of both these approaches. Hash...

Hash tables are used in a plethora of applications, including database operations, DNA sequencing, string searching, and many more. As such, there are many parallelized hash tables targeting multicore, distributed, and accelerator-based systems. We present in this work a multi-GPU hash table implementation that can process keys at a throughput comp...

For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit flips. More specifically, it applies the technique of self-stabilization. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a...

Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessari...

Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessari...

Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm...

The k-core of a graph is a metric used in a wide range of applications, including social networks analytics, visualization, and graph coloring. Finding the maximal k-core of a graph can be be done in near linear time. The low computational requirements for finding the maximal k-core makes effective parallelization challenging, especially for the it...

Merging and sorting algorithms are the backbone of many modern computer applications. As such, efficient implementations are desired. Recent architectural advancements in CPUs (Central Processing Units), such as wider and more powerful vector instructions, allow for algorithmic improvements. This paper presents a new approach to merge sort using ve...

Triangle counting is a building block for numerous graph applications and given the fact that graphs continue to grow in size, its scalability is important. As such, numerous algorithms have been designed for triangle counting-some of which are compute-bound rather than memory bound. Even for compute-bound algorithms, one of the key challenges is t...

The Betweenness Centrality of a vertex is an important metric used for determining how "central" a vertex is in a graph based on the number of shortest paths going through that vertex. Computing the betweenness centrality of a graph is computationally expensive, O(V ·(V +E)). This has led to the development of several important optimizations includ...

List intersections are ubiquitous and can be found in wide range of applications, including triangle counting and finding the maximal k-truss, both of which are part of the HPEC Static Graph Challenge. For many graph based problems it is necessary to find intersections for a very large number of lists-these lists tend to vary greatly in size and ar...

Network analysis defines a number of centrality measures to identify the most central nodes in a network. Fast computation of those measures is a major challenge in algorithmic network analysis. Aside from closeness and betweenness, Katz centrality is one of the established centrality measures. In this paper, we consider the problem of computing ra...

Triangle counting is an important building block for finding key players in a graph. It is an integral part of the popular clustering coefficient analytic and can be used for pattern matching in social networks. A triangle, which is also a 3-clique, represents a strong connection between three players that are all connected. While counting triangle...

Median filtering is a smoothing technique for noise removal in images. While there are various implementations of median filtering for a single-core CPU, there are few implementations for accelerators and multi-core systems. Many parallel implementations of median filtering use a sorting algorithm for rearranging the values within a filtering windo...

The k-truss of a graph is a subgraph such that each edge is tightly connected to the remaining elements in the k-truss. The k-truss of a graph can also represent an important community in the graph. Finding the k-truss of a graph can be done in a polynomial amount of time, in contrast finding other subgraphs such as cliques. While there are numerou...

Pairwise association measure is an important operation in data analytics. Kendall's tau coefficient is one widely used correlation coefficient identifying non-linear relationships between ordinal variables. In this paper, we investigated a parallel algorithm accelerating all-pairs Kendall's tau coefficient computation via single instruction multipl...

cuSTINGER, a new graph data structure targeting NVIDIA GPUs is designed for streaming graphs that evolve over time. cuSTINGER enables algorithm designers greater productivity and efficiency for implementing GPU-based an-alytics, relieving programmers of managing memory and data placement. In comparison with static graph data structures, which may r...

We present an efficient distributed memory parallel algorithm for computing connected components in undirected graphs based on Shiloach-Vishkin's PRAM approach. We discuss multiple optimization techniques that reduce communication volume as well as balance the load to improve the performance of the algorithm in practice. We also note that the effic...

We present a new fault-tolerant algorithm for the problem of computing the connected components of a graph. Our algorithm derives from a highly parallel but non-resilient algorithm, which is based on the technique of label propagation (LP). To make the (LP) algorithm resilient to transient soft faults, we apply an algorithmic design principle that...

This paper quantifies the impact of branches and branch mispredictions on the single-core performance of certain graph problems, specifically for computing connected components. We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algorithms and implementati...

Triangle counting in a graph is a building block for clustering coefficients which is a widely used social network analytic for finding key players in a network based on their local connectivity. In this paper we show the first scalable GPU implementation for triangle counting. Our approach uses a new list intersection algorithm called Intersect Pa...

Merging is a building block for many computational domains. In this work we consider the relationship between merging, branch predictors, and input data dependency. Branch predictors are ubiquitous in modern processors as they are useful for many high performance computing applications. While it is well known that the performance and the branch pre...

Betweenness centrality is a graph analytic that states the importance of a vertex based on the number of shortest paths that it is on. As such, betweenness centrality is a building block for graph analysis tools and is used by many applications, including finding bottlenecks in communication networks and community detection. Computing betweenness c...

This paper quantifies the impact of branches and branch mispredictions on the single-core performance of certain graph problems, specifically for computing connected components. We show that branch mispredictions are costly and can reduce performance by as much as 30%-50%. This insight suggests that one should seek graph algorithms and implementati...

Merging two sorted arrays is a prominent building block for sorting and other
functions. Its efficient parallelization requires balancing the load among
compute cores, minimizing the extra work brought about by parallelization, and
minimizing inter-thread synchronization requirements. Efficient use of memory
is also important.
We present a novel, v...

Clustering coefficients is a building block in network sciences that offers insights on how tightly bound vertices are in a network. Effective and scalable parallelization of clustering coefficients requires load balancing amongst the cores. This property is not easy to achieve since many real world networks are scale free, which leads to some vert...

Social networks, communication networks, busi-ness intelligence databases, and large scientific data sources now contain hundreds of millions elements with billions of relationships. The relationships in these massive datasets are changing at ever-faster rates. Through representing these datasets as dynamic and semantic graphs of vertices and edges...

The estimated covariance matrix is a building block for many algorithms, including signal and image processing. The Covariance Method is an estimator for the covariance matrix, favored both as an estimator and in view of the convenient properties of the matrix that it produces. However, the considerable computational requirements limit its use. We...

Clustering coefficients, also called triangle counting, is a widely-used graph analytic for measuring the closeness in which vertices cluster together. Intuitively, clustering coefficients can be thought of as the ratio of common friends versus all possible connections a person might have in a social network. The best known time complexity for comp...

This paper reports on methods and results of an applied research project by a team consisting of SAIC and four universities to develop, integrate, and evaluate new approaches to detect the weak signals characteristic of insider threats on organizations' information systems. Our system combines structural and semantic information from a real corpora...

Computation of a signal's estimated covariance matrix is an important
building block in signal processing, e.g., for spectral estimation. Each matrix
element is a sum of products of elements in the input matrix taken over a
sliding window. Any given product contributes to multiple output elements,
thereby complicating parallelization. We present a...

We consider many-core processors with task-oriented programming, whereby scheduling constraints among tasks are decided offline, and are then enforced by the runtime system. Here, exposing and beneficially exploiting fine grain data and control parallelism is increasingly important. Therefore, high expressive power for stating such constraints/dire...

Analysis of social networks is challenging due to the rapid changes of its members and their relationships. For many cases it impractical to recompute the metric of interest, therefore, streaming algorithms are used to reduce the total runtime following modifications to the graph. Centrality is often used for determining the relative importance of...

Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow for significant performance improvement when comput...

Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and minimizing inter-thread synchronization requirements. Efficient use of memory is also important. We present a novel ap...

## Projects

Projects (5)

Accelerating graph applications by developing more efficient and scalable algorithms/approaches for shared and distributed-memory parallel platforms.