# Yogish SabharwalIBM · High Performance Computing, India

Yogish Sabharwal

Ph.D., IIT Delhi

## About

98

Publications

20,142

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

1,438

Citations

Citations since 2017

## Publications

Publications (98)

Subgraph similarity search is a fundamental operator in graph analysis. In this framework, given a query graph and a graph database, the goal is to identify subgraphs of the database graphs that are structurally similar to the query. Subgraph edit distance (SED) is one of the most expressive measures for subgraph similarity. In this work, we study...

In conventional public clouds, designing a suitable initial cluster for a given application workload is important in reducing the computational foot-print during run-time. In edge or on-premise clouds, cold-start rightsizing the cluster at the time of installation is crucial in avoiding the recurrent capital expenditure. In both these cases, rights...

We present distributed algorithms for training dynamic Graph Neural Networks (GNN) on large scale graphs spanning multi-node, multi-GPU systems. To the best of our knowledge, this is the first scaling study on dynamic GNN. We devise mechanisms for reducing the GPU memory usage and identify two execution time bottlenecks: CPU-GPU data transfer; and...

We study the problem of maximizing the throughput of jobs wherein each job consists of multiple tasks. Consider a system offering a capacity of one unit. We are given a set of jobs, each consisting of a sequence of r tasks. Each task is associated with a demand and an interval where it should be scheduled. Each job has a profit associated with it....

Data compression is used in a wide variety of tasks, including compression of databases, large learning models, videos, images, etc. The cost of decompressing (decoding) data can be prohibitive for certain real-time applications. In many scenarios, it is acceptable to sacrifice (to some extent) on compression in the interest of fast decoding. In th...

The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner. In this paper, we e...

BERT has emerged as a popular model for natural language understanding. Given its compute intensive nature, even for inference, many recent studies have considered optimization of two important performance characteristics: model size and inference time. We consider classification tasks and propose a novel method, called PoWER-BERT, for improving th...

The Tucker decomposition generalizes singular value decomposition (SVD) to high dimensional tensors. It factorizes a given N-dimensional tensor as the product of a small core tensor and a set of N factor matrices. Non-negative Tucker Decomposition (NTD) is a variant that imposes the constraint that the entries of the core and the factor matrices mu...

In this paper, we study a class of set cover problems that satisfy a special property which we call the small neighborhood cover property. This class encompasses several well-studied problems including vertex cover, interval cover, bag interval cover and tree cover. We design unified sequential, parallel and distributed algorithms that can handle a...

The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input t...

We consider the problem of scheduling a set of jobs on a system that offers certain resource, wherein the amount of resource offered varies over time. For each job, the input specifies a set of possible scheduling instances, where each instance is given by starting time, ending time, profit and resource requirement. A feasible solution selects a su...

The Tucker decomposition generalizes the notion of Singular Value Decomposition (SVD) to tensors, the higher dimensional analogues of matrices. We study the problem of constructing the Tucker decomposition of sparse tensors on distributed memory systems via the HOOI procedure, a popular iterative method. The scheme used for distributing the input t...

Deep Neural Networks (DNNs) have achieved im- pressive accuracy in many application domains including im- age classification. Training of DNNs is an extremely compute- intensive process and is solved using variants of the stochastic gradient descent (SGD) algorithm. A lot of recent research has focussed on improving the performance of DNN training....

The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Apart from providing data compression, the construction is useful in performing analysis such as principal component analysis (PCA)and finds applications in diverse domains such as signal processing, computer vision and text analyti...

We consider the replica placement problem: given a graph and a set of clients, place replicas on a minimum set of nodes of the graph to serve all the clients; each client is associated with a request and maximum distance that it can travel to get served; there is a maximum limit (capacity) on the amount of request a replica can serve. The problem f...

We consider the replica placement problem: given a graph with clients and nodes, place replicas on a minimum set of nodes to serve all the clients; each client is associated with a request and maximum distance that it can travel to get served and there is a maximum limit (capacity) on the amount of request a replica can serve. The problem falls und...

We consider the single-source shortest path (SSSP) problem: given an undirected graph with integer edge weights and a source vertex
$v$
, find the shortest paths from
$v$
to all other vertices. In this paper, we introduce a novel parallel algorithm, derived from the Bellman-Ford and Delta-stepping algorithms. We employ various pruning techniques,...

We consider a variant of the knapsack problem, where items are available with different possible weights. Using a separate budget for these item improvements, the question is: Which items should be improved to which degree such that the resulting classic knapsack problem yields maximum profit? We present a detailed analysis for several cases of imp...

We present a framework for computing with input data specified by intervals, representing uncertainty in the values of the input parameters. To compute a solution, the algorithm can query the input parameters that yield more refined estimates in the form of sub-intervals and the objective is to minimize the number of queries. The previous approache...

The problem of counting occurrences of query graphs in a large data graph, known as subgraph counting, is fundamental to several domains such as genomics and social network analysis. Many important special cases (e.g. triangle counting) have received significant attention. Color coding is a very general and powerful algorithmic technique for subgra...

In the classical k-median problem, we are given a metric space and want to open k centers so as to minimize the sum (over all the vertices) of the distance of each vertex to its nearest open center. In this paper we present the first constant-factor approximation algorithms for two natural generalizations of this problem that handle matroid or knap...

A computer-implemented method of load balancing including calculating an expected cost set associated with an application-specific task of an application executing on a processing resource in a cloud computing environment, and communicating the expected cost set from the processing resource to a cloud management system. Resource mapping of applicat...

A non-transitory computer-implemented method of load balancing includes calculating an expected cost set associated with an application-specific task of an application executing on a processing resource in a cloud computing environment, and communicating the expected cost set from the processing resource to a cloud management system. Resource mappi...

We consider the problem of scheduling a set of jobs on a system that offers certain resource, wherein the amount of resource offered varies over time. For each job, the input specifies a set of possible scheduling instances, where each instance is given by starting time, ending time, profit and resource requirement. A feasible solution selects a su...

Facility location and data placement problems have been widely studied. Consider the following problem. We are given a set of facilities FF and a set of clients DD in a metric space. There are two types of objects. A client may have demand for each of the object-types. A facility can be opened for one or both types depending on its storage capacity...

In the single-source shortest path (SSSP) problem, we have to find the shortest paths from a source vertex v to all other vertices in a graph. In this paper, we introduce a novel parallel algorithm, derived from the Bellman-Ford and Delta-stepping algorithms. We employ various pruning techniques, such as edge classification and direction-optimizati...

We consider the weight-reducible knapsack problem, where we are given a limited budget that can be used to decrease item weights, and we would like to optimize the knapsack objective value using such weight improvements.
We develop a pseudo-polynomial algorithm for the problem, as well as a polynomial-time 3-approximation algorithm based on solving...

Dynamic Programming (DP) is an efficient technique to solve combinatorial search and optimization problems. There have been many research efforts towards parallelizing dynamic programs. In this paper, we study the parallelization of the Polynomial Time Approximation Scheme (PTAS) DP for the classical bin-packing problem. This problem is challenging...

In this paper, we study a class of set cover problems that satisfy a special
property which we call the {\em small neighborhood cover} property. This class
encompasses several well-studied problems including vertex cover, interval
cover, bag interval cover and tree cover. We design unified distributed and
parallel algorithms that can handle any set...

Numerical weather prediction (NWP) models use mathematical models of the atmosphere to predict the weather. Ongoing efforts in the weather and climate community continuously try to improve the fidelity of weather models by employing higher order numerical methods suitable for solving model equations at high resolutions. In realistic weather forecas...

Accurate and timely flood forecasts are becoming highly essential due to the increased incidence of flood related disasters over the last few years. Such forecasts require a high resolution integrated flood modeling approach. In this paper, we present an integrated flood forecasting system with an automated workflow over the weather modeling, surfa...

In this paper we present the generalization of the relaxed Multi- Organization Scheduling Problem (α MOSP). In our generalized problem, we are given a set of organizations; each organization is comprised of a set of machines. We are interested in minimizing the global makespan while allowing a constant factor, αO, degradation in the local objective...

A method of generating an image includes the step of obtaining captured data characterizing an object. The method also includes the step of reconstructing a spatio-temporal image of the object based on the captured data, the spatio-temporal image comprising a plurality of spatial images in respective time intervals, with at least a given one of the...

Energy storage technologies that are connected to medium- or low-voltage distribution systems are referred to as Distributed Energy Storage (DES). DES are becoming more common as the storage technologies are becoming cheaper. Energy stored on the distribution system, whether it is generated by Distributed Generation (DG) or central generation units...

This paper considers the problem of maximizing the throughput of jobs wherein each job consists of multiple tasks. Consider a system offering a uniform capacity of a resource (say unit bandwidth). We are given a set of jobs, each consisting of a sequence of at most r tasks. Each task is associated with a window (specified by a release time and a de...

In this paper we study the unsplittable flow problem (UFP) on tree networks in a distributed setting. We have a set of processors (or agents) and a set of tree networks defined over some vertex set. Each processor can access a subset of the tree networks. Each edge in each of the tree networks is associated with a capacity. Each processor has a dem...

Accurate and timely prediction of weather phenomena, such as hurricanes and flash floods, require high-fidelity compute intensive simulations of multiple finer regions of interest within a coarse simulation domain. Current weather applications execute these nested simulations sequentially using all the available processors, which is sub-optimal due...

Accurate and timely prediction of weather phenomena, such as hurricanes and flash floods, require high-fidelity compute intensive simulations of multiple finer regions of interest within a coarse simulation domain. Current weather applications execute these nested simulations sequentially using all the available processors, which is sub-optimal due...

This paper explores the performance and optimization of the IBM Blue Gene/Q (BG/Q) five dimensional torus network on up to 16K nodes. The BG/Q hardware supports multiple dynamic routing algorithms and different traffic patterns may require different algorithms to achieve best performance. Between 85% to 95% of peak network performance is achieved f...

In this paper, we describe the challenges involved in designing a family of highly-efficient Breadth-First Search (BFS) algorithms and in optimizing these algorithms on the latest two generations of Blue Gene machines, Blue Gene/P and Blue Gene/Q. With our recent winning Graph 500 submissions in November 2010, June 2011, and November 2011, we have...

In this paper, we consider the problem of choosing a minimum cost set of
resources for executing a specified set of jobs. Each input job is an interval,
determined by its start-time and end-time. Each resource is also an interval
determined by its start-time and end-time; moreover, every resource has a
capacity and a cost associated with it. We con...

The PERCS system was designed by IBM in response to a DARPA challenge that
called for a high-productivity high-performance computing system. The IBM PERCS
architecture is a two level direct network having low diameter and high
bisection bandwidth. Mapping and routing strategies play an important role in
the performance of applications on such a top...

In this paper we consider the problem of finding the {\em densest} subset
subject to {\em co-matroid constraints}. We are given a {\em monotone
supermodular} set function $f$ defined over a universe $U$, and the density of
a subset $S$ is defined to be $f(S)/\crd{S}$. This generalizes the concept of
graph density. Co-matroid constraints are the fol...

We have a set of processors (or agents) and a set of graph networks defined over some vertex set. Each processor can access a subset of the graph networks. Each processor has a demand specified as a pair of vertices ‹u, v›, along with a profit; the processor wishes to send data between u and v. Towards that goal, the processor needs to select a gra...

Collective communication over a group of processors is an integral and time consuming component in many HPC applications. Many modern day supercomputers are based on torus interconnects. On such systems, for an irregular communicator comprising of a subset of processors, the algorithms developed so far are not contention free in general and hence n...

We consider the problem of scheduling jobs that require multiple resources such as memory, bandwidth and processors. For each job, the input specifies start time, finish time and profit; the input also specifies the job's requirement for each resource. Each resource has a fixed capacity (called bandwidth). A feasible solution is a subset of jobs su...

In the classical k-median problem, we are given a metric space and would like to open k centers so as to minimize the sum (over all the vertices) of the distance of each vertex to its nearest open center. In this paper, we consider the following generalization of the problem: instead of opening at most k centers, what if each center belongs to one...

We consider the problem of allocating resources to satisfy demand requirements varying over time. The input specifies a demand for each timeslot. Each resource is specified by a start-time, end-time, an associated cost and a capacity. A feasible solution is a multiset of resources such that at any point of time, the sum of the capacities offered by...

Modern power grids are continuously monitored by trained system operators equipped with sophisticated monitoring and control
systems. Despite such precautionary measures, large blackouts, that affect more than a million consumers, occur quite frequently.
To prevent such blackouts, it is important to perform high-order contingency analysis in real...

We present a framework for computing with input data specified by intervals,
representing uncertainty in the values of the input parameters. To compute a
solution, the algorithm can query the input parameters that yield more refined
estimates in form of sub-intervals and the objective is to minimize the number
of queries. The previous approaches ad...

We consider the problem of allocating resources for completing a collection of jobs. Each resource is specified by a start-time, finish-time and the capacity of resource available and has an associated cost, and each job is specified by a start-time, finish-time and the amount of the resource required (demand) during this interval. A feasible solut...

Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting (DTC) problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-specified input. The problem has applications in monitoring,...

Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem
is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-specified input. The problem has applications in monitoring, globa...

In this paper we study the problem of clustering entities that are described by two types of data: attribute data and relationship
data. While attribute data describe the inherent characteristics of the entities, relationship data represent associations
among them. Attribute data can be mapped to the Euclidean space, whereas that is not always poss...

We consider the problem of scheduling a set of resources over time. Each resource is specified by a set of time intervals
(and the associated amount of resource available), and we can choose to schedule it in one of these intervals. The goal is
to maximize the number of demands satisfied, where each demand is an interval with a starting and ending...

The slow progress in memory access latencies in comparison to CPU speeds has resulted in memory accesses dominating code performance. While architectural enhancements have benefited applications with data locality and sequential access, random memory access still remains a cause for concern. Several benchmarks have been proposed to evaluate the ran...

Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem
is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-specified input. The problem has applications in monitoring, globa...

Matrix transpose is a fundamental matrix operation that arises in many scientific and engineering applications. Communication
is the main bottleneck in performing matrix transpose on most multi-processor systems. In this paper, we focus on torus interconnection
networks and propose application-level routing techniques that improve load balancing, r...

Existing algorithms for global snapshots in distributed systems are not scalable when the underlying topology is complete. There are primarily two classes of existing algorithms for computing a global snapshot. Algorithms in the first class use control messages of size 0(1) but require O(N) space and O(N) messages per processor in a network with JV...

We study the problem of anonymizing tables containing personal information before releasing them for public use. One of the formulations considered in this context is the $k$-anonymization problem: given a table, suppress a minimum number of cells so that in the transformed table, each row is identical to atleast $k-1$ other rows. The problem is kn...

We present a general approach for designing approximation algorithms for a fundamental class of geometric clustering problems in arbitrary dimensions. More specifically, our approach leads to simple randomized algorithms for the k-means, k-median and discrete k-means problems that yield (1 + ε) approximations with probability ≥ 1/2 and running time...

We consider the problem of scheduling jobs on a pool of machines. Each job requires multiple machines on which it executes in parallel. For each job, the input specifies release time, deadline, processing time, profit and the number of machines required. The total number of machines may be different at different points of time. A feasible solution...

Collectives are an important and frequently used component of MPI. Bucket algorithms, also known as "large vector" algorithms, were introduced in the early 90's and have since evolved as a well known paradigm for large MPI collectives. Many modern day supercomputers such as the IBM Blue Gene and Cray XT are based on torus interconnects that offer a...

The maximum independent set problem MaxIS on general graphs is known to be NP-hard to approximate within a factor of n 1-ϵ , for any ϵ>0. However, there are many “easy” classes of graphs on which the problem can be solved in polynomial time. In this context, an interesting question is that of computing the maximum independent set in a graph that ca...

Consider a scenario where we need to schedule a set of jobs on a system offering some resource (such as electrical power or communication bandwidth), which we shall refer to as bandwidth. Each job consists of a set (or bag) of job instances. For each job instance, the input specifies the start time, finish time, bandwidth requirement and profit. Th...

We study the distributed trigger counting problem. In this problem, we have a distributed system having n processors. These processors receive triggers from an external source. The goal is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-specified input. The problem has applicati...

We consider the problem of constructing decision trees for entity identification from a given table. The input is a table
containing information about a set of entities over a fixed set of attributes. The goal is to construct a decision tree that
identifies each entity unambiguously by testing the attribute values such that the average number of te...

In this paper we examine the key elements determin- ing the performance of the HPC Challenge RandomAccess benchmark on next generation supercomputers. We find that the performance of this benchmark is closely related to the bisection bandwidth of the underlying communication network, performance of integer divide operation and details of benchmark...

In this paper, we present a comprehensive theoretical analy- sis of the sampling technique for the association rule mining problem. Most of the previous works have concentrated only on the empirical evaluation of the efiectiveness of sampling for the step of flnding frequent itemsets. To the best of our knowledge, a theoretical framework to analyze...

Fast Fourier Transform is a class of efficient algorithms usedto compute Discrete Fourier Transforms, widely used in many scientific and technical applications. In this paper, we analyze the bottlenecks in the parallel FFT algorithm and describe opti- mizations carried out for the algorithm on the Blue Gene/L Supercomputer. There were three avenues...

The unique architecture of the heterogeneous multicore Cell processor offers great potential for high performance computing.It offers features such as high memory bandwidth using DMA, usermanaged local stores and SIMD architecture. In this paper, we presentstrategies for leveraging these features to develop a high performanceBLAS library. We propos...

All-to-all communication is a well known performance bottleneck for many applications, such as the ones that use the Fast-Fourier-Transform (FFT) algorithm. We analyze the performance of all-to-all communication on the BlueGene/L torus interconnect that has link contention even for all-to-all operations with short messages. We observed that the per...