# Rezaul Chowdhury's research while affiliated with Stony Brook University and other places

**What is this page?**

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

## Publications (94)

We study the binomial option pricing model and the Black-Scholes-Merton pricing model. In the binomial option pricing model, we concentrate on two widely-used call options: (1) European and (2) American. Under the Black-Scholes-Merton model, we investigate pricing American put options. Our contributions are two-fold: First, we transform the option...

We present a work-efficient parallel level-synchronous Breadth First Search (BFS) algorithm for shared-memory architectures which achieves the theoretical lower bound on parallel running time. The optimality holds regardless of the shape of the graph. We also demonstrate the implication of this optimality for the energy consumption of the program e...

Predicting protein side-chains is important for both protein structure prediction and protein design. Modeling approaches to predict side-chains such as SCWRL4 have become one of the most widely used tools of its type due to fast and highly accurate predictions. Motivated by the recent success of AlphaFold2 in CASP14, our group adapted a 3D equivar...

We present two simple, intuitive and general algorithmic frameworks that can be used to design a wide variety of permutation generation algorithms. The frameworks can be used to produce 19 existing permutation algorithms, including the well-known algorithms of Heap, Wells, Langdon, Zaks, Tompkins and Lipski. We use the frameworks to design two new...

To respond to the intense computational load of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order...

We present efficient parallel recursive divide-and-conquer algorithms for bubble sort, selection sort, and insertion sort. Our algorithms have excellent data locality and are highly parallel. The computational complexity of our insertion sort is ${{\mathcal{O}}}\left ({n^{\log _2 3}}\right )$ in contrast to ${{\mathcal{O}}}\left ({n^2}\right )$ of...

Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this pa...

The binary-forking model is a parallel computation model, formally defined by Blelloch et al. very recently, in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of $\Theta(\log n)$ to spawn or synchronize $n$ tasks or threads. The binary-forking model realistically captures the performance o...

Current wireless networks mainly focus on delay-tolerant applications while demands for latency-sensitive applications are rising with VR/AR technologies and machine-to-machine IoT applications. In this paper we consider multi-channel, multi-radio scheduling at the MAC layer to optimize for the performance of prioritized, delay-sensitive demands. O...

We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivi...

A determinacy race occurs if two or more logically parallel instructions access the same memory location and at least one of them tries to modify its content. Races often lead to nondeterministic and incorrect program behavior. A data race is a special case of a determinacy race which can be eliminated by associating a mutual-exclusion lock or allo...

Recursive divide-&-conquer algorithms are known for solving dynamic programming (DP) problems efficiently on shared-memory multicore machines. In this work, we extend them to run efficiently also on manycore GPUs and distributed-memory machines without changing their basic structure.
Our GPU algorithms work efficiently even when the data is too lar...

We present the buffer heap, a cache-oblivious priority queue that supports Delete-Min, Delete, and a hybrid Insert/Decrease-Key operation in O(1/B log2N/M) amortized block transfers from main memory, where M and B are the (unknown) cache size and block size, respectively, and N is the number of elements in the queue. We introduce the notion of a sl...

We present Autogen—an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. Autogen analyzes the set of DP table locations accessed by the iterative algorithm when run on...

Iterative wavefront algorithms for evaluating dynamic programming recurrences exploit optimal parallelism but show poor cache performance. Tiled-iterative wavefront algorithms achieve optimal cache complexity and high parallelism but are cache-aware and hence are not portable and not cache-adaptive. On the other hand, standard cache-oblivious recur...

Standard cache-oblivious recursive divide-and-conquer algorithms for evaluating dynamic programming recurrences have optimal serial cache complexity but often have lower parallelism compared with iterative wavefront algorithms due to artificial dependencies among subtasks. Very recently cache-oblivious recursive wavefront (COW) algorithms have been...

Standard cache-oblivious recursive divide-and-conquer algorithms for evaluating dynamic programming recurrences have optimal serial cache complexity but often have lower parallelism compared with iterative wavefront algorithms due to artificial dependencies among subtasks. Very recently cache-oblivious recursive wavefront (COW) algorithms have been...

We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automati...

We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automati...

The Viterbi algorithm is used to find the most likely path through a hidden Markov model given an observed sequence, and has numerous applications. Due to its importance and high computational complexity, several algorithmic strategies have been developed to parallelize it on different parallel architectures. However, none of the existing Viterbi d...

We present AUTOGEN — an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences.
AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run...

We present AUTOGEN---an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run...

We show that cache-oblivious recursive divide-and-conquer (CORDAC) algorithms for solving a class of dynamic programming problems are more energy e�cient and have more flexibility in the runtime-power tradeo� than their corresponding iterative or tiled algorithms. Our experimental results show that CORDAC algorithms are more robust in terms of perf...

New generation sequencing technologies produce massive data sets of millions of reads, making the compression of sequence read files an important problem. The sequential order of the reads in these files typically conveys no biologically significant information, providing the freedom to reorder them so as to facilitate compression. Similarly, for m...

Motivation. Despite several reported acceleration successes of programmable GPUs (Graphics Processing Units) for molecular modeling and simulation tools, the general focus has been on fast computation with small molecules. This was primarily due to the limited memory size on the GPU. Moreover, simultaneous use of CPU and GPU cores for a single kern...

The availability of synonymous codons (codons that can translate the same amino acid into protein) enables a protein to be encoded by many different sequences of codons/tRNAs. Autocorrelation measures the reuse of a particular codon/tRNA in succession (instead of choosing a different synonymous one) during the translation of a protein sequence. Stu...

Dynamic Programming (DP) problems arise in
a wide range of application areas spanning from logistics to
computational biology. In this paper, we show how to obtain
high-performing parallel implementations for a class of DP
problems by reducing them to highly optimizable flexible
kernels through cache-oblivious recursive divide-and-conquer
(CORDAC)....

State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursi...

We show that Cache-oblivious recursive divide
and conquer (CORDAC) algorithms for solving some popular
dynamic programming problems are more energy efficient and
have more flexibility in the runtime-power tradeoff than their
corresponding iterative or tiled algorithms. Our experimental
results show that CORDAC algorithms are more robust in
terms of...

We define the range 1 query (R1Q) problem as follows. Given a d-dimensional (d≥1) input bit matrix A (consisting of 0's and 1's), preprocess A so that for any given region R of A, efficiently answer queries asking if R contains a 1 or not. We consider both orthogonal and non-orthogonal shapes for R including rectangles, axis-parallel right-triangle...

The state-of-the-art "trapezoidal decomposition algorithm" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arisi...

Molecular mechanics and dynamics simulations use distance based cutoff approximations for faster computation of pairwise van der Waals and electrostatic energy terms. These approximations traditionally use a precalculated and periodically updated list of interacting atom pairs, known as the "nonbonded neighborhood lists" or nblists, in order to red...

Dynamic Programming (DP) provides optimal solutions to a problem by combining optimal solutions to many over- lapping subproblems. DP algorithms exploit this overlap- ping property to explore otherwise exponential-sized prob-lem spaces in polynomial time, making them central to many important applications spanning from logistics to computa- Tional...

This paper introduces the kissing problem: given a rectangular room with n people in it, what is the most efficient way for each pair of people to kiss each other goodbye? The room is viewed as a set of pixels that form a subset of the integer grid. At most one person can stand on a pixel at once, and people move horizontally or vertically. In orde...

Rapid developments of multicore processors in the last ten years have accelerated the advancements in concurrency platforms. Performance of bottom-up resolution algorithms used in logic programming and artificial intelligent systems, can potentially be improved using the parallel programming constructs offered by these platforms (e.g., OpenMP, Cilk...

We address the design of algorithms for multicores that are oblivious to machine parameters. We propose HM, a multicore model consisting of a parallel shared-memory machine with hierarchical multi-level caching, and we introduce a multicore-oblivious approach to algorithms and schedulers for HM. A multicore-oblivious algorithm is specified with no...

Computing the polarization energy between a ligand (i.e., a small molecule such as a drug molecule) and a receptor (e.g., a virus molecule) is of utmost importance in drug design. We have designed and implemented distributed-memory and distributed-shared-memory parallel algorithms for approximating GB-polarization energy (e.g., polar part of free e...

Supplemental materials.
(PDF)

Motivation:
Computational simulation of protein-protein docking can expedite the process of molecular modeling and drug discovery. This paper reports on our new F(2) Dock protocol which improves the state of the art in initial stage rigid body exhaustive docking search, scoring and ranking by introducing improvements in the shape-complementarity a...

When a molecule experiences an electric field, its charge distribution is relaxed in response to that field. The energy associated with this relaxation is known as the polarization energy . Computing the polarization energy between a ligand (i.e., a small molecule such as a drug molecule) and a receptor (e.g., a virus molecule) is of utmost importa...

A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification o...

The functions of proteins are often realized through their mutual interactions. Determining a relative transformation for a pair of proteins and their conformations which form a stable complex, reproducible in nature, is known as docking. It is an important step in drug design, structure determination, and understanding function and structure relat...

We present the 'Dynamic Packing Grid' (DPG), a neighborhood data structure for maintaining and manipulating flexible molecules and assemblies, for efficient computation of binding affinities in drug design or in molecular dynamics calculations.
DPG can efficiently maintain the molecular surface using only linear space and supports quasi-constant ti...

We consider triply-nested loops of the type that occur in the standard Gaussian elimination algorithm, which we denote by GEP (or the Gaussian Elimination Paradigm). We present two related cache-oblivious methods I-GEP and C-GEP, both of which reduce the number of cache misses incurred (or I/Os performed) by the computation over that performed by s...

Bio-molecules reach their stable configuration in solvent which is primarily water with a small concentration of salt ions. One approximation of the total free energy of a bio-molecule includes the classical molecular mechanical energy EMM (which is understood as the self intra-molecular energy in vacuum) and the solvation energy Gsol which is caus...

We present efficient cache-oblivious algorithms for some well-studied string problems in bioinformatics including the longest common subsequence, global pairwise sequence alignment and three-way sequence alignment (or median), both with affine gap costs, and RNA secondary structure prediction with simple pseudoknots. For each of these problems, we...

We address the design of algorithms for multicores that are oblivious to machine parameters. We propose HM, a multicore model consisting of a parallel shared-memory machine with hierarchical multi-level caching, and we introduce a multicore-oblivious (MO) approach to algorithms and schedulers for HM. An MO algorithm is specified with no mention of...

We address the design of parallel algorithms that are oblivious to machine parameters for two dominant machine configurations: the chip multiprocessor (or multicore) and the network of processors. First, and of independent interest, we propose HM, a hierarchical multi-level caching model for multicores, and we propose a multicore-oblivious approach...

We present cache-efficient chip multiprocessor (CMP) algorithms with good speed-up for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: D-CMP with a private cache for each core, S-CMP with a single cache shared by all cores, and Multicore, which has private L1 caches and a shared L2 cache. We der...

This paper presents a multicore-cache model that reflects the reality that multicore processors have both per-processor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divide-and- conquer algorithms and present a new on-line scheduler, controlled-pdf, that is competitive with the standard sequential...

We consider the problem of preprocessing an edge-weighted directed graph G to answer queries that ask for the shortest distance from any given node x to any other node y avoiding an arbitrary failed node or link. We describe an oracle (i.e, a simple data structure) for such queries that can be stored in O(n2 log n) space, and which allows queries t...

The Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to represent the triply-nested loop computation that occurs in several important algorithms including Gaussian elimination without pivoting and Floyd-Warshall's all-pairs shortest paths algorithm. An efficient cache-oblivious algorithm for these instances of GEP was presen...

The cache-oblivious Gaussian Elimination Paradigm (GEP) was introduced by the authors in (6) to obtain efficient cache-oblivious algorithms for several i mportant problems that have algorithms with triply-nested loops similar to those that occur in Gaussian elimination. These include Gaussian elimination and LU-decomposition without pivoting, all-p...

We consider a model of recommendation systems, where each member from a given set of players has a binary preference to each element in a given set of objects: intuitively, each player either likes or dislikes each object. However, the ...

We present efficient cache-oblivious algorithms for sev- eral fundamental dynamic programs. These include new algorithms with improved cache performance for longest common subsequence (LCS), edit distance, gap (i.e., edit distance with gaps), and least weight subse- quence. We present a new cache-oblivious framework called the Gaussian Elimination...

We present several new external-memory algorithms for finding all-pairs shortest paths in a V-node. E-edge undirected graph. For all-pairs shortest paths and diameter in unweighted undirected graphs we present cache-oblivious algorithms with O(V·E/B log M/B E/B) I/Os, where B is the block-size and M is the size of internal memory. For weighted undi...

We present the Buffer Heap (BH), a cache-oblivious priority queue that supports Delete-Min, Delete, and Decrease-Key operations in O(1overB log2 NoverB) amortized block transfers from external memory, where B is the (unknown) block-size and N is the maximum number of elements in the queue. As is common in cache-oblivious algorithms, we assume a 'ta...

In theory, increasing alias analysis precision should impr ove compiler optimizations on C programs. This paper compares alias analysis algorithms on scalar optimizations, including an analysis that assumes no aliases, to establish a very loose upper bound on optimization opportunities. We then measure opti- mization opportunities on thirty-six C p...

In this paper, we present improved algorithms for min-max pair heaps introduced by S. Olariu et al. (A Mergeable Double-ended Priority Queue - The Comp. J. 34, 423-427, 1991). We also show that in the worst case, this structure, though slightly costlier to create, is better than min-max heaps of Strothotte (Min-max Heaps and Generalized Priority Qu...

In this paper a new exact string-matching algorithm with sub-linear average case complexity has been presented. Unlike other sub-linear string-matching algorithms it never performs more than n text character comparisons while working on a text of length n. It requires only O(mþs) extra pre-processing time and space, where m is the length of the pat...

We consider the problem of preprocessing an edge-weighted directed graph to answer queries that ask for the shortest path from any given vertex to another avoiding a failed link. We present two algorithms that improve on earlier results for this problem. Our first algorithm, which is a modification of an earlier method, improves the query time to a...

In this paper, lower and upper bounds for min-max pair heap construction has been presented. It has been shown that the construction of a min-max pair heap with n elements requires at least 2.07n element comparisons. A new algorithm for creating min-max pair heap has been devised that lowers the upper bound to 2.43n.

We present a new data structure for Huffman coding in which in addition to sending symbols in order of their appearance in the Huffman tree one needs to send codes of all circular leaf nodes (nodes with two adjacent external nodes), the number of which is always bounded above by half the number of symbols. We decode the text by using the memory eff...

In this paper we present a new sorting algorithm for heaps which can sort n (=2h+1-1) elements using no more than nlog2(n + 1)-(13/12)n-1 element comparisons in the worst case (including the heap creation phase). Experimental results show that this algorithm requires only nlog2(n+1) - 1.2n element comparisons in the average case. However it require...

In this paper we present a new data structure for double ended priority queue, called min-max fine heap, which combines the techniques used in fine heap and traditional min-max heap. The standard operations on this proposed structure are also presented, and their analysis indicates that the new structure outperforms the traditional one.

In this paper, we present a new mergesort algorithm which can sort n(= 2h+1 − 1) elements using no more than element comparisons in the worst case. This algorithm includes the heap (fine heap) creation phase as a pre-processing step, and for each internal node v, its left and right subheaps are merged into a sorted list of the elements under that n...

McDiarmid and Reed (1989) presented a variant of BOTTOM-UP-HEAPSORT which requires nlog2 n+n element comparisons (for n= 2 h+1-1) in the worst case, but requires an extra storage of n bits. Ingo Wegener (1992) has analyzed the average and worst case complexity of the algorithm which is very complex and long. In this paper we present a simplified co...

In this paper an iterative algorithm has been presented for calculating the square root of a real number with arbitrary order of convergence using formulae derived by applying binomial theorem. The primary objective is to reduce the number of division operations required.

This paper presents a theorem that asserts that average edge length of the minimum spanning tree of a complete graph on nC 1 vertices is less than or equal to the average edge length of all the nC 1 minimum spanning trees of the induced graph onn vertices. The result is also in compliance with results given by Frieze and Steele. © 1999 Elsevier Sci...

An iterative algorithm based on binary search has been presented for finding the mode of a sorted array and its frequency. Complexity of the algorithm has been deduced. Numerical experiments show its supremacy over the iterative implementation of Griffiths' algorithm.

We study the impact of using different priority queues in the performance of Dijkstra's SSSP algorithm. We consider only general priority queues that can handle any type of keys (integer, floating point, etc.); the only exception is that we use as a benchmark the DIMACS Challenge SSSP code (1) which can handle only integer values for distances. Our...

Not available Computer Sciences

Abstract We present theoretical and experimental results on cache-efficient and parallel algorithms for some well-studied string problems in bioinformatics: 1. Pairwise alignment. Optimal pairwise global sequence alignment using affine gap penalty; 2. Median. Optimal alignment of three sequences using affine gap penalty; 3. RNA secondary structure...

We present the results of an extensive computational study of an I/O-optimal cache-oblivious LCS (longest common subsequence) algorithm developed by Chowdhury and Ramachandran. Three variants of the algorithm were implemented (CO denoting the fastest variant) along with the widely used linear-space LCS algorithm by Dan Hirschberg (denoted Hi). Both...