# Sardar Anisul HaqueGeomechanica Inc. (Canada) & Qassim University (KSA)

5.21

ยท PhD, University of Western OntarioAbout

14

Research items

828

Reads

46

Citations

Introduction

http://www.geomechanica.com/

**Skills and Expertise**

AlgorithmsParallel and Distributed ComputingHigh Performance ComputingParallel ProgrammingScientific ComputingParallel ProcessingGPU ProgrammingScientific ComputationGPUComputational ScienceGPU-ComputingScientific ProgrammingCode DevelopmentParallel AlgorithmParallel & Distributed SystemsData Intensive ComputingGPGPUCompute Unified Device ArchitectureMulticore and Manycore

Research Experience

Jun 2018

**Geomechanica Inc.**

- Toronto, Ontario, Canada

Position

- Visiting Scientist at Geomechanica Inc.

Feb 2017

Nov 2015 - Feb 2017

**Geomechanica Inc.**

- High Performance Computing
- Toronto, Ontario, Canada

Position

- Ontario Centre of Excellence Postdoctoral Fellow

Description

- Using high performance computing technics to solve problems from rock science. Technologies used: OpenCL and C++.

Education

Jan 2009 - Dec 2013

Sep 2007 - Dec 2008

Jan 1999 - Oct 2002

Current institution

Geomechanica Inc. (Canada) & Qassim University (KSA)

Current position

- Visiting Scientist at Geomechanica Inc. & Assistant Professor at Qassim University

Top co-authors

Network

Co-authors

Followers

Following

Projects

Projects (1)

Project

This R&D project focuses on the development, implementation and validation of a 3D geomechanical simulation software based on the finite-discrete element method (FDEM). With this software, named Irazu, the parallel processing power of general-purpose graphics processing units (GPGPUs) is leveraged to gain significant computational performance boosts compared to existing sequential FDEM codes.

Research

Research Item (14)

- Aug 2018

This paper presents a novel implementation of a hydro-mechanically coupled, finite-discrete element method (FDEM) optimized to exploit the computing parallelism of graphics processing units (GPUs). A co-processing approach is adopted with the control loop of FDEM executed serially on the CPU and compute-intensive tasks off-loaded to the GPU. A benchmarking study indicates speedups of up to 100ร compared to sequential CPU execution. The implementation is validated by comparing 3D laboratory-scale rock fracturing simulations with experimental results. The effectiveness of the approach for practical rock engineering applications is demonstrated through the back analysis of a slope in a fractured rock mass.

- Jan 2018

The CUDA Modular Polynomial (CUMODP) Library implements arithmetic operations for dense matrices and dense polynomials, primarily with modular integer coefficients. Some operations are available for integer or floating point coefficients. Similar to other software libraries, like CuBLAS ยน targeting Graphics Processing Units (GPUs), CUMODP focuses on efficiency-critical routines and provides them in the form of device functions and CUDA kernels. Hence, these routines are primarily designed to offer GPU support to polynomial system solvers. A bivariate system solver is part of the library, as a proof-of-concept. Its implementation is presented in [10] and it is integrated in Maple's Triangularize commandยฒ, since the release 18 of Maple.

- Jul 2017
- the International Workshop

We present multithreaded adaptations of the Euclidean plain division and the Euclidean GCD algorithms to the many-core GPU architectures We report on implementation with NVIDIA CUDA and complexity analysis with an enhanced version of the PRAM model.

- Sep 2014
- International Workshop on Computer Algebra in Scientific Computing

We propose parallel algorithms for operations on univariate polynomials (multi-point evaluation, interpolation) based on subproduct tree techniques and targeting many-core GPUs. On those architectures, we demonstrate the importance of adaptive algorithms, in particular the combination of parallel plain arithmetic and parallel FFT-based arithmetic. Experimental results illustrate the benefits of our algorithms.

- Aug 2014
- International Congress on Mathematical Software

CUMODP is a CUDA library for exact computations with dense polynomials over finite fields. A variety of operations like multiplication, division, computation of subresultants, multi-point evaluation, interpolation and many others are provided. These routines are primarily designed to offer GPU support to polynomial system solvers and a bivariate system solver is part of the library. Algorithms combine FFT-based and plain arithmetic, while the implementation strategy emphasizes reducing parallelism overheads and optimizing hardware usage.

- Feb 2014

We present a model of multithreaded computation, combining fork-join and
single-instruction-multiple-data parallelisms, with an emphasis on estimating
parallelism overheads of programs written for modern many-core architectures.
We establish a Graham-Brent theorem for this model so as to estimate execution
time of programs running on a given number of streaming multiprocessors. We
evaluate the benefits of our model with four fundamental algorithms from
scientific computing. In each case, our model is used to minimize parallelism
overheads by determining an appropriate value range for a given program
parameter; moreover experimentation confirms the model's prediction.

- Feb 2014

The objective of high performance computing (HPC) is to ensure that the computational
power of hardware resources is well utilized to solve a problem. Various
techniques are usually employed to achieve this goal. Improvement of algorithm to
reduce the number of arithmetic operations, modifications in accessing data or rearrangement
of data in order to reduce memory traffic, code optimization at all levels,
designing parallel algorithms with smaller span or reduced overhead are some of the
attractive areas that HPC researchers are working on.
In this thesis, we investigate HPC techniques for the implementation of basic
routines in computer algebra targeting hardware acceleration technologies. We start
with a sorting algorithm and its application to sparse matrix-vector multiplication for
which we focus on work on cache complexity issues. Since basic routines in computer
algebra often provide a lot of fine grain parallelism, we then turn our attention to
many-core architectures on which we consider dense polynomial and matrix operations
ranging from plain to fast arithmetic. Most of these operations are combined within
a bivariate system solver running entirely on a graphics processing unit (GPU).

- Oct 2012

As for serial code on CPUs, parallel code on GPUs for dense polynomial
arithmetic relies on a combination of asymptotically fast and plain
algorithms. Those are employed for data of large and small size,
respectively. Parallelizing both types of algorithms is required in
order to achieve peak performances. In this paper, we show that the
plain dense polynomial multiplication can be efficiently parallelized on
GPUs. Remarkably, it outperforms (highly optimized) FFT-based
multiplication up to degree 212 while on CPU the same
threshold is usually at 26. We also report on a GPU
implementation of the Euclidean Algorithm which is both work-efficient
and runs in linear time for input polynomials up to degree
218 thus showing the performance of the GCD algorithm based
on systolic arrays.

- Feb 2012

We report on a GPU implementation of the condensation method designed by Abdelmalek Salem and Kouachi Said for computing the determinant of a matrix. We consider two types of coefficients: modular integers and floating point numbers. We evaluate the performance of our code by measuring its effective bandwidth and argue that it is numerical stable in the floating point number case. In addition, we compare our code with serial implementation of determinant computation from well-known mathematical packages. Our results suggest that a GPU implementation of the condensation method has a large potential for improving those packages in terms of running time and numerical stability.

- Jan 2010
- Proceedings of the 4th International Workshop on Parallel Symbolic Computation, PASCO 2010, July 21-23, 2010, Grenoble, France

Sparse matrix-vector multiplication or SpMxV is an important kernel in scientific computing. For example, in the conjugate gradient method, where SpMxV is the main computational step. Though the total number of arithmetic operations in SpMxV is fixed, reducing the probability of cache misses per operation is still a challenging area of research. In this work, we present a new column ordering algorithm for sparse matrices. We analyze the cache complexity of SpMxV when A is ordered by our technique. The numerical experiments, with very large test matrices, clearly demonstrate the performance gains rendered by our proposed technique.

- Apr 2009
- Computing, Engineering and Information, 2009. ICC '09. International Conference on

We revisit ordering techniques as a preprocessing step for improving the performance of sparse matrix-vector multiplication (SpM$\times$V) on modern hierarchical memory computers. In computing SpM$\times$V the main purpose of ordering of columns (or rows) is to improve the performance by enhancing data reuse. We present a new ordering technique based on the binary reflected gray codes and experimentally evaluate and compare it with other column ordering techniques from the literature. The results from numerical experiments with very large test matrices clearly demonstrates the performance gains rendered by our proposed technique.

- Jan 2004
- Evolutionary Computation, 2003. CEC '03. The 2003 Congress on

The paper presents a genetic algorithm with fuzzy logic controller for determining opportunistic replacement policy for deteriorating components of an equipment or system. An opportunistic replacement model has been formulated by considering the dynamics of the decision process of such a policy. In order to reduce the computational burden involving complete enumeration of all possible policies, genetic algorithm has been used to find near optimal solution by maximizing net benefit to be gained from an opportunistic replacement. A fuzzy logic controller has been used to automatically adjust the fine-tuning structure of genetic algorithm parameters. The performance of the model and the solution procedure has been evaluated for a number of case problems, which clearly demonstrates that the proposed method is very effective.

Abstract The efficiency of linear algebra operations for sparse matrices on modern,high performance computing,system is often constrained by the available memory,bandwidth. We are inter- ested in sparse matrices whose sparsity pattern is unknown. In this thesis, we study the efficiency of major storage schemes,of sparse matrices during multiplication with dense vector. A proper reordering of columns,or rows usually results in reduced memory,traf- fic due to the improved data reuse. This thesis also proposes an efficient column,ordering algorithm based on binary reflected gray code. Computational experiments show that this ordering results in increased performance,in computing,the product of a sparse matrix with a dense vector. iv Acknowledgments

Summary Recent biological researches have corroborated that gene sequence variants has a rule for the development and progression of common diseases. Some technological constraints restricts us from collecting haplotype data directly, instead we collect genotype data. To infer haplotype data from genotype data, Haplotype Inference By Pure Parsimony which minimizes the number of distinct haplotypes to explain certain number of genotype, is a good option. HIPP can be reduced to equivalent boolean satisfiablity problem. The performance of this approach depends the choice of branching rules and preprocessing steps dramatically. In this paper, we experiment on different combination of preprocessing choices and branching rules of SAT solver. This paper proposes a solution to the HIPP problem, based on this SAT model implemented on a distributed environment. Keeping the complexity of the search problem in mind, we developed SAT solver on distributed environment. And at the upshot of our work, we tested some problem instances under the combination of six branching rules and three pre-processing to give the decision which variant of the SAT model is best for HIPP.